A corpus containing transcriptions of spoken language.
What is a spoken corpus?
The process of breaking text into words or sentences
What is tokenization?
Words that often appear together (e.g., "fast food").
What are collocations?
Software to analyze word frequency (e.g., AntConc).
What is a concordancer?
The first computerized corpus (1961, Brown University).
What is the Brown Corpus?
A corpus that includes multiple languages for translation studies.
What is a parallel corpus?
The method of reducing words to their base form (e.g., "running" → "run").
What is lemmatization?
A word with multiple meanings (e.g., "bat").
A word with multiple meanings (e.g., "bat").What is polysemy?
Corpora are widely used to improve this type of AI-based language translation.
What is machine translation?
The linguist who criticized corpora for ignoring "possible" sentences.
Who is Noam Chomsky?
A collection of old texts used to study language change over time.
What is a historical corpus?
The tool used to find words in context within a corpus.
What is a concordancer?
The tendency of "cause" to pair with negative words ("trouble").
What is semantic prosody?
Corpus data helps in building these digital tools that define word meanings.
What are dictionaries?
This corpus includes conversational English and is often used for studying spoken language.
What is the Corpus of Contemporary American English?
A corpus built using student writing to analyze common mistakes.
What is a learner corpus?
A measure of how often a word appears in a corpus.
What is word frequency?
Fixed phrases like "on the other hand."
What are lexical bundles?
Using corpora to update dictionaries.
What is lexicography?
The father of modern corpus linguistics (created COBUILD).
Who is John Sinclair?
A corpus updated regularly to track new language trends (e.g., COCA).
What is a monitor corpus?
Tagging words with meanings (e.g., "bank" = financial institution vs. river).
What is semantic annotation?
Studying how words interact with grammar (e.g., "make a decision").
What is lexico-grammar?
Corpora are used to train AI models for this type of technology that lets computers understand and generate human language.
What is natural language processing (NLP)?
The organization that distributes corpora for NLP research.
What is the Linguistic Data Consortium (LDC)