Language
Models
Feature
Training
... And Beyond
100

This challenge occurs when a word, phrase, or sentence can have multiple possible interpretations.

What is Ambiguity?

100

This type of machine learning model creates new content similar to its training data.

What is Generative Model

100

An operation that converts data and objects, such as text, into a numerical representation.

What is a feature function?
100

In machine learning, this carefully reserved portion of data serves as the final, unseen benchmark to assess how well a model generalizes beyond its training examples.

What is Test-Set?

100

This evaluation metric measures the proportion of actual positive instances that a machine learning model correctly identified.

What is Recall?
200

This linguistic phenomenon in NLP refers to the many ways humans can express the same core meaning using different words, sentence structures, or grammatical variations.


 

What is Variability?

200

This predictive modeling technique uses a graph of binary questions and their possible consequences, resembling a flowchart that breaks down complex choices.

What is Decision Tree?

200

This simple text representation model converts a document into an unordered set of tokens, ignoring grammar and order while tracking their frequencies.


What is "Bag of words"?

200

In this problematic scenario, a model performs exceptionally well on training data but fails dramatically when presented with new, independent datasets.

What is Overfitting?

200

This performance metric measures the proportion of positive predictions that are also correct.

What is Precision?

300

This linguistic law states that the frequency of any word is inversely proportional to its rank in the frequency table of all words in a language.

What is Zipf law?

300

This advanced ensemble method combines multiple decision trees to create a more robust and accurate predictive model, reducing overfitting and improving generalization.

What is Random Forest?

300

This preprocessing step in bag-of-words often involves removing common words like "the", "a", "I", and converting all text to lowercase to reduce dimensionality and computational complexity.

What is Normalization & Stopword Removal?

300

This fundamental machine learning concept describes the balance between a model's tendency to oversimplify the learned patterns, and its sensitivity to fluctuations and fine changes in training data.

What is the Bias-Variance Tradeoff?

300

This metric balances the trade-off between identifying all relevant instances and maintaining high accuracy.

What is F1-Score?

400

In NLP, these are the unique words or vocabulary items in a text corpus.

What are Types?

400

this classification method calculates the probability of a data point belonging to a particular class by multiplying individual feature probabilities.


What is Naive Bayes?

400

An advanced variation for text representation that weights word importance by comparing word occurrence count to its count in other documents in the corpus.


What is TF-IDF?

400

This process involves adjusting the configuration settings of a machine learning model that are not learned from the data itself, but are set before training begins.


What is Hyperparameter Tuning?

400

This metric measures the quality of binary and multi-class classifications and works effectively even with imbalanced datasets.


What is the Matthews Correlation Coefficient (MCC)?

500

This phenomenon in linguistics and NLP refers to the use of language to narrow down or limit the set of possible referents or meanings, often seen with modifiers like adjectives or negations. 

What is Restrictivity?

500

This conceptual line or surface in feature space separates different class predictions in a classification model.

What is Decision Boundary?

500

A text representation approach that captures semantic relationships between words and contextual nuances.

What are Word Embeddings?

500

This dataset subset is used during model training to provide an intermediate performance evaluation and help prevent overfitting.

What is Validation Set?

500

Techniques that use multiple iterations of this dataset to provide a more robust and statistically reliable assessment of model performance.

What is Cross-Validation?