Math (Prob/Stats/
Linear Algebra)
Data Science Riddles/
Resume Buzzwords
Machine Learning
Women in Data Science/ DS News/Latest trends
SQL, Big Data/
Data Engineering
100

The statistical phenomenon that states that 80% of consequences roughly originate from 20% of causes

What is the Pareto Principle?

100

I make your training accuracy look amazing, but when new data arrives, I embarrass you.

What is overfitting?
100

The model that finds a hyperplane maximizing the margin between classes in a high-dimensional space

What is Support Vector Machine (SVM)?

100

Pioneering U.S. Navy computer scientist who helped develop early compilers and popularized the idea of machine-independent programming languages

Who is Grace Hopper?

100

The clause used to filter groups after aggregation functions like COUNT or SUM are applied

What is HAVING?

200

True or False? Given matrix AAA, the dimension of Col(A) equals the number of pivot columns of AAA.

True

200

I am the average squared difference between predictions and actual values and punish large errors especially harshly. Linear regression models often try to minimize me.

What is Mean Squared Error (MSE)?

200

The method that reduces the dimensionality of data by projecting it onto orthogonal components that maximize variance

What is Principal Component Analysis (PCA)?

200

Stanford professor and AI pioneer who spearheaded the creation of the ImageNet dataset, widely credited with catalyzing the modern deep learning boom

Who is Fei-Fei Li?

200

When writing a window function in SQL, the clause that is used to divide the result set into subsets before the function is applied

What is PARTITION BY?

300

The probability distribution that is typically used to model waiting time in a call center

What is Exponential distribution?

300

I am a regularization algorithm that adds a penalty equal to the absolute value of the magnitude of coefficients, often shrinking some to exactly zero to perform built-in feature selection.

What is LASSO Regression (L1 Regularization)?

300

The ensemble learning method that combines predictions from multiple models trained on bootstrap samples to improve performance and reduce variance

What is Bagging (Bootstrap Aggregating)?

300

Pioneering nurse and statistician who transformed Crimean War mortality data into the “Nightingale Rose Diagram,” revealing that most soldier deaths were caused by preventable disease rather than combat

Who is Florence Nightingale?

300

Window function that assigns a unique sequential number to rows within a partition, often used to select the nth record within each group

What is ROW_NUMBER()?

400

The statistical phenomenon that describes a trend that appears strongly in several separate groups of data but disappears or reverses when the groups are combined

What is Simpson’s Paradox?

400

I help neural networks learn by sending errors backward through the layers.

What is Backpropagation?

400

The optimization method that estimates the gradient of the expected loss using a randomly sampled training example or mini-batch

What is Stochastic Gradient Descent (SGD)?

400

The data scientist who created the global conference series Women in Data Science (WiDS)

Margot Gerritsen

400

In relational database design, the normalization step that requires that each table cell contain a single atomic value and eliminates repeating groups

What is First Normal Form (1NF)?

500

The expected number of Bernoulli trials needed to obtain the first success if the probability of success is 5/34

What is 34/5?

500

I help models learn step by step. Each new model focuses on correcting the mistakes of the previous one.

What is Gradient Boosting?

500

The highly popular tree-based ensemble algorithm that uses a regularized gradient boosting framework and became famous for dominating Kaggle tabular data competitions

What is XGBoost?

500

University of Washington statistician who coauthored the textbook Introduction to Statistical Learning and researches causal inference and high-dimensional statistics

Daniela Witten

500

The distributed data optimization that ensures that records with the same key from multiple datasets are placed in the same partition, reducing expensive shuffle operations during joins

What is Copartitioning?