What is Data Science / Data Analysis
Big Data
Statistics
Probability
Gen AI Pt. 1
Gen AI Pt. 2
ML Pt. 1
ML Pt. 2
Data Visualization
Ethics
100

This is the practice of deriving valuable and actionable insights from data through interdisciplinary knowledge in programming, math/stats, and specialized domains.

Data science

100

Big data is largely defined by its volume, ____, and variety.

velocity

100

Statistical term for "average".

Mean

100

Two events are _____ if they cannot both occur simultaneously.

mutually exclusive

100

What does LLM stand for?

Large Language Model

100

What does RAG stand for?

Retrieval Augmented Generation

100

Regression and classification are examples of this kind of machine learning.

supervised

100

Technique for limiting model complexity.

Regularization

100

This visualization displays counts of categories.

Bar chart

100

The problematic feature in the healthcare risk score algorithm discussed in "Dissecting racial bias in an algorithm used to manage the health of populations".

Medical cost

200

Describe the effect of the following:

df['column'] = df['column'].apply(lambda x: x2)

Squares each value in the field 'column'.

200

When a data processing task is distributed across multiple computers, this is known as ____ processing.

parallel

200

The difference between the third and first quartiles. 

Interquartile range

200

You roll a die. 

P(roll 1) + P(roll 2) + P(roll 3) + P(roll 4) + P(roll 5) + P(roll 6) = ?

1

200

In simple terms, describe what an LLM does.

It excels at generating text by predicting the next token based on context information.

200

Where are the documents for a RAG system stored?

In a vector database

200

Simple linear regression optimizes this loss function.

Mean squared error

200

A way of validating model performance on several unseen datasets by dividing our training data into K sections.

K-fold cross validation
200

This visualization displays counts of quantitative variables.

Histogram

200

Fairness through ______ is the practice of trying to achieve fairness & eliminate bias by removing sensitive features from a model.

unawareness

300

This is the first step of the data science lifecycle.

Ask a question

300

This big data computer technology is an extension of the MapReduce model and has in-memory processing capabilities. 

Apache Spark

300
True or False: A population's true mean will always fall within a sample of the population.

False

300

Statistical philosophy that views probabilities as subjective beliefs that can be updated.

Bayesian

300

The pretraining process of an LLM is an example of ____ learning.

self-supervised

300
The process of splitting a document into smaller pieces is called _____.

chunking

300

Clustering is an example of this kind of machine learning.

unsupervised

300

Algorithm for finding model parameters that minimize a loss function.

Gradient descent

300

A declarative visualization language in Python.

Altair

300

True or False: Redlining on a racial basis is illegal.

True

400

The task of communicating the results of analysis to the appropriate stakeholders is called the last ____ problem.

mile

400

This is the process of standardizing and harmonizing data across systems within an enterprise to boost data usability, integrity, and security.

Data governance

400
When using simple linear regression to predict y from x, this term describes the difference between the observed value of y and the predicted value of y for a given x. 

Residual

400

P(A uu B) = ?

P(A) + P(B) - P(A nn B)

400

This parameter influences the conditional probability distribution for an LLMs next word selection.

temperature

400

Describe the primary use case for RAG systems.

Q+A

400

A reinforcement learning problem that requires a balance of exploration and exploitation.

The Multi-Armed Bandit

400
In logistic regression, this is the function that transforms values to be between 0 and 1.

sigmoid

400

This visualization measures magnitudes of values using a color gradient.

Heatmap

400

Joy Buolamwini researched fairness in this application of machine learning.

Facial recognition

500

Write an expression that groups a dataframe 'df' by 'column' and sums a column named 'count' in each group.

df.groupby('column')['count'].sum()

500
This is an online analytics data warehouse solution on the Google Cloud Platform.

Big Query

500

This value is represented by an "r", and is bound between -1 and 1.

Correlation coefficient

500

Name the rule:

P(A | B) = (P(A) * P(B | A) )/(P(B))

Bayes'

500

This term is used to describe when an LLM generates false information.

Hallucination

500

RAGAs is a method for evaluating the quality of an LLM's generated output. It focuses on three aspects of answer quality. Name two of them.

faithfulness, answer relevance, context relevance

500

In the multiple linear regression setting, the vector of predicted values is the _____ of the observed vector on the span of the design matrix. (think back to the geometric visual I provided!)

orthogonal projection

500

Loss function used in logistic regression.

Cross-entropy

500

This visualization is suitable for time series analysis.

Line chart

500

Measure of a model's fairness also called the p% rule.

Disparate impact