What is Data Science / Data Analysis
Big Data
Statistics
Probability
Gen AI Pt. 1
Gen AI Pt. 2
ML Pt. 1
ML Pt. 2
Data Visualization
Ethics
100

This is the practice of deriving valuable and actionable insights from data through interdisciplinary knowledge in programming, math/stats, and specialized domains.

Data science

100

Big data is largely defined by its volume, ____, and variety.

velocity

100

Statistical term for "average".

Mean

100

Two events are _____ if they cannot both occur simultaneously.

mutually exclusive

100

What does LLM stand for?

Large Language Model

100

LLM architecture that preceded the transformer

recurrent neural net (RNN)

100

Regression and classification are examples of this kind of machine learning.

supervised

100

This standard classification evaluation metric looks at the ratio of correctly classified examples to the total number of examples.

Accuracy

100

This visualization displays counts of categories.

Bar chart

100

The problematic feature in the healthcare risk score algorithm discussed in "Dissecting racial bias in an algorithm used to manage the health of populations".

Medical cost

200

Describe the effect of the following:

df['column'] = df['column'].apply(lambda x: x2)

Squares each value in the field 'column'.

200

When a data processing task is distributed across multiple computers, this is known as ____ processing.

parallel

200

The difference between the third and first quartiles. 

Interquartile range

200

You roll a die. 

P(roll 1) + P(roll 2) + P(roll 3) + P(roll 4) + P(roll 5) + P(roll 6) = ?

1

200

In simple terms, describe what an LLM does.

It excels at generating text by predicting the next token based on context information.

200
The name of the process for training an LLM to behave according to human values.

alignment fine-tuning

200

Simple linear regression optimizes this loss function.

Mean squared error

200

A way of validating model performance on several unseen datasets by dividing our training data into K sections.

K-fold cross validation
200

This visualization displays counts of quantitative variables.

Histogram

200

Fairness through ______ is the practice of trying to achieve fairness & eliminate bias by removing sensitive features from a model.

unawareness

300

This is the first step of the data science lifecycle.

Ask a question

300

Define ELT

Extract, load, transform

300
True or False: A population's true mean will always fall within a sample of the population.

False

300

Statistical philosophy that views probabilities as subjective beliefs that can be updated.

Bayesian

300

The pretraining process of an LLM is an example of ____ learning.

self-supervised

300

The mechanism used by transformers to dynamically construct the context for each word in a sentence in parallel is called _____.

self-attention

300

This method is a good baseline for classifying textual data.

Naive Bayes Classification


300

Computational algorithm for finding model parameters that minimize a loss function.

Gradient descent

300

A declarative visualization language in Python.

Altair

300

Describe reject option classification.

A post-processing bias mitigation technique that adjusts a model's classification output for examples within a given confidence band. Protected examples are classified favorably and unprotected examples are classified unfavorably.

400

The task of communicating the results of analysis to the appropriate stakeholders is called the last ____ problem.

mile

400

This is the process of standardizing and harmonizing data across systems within an enterprise to boost data usability, integrity, and security.

Data governance

400
When using simple linear regression to predict y from x, this term describes the difference between the observed value of y and the predicted value of y for a given x. 

Residual

400

P(A uu B) = ?

P(A) + P(B) - P(A nn B)

400

This parameter influences the conditional probability distribution for an LLMs next word selection.

temperature

400

This form of behavioral training uses a reward model to evaluate multiple LLM outputs to a prompt from a prompt dataset.

Reinforcement Learning from Human Feedback (RLHF)

400
In this kind of machine learning, a policy is learned to maximize an agent's reward.

Reinforcement Learning

400
In classification, this is used to summarize the number of true positives, false positives, true negatives, and false negatives.

confusion matrix

400

This visualization measures magnitudes of values using a color gradient.

Heatmap

400
The general name for bias mitigation techniques that occur before a model is trained.

Pre-processing methods

500

Write an expression that groups a dataframe 'df' by 'column' and sums a column named 'count' in each group.

df.groupby('column')['count'].sum()

500
This is an online analytics data warehouse solution on the Google Cloud Platform.

Big Query

500

This value is represented by an "r", and is bound between -1 and 1.

Correlation coefficient

500

Name the rule:

P(A | B) = (P(A) * P(B | A) )/(P(B))

Bayes'

500

This term is used to describe when an LLM generates false information.

Hallucination

500

The founding father of artificial intelligence

Alan Turing

500

Multiple linear regression is  _____ 

a statistical method for predicting numerical outputs based on two or more input features.

500

Loss function used in logistic regression.

Cross-entropy

500

This visualization is suitable for time series analysis.

Line chart

500

Measure of a model's fairness also called the p% rule.

Disparate impact