This is the practice of deriving valuable and actionable insights from data through interdisciplinary knowledge in programming, math/stats, and specialized domains.
Data science
Big data is largely defined by its volume, ____, and variety.
velocity
Statistical term for "average".
Mean
Two events are _____ if they cannot both occur simultaneously.
mutually exclusive
What does LLM stand for?
Large Language Model
LLM architecture that preceded the transformer
recurrent neural net (RNN)
Regression and classification are examples of this kind of machine learning.
supervised
This standard classification evaluation metric looks at the ratio of correctly classified examples to the total number of examples.
Accuracy
This visualization displays counts of categories.
Bar chart
The problematic feature in the healthcare risk score algorithm discussed in "Dissecting racial bias in an algorithm used to manage the health of populations".
Medical cost
Describe the effect of the following:
df['column'] = df['column'].apply(lambda x: x2)
Squares each value in the field 'column'.
When a data processing task is distributed across multiple computers, this is known as ____ processing.
parallel
The difference between the third and first quartiles.
Interquartile range
You roll a die.
P(roll 1) + P(roll 2) + P(roll 3) + P(roll 4) + P(roll 5) + P(roll 6) = ?
1
In simple terms, describe what an LLM does.
It excels at generating text by predicting the next token based on context information.
alignment fine-tuning
Simple linear regression optimizes this loss function.
Mean squared error
A way of validating model performance on several unseen datasets by dividing our training data into K sections.
This visualization displays counts of quantitative variables.
Histogram
Fairness through ______ is the practice of trying to achieve fairness & eliminate bias by removing sensitive features from a model.
unawareness
This is the first step of the data science lifecycle.
Ask a question
Define ELT
Extract, load, transform
False
Statistical philosophy that views probabilities as subjective beliefs that can be updated.
Bayesian
The pretraining process of an LLM is an example of ____ learning.
self-supervised
The mechanism used by transformers to dynamically construct the context for each word in a sentence in parallel is called _____.
self-attention
This method is a good baseline for classifying textual data.
Naive Bayes Classification
Computational algorithm for finding model parameters that minimize a loss function.
Gradient descent
A declarative visualization language in Python.
Altair
Describe reject option classification.
A post-processing bias mitigation technique that adjusts a model's classification output for examples within a given confidence band. Protected examples are classified favorably and unprotected examples are classified unfavorably.
The task of communicating the results of analysis to the appropriate stakeholders is called the last ____ problem.
mile
This is the process of standardizing and harmonizing data across systems within an enterprise to boost data usability, integrity, and security.
Data governance
Residual
P(A uu B) = ?
P(A) + P(B) - P(A nn B)
This parameter influences the conditional probability distribution for an LLMs next word selection.
temperature
This form of behavioral training uses a reward model to evaluate multiple LLM outputs to a prompt from a prompt dataset.
Reinforcement Learning from Human Feedback (RLHF)
Reinforcement Learning
confusion matrix
This visualization measures magnitudes of values using a color gradient.
Heatmap
Pre-processing methods
Write an expression that groups a dataframe 'df' by 'column' and sums a column named 'count' in each group.
df.groupby('column')['count'].sum()
Big Query
This value is represented by an "r", and is bound between -1 and 1.
Correlation coefficient
Name the rule:
P(A | B) = (P(A) * P(B | A) )/(P(B))
Bayes'
This term is used to describe when an LLM generates false information.
Hallucination
The founding father of artificial intelligence
Alan Turing
Multiple linear regression is _____
a statistical method for predicting numerical outputs based on two or more input features.
Loss function used in logistic regression.
Cross-entropy
This visualization is suitable for time series analysis.
Line chart
Measure of a model's fairness also called the p% rule.
Disparate impact