This is the practice of deriving valuable and actionable insights from data through interdisciplinary knowledge in programming, math/stats, and specialized domains.
Data science
Big data is largely defined by its volume, ____, and variety.
velocity
Statistical term for "average".
Mean
Two events are _____ if they cannot both occur simultaneously.
mutually exclusive
What does LLM stand for?
Large Language Model
Regression and classification are examples of this kind of machine learning.
supervised
This standard classification evaluation metric looks at the ratio of correctly classified examples to the total number of examples.
Accuracy
The problematic label in the healthcare risk score algorithm discussed in "Dissecting racial bias in an algorithm used to manage the health of populations".
Medical cost
Describe the effect of the following:
df['column'] = df['column'].apply(lambda x: x2)
Squares each value in the field 'column'.
When a data processing task is distributed across multiple computers, this is known as ____ processing.
parallel
The difference between the third and first quartiles.
Interquartile range
You roll a die.
P(roll 1) + P(roll 2) + P(roll 3) + P(roll 4) + P(roll 5) + P(roll 6) = ?
1
In simple terms, describe what an LLM does.
It excels at generating text by predicting the next token based on context information.
Simple linear regression optimizes this loss function.
Mean squared error
A way of validating model performance on several unseen datasets by dividing our training data into K sections.
Fairness through ______ is the practice of trying to achieve fairness & eliminate bias by removing sensitive features from a model.
unawareness
This is the first step of the data science lifecycle.
Ask a question
Define ELT
Extract, load, transform
False
Statistical philosophy that views probabilities as subjective beliefs that can be updated.
Bayesian
The pretraining process of an LLM is an example of ____ supervised learning.
self
This method is a good baseline for classifying textual data.
Naive Bayes Classification
Computational algorithm for finding model parameters that minimize a loss function.
Gradient descent
Describe reject option classification.
A post-processing bias mitigation technique that adjusts a model's classification output for examples within a given confidence band. Protected examples are classified favorably and unprotected examples are classified unfavorably.
The task of communicating the results of analysis to the appropriate stakeholders is called the last ____ problem.
mile
This is the process of standardizing and harmonizing data across systems within an enterprise to boost data usability, integrity, and security.
Data governance
Residual
P(A uu B) = ?
P(A) + P(B) - P(A nn B)
This parameter influences the conditional probability distribution for an LLMs next word selection.
temperature
Reinforcement Learning
confusion matrix
Pre-processing methods
Write an expression that groups a dataframe 'df' by 'column' and sums a column named 'count' in each group.
df.groupby('column')['count'].sum()
Big Query
This value is represented by an "r", and is bound between -1 and 1.
Correlation coefficient
Name the rule:
P(A | B) = (P(A) * P(B | A) )/(P(B))
Bayes'
This term is used to describe when an LLM generates false information.
Hallucination
Multiple linear regression is _____
a statistical method for predicting numerical outputs based on two or more input features.
Loss function used in logistic regression.
Cross-entropy
Measure of a model's fairness also called the p% rule.
Disparate impact