This is the practice of deriving valuable and actionable insights from data through interdisciplinary knowledge in programming, math/stats, and specialized domains.
Data science
Big data is largely defined by its volume, ____, and variety.
velocity
Statistical term for "average".
Mean
Two events are _____ if they cannot both occur simultaneously.
mutually exclusive
What does LLM stand for?
Large Language Model
What does RAG stand for?
Retrieval Augmented Generation
Regression and classification are examples of this kind of machine learning.
supervised
Technique for limiting model complexity.
Regularization
This visualization displays counts of categories.
Bar chart
The problematic feature in the healthcare risk score algorithm discussed in "Dissecting racial bias in an algorithm used to manage the health of populations".
Medical cost
Describe the effect of the following:
df['column'] = df['column'].apply(lambda x: x2)
Squares each value in the field 'column'.
When a data processing task is distributed across multiple computers, this is known as ____ processing.
parallel
The difference between the third and first quartiles.
Interquartile range
You roll a die.
P(roll 1) + P(roll 2) + P(roll 3) + P(roll 4) + P(roll 5) + P(roll 6) = ?
1
In simple terms, describe what an LLM does.
It excels at generating text by predicting the next token based on context information.
Where are the documents for a RAG system stored?
In a vector database
Simple linear regression optimizes this loss function.
Mean squared error
A way of validating model performance on several unseen datasets by dividing our training data into K sections.
This visualization displays counts of quantitative variables.
Histogram
Fairness through ______ is the practice of trying to achieve fairness & eliminate bias by removing sensitive features from a model.
unawareness
This is the first step of the data science lifecycle.
Ask a question
This big data computer technology is an extension of the MapReduce model and has in-memory processing capabilities.
Apache Spark
False
Statistical philosophy that views probabilities as subjective beliefs that can be updated.
Bayesian
The pretraining process of an LLM is an example of ____ learning.
self-supervised
chunking
Clustering is an example of this kind of machine learning.
unsupervised
Algorithm for finding model parameters that minimize a loss function.
Gradient descent
A declarative visualization language in Python.
Altair
True or False: Redlining on a racial basis is illegal.
True
The task of communicating the results of analysis to the appropriate stakeholders is called the last ____ problem.
mile
This is the process of standardizing and harmonizing data across systems within an enterprise to boost data usability, integrity, and security.
Data governance
Residual
P(A uu B) = ?
P(A) + P(B) - P(A nn B)
This parameter influences the conditional probability distribution for an LLMs next word selection.
temperature
Describe the primary use case for RAG systems.
Q+A
A reinforcement learning problem that requires a balance of exploration and exploitation.
The Multi-Armed Bandit
sigmoid
This visualization measures magnitudes of values using a color gradient.
Heatmap
Joy Buolamwini researched fairness in this application of machine learning.
Facial recognition
Write an expression that groups a dataframe 'df' by 'column' and sums a column named 'count' in each group.
df.groupby('column')['count'].sum()
Big Query
This value is represented by an "r", and is bound between -1 and 1.
Correlation coefficient
Name the rule:
P(A | B) = (P(A) * P(B | A) )/(P(B))
Bayes'
This term is used to describe when an LLM generates false information.
Hallucination
RAGAs is a method for evaluating the quality of an LLM's generated output. It focuses on three aspects of answer quality. Name two of them.
faithfulness, answer relevance, context relevance
In the multiple linear regression setting, the vector of predicted values is the _____ of the observed vector on the span of the design matrix. (think back to the geometric visual I provided!)
orthogonal projection
Loss function used in logistic regression.
Cross-entropy
This visualization is suitable for time series analysis.
Line chart
Measure of a model's fairness also called the p% rule.
Disparate impact