This is the practice of deriving valuable and actionable insights from data through interdisciplinary knowledge in programming, math/stats, and specialized domains.
Data science
Big data is largely defined by its volume, ____, and variety.
velocity
Statistical term for "average".
Mean
Two events are _____ if they cannot both occur simultaneously.
mutually exclusive
NLP stands for ____.
natural language processing
Object in pandas that represents tabular data with rows and columns.
Dataframe
The main categories of machine learning are supervised, unsupervised, and ____.
reinforcement
Technique for limiting model complexity.
Regularization
This visualization displays counts of categories.
Bar chart
The problematic feature in the healthcare risk score algorithm discussed in "Dissecting racial bias in an algorithm used to manage the health of populations".
Medical cost
This is the name for precious data from humans that cannot be quantified.
Thick data
When a data processing task is distributed across multiple computers, this is known as ____ processing.
parallel
The difference between the third and first quartiles.
Interquartile range
You roll a die.
P(roll 1) + P(roll 2) + P(roll 3) + P(roll 4) + P(roll 5) + P(roll 6) = ?
1
Describing the positivity or negativity of a document using machine learning is an example of ____.
sentiment analysis
Synonym for a .py file.
Module
Simple linear regression optimizes this loss function.
Mean squared error
A way of validating model performance on several unseen datasets by dividing our training data into K sections.
This visualization displays counts of quantitative variables.
Histogram
Adage describing the removal of sensitive features from a dataset to eliminate bias.
Fairness through unawareness
This is the first step of the data science lifecycle.
Ask a question
This big data computer technology is an extension of the MapReduce model and has in-memory processing capabilities.
Apache Spark
False
collision
The Naive Bayes algorithm is an example of this kind of supervised machine learning.
Classification
A pandas function that allows you to see the distribution of a column's values.
value_counts
This machine learning algorithm selects the most important features to use in a model.
Algorithm for finding model parameters that minimize a loss function.
Gradient descent
As Josh would say, a "chlorophyll" map.
Choropleth
True or False: Redlining on a racial basis is illegal.
True
The task of communicating the results of analysis to the appropriate stakeholders is called the last ____ problem.
mile
This is the process of standardizing and harmonizing data across systems within an enterprise to boost data usability, integrity, and security.
Data governance
Residual
P(A uu B) = ?
P(A) + P(B) - P(A nn B)
"amus" is the ___ of "amused", "amusing", and "amusement".
stem
Write an expression that groups a dataframe 'df' by 'column' and sums by 'count' in each group.
df.groupby('column')['count'].sum()
This algorithm balances exploration with exploitation in the Multi-Armed Bandit problem.
Epsilon-greedy
(TP)/(TP + FP)
Precision
This visualization measures magnitudes of values using a color gradient.
Heatmap
Joy Buolamwini researched fairness in this application of machine learning.
Facial recognition
This is the unconscious belief of valuing the measurable over the immeasurable.
Quantification bias
Big Query
This value is represented by an "r", and is bound between -1 and 1.
Correlation coefficient
Name the rule:
P(A | B) = (P(A) * P(B | A) )/(P(B))
Bayes'
P(x_1, x_2 | c) = P(x_1 |c) * P(x_2|c)
implies x1 and x2 are conditionally ____ on c.
independent
Describe the effect of the following:
df['column'] = df['column'].apply(lambda x: x2)
Squares each value in the field 'column'.
In the multiple linear regression setting, the vector of predicted values is the _____ of the observed vector on the span of the design matrix.
orthogonal projection
Loss function used in logistic regression.
Cross-entropy
This visualization is suitable for time series analysis.
Line chart
Measure of a model's fairness also called the p% rule.
Disparate impact