What is Data Science
Big Data
Statistics
Probability
NLP
Data Mining
ML Pt. 1
ML Pt. 2
Data Visualization
Ethics
100

This is the practice of deriving valuable and actionable insights from data through interdisciplinary knowledge in programming, math/stats, and specialized domains.

Data science

100

Big data is largely defined by its volume, ____, and variety.

velocity

100

Statistical term for "average".

Mean

100

Two events are _____ if they cannot both occur simultaneously.

mutually exclusive

100

NLP stands for ____.

natural language processing

100

Object in pandas that represents tabular data with rows and columns.

Dataframe

100

The main categories of machine learning are supervised, unsupervised, and ____.

reinforcement

100

Technique for limiting model complexity.

Regularization

100

This visualization displays counts of categories.

Bar chart

100

The problematic feature in the healthcare risk score algorithm discussed in "Dissecting racial bias in an algorithm used to manage the health of populations".

Medical cost

200

This is the name for precious data from humans that cannot be quantified.

Thick data

200

When a data processing task is distributed across multiple computers, this is known as ____ processing.

parallel

200

The difference between the third and first quartiles. 

Interquartile range

200

You roll a die. 

P(roll 1) + P(roll 2) + P(roll 3) + P(roll 4) + P(roll 5) + P(roll 6) = ?

1

200

Describing the positivity or negativity of a document using machine learning is an example of ____.

sentiment analysis

200

Synonym for a .py file.

Module

200

Simple linear regression optimizes this loss function.

Mean squared error

200

A way of validating model performance on several unseen datasets by dividing our training data into K sections.

K-fold cross validation
200

This visualization displays counts of quantitative variables.

Histogram

200

Adage describing the removal of sensitive features from a dataset to eliminate bias.

Fairness through unawareness

300

This is the first step of the data science lifecycle.

Ask a question

300

This big data computer technology is an extension of the MapReduce model and has in-memory processing capabilities. 

Apache Spark

300
True or False: A population's true mean will always fall within a sample of the population.

False

300
When a hash function assigns a single value to more than one element / individual, this is called a ____.

collision

300

The Naive Bayes algorithm is an example of this kind of supervised machine learning.

Classification

300

A pandas function that allows you to see the distribution of a column's values.

value_counts

300

This machine learning algorithm selects the most important features to use in a model.

Principal component analysis
300

Algorithm for finding model parameters that minimize a loss function.

Gradient descent

300

As Josh would say, a "chlorophyll" map.

Choropleth

300

True or False: Redlining on a racial basis is illegal.

True

400

The task of communicating the results of analysis to the appropriate stakeholders is called the last ____ problem.

mile

400

This is the process of standardizing and harmonizing data across systems within an enterprise to boost data usability, integrity, and security.

Data governance

400
When using simple linear regression to predict y from x, this term describes the difference between the observed value of y and the predicted value of y for a given x. 

Residual

400

P(A uu B) = ?

P(A) + P(B) - P(A nn B)

400

"amus" is the ___ of "amused", "amusing", and "amusement".

stem

400

Write an expression that groups a dataframe 'df' by 'column' and sums by 'count' in each group.

df.groupby('column')['count'].sum()

400

This algorithm balances exploration with exploitation in the Multi-Armed Bandit problem.

Epsilon-greedy

400

(TP)/(TP + FP)

Precision

400

This visualization measures magnitudes of values using a color gradient.

Heatmap

400

Joy Buolamwini researched fairness in this application of machine learning.

Facial recognition

500

This is the unconscious belief of valuing the measurable over the immeasurable.

Quantification bias

500
This is an online analytics data warehouse solution on the Google Cloud Platform.

Big Query

500

This value is represented by an "r", and is bound between -1 and 1.

Correlation coefficient

500

Name the rule:

P(A | B) = (P(A) * P(B | A) )/(P(B))

Bayes'

500

P(x_1, x_2 | c) = P(x_1 |c) * P(x_2|c)

implies xand xare conditionally ____ on c.

independent

500

Describe the effect of the following:

df['column'] = df['column'].apply(lambda x: x2)

Squares each value in the field 'column'.

500

In the multiple linear regression setting, the vector of predicted values is the _____ of the observed vector on the span of the design matrix.

  

orthogonal projection

500

Loss function used in logistic regression.

Cross-entropy

500

This visualization is suitable for time series analysis.

Line chart

500

Measure of a model's fairness also called the p% rule.

Disparate impact

M
e
n
u