Data Science Technical Interview Questions

General Questions

Supervised Learning

Unsupervised
Learning

Statistical Inference/ Distributions

Time Series

100

When the training score is better than the testing score?

What is overfit.

100

What generic metric do we optimize when fitting a machine learning model.

Loss Function

100

The success of this algorithm can be affected by outliers.

What is K-Means Clustering

100

This distribution models single, binary events.

What is the Bernoulli Distribution

100

One particular/ defining characteristic of Time Series data, distinct from orthogonal data.

What is Autocorrelation

200

The training data in a Random Forest is compiled via this method.

What is bootstrapping

200

TP / TP + FN

What is Sensitivity

200

This metric can be used to evaluate intra-cluster cohesion and inter-cluster separation.

What is silhouette score.

200

This distribution models the number of successes in a fixed time period.

What is the Poisson distribution.

200

This algorithm can use exogenous variables to model a time series.

What is VAR (vector auto-regression models)

300

This type of model uses a collection of weak learners fit iteratively.

Boosting

300

This algorithm finds the optimal Beta coefficients in a Regularized Linear Regression Model.

What is Gradient Descent

300

This clustering evaluation metric will increase as the number of clusters increases.

What is inertia score.

300

This statistical procedure determines if the mean difference between two series of data is equal to zero

What is a Paired T-test

300

The statistical test to find the "d" hyper-parameter in ARIMA models.

What is the "Augmented Dickey Fuller Test"

400

The most uncommon type of missing data:

What is "Missing Completely at Random"

400

The GLM link component for Linear Regression?

What is the Identity Function

400

This method uses information from one machine learning model and uses it in another, unrelated model.

What is transfer learning

400

This type of inference uses new information to update our understanding of event probability.

What is Bayesian inference

400

The null hypothesis for the above test.

What is "Not stationary"

500

This Tree-based model is the most uncorrelated.

Extra Trees

500

The Loss Function for Logistic Regression

What is Log Loss/ Binary Cross-Entropy

500

The linear relationship between two or more Principal components.

Orthogonal

500

This proposes a range of plausible values for an unknown parameter.

What is a confidence interval.

500

This algorithm can predict sudden shocks in a time series.

What is Moving Average