Data Analysis

Archetypes

Analysis Lifecycle

Machine Learning Modalities

Algorithmic Arsenal

Metric Mastery

100

This type of analysis describes current characteristics or patterns in the data to answer the fundamental question, "What happened?"

Descriptive Analysis

100

In this initial step, you must determine what you want to achieve to decide which specific technique is most appropriate for your data

Define your Data Analysis Goal

100

In this learning category, the model is trained on labeled data, meaning the input data is paired with the correct corresponding output labels

Supervised Learning

100

This algorithm uses a tree-like structure where internal nodes represent feature-based decisions and leaf nodes represent final class labels

Decision Tree

100

This common metric represents the proportion of correctly classified instances out of the total number of instances

Accuracy

200

Taking patterns from descriptive analytics a step further, this process uses techniques like root cause analysis to discover "Why did this happen?"

Diagnostic Analysis

200

This step is vital because your choice here determines the specific features and limitations you will face during data gathering and processing

Choose your Data Analysis Tool/s

200

The goal of this learning type is to discover hidden structures or patterns, such as clustering, within data that has no associated target values

Unsupervised Learning

200

This probabilistic classifier is based on Bayes' Theorem and operates under the assumption that features are conditionally independent given the class

Naive Bayes

200

Also known as a Type I error, this occurs when the model incorrectly predicts a positive outcome for an actually negative case

False Positive

300

This analysis type uses historical and current data to forecast what might happen in the future, often utilizing regression models or time series analysis

Predictive Analysis

300

To gain a comprehensive understanding of patterns, you should ideally perform this step from several different perspectives

Analyze your Data

300

Used when the output is a continuous value, this specific supervised learning technique is used to predict things like house prices or stock trends

Regression

300

This non-parametric method classifies a sample based on the majority class of its "K" closest neighbors within the feature space

K-Nearest Neighbors (KNN)

300

This metric, also called sensitivity, measures the proportion of actual positive instances that the model correctly identified

Recall

400

Often serving as an "expert opinion," this analysis suggests actionable takeaways to support decisions on what a business should do next

Prescriptive Analysis

400

This specific process utilizes specialized techniques to enhance the overall quality and reliability of the data you have gathered

Preprocess your Data

400

This unsupervised learning technique involves reducing the number of features in a dataset while carefully retaining the most important information

Dimensionality Reduction (or PCA)

400

This supervised model works by finding the "optimal hyperplane" to best separate different classes in a feature space

Support Vector Machine (SVM)

400

Known as a Type II error, this occurs when the model predicts a negative outcome for an instance that was actually positive

False Negative

500

Recommending specific games to a user based on their unique budget, hardware, and skill level is a complex application of this analysis category

Prescriptive Analysis

500

If your goals are found to be unmet during this final stage, you may be required to change your tools and restart the entire analysis process

Assess your Results

500

This specific unsupervised goal involves grouping similar data points together based on behavior, such as segmenting customers without predefined categories

Clustering

500

Despite its name, this is a statistical method used specifically for modeling the probability of binary outcomes

Logistic Regression

500

This metric is the harmonic mean of precision and recall; it is particularly useful for balancing both concerns in imbalanced datasets

F1 Score