Databases
Data Engineering
AI/ML Concepts
Modern Data Stack
Data Governance and Quality
100

This is the SQL keyword used to fetch data from a table

What is SELECT?

100

This is a series of automated steps to move and transform data

What is a pipeline?

100

This is the term for when a model fits training data too tightly

What is overfitting?

100

This is a tool that uses SQL to transform data models in your data warehouse

What is dbt?

100

This word describes the accuracy and completeness of data

What is data quality?

200

This is a column or set of columns that uniquely identifies a record

What is a primary key?

200

This open-source, in-memory, unified analytics engine is widely used for fast, distributed big data processing, replacing MapReduce in Hadoop

What is Apache Spark?

200

For LLMs, this is the process of breaking down text into discrete units like words or subwords

What is tokenization?

200

This is the cloud-based data warehouse offered by Microsoft

What is Synapse?

200

This position/job title is responsible for maintaining data standards

What is a data steward?

300

This is a database system optimized for documents, not tables  

What is MongoDB?

300

This data format is human-readable and replaced XML as the web standard

What is JSON?

300

This term describes autonomous LLM-based systems that use tools and memory to complete tasks

What is an agent?

300

This is the term for combining data warehouse and data lake capabilities in a single platform

What is a data lakehouse?

300

This is a detailed inventory of an organization's data assets

What is a data catalog?

400

This columnar file format is commonly used for efficient storage and querying in data lakes

What is parquet?

400

This Python library is often used to manipulate tabular data

What is Pandas?

400

This technique improves LLM responses by injecting relevant external data at query time

What is RAG (Retrieval Augmented Generation)?

400

This open-source Apache service provides managed, scalable event streaming on the cloud

What is Apache Kafka?

400

This term describes tracking where data originated and how it has changed

What is data lineage?

500

This OLAP database runs in-process and can query Parquet files without loading them

Hint: It's also Troy Schmidt's favorite DB

What is DuckDB?

500

This open-source Apache tool is used for orchestration of data workflows

What is Apache Airflow?

500

This is Google's latest video generation model

What is Veo 3?

500

This cloud-based data warehouse just announced adaptive compute to automatically handle cluster sizing

What is Snowflake?

500

This is Microsoft's new architecture approach that provides a unified data management and governance layer across on-premises, multi-cloud, and edge environments

What is Data Fabric?

M
e
n
u