This is the SQL keyword used to fetch data from a table
What is SELECT?
This is a series of automated steps to move and transform data
What is a pipeline?
This is the term for when a model fits training data too tightly
What is overfitting?
This is a tool that uses SQL to transform data models in your data warehouse
What is dbt?
This word describes the accuracy and completeness of data
What is data quality?
This is a column or set of columns that uniquely identifies a record
What is a primary key?
This open-source, in-memory, unified analytics engine is widely used for fast, distributed big data processing, replacing MapReduce in Hadoop
What is Apache Spark?
For LLMs, this is the process of breaking down text into discrete units like words or subwords
What is tokenization?
This is the cloud-based data warehouse offered by Microsoft
What is Synapse?
This position/job title is responsible for maintaining data standards
What is a data steward?
This is a database system optimized for documents, not tables
What is MongoDB?
This data format is human-readable and replaced XML as the web standard
What is JSON?
This term describes autonomous LLM-based systems that use tools and memory to complete tasks
What is an agent?
This is the term for combining data warehouse and data lake capabilities in a single platform
What is a data lakehouse?
This is a detailed inventory of an organization's data assets
What is a data catalog?
This columnar file format is commonly used for efficient storage and querying in data lakes
What is parquet?
This Python library is often used to manipulate tabular data
What is Pandas?
This technique improves LLM responses by injecting relevant external data at query time
What is RAG (Retrieval Augmented Generation)?
This open-source Apache service provides managed, scalable event streaming on the cloud
What is Apache Kafka?
This term describes tracking where data originated and how it has changed
What is data lineage?
This OLAP database runs in-process and can query Parquet files without loading them
Hint: It's also Troy Schmidt's favorite DB
What is DuckDB?
This open-source Apache tool is used for orchestration of data workflows
What is Apache Airflow?
This is Google's latest video generation model
What is Veo 3?
This cloud-based data warehouse just announced adaptive compute to automatically handle cluster sizing
What is Snowflake?
This is Microsoft's new architecture approach that provides a unified data management and governance layer across on-premises, multi-cloud, and edge environments
What is Data Fabric?