ETL
Extract, Transform, Load
Refer to groups or collections of data points that share similar characteristics or properties. is also known as a
Cluster
Automatically builds (build - combination of separate parts of code) and tests changes to the code is known as this two word answer
Continuous integration
AWS
This is referred to as Amazon's Data Warehouse
Amazon Redshift
AI
broader concept that involves creating machines or systems capable of performing tasks that typically require human intelligence, including reasoning, problem-solving, understanding natural language, and more.
It provides information about other data, describing various aspects such as the content, format, location, and characteristics of the data is also known as...
Metadata
What is "Apache Airflow"
an Open-source tool that helps by creating to-do lists, task automation, scheduling, and more
AWS Datasync
transfer large amounts of data between on-premises storage systems and AWS storage services quickly, securely, and efficiently
Amazon Glue
service that helps you prepare data for use, especially by automating data quality tasks (identifying inconsistencies, cataloging data, suggesting transformations, etc.). It does not actually build the data pipelines, but aids in the transformation and data quality in preparation for its use.
ML
subset of Artificial Intelligence (AI) that focuses on systems learning from data to improve their performance on a specific task without being explicitly programmed.
this is a blueprint that organizes and defines how data is stored and accessed in a database
Schema
this is an opensource tool that can use multiple coding languages to digest and process data
Apache Spark
primarily for ML but can also does pre-processing/feature engineering in data science workflows. builds models, can use pre-built algorithms, write custom algorithms. Helps you train your models on large datasets. Only prepared data can be used in sagemaker is also known as
AWS Sagemaker
Process and analyze large amounts of data using popular open-source tools like Apache Spark, Hadoop, and others, by offering a managed environment that simplifies the setup, scaling, and maintenance of clusters for big data processing and analytics.
Amazon EMR
What does VM mean/refer
A Virtual Machine refers to an app running on a Guest OS that runs on a hypervisor. This allows multiple operating systems to run on one piece of hardware. VMware is the basis for cloud computing
This is a system or process to handle increasing amounts of data or a growing number of users without sacrificing performance known as...
Scalable
serverless computing
enables developers to build applications faster by eliminating the need for them to manage infrastructure
Used to create serverless ETL processes in response to events and automate data-related tasks
AWS Lambda
Amazon S3
Simple storage solution: scalable storage solution that uses buckets to store structured and unstructured data. Useful for organization, versioning, access control, and lifecycle management
What is EC2
EC2 is a service managed by Amazon that allows a flexible pricing structure based on the end user's needs (eg, rather than owning an entire server and paying for upkeep/power usage, just paying for the parts that they need).
Allows multiple operating systems to run on a single physical computer at the same time by virtually separating and managing the computer's resources like CPU, memory, and storage. is also known as?
Hypervisor
an open-source tool that builds off of Spark's capabilities and adds machine learning/data science capabilities
Databricks
simplifies the process of cleaning and preparing data. Can be used in conjunction with Glue or as a standalone is known as
AWS DataBrew
Automatically discovers and catalogs metadata
AWS Glue Crawler