ETL
Extract, Transform, Load
AWS
Amazon web services
DataBricks
open-source tool that builds off of Spark's capabilities and adds machine learning/data science capabilities
Is referred to as Amazon's Data Warehouse
Amazon Redshift
This breaks code from language understandable by people (Java, Python) into something a computer can ready (binary)
Compiling
EC2
(EC2 is a service managed by Amazon that allows a flexible pricing structure based on the end user's needs (eg, rather than owning an entire server )
coordinates data processing workflows is
AWS Step Functions
Apache Spark
Spark takes instructions written in different coding languages and uses them to process data.
service that helps you prepare data for use, especially by automating data quality tasks (identifying inconsistencies, cataloging data, suggesting transformations, etc.) is known as
Amazon Glue
Jenkins
Helps build, test, and deploy by automating repetative processes such as compiling and regression testing. It also notifies dev teams if something goes wrong in the deployment process
VM and what it allows
A Virtual Machine refers to an app running on a Guest OS that runs on a hypervisor. This allows multiple operating systems to run on one piece of hardware.
AWS Datasync is
Data transfer service that makes it easy to automate moving data between on-prem storage and AWS
Open-source tool that helps by creating to-do lists, task automation, scheduling, and more is known as
Apache Airflow
big data platform that uses popular frameworks like Hadoop, Spark, Presto is also known as
Amazon EMR
collects/extracts/ingests data from multiple sources in real time for Spark to digest/transform/load is known as
Kafka
AWS Glue Crawler
Automatically discovers and catalogs metadata
real-time streaming capabilities, allowing data engineers to ingest, process, analyze streaming data is
AWS Kinesis
Open-souce tool that helps by creating to-do lists, task automation, scheduling, and more
Apache Airflow
what is AWS Lambda
a serverless computing platform that enables developers to run code without provisioning servers.
This is uses code to manitpulate infrastructure; allows engineers to allocate infra resources
Terraform
AWS provides serverless computing where AWS manages the infrastructure so that developers don't have to maintain infrastructure when using AWS.
Serverless computing
primarily for ML but can also does pre-processing/feature engineering in data science workflows. builds models, can use pre-built algorithms, write custom algorithms is known as
AWS Sagemaker
Automatically builds (build - combination of separate parts of code) and tests changes to the code is known as
Continuous integration
simplifies the the process of cleaning and preparing data. Can be used in conjunction with Glue or as a standalone is also known as
AWS DataBrew
Cloud-based project management software, this program is used across multiple pieces of hardware and can handle large sets of data.
Hive