Databricks Certified Data Engineer Associate

Databricks Lakehouse Platform

ETL with Spark SQL and Python

Incremental Data Processing

Production Pipelines

Data Governance (And more..)

100

The following statement describes a Delta Lake

A. Delta Lake is an open source analytics engine used for big data workloads.

B. Delta Lake is an open format storage layer that delivers reliability, security, and performance.

C. Delta Lake is an open source platform to help manage the complete machine learning lifecycle.

D. Delta Lake is an open source data storage format for distributed data.

E. Delta Lake is an open format storage layer that processes data.

What is

B. Delta Lake is an open format storage layer that delivers reliability, security, and performance.

100

The following data workload will utilize a Bronze table as its source

A. A job that aggregates cleaned data to create standard summary statistics

B. A job that queries aggregated data to publish key insights into a dashboard

C. A job that ingests raw data from a streaming source into the Lakehouse

D. A job that develops a feature set for a machine learning application

E. A job that enriches data by parsing its timestamps into a human-readable format

What is

E. A job that enriches data by parsing its timestamps into a human-readable format

100

A data engineer has configured a Structured Streaming job to read from a table, manipulate the data, and then perform a streaming write into a new table.

The code block used by the data engineer is below:

(spark.table("sales") .withColumn("avg_price", col("sales") / col("units"))

.writeStream

.option("checkpointLocation", checkpointPath)

.outputMode("complete")

._____

.table("new_sales") )

If the data engineer only wants the query to execute a single micro-batch to process all of the available data, the data engineer should use the following lines of code to fill in the blank

A. trigger(once=True)

B. trigger(continuous="once")

C. processingTime("once")

D. trigger(processingTime="once")

E. processingTime(1)

What is

A. trigger(once=True)

100

A data engineer is designing a data pipeline. The source system generates files in a shared directory that is also used by other processes. As a result, the files should be kept as is and will accumulate in the directory. The data engineer needs to identify which files are new since the previous run in the pipeline, and set up the pipeline to only ingest those new files with each run.

The following tool can be used by the data engineer to solve this problem:

A. Databricks SQL

B. Delta Lake

C. Unity Catalog

D. Data Explorer

E. Auto Loader

What is

E. Auto Loader

100

Question 24

A data engineer needs to dynamically create a table name string using three Python variables: region, store, and year. An example of a table name is below when region = "nyc", store = "100", and year = "2021":

nyc100_sales_2021

The the data engineer should use following commands to construct the table name in Python

A. "{region}+{store}+_sales_+{year}"

B. f"{region}+{store}+_sales_+{year}"

C. "{region}{store}_sales_{year}"

D. f"{region}{store}_sales_{year}"

E. {region}+{store}+"_sales_"+{year}

What is

D. f"{region}{store}_sales_{year}"

200

The following describes a benefit of a data lakehouse that is unavailable in a traditional data warehouse?

A. A data lakehouse provides a relational system of data management.
B. A data lakehouse captures snapshots of data for version control purposes.
C. A data lakehouse couples storage and compute for complete control.
D. A data lakehouse utilizes proprietary storage formats for data.
E. A data lakehouse enables both batch and streaming analytics.

What is
E. A data lakehouse enables both batch and streaming analytics.

200

An engineering manager uses a Databricks SQL query to monitor their team’s progress on fixes related to customer-reported bugs. The manager checks the results of the query every day, but they are manually rerunning the query each day and waiting for the results.

The the manager can use the following approach to ensure the results of the query are updated each day?

A. They can schedule the query to run every 1 day from the Jobs UI.

B. They can schedule the query to refresh every 1 day from the query’s page in Databricks SQL.

C. They can schedule the query to run every 12 hours from the Jobs UI.

D. They can schedule the query to refresh every 1 day from the SQL endpoint’s page in Databricks SQL.

E. They can schedule the query to refresh every 12 hours from the SQL endpoint’s page in Databricks SQL.

What is

B. They can schedule the query to refresh every 1 day from the query’s page in Databricks SQL.

200

A data engineer has set up a notebook to automatically process using a Job. The data engineer’s manager wants to version control the schedule due to its complexity.

The data engineer can use following approach the to obtain a version-controllable configuration of the Job’s schedule

A. They can link the Job to notebooks that are a part of a Databricks Repo.

B. They can submit the Job once on a Job cluster.

C. They can download the JSON description of the Job from the Job’s page.

D. They can submit the Job once on an all-purpose cluster.

E. They can download the XML description of the Job from the Job’s page.

What is

C. They can download the JSON description of the Job from the Job’s page.

200

A data engineering team is in the process of converting their existing data pipeline to utilize Auto Loader for incremental processing in the ingestion of JSON files. One data engineer comes across the following code block in the Auto Loader documentation:

(streaming_df = spark.readStream.format("cloudFiles") .option("cloudFiles.format", "json") .option("cloudFiles.schemaLocation", schemaLocation) .load(sourcePath))

Assuming that schemaLocation and sourcePath have been set correctly, this is the following change the data engineer needs to make to convert this code block to use Auto Loader to ingest the data

A. The data engineer needs to change the format("cloudFiles") line to format("autoLoader").

B. There is no change required. Databricks automatically uses Auto Loader for streaming reads.

C. There is no change required. The inclusion of format("cloudFiles") enables the use of Auto Loader.

D. The data engineer needs to add the .autoLoader line before the .load(sourcePath) line.

E. There is no change required. The data engineer needs to ask their administrator to turn on Auto Loader

What is

C. There is no change required. The inclusion of format("cloudFiles") enables the use of Auto Loader.

200

A data engineer has ingested data from an external source into a PySpark DataFrame raw_df. They need to briefly make this data available in SQL for a data analyst to perform a quality assurance check on the data.

The data engineer should run the following command to make this data available in SQL for only the remainder of the Spark session

A. raw_df.createOrReplaceTempView("raw_df")

B. raw_df.createTable("raw_df")

C. raw_df.write.save("raw_df")

D. raw_df.saveAsTable("raw_df")

E. There is no way to share data between PySpark and SQL

What is

A. raw_df.createOrReplaceTempView("raw_df")

300

A data architect is designing a data model that works for both video-based machine learning workloads and highly audited batch ETL/ELT workloads.

The following describes how using a data lakehouse can help the data architect meet the needs of both workloads.

A. A data lakehouse requires very little data modeling.

B. A data lakehouse combines compute and storage for simple governance.

C. A data lakehouse provides autoscaling for compute clusters.

D. A data lakehouse stores unstructured data and is ACID-compliant.

E. A data lakehouse fully exists in the cloud.

What is

D. A data lakehouse stores unstructured data and is ACID-compliant.

300

The following data workloads will utilize a Silver table as its source

A. A job that enriches data by parsing its timestamps into a human-readable format

B. A job that queries aggregated data that already feeds into a dashboard

C. A job that ingests raw data from a streaming source into the Lakehouse

D. A job that aggregates cleaned data to create standard summary statistics

E. A job that cleans data by removing malformatted records

What is

D. A job that aggregates cleaned data to create standard summary statistics

300

Which of the following describes a scenario in which a data engineer will want to use a Job cluster instead of an all-purpose cluster?

A. An ad-hoc analytics report needs to be developed while minimizing compute costs.

B. A data team needs to collaborate on the development of a machine learning model.

C. An automated workflow needs to be run every 30 minutes.

D. A Databricks SQL query needs to be scheduled for upward reporting.

E. A data engineer needs to manually investigate a production error.

What is

C. An automated workflow needs to be run every 30 minutes.

300

The following benefit Delta Live Tables provides for ELT pipelines over standard data pipelines that utilize Spark and Delta Lake on Databricks

A. The ability to declare and maintain data table dependencies

B. The ability to write pipelines in Python and/or SQL

C. The ability to access previous versions of data tables

D. The ability to automatically scale compute resources

E. The ability to perform batch and streaming queries

What is

A. The ability to declare and maintain data table dependencies

300

Question 44

A new data engineer has started at a company. The data engineer has recently been added to the company’s Databricks workspace as new.engineer@company.com. The data engineer needs to be able to query the table sales in the database retail. The new data engineer already has been granted USAGE on the database retail.

The following command can be used to grant the appropriate permissions to the new data engineer

A. GRANT USAGE ON TABLE sales TO new.engineer@company.com;

B. GRANT CREATE ON TABLE sales TO new.engineer@company.com;

C. GRANT SELECT ON TABLE sales TO new.engineer@company.com;

D. GRANT USAGE ON TABLE new.engineer@company.com TO sales;

E. GRANT SELECT ON TABLE new.engineer@company.com TO sales;

What is

C. GRANT SELECT ON TABLE sales TO new.engineer@company.com;

400

A data engineer has created a Delta table as part of a data pipeline. Downstream data analysts now need SELECT permission on the Delta table.

Assuming the data engineer is the Delta table owner, this is part of the Databricks Lakehouse Platform the data engineer can use to grant the data analysts the appropriate access

A. Repos

B. Jobs

C. Data Explorer

D. Databricks Filesystem

E. Dashboards

What is

C. Data Explorer

400

Question 42

A data engineering team has been using a Databricks SQL query to monitor the performance of an ELT job. The ELT job is triggered by a specific number of input records being ready to process. The Databricks SQL query returns the number of minutes since the job’s most recent runtime.

The following approach can enable the data engineering team to be notified if the ELT job has not been run in an hour

A. They can set up an Alert for the accompanying dashboard to notify them if the returned value is greater than 60.

B. They can set up an Alert for the query to notify when the ELT job fails.

C. They can set up an Alert for the accompanying dashboard to notify when it has not refreshed in 60 minutes.

D. They can set up an Alert for the query to notify them if the returned value is greater than 60.

E. This type of alerting is not possible in Databricks.

What is

D. They can set up an Alert for the query to notify them if the returned value is greater than 60.

400

A data engineering team needs to query a Delta table to extract rows that all meet the same condition. However, the team has noticed that the query is running slowly. The team has already tuned the size of the data files. Upon investigating, the team has concluded that the rows meeting the condition are sparsely located throughout each of the data files.

Based on the scenario, the following optimization technique could speed up the query

A. Data skipping

B. Z-Ordering

C. Bin-packing

D. Write as a Parquet file

E. Tuning the file size

What is

B. Z-Ordering

400

A data engineer has three notebooks in an ELT pipeline. The notebooks need to be executed in a specific order for the pipeline to complete successfully. The data engineer would like to use Delta Live Tables to manage this process.

The following steps the data engineer must take as part of implementing this pipeline using Delta Live Tables

A. They need to create a Delta Live Tables pipeline from the Data page.

B. They need to create a Delta Live Tables pipeline from the Jobs page.

C. They need to create a Delta Live tables pipeline from the Compute page.

D. They need to refactor their notebook to use Python and the dlt library

E. They need to refactor their notebook to use SQL and CREATE LIVE TABLE keyword.

What is

B. They need to create a Delta Live Tables pipeline from the Jobs page?

400

A new data engineer new.engineer@company.com has been assigned to an ELT project. The new data engineer will need full privileges on the table sales to fully manage the project.

The following command can be used to grant full permissions on the table to the new data engineer

A. GRANT ALL PRIVILEGES ON TABLE sales TO new.engineer@company.com;

B. GRANT USAGE ON TABLE sales TO new.engineer@company.com;

C. GRANT ALL PRIVILEGES ON TABLE new.engineer@company.com TO sales;

D. GRANT SELECT ON TABLE sales TO new.engineer@company.com;

E. GRANT SELECT CREATE MODIFY ON TABLE sales TO new.engineer@company.com;

What is

A. GRANT ALL PRIVILEGES ON TABLE sales TO new.engineer@company.com;

500

The following describes how Databricks Repos can help facilitate CI/CD workflows on the Databricks Lakehouse Platform

A. Databricks Repos can facilitate the pull request, review, and approval process before merging branches

B. Databricks Repos can merge changes from a secondary Git branch into a main Git branch

C. Databricks Repos can be used to design, develop, and trigger Git automation pipelines

D. Databricks Repos can store the single-source-of-truth Git repository

E. Databricks Repos can commit or push code changes to trigger a CI/CD process

What is

E. Databricks Repos can commit or push code changes to trigger a CI/CD process

500

A data engineer has developed a code block to perform a streaming read on a data source. The code block is below:

(spark

.read

.schema(schema)

.format("cloudFiles")

.option("cloudFiles.format", "json")

.load(dataSource) )

The code block is returning an error.

The following change should be made to the code block to configure the block to successfully perform a streaming read

A. The .read line should be replaced with .readStream.

B. A new .stream line should be added after the .read line.

C. The .format("cloudFiles") line should be replaced with .format("stream").

D. A new .stream line should be added after the spark line.

E. A new .stream line should be added after the .load(dataSource) line.

What is

A. The .read line should be replaced with .readStream

500

A data engineer is overwriting data in a table by deleting the table and recreating the table. Another data engineer suggests that this is inefficient and the table should simply be overwritten instead.

The following is a reason to overwrite the table instead of deleting and recreating the table is incorrect

A. Overwriting a table is efficient because no files need to be deleted.

B. Overwriting a table results in a clean table history for logging and audit purposes.

C. Overwriting a table maintains the old version of the table for Time Travel.

D. Overwriting a table is an atomic operation and will not leave the table in an unfinished state.

E. Overwriting a table allows for concurrent queries to be completed while in progress.

What is

B. Overwriting a table results in a clean table history for logging and audit purposes.

500

A Delta Live Table pipeline includes two datasets defined using STREAMING LIVE TABLE. Three datasets are defined against Delta Lake table sources using LIVE TABLE.

The table is configured to run in Development mode using the Triggered Pipeline Mode.

Assuming previously unprocessed data exists and all definitions are valid, the following is the expected outcome after clicking Start to update the pipeline

A. All datasets will be updated once and the pipeline will shut down. The compute resources will be terminated.

B. All datasets will be updated at set intervals until the pipeline is shut down. The compute resources will be deployed for the update and terminated when the pipeline is stopped.

C. All datasets will be updated at set intervals until the pipeline is shut down. The compute resources will persist after the pipeline is stopped to allow for additional testing.

D. All datasets will be updated once and the pipeline will shut down. The compute resources will persist to allow for additional testing.

E. All datasets will be updated continuously and the pipeline will not shut down. The compute resources will persist with the pipeline.

What is

D. All datasets will be updated once and the pipeline will shut down. The compute resources will persist to allow for additional testing.

500

A dataset has been defined using Delta Live Tables and includes an expectations clause:

CONSTRAINT valid_timestamp EXPECT (timestamp > '2020-01-01')

The following is expected behavior when a batch of data contains data that violates these constraints is processed

A. Records that violate the expectation are added to the target dataset and recorded as invalid in the event log.

B. Records that violate the expectation are dropped from the target dataset and recorded as invalid in the event log.

C. Records that violate the expectation cause the job to fail.

D. Records that violate the expectation are added to the target dataset and flagged as invalid in a field added to the target dataset.

E. Records that violate the expectation are dropped from the target dataset and loaded into a quarantine table.

What is

A. Records that violate the expectation are added to the target dataset and recorded as invalid in the event log.