MapReduce
Hadoop
HDFS Deep Dive: Storage & Fault Tolerance
YARN & Beyond: The Resource Manager
SPARK
100

This is the programming model Hadoop uses to process large datasets in parallel across a cluster.

What is MapReduce?

100

This is the distributed file system at the heart of Hadoop's storage layer.

 What is HDFS (Hadoop Distributed File System)?

100

This Apache Hadoop component acts as the central authority for managing resources in a cluster.

What is YARN (Yet Another Resource Negotiator)?

100

This Apache Hadoop component acts as the central authority for managing resources in a cluster.

What is YARN (Yet Another Resource Negotiator)?

100

This open-source framework is known for its in-memory processing capabilities and is part of the Hadoop ecosystem.

What is Apache Spark?

200

 The phase where MapReduce combines all values associated with the same key.

What is the reduce phase?

200

This is the distributed file system at the heart of Hadoop's storage layer.

What is HDFS (Hadoop Distributed File System)?

200

In YARN, this component negotiates resources and manages application execution.

What is the ResourceManager?

200

In YARN, this component negotiates resources and manages application execution.

What is the ResourceManager?

200

Spark’s primary data abstraction, which represents an immutable distributed collection of objects.

What is a Resilient Distributed Dataset (RDD)?

300

If a mapper outputs <"apple", 1> three times, the combiner’s output would be this.

What is <"apple", 3>? (Tests understanding of local aggregation.)

300

The two primary roles of nodes in HDFS, where one stores metadata and the other stores actual data blocks.

What are the NameNode and DataNode?

300

This YARN component runs on each node and manages containers for tasks.

What is the NodeManager?

300

This YARN feature allows multiple applications to share cluster resources efficiently by enforcing limits.

What are Resource Queues (or Fair Scheduler/Capacity Scheduler)?

300

The programming language Spark was originally written in and still primarily supports.

What is Scala?

400

 Unlike the mapper, this component runs after the shuffle phase and before the reducer.

What is the partitioner? (Or: What is the sort phase?)

400

This Hadoop configuration file defines default block size, replication factor, and other HDFS settings.

What is hdfs-site.xml?

400

This YARN feature allows multiple applications to share cluster resources efficiently by enforcing limits.

What are Resource Queues (or Fair Scheduler/Capacity Scheduler)?

400

If a NodeManager fails, this YARN mechanism ensures tasks are rescheduled on healthy nodes.

What is fault tolerance (or container reallocation)?

400

 This optimization technique in Spark avoids recomputation by persisting intermediate RDDs.

What is caching (or persistence)?

500

A job with 100 reducers runs slower than one with 10, likely due to this overhead.

What is excessive network traffic (or shuffle overhead)?

500

 Unlike HDFS, this Hadoop component manages resources and schedules tasks across the cluster.

What is YARN (Yet Another Resource Negotiator)?

500

if a NodeManager fails, this YARN mechanism ensures tasks are rescheduled on healthy nodes.

What is fault tolerance (or container reallocation)?

500

 This YARN concept represents a collection of physical resources (CPU, RAM) allocated to a task.

What is a Container?

500

 If a Spark job fails due to a lost executor, this feature ensures recovery by recomputing lost partitions.


What is lineage (or RDD lineage)?

M
e
n
u