This is the programming model Hadoop uses to process large datasets in parallel across a cluster.
What is MapReduce?
This is the distributed file system at the heart of Hadoop's storage layer.
What is HDFS (Hadoop Distributed File System)?
This Apache Hadoop component acts as the central authority for managing resources in a cluster.
What is YARN (Yet Another Resource Negotiator)?
This Apache Hadoop component acts as the central authority for managing resources in a cluster.
What is YARN (Yet Another Resource Negotiator)?
This open-source framework is known for its in-memory processing capabilities and is part of the Hadoop ecosystem.
What is Apache Spark?
The phase where MapReduce combines all values associated with the same key.
What is the reduce phase?
This is the distributed file system at the heart of Hadoop's storage layer.
What is HDFS (Hadoop Distributed File System)?
In YARN, this component negotiates resources and manages application execution.
What is the ResourceManager?
In YARN, this component negotiates resources and manages application execution.
What is the ResourceManager?
Spark’s primary data abstraction, which represents an immutable distributed collection of objects.
What is a Resilient Distributed Dataset (RDD)?
If a mapper outputs <"apple", 1> three times, the combiner’s output would be this.
What is <"apple", 3>? (Tests understanding of local aggregation.)
The two primary roles of nodes in HDFS, where one stores metadata and the other stores actual data blocks.
What are the NameNode and DataNode?
This YARN component runs on each node and manages containers for tasks.
What is the NodeManager?
This YARN feature allows multiple applications to share cluster resources efficiently by enforcing limits.
What are Resource Queues (or Fair Scheduler/Capacity Scheduler)?
The programming language Spark was originally written in and still primarily supports.
What is Scala?
Unlike the mapper, this component runs after the shuffle phase and before the reducer.
What is the partitioner? (Or: What is the sort phase?)
This Hadoop configuration file defines default block size, replication factor, and other HDFS settings.
What is hdfs-site.xml?
This YARN feature allows multiple applications to share cluster resources efficiently by enforcing limits.
What are Resource Queues (or Fair Scheduler/Capacity Scheduler)?
If a NodeManager fails, this YARN mechanism ensures tasks are rescheduled on healthy nodes.
What is fault tolerance (or container reallocation)?
This optimization technique in Spark avoids recomputation by persisting intermediate RDDs.
What is caching (or persistence)?
A job with 100 reducers runs slower than one with 10, likely due to this overhead.
What is excessive network traffic (or shuffle overhead)?
Unlike HDFS, this Hadoop component manages resources and schedules tasks across the cluster.
What is YARN (Yet Another Resource Negotiator)?
if a NodeManager fails, this YARN mechanism ensures tasks are rescheduled on healthy nodes.
What is fault tolerance (or container reallocation)?
This YARN concept represents a collection of physical resources (CPU, RAM) allocated to a task.
What is a Container?
If a Spark job fails due to a lost executor, this feature ensures recovery by recomputing lost partitions.
What is lineage (or RDD lineage)?