13.5.15

Figures in Hadoop: The Definitive Guide

Figure 1-1. Structure of the book: there are various pathways through the book





Figure 2-1. MapReduce logical data flow


  • MapReduce works by breaking the processing into two phases:
    • the map phase and
    • the reduce phase
  • Each phase has key-value pairs as input and output,
    • the types of which may be chosen by the programmer

Figure 2-2. Data-local (a), rack-local (b), and off-rack (c) map tasks


Figure 2-3. MapReduce data flow with a single reduce task



Figure 2-4. MapReduce data flow with multiple reduce tasks


Figure 2-5. MapReduce data flow with no reduce tasks


Figure 3-1. Accessing HDFS over HTTP directly and via a bank of HDFS proxies



Figure 3-2. A client reading data from HDFS


Figure 3-3. Network distance in Hadoop

Figure 3-4. A client writing data to HDFS

Figure 3-5. A typical replica pipeline


Figure 4-1. YARN applications



Figure 4-2. How YARN runs an application



Notice from Figure 4-2 that YARN itself does not provide any way for the parts of the
The difference between schedulers is illustrated in Figure 4-3, which shows that under
With the Capacity Scheduler (ii. in Figure 4-3), a separate dedicated queue allows the
With the Fair Scheduler (iii. in Figure 4-3) there is no need to reserve a set amount of

Figure 4-3. Cluster utilization over time when running a large job and a small job un‐
Figure 4-3 contrasts the basic operation of the Capacity Scheduler and the Fair Sched‐
As we saw in Figure 4-3, a single job does not use more resources than its queue’s
the same share of resources. We saw in Figure 4-3 how fair sharing works for applications
each with their own queue, see Figure 4-4. A starts a job and it is allocated all the re‐

Figure 4-4. Fair sharing between user queues
doop.io package. They form the class hierarchy shown in Figure 5-1.

Figure 5-1. Writable class hierarchy
A sequence file consists of a header followed by one or more records (see Figure 5-2).

Figure 5-2. The internal structure of a sequence file with no compression and with re‐
opportunity to take advantage of similarities between records. (See Figure 5-3.) Records

Figure 5-3. The internal structure of a sequence file with block compression
so on. This is shown diagrammatically in Figure 5-4.

Figure 5-4. Row-oriented versus column-oriented storage
ped. Consider a query of the table in Figure 5-4 that processes only column 2. With
A screenshot of the home page is shown in Figure 6-1. The “Cluster Metrics” section
Figure 6-1. Screenshot of the resource manager page
takes us to the job page, illustrated in Figure 6-2.

Figure 6-2. Screenshot of the job page
for all of the map tasks on one page. The screenshot in Figure 6-3 shows this page for

Figure 6-3. Screenshot of the tasks page
and allowed transitions between them are shown in Figure 6-4.
Figure 6-4. Transition diagram of an Oozie workflow
The whole process is illustrated in Figure 7-1. At the highest level, there are five inde‐
Figure 7-1. How Hadoop runs a MapReduce job
submitJobInternal() on it (step 1 in Figure 7-1). Having submitted the job, waitFor
Streaming runs special map and reduce tasks for the purpose of launching the usersupplied executable and communicating with it (Figure 7-2).

Figure 7-2. The relationship of the Streaming executable to the node manager and the
The process is illustrated in Figure 7-3.

Figure 7-3. How status updates are propagated through the MapReduce system
some presorting for efficiency reasons. Figure 7-4 shows what happens.
Figure 7-4. Shuffle and sort in MapReduce
for the final round. The process is illustrated in Figure 7-5.
Figure 7-5. Efficiently merging 40 file segments with a merge factor of 10
listed in Table 8-3 and shown in a diagram of the data flow path in Figure 8-1.

Figure 8-1. Where separators are used in a Streaming MapReduce job
as their data source (see Figure 8-2). It provides two things: a place to define which files

Figure 8-2. InputFormat class hierarchy
Figure 8-3 shows an example. A single file is broken into lines, and the line boundaries
Figure 8-3. Logical records and HDFS blocks for TextInputFormat
previous section. The OutputFormat class hierarchy appears in Figure 8-4.

Figure 8-4. OutputFormat class hierarchy
Figure 2-2.
of temperature buckets. For example, Figure 9-1 shows the distribution for buckets of

Figure 9-1. Temperature distribution for the weather dataset
inlined in each output row. This is illustrated in Figure 9-2.

Figure 9-2. Inner join of two datasets
illustrated in Figure 10-1. Typically there are 30 to 40 servers per rack (only three are

Figure 10-1. Typical two-level network architecture for a Hadoop cluster
For the network in Figure 10-1, the rack topology is described by two network locations,
tribution Center (KDC). The process is shown graphically in Figure 10-2.

Figure 10-2. The three-step Kerberos ticket exchange protocol
proceeds as follows (and is shown schematically in Figure 11-1 for the edit log and image

Figure 11-1. The checkpointing process
is illustrated in Figure 5-4.) This chapter looks at Parquet in more depth, but there are
chunk is written in pages; this is illustrated in Figure 13-1.
Figure 13-1. The internal structure of a Parquet file
system is illustrated in Figure 14-1.
Figure 14-1. Flume agent with a spooling directory source and a logger sink connected
just like in the previous examples. The flow is illustrated in Figure 14-2.
Figure 14-2. Flume agent with a spooling directory source and fanning out to an HDFS
writing them to HDFS, see Figure 14-3. Further tiers may be warranted for very large
Figure 14-3. Using a second agent tier to aggregate Flume events from the first tier
The system is illustrated in Figure 14-4.
Figure 14-4. Two Flume agents connected by an Avro sink-source pair
purposes, see Figure 14-5. If a second tier agent is unavailable, then events will be de‐

Figure 14-5. Using multiple sinks for load balancing or failover
A diagram of the whole system is show in Figure 14-6.
Figure 14-6. Load balancing between two agents
At a high level, Figure 15-1 demonstrates how Sqoop interacts with both the database
Figure 15-1. Sqoop’s import process
Figure 15-2. When scanning through rows to determine which rows match the criteria
rows, as in Figure 15-3. Accessing a large object often requires “opening” it through the
Figure 15-2. Database tables are typically physically represented as an array of rows,

Figure 15-3. Large objects are usually held in a separate area of storage; the main row
performs imports. (See Figure 15-4.) Before performing the export, Sqoop picks a strat‐
Figure 15-4. Exports are performed in parallel using MapReduce
Hive clients and Hive services is illustrated in Figure 17-1.

Figure 17-1. Hive architecture
(see Figure 17-2).
Figure 17-2. Metastore configurations
trated in Figure 17-3):

Figure 17-3. Data flow with partial results for a UDAF
The plan diagram generated from this pipeline is show in Figure 18-1.
Figure 18-1. Plan diagram for a Crunch pipeline for calculating a histogram of word
in Figure 19-1), which passes the call onto the scheduler that runs as a part of the driver

Figure 19-1. How Spark runs a job
forces a shuffle stage.5 The resulting DAG is illustrated in Figure 19-2.

Figure 19-2. The stages and RDDs in a Spark job for calculating a histogram of word
Going back to Figure 19-1, once the DAG scheduler has constructed the complete DAG
Assigned tasks are launched through a scheduler backend (step 4 in Figure 19-1), which
instance is constructed by the driver program (step 1 in Figure 19-3). The context sub‐
Figure 19-3. How Spark executors are started in YARN client mode
The spark-submit client will launch the YARN application (step 1 in Figure 19-4), but

Figure 19-4. How Spark executors are started in YARN cluster mode
is shown in Figure 20-1.

Figure 20-1. The HBase data model, illustrated for a table storing photos
regionserver workers (see Figure 20-2). The HBase master is responsible for bootstrap‐

Figure 20-2. HBase cluster members
with the name of the group members (servers). This is shown in Figure 21-1.

Figure 21-1. ZooKeeper znodes
but it has no control over this and cannot even know if this is the case.7 See Figure 21-2.
Figure 21-2. Reads are satisfied by followers, whereas writes are committed by the lead‐
Figure 21-3). You can query its state at any time by using the getState() method:

Figure 21-3. ZooKeeper state transitions
never reconnect (refer to Figure 21-3). We simply rethrow the exception10 and let the
Figure 22-1.
Figure 22-1. Operational data flow
Figure 22-2 shows how components can be wired into one another in novel ways, with
Figure 22-2. Composable datasets and functions
eration”) sequencing technology; see Figure 22-3. Sequencing machines are scientific
Figure 22-3. Timeline of Big Data Technology and Cost of Sequencing a Genome
see Figure 22-4. This deterministic pairing enables a “copying mechanism”: the DNA
Figure 22-4. DNA Double Helix Structure
the reference genome; see Figure 22-512. A complete human genome is about 3 billion

Figure 22-5. Aligning Reads to a Reference Genome14
map, and then to the next reduce, and so on (Figure 22-6). That is, key-value pairs are
Figure 22-6. Counting and sorting in MapReduce
pipes (Figure 22-7).

Figure 22-7. Pipes linked by fields and tuples
(Figure 22-8):

Figure 22-8. Pipe types
streams (sinks). See Figure 22-9.
Figure 22-9. A simple PipeAssembly
operations that are applied either to individual tuples or groups of tuples (Figure 22-10):

Figure 22-10. Operation types
Operations are bound to pipes when the pipe assembly is created (Figure 22-11).
Figure 22-11. An assembly of operations
(Figure 22-12). That is, the same pipe assembly can be “instantiated” many times into

Figure 22-12. A Flow
between them (Figure 22-13).
Figure 22-13. How a Flow translates to chained MapReduce jobs
In this overview, we will focus on the “log processing pipeline” (Figure 22-14). This

Figure 22-14. The ShareThis log processing pipeline
just source and sink Taps, trap Taps were planned in (Figure 22-15). Normally, when

Figure 22-15. The ShareThis log processing Flow

No comments: