Travel of Software Developer: Storm?

Whereas Hadoop targets batch processing, Storm is an always-active service that receives and processes unbound streams of data.

Like Hadoop, Storm is a distributed system that offers massive scalability for applications that store and manipulate big data.
Unlike Hadoop, it delivers that data instantaneously, in realtime.

It is written primarily in Clojure and supports Java by default.
Use cases

Realtime analytics
Online machine learning
Continuous computation
Distributed RPC
ETL

How does storm differ from Hadoop?

The simple answer is that Storm analyzes realtime data while Hadoop analyze offline data.
In truth, the two frameworks complement one another more than they compete.

Hadoop

Provides its own file system (HDFS)
Manages both data and code/tasks.
It divides data into blocks and when a "job" executes, it pushes analysis code close to the data it is analyzing.

This is how Hadoop avoids the overhead of network communication in loading data -- keeping the analysis code next to the data enables Hadoop to read it faster by orders of magnitude.

MapReduce

Hadoop partitions data into chunks and passes those chunks to mappers that map keys to values.
Reducers then assemble those mapped key/value pairs into a usable output.
The MapReduce paradigm operates quite elegantly but is targeted at data analysis.

HDFS

In order to leverage all the power of Hadoop application data must be stored in the HDFS file system.

Storm

Storm solves a different problem altogether.
Realtime

meaning right now
Storm is interested in understanding things that are happening in realtime, and interpreting them.

File System

Storm does not have its own file system.

Programming Paradigm

Its programming paradigm is quite a bit different from Hadoop's.
Storm is all about obtaining chunks of data, known as spouts, from somewhere and passing that data through various processing components, known as bolts.
Storm's data processing mechanism is extremely fast and is meant to help you identify live trends as they are happening.
Unlike Hadoop, Storm doesn't care what happened yesterday or last week.

Architecture

At the highest level, Storm is comprised of topologies.

A topology is a graph of computations

Each node contains processing logic and each path between nodes indicates how data should be passed between nodes.

In side of topologies you have networks of streams, which are unbounded sequences of tuples.

Storm provides a mechanism to transform streams into new streams using spouts and bolts.
Spouts

Spouts generate streams, which can pull data from a site like Twitter of Facebook and then publish it in an abstract format.

Bolts

Bolts consume input streams, process them, and then optionally generate new streams.

Tuples

Storm's data model is represented by tuples.

A tuples is a named list of values of any type.
Storm supports all primititve types, Strings, and byte-arrays and you can build your own serializer if you want to use your own object types.

Your spouts will "emit" tuples
And your bolts will consume them.
Your bolts may also emit tuples if their output is destined to be processed by another bolt downstream.
Basically, emitting tuples is the mechanism for passing data from a spout to a bolt, or from a bolt to another bolt.

This is quite a normal architecture for expandable or scalable computatal system, or network. And basically, the name, Topologies, the graph, makes a lot of sense.

It's an Oriented Graph. If we add a root node above Spouts, it would look like a tree but with shared children. It reminds me the Traveling Saleman Problem (TSP, Travelling salesman problem - Wikipedia, the free encyclopedia).

Storm Cluster

A Storm Cluster is somewhat similar to Hadoop clusters, but while a Hadoop cluster runs map-reduce jobs, Storm runs topologies.

Map-reduce jobs eventually end.
Topologies are destined to run until you explicitly kill them.

Storm clusters define two types of nodes

Master Node

This node runs a daemon process called Nimbus.
Nimbus is responsible for distributing code across the cluster, assigning tasks to machines, and monitoring the success and failure of units of work.

Worker Nodes

These nodes run a daemon process called the Supervisor.
A Supervisor is responsible for listening for work assignments for its machine.
It then subsequently starts and stops worker processes. Each worker process executes a subset of a topology, so that the execution of a topology is spread across a multitude of worker processes running on a multitude of machines.

This is a typical cluster architecture. I used to designed a system with the Master Node named MC (MultiController), and the Worker Nodes named LC (LocalController). MC takes care of job distribution. But MC doesn't monitor the success or failure of units of work. It monitors the LC's health and relocate the Work Node to ensure High Availability.

My system was not designed for dynamic expansion. There are only three fixed layers. LC, the Worker Node runs as a daemon and supervises a list of Engines. Those Engines are created dynamically by some configuration package, the work assignments if you will. It starts and stops these Engines, the Worker Processes if you will. Those Worker Processes work basically independently, which made my cluster simpler then this. I think having each worker process executes a subset of a topology, and coordinates those topology will be a challenge.

In my cluster, each Worker Nodes takes care of the success or failure of its own Worker Processes.

ZooKeeper

Sitting between the Nimbus and the various Supervisors is ZooKeeper.
ZooKeeper's goal is to enable highly reliable distributed coordination, mainly by acting as a centralized service for distributed cluster functionality.
Storm topologies are deployed to the Nimbus and then the Nimbus deploys spouts and bolts to Supervisors.

When it comes time to execute spouts and bolts, the Nimbus communicates with the Supervisors by passing messages to ZooKeepers.
Zookeepers maintain all state for the topology, which allows the Nimbus and Supervisors to be fail-fast and stateless: if the Nimbus or Supervisor processes go down then the state of processing is not lost;
If the Nimbus or Supervisor processes go down then the state of processing is not lost;
Work is reassigned to another Supervisor and processing continues.
Then, when the Nimbus or a Supervisor is restarted, they simply rejoin the cluster and their capacity is added to the cluster.

My first question would be: how many Nimbus in the system? It seems only one. If this is true, we need somewhere to eliminate this single-failure-point (Single point of failure - Wikipedia, the free encyclopedia). So I think Zookeeper actions as an independent configuration storage out of Nimbus. Nimbus delegates Zookeepers to deployed Supervisors and manage them. Zookeepers are clustered and do not have single point of failure problem.

Of course, this is what I guess so far.

Storm Development Environment

Local Mode:

Storm executes topologies completely in-process by simulating worker nodes using threads.

Distributed Mode:

In distributed mode, it runs across a cluster of machines.

Travel of Software Developer

14.5.13

Storm?

No comments: