Hi All,
Bellow are some snapshots of nathanmarz video, at the time of launching storm and other content available on internet..
Please have look and let me know in case of any doubt.
Twitter/Backtype Storm
Storm, developed at Backtype before it was acquired by Twitter, is billed as the “Hadoop of realtime processing.” Storm is engineered to analyze near real-time, streaming data sources – like the Twitter firehose. Historically, Hadoop has been best used for analyzing big data sets rather than quickly updated streams of data. Hadoop is for running a job with a set end point, Storm is for processing jobs that are continuous because new data is constantly being added.Introduction
Storm, a distributed fault-tolerant and real-time
computational system currently used by Twitter to keep statistics on user
clicks for every URL and domain. It uses workers and queues paradigm.
Components of a Storm cluster
A Storm cluster is superficially
similar to a Hadoop cluster. Whereas on Hadoop you run "MapReduce
jobs", on Storm you run "topologies". "Jobs" and
"topologies" themselves are very different -- one key difference is
that a MapReduce job eventually finishes, whereas a topology processes messages
forever (or until you kill it).
There are two kinds of nodes on a
Storm cluster: the master node and the worker nodes. The master node runs a
daemon called "Nimbus" that is similar to Hadoop's
"JobTracker". Nimbus is responsible for distributing code around the
cluster, assigning tasks to machines, and monitoring for failures.
Each worker node runs a daemon
called the "Supervisor". The supervisor listens for work assigned to
its machine and starts and stops worker processes as necessary based on what
Nimbus has assigned to it. Each worker process executes a subset of a topology;
a running topology consists of many worker processes spread across many
machines.
Twitter/Backtype Storm
Storm, developed at Backtype before it was acquired by Twitter, is billed as the “Hadoop of realtime processing.” Storm is engineered to analyze near real-time, streaming data sources – like the Twitter firehose. Historically, Hadoop has been best used for analyzing big data sets rather than quickly updated streams of data. Hadoop is for running a job with a set end point, Storm is for processing jobs that are continuous because new data is constantly being added.Introduction
Storm, a distributed fault-tolerant and real-time
computational system currently used by Twitter to keep statistics on user
clicks for every URL and domain. It uses workers and queues paradigm.
Components of a Storm cluster
A Storm cluster is superficially
similar to a Hadoop cluster. Whereas on Hadoop you run "MapReduce
jobs", on Storm you run "topologies". "Jobs" and
"topologies" themselves are very different -- one key difference is
that a MapReduce job eventually finishes, whereas a topology processes messages
forever (or until you kill it).
There are two kinds of nodes on a
Storm cluster: the master node and the worker nodes. The master node runs a
daemon called "Nimbus" that is similar to Hadoop's
"JobTracker". Nimbus is responsible for distributing code around the
cluster, assigning tasks to machines, and monitoring for failures.
Each worker node runs a daemon
called the "Supervisor". The supervisor listens for work assigned to
its machine and starts and stops worker processes as necessary based on what
Nimbus has assigned to it. Each worker process executes a subset of a topology;
a running topology consists of many worker processes spread across many
machines.
All coordination between Nimbus and
the Supervisors is done through a Zookeeper
cluster. Additionally, the Nimbus daemon and Supervisor daemons are fail-fast
and stateless; all state is kept in Zookeeper or on local disk. This means you
can kill -9 Nimbus or the Supervisors and they'll start back up like nothing
happened. This design leads to Storm clusters being incredibly stable.
Topologies
To do realtime computation on Storm,
you create what are called "topologies". A topology is a graph of
computation. Each node in a topology contains processing logic, and links
between nodes indicate how data should be passed around between nodes.
Running a topology is straightforward.
First, you package all your code and dependencies into a single jar. Then, you
run a command like the following:
storm
jar all-my-code.jar backtype.storm.MyTopology arg1 arg2
This runs the class backtype.storm.MyTopology with the arguments arg1 and arg2.
The main function of the class defines the topology and submits it to Nimbus.
The storm jar part
takes care of connecting to Nimbus and uploading the jar.
Since
topology definitions are just Thrift structs, and Nimbus is a Thrift service, you
can create and submit topologies using any programming language.
Nimbus:
1. It is the point where you can submit
topology and code for execution.
2. Distribute the code among the cluster
3. Launch worker
Zookeper is 3rd party product.
Supervisor:
1.
These
are worker nodes that actually runs the computations.
2.
All
the nodes pass messages to each other.
A
group of nodes has supervisor who communicates with zookeeper and nibus to get
to know what is running on the machine.
Tweets are distributed to multiple queues servers.
Which worker going to
work on which tweet is random.
First set of worker
then choose queues using the hash map then these urls are allocated to
different workers.
1. One worker handles all updates regarding
one url as if, there will be more workers then data can be trapped on each
other and we can lose data.
2. Here database call is not used, we do
buffer the updates and batch the updates in casandra.
Scaling the system:
Workers are doing
Cassandra updates, so we need to add a worker.
Steps to be done to
add a new worker:
1. Deploy a worker
2. Add a new queue for that worker
3. Reconfigure/ redeploy the workers(1st
set of)
Example
of this scaling is:
Adding
a new function. And then that function needs to know that how other functions
are going to use that function. And if anytime we are updating the calling
function we need to update that function also.
This
is not only for this system it is for all real time systems.
1. Horizontal scalable: adding machine etc
should be very easy
2. No intermediate message broker:
If
the consumer of message lose the message then he can get the message from the
broker as it is persistent to the broker. But the reason why we should not use
this is that :
1. Slow: data should directly go from producer
to consumer
2. Complex: broker should persist the whole
architechture
3. Just Work:
For hadoop as it is batch processing
system, if the downtime is of some hours then its ok. There are lots of
installation issue which an experienced person also faces (new issue very
often). But it will not be a good thing for real time.
No comments:
Post a Comment