Sunday, December 2, 2012

Strom(BackType/Twitter) - Part1


 Hi All,

Bellow are some snapshots of nathanmarz video, at the time of launching storm and other content available on internet..

Please have look and let me know in case of any doubt.

Twitter/Backtype Storm

Storm, developed at Backtype before it was acquired by Twitter, is billed as the “Hadoop of realtime processing.” Storm is engineered to analyze near real-time, streaming data sources – like the Twitter firehose. Historically, Hadoop has been best used for analyzing big data sets rather than quickly updated streams of data. Hadoop is for running a job with a set end point, Storm is for processing jobs that are continuous because new data is constantly being added.

Introduction

Storm, a distributed fault-tolerant and real-time computational system currently used by Twitter to keep statistics on user clicks for every URL and domain. It uses workers and queues paradigm.

Components of a Storm cluster

A Storm cluster is superficially similar to a Hadoop cluster. Whereas on Hadoop you run "MapReduce jobs", on Storm you run "topologies". "Jobs" and "topologies" themselves are very different -- one key difference is that a MapReduce job eventually finishes, whereas a topology processes messages forever (or until you kill it).
There are two kinds of nodes on a Storm cluster: the master node and the worker nodes. The master node runs a daemon called "Nimbus" that is similar to Hadoop's "JobTracker". Nimbus is responsible for distributing code around the cluster, assigning tasks to machines, and monitoring for failures.
Each worker node runs a daemon called the "Supervisor". The supervisor listens for work assigned to its machine and starts and stops worker processes as necessary based on what Nimbus has assigned to it. Each worker process executes a subset of a topology; a running topology consists of many worker processes spread across many machines.

Twitter/Backtype Storm

Storm, developed at Backtype before it was acquired by Twitter, is billed as the “Hadoop of realtime processing.” Storm is engineered to analyze near real-time, streaming data sources – like the Twitter firehose. Historically, Hadoop has been best used for analyzing big data sets rather than quickly updated streams of data. Hadoop is for running a job with a set end point, Storm is for processing jobs that are continuous because new data is constantly being added.

Introduction

Storm, a distributed fault-tolerant and real-time computational system currently used by Twitter to keep statistics on user clicks for every URL and domain. It uses workers and queues paradigm.

Components of a Storm cluster

A Storm cluster is superficially similar to a Hadoop cluster. Whereas on Hadoop you run "MapReduce jobs", on Storm you run "topologies". "Jobs" and "topologies" themselves are very different -- one key difference is that a MapReduce job eventually finishes, whereas a topology processes messages forever (or until you kill it).
There are two kinds of nodes on a Storm cluster: the master node and the worker nodes. The master node runs a daemon called "Nimbus" that is similar to Hadoop's "JobTracker". Nimbus is responsible for distributing code around the cluster, assigning tasks to machines, and monitoring for failures.
Each worker node runs a daemon called the "Supervisor". The supervisor listens for work assigned to its machine and starts and stops worker processes as necessary based on what Nimbus has assigned to it. Each worker process executes a subset of a topology; a running topology consists of many worker processes spread across many machines.
All coordination between Nimbus and the Supervisors is done through a Zookeeper cluster. Additionally, the Nimbus daemon and Supervisor daemons are fail-fast and stateless; all state is kept in Zookeeper or on local disk. This means you can kill -9 Nimbus or the Supervisors and they'll start back up like nothing happened. This design leads to Storm clusters being incredibly stable.

Topologies

To do realtime computation on Storm, you create what are called "topologies". A topology is a graph of computation. Each node in a topology contains processing logic, and links between nodes indicate how data should be passed around between nodes.
Running a topology is straightforward. First, you package all your code and dependencies into a single jar. Then, you run a command like the following:
storm jar all-my-code.jar backtype.storm.MyTopology arg1 arg2
This runs the class backtype.storm.MyTopology with the arguments arg1 and arg2. The main function of the class defines the topology and submits it to Nimbus. The storm jar part takes care of connecting to Nimbus and uploading the jar.
Since topology definitions are just Thrift structs, and Nimbus is a Thrift service, you can create and submit topologies using any programming language.

Nimbus:  
1.       It is the point where you can submit topology and code for execution.
2.       Distribute the code among the cluster
3.       Launch worker



Zookeper is 3rd party product.
 
Supervisor:
1.       These are worker nodes that actually runs the computations.
2.       All the nodes pass messages to each other.
A group of nodes has supervisor who communicates with zookeeper and nibus to get to know what is running on the machine.



 
Tweets are distributed to multiple queues servers.
Which worker going to work on which tweet is random.

First set of worker then choose queues using the hash map then these urls are allocated to different workers. 
1.       One worker handles all updates regarding one url as if, there will be more workers then data can be trapped on each other and we can lose data.
2.       Here database call is not used, we do buffer the updates and batch the updates in casandra.





Scaling the system:





Workers are doing Cassandra updates, so we need to add a worker.

Steps to be done to add a new worker:
1.       Deploy a worker
2.       Add a new queue for that worker
3.       Reconfigure/ redeploy the workers(1st set of)
Example of this scaling is:
Adding a new function. And then that function needs to know that how other functions are going to use that function. And if anytime we are updating the calling function we need to update that function also.

This is not only for this system it is for all real time systems.
1.       Horizontal scalable: adding machine etc should be very easy
2.       No intermediate message broker:
If the consumer of message lose the message then he can get the message from the broker as it is persistent to the broker. But the reason why we should not use this is that :
1.       Slow: data should directly go from producer to consumer
2.       Complex: broker should persist the whole architechture

3.       Just Work:
For hadoop as it is batch processing system, if the downtime is of some hours then its ok. There are lots of installation issue which an experienced person also faces (new issue very often). But it will not be a good thing for real time.
 


No comments:

Post a Comment