With today’s data technologies, storing data and scaling the infrastructure is becoming a non-issue with HDFS, Hadoop, and related architectures. Hadoop provides the batch-processing framework with MapReduce for processing the data. However, batch processing poses challenges with high data read latency for use cases like real-time analytics, clickstream visualization, and machine learning. We needed a real-time system to process our customer and system generated data as it happens to make important and quick business decisions. At Rocket Lawyer, we have chosen Apache Storm to supplement our data platform with real-time processing capabilities.
Apache Storm to the rescue
Storm provides simple, open source, scalable, fault tolerant, and guaranteed data processing capabilities while abstracting the technological complexities with simple API and visual management tools. In addition, several data connectors called Spouts are available to integrate Storm with many data source streams. Storm Bolts perform the function of ETL and data processing. Storm Topology orchestrates the stream computation with Spouts and Bolts.
Real-time event stream processing with Storm, HBase, Redis, and D3.js
One common question we get from product managers and analysts is to have the ability to see our customer flow as they go through our product funnels. Funnels can involve multiple components from landing pages, search, legal questionnaire framework, customer registration, checkouts, and everything in between and after. Before Storm, the way to do this was by running batch queries against HBase, where our application and business events are stored. This was incredibly time consuming, taking hours to process depending on the flow and the points tracked. And if there were changes to the funnel, it would require rewriting the batch job.
Storm solved this issue with the real-time distributed computation framework. We have created an HBase Spout that constantly streams data from HBase into Bolts, which then identifies customer sessions and aggregates the clicks by traffic source, partners, campaigns, micro, and macro conversion events. Aggregations are stored in Redis, which provides efficient in-memory storage capabilities for extremely fast reads and joins. D3.js is used for visualizing the customer event stream with dynamic querying capabilities. When new events are introduced with product changes, Bolt recognizes the new events and creates new key value pairs in Redis, which are immediately available for data analysis and to track user behavior. From actual creation of data at source to visualization, the process is completed within seconds.
Benefits
At Rocket Lawyer, real-time data analysis is powered by Storm to analyze and spot trends in consumer behavior from product changes immediately after launch and empowers our business to make quick decisions. From technology perspective, Storm made real-time computation work out of the box, while abstracting all the complexities. This reduced our development time significantly. We are expanding Storm topologies for other data products like real-time ad and revenue optimizations, recommendation algorithms, and alert systems.
Deepak Srinivasan is Senior Director of Data Engineering and Adam Moore is a Data Engineer at Rocket Lawyer.
| December 18, 2013
source : http://blog.rocketlawyer.com/using-apache-storm-for-real-time-analytics-at-rocket-lawyer-915600
| December 18, 2013
No comments:
Post a Comment