Receiving data from a source that produces 5-10 GBytes per hour, and presenting analysis results as the data streams in has some interesting challenges.
We used MongoDB running on Amazon EC2 to house the data, map reduce to analyze it and Django-non-rel to present the results in near-real-time.
(Slides from my presentation at MongoDB Boston)
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
High Throughput Data Analysis
1. High-throughput data analysis A Streaming Reports Platform Authors J Singh, Early Stage IT David Zheng, Early Stage IT Contributor Satya Gupta, Virsec Systems October 3, 2011
2. High-throughput data analysis A few examples of streaming data problems A concrete problem we solved How we solved it Take-away lessons
3. Streaming Data Data arrives continuously Must be processed continuously Emit analysis results or alerts as needed
7. High-throughput data analysis A few examples of streaming data problems A concrete problem we solved How we solved it Take-away lessons
8.
9.
10. Requirements Fast inserts into the database Thenature and amount of analysis required was hard to judge in the beginning Previous experience with Map/Reduce in the Google App Engine environment had shown promise but GAE was not appropriate for this application Slick, demo-worthy web interface for presenting results Stream-mode operation Start showing results within a few seconds of starting the Resolve Virtual Machine, and update it periodically as more data is collected and analyzed.
11. High-throughput data analysis A few examples of streaming data problems A concrete problem we solved How we solved it Take-away lessons
12. Key decisions Chunk up data into 1-second “slices” as it arrives Use a collection for signaling the availability of each data slice Process each chunk as it becomes available Use Map/Reduce for analysis Exploit Parallelism of the data by using as many processors as needed to maintain the “flow rate” Pipeline the various Map Reduce jobs to maintain sequentiality of data
13. Pipeline Component: Listener Listener Goal: push the data into MongoDB as fast as possible Receives the data from the Resolve Virtual Machine and stores it into MongoDB Self-describing data 12 different types of data fed over 12 different sockets Written in C++ Socket Interface at one end MongoDB C++ driver at other end
25. Endpoint Stack Data Capture (Listener) Custom, preferably written in C++ or Java NoSQL Database MongoDB Well suited for high speedinserts Calculation Platform MongoDB Map/Reduce Could use Hadoop but startup times are a concern Presentation Django Non-Rel
26. About Us Involved with Map/Reduce and NoSQL technologies on several platforms Many students in J’s Database Systems class at WPI did a project on a NoSQL database. DataThinks.org is a new service of Early Stage IT Building and operating “Big Data” analytics services Thanks