High Throughput Data Analysis

•

2 recomendaciones•5,008 vistas

Receiving data from a source that produces 5-10 GBytes per hour, and presenting analysis results as the data streams in has some interesting challenges. We used MongoDB running on Amazon EC2 to house the data, map reduce to analyze it and Django-non-rel to present the results in near-real-time. (Slides from my presentation at MongoDB Boston)

Tecnología Educación

High-throughput data analysis A Streaming Reports Platform Authors J Singh, Early Stage IT David Zheng, Early Stage IT Contributor Satya Gupta, Virsec Systems October 3, 2011

High-throughput data analysis A few examples of streaming data problems A concrete problem we solved How we solved it Take-away lessons

Streaming Data Data arrives continuously Must be processed continuously Emit analysis results or alerts as needed

Example Use Case Resolve Virtual Machineprofiles application and gathers data about the application The data analysis blocked until data collection was complete, Took several hours before conclusions could be drawn Project goals Stream-mode analysis Begin within a few seconds of start of profiling Continuous update Data rates up to 5 GB per hour Ability to sustain rate for 24 hrs A product of Virsec Systems Analysis and Reporting configured to run in the Amazon EC2 environment ,[object Object]

Or scaled out (more machines),[object Object]

Requirements Fast inserts into the database Thenature and amount of analysis required was hard to judge in the beginning Previous experience with Map/Reduce in the Google App Engine environment had shown promise but GAE was not appropriate for this application Slick, demo-worthy web interface for presenting results Stream-mode operation Start showing results within a few seconds of starting the Resolve Virtual Machine, and update it periodically as more data is collected and analyzed.

Key decisions Chunk up data into 1-second “slices” as it arrives Use a collection for signaling the availability of each data slice Process each chunk as it becomes available Use Map/Reduce for analysis Exploit Parallelism of the data by using as many processors as needed to maintain the “flow rate” Pipeline the various Map Reduce jobs to maintain sequentiality of data

Pipeline Component: Listener Listener Goal: push the data into MongoDB as fast as possible Receives the data from the Resolve Virtual Machine and stores it into MongoDB Self-describing data 12 different types of data fed over 12 different sockets Written in C++ Socket Interface at one end MongoDB C++ driver at other end

Pipeline Component:MongoDB MongoDB ,[object Object]

Allowed us to focus on our problem, not on MongoDB

Will use replica sets for making the data available to analysis servers,[object Object]

$“Function Call Structure” Analysis FnNameTotalTimeSrcFnAddress PID SF_CFA map (emit: FnName, TotalTime, min_addr, NumOfCalls, PID, …) Shuffle stage: FnName: CreateRaceObjects { {TotalTime: 3, min_addr: 2 , NumOfCalls: 1 , PID: 1, …} {TotalTime: 7, min_addr: 3, NumOfCalls: 1, PID: 1, …} {TotalTime: 4, min_addr: 1, NumOfCalls: 1, PID: 1, …} {TotalTime: 6, min_addr: 1, NumOfCalls: 1, PID: 1, …} } reduce Output: {FnName: CreateRaceObjects{TotalTime: 20, min_addr: 1 , NumOfCalls: 2 , PID: 1, …}$

Pipeline Component: Presentation Presentation ,[object Object]

Más contenido relacionado

La actualidad más candente

Tuning Java Servers Srinath Perera

Psdot 1 optimization of resource provisioning cost in cloud computingZTech Proje

Introduction to Reactive programmingDwi Randy Herdinanto

Scalable Realtime Analytics with declarative SQL like Complex Event Processin...Srinath Perera

Netflix machine learningAmer Ather

ICTER 2014 Invited Talk: Large Scale Data Processing in the Real World: from ...Srinath Perera

An Efficient Decentralized Load Balancing Algorithm in Cloud ComputingAisha Kalsoom

High Resolution Energy Modeling that Scales with Apache Spark 2.0 Spark Summi...Spark Summit

Learning From the Past: Automated Rule Generation for CEP - DEBS 2014Alessandro Margara

1. introductionMathenge Kenneth

18 Data StreamsPier Luca Lanzi

Efficient monitoring and alertingTobias Schmidt

Doc5Rupinder Sidhu

Mapreduce scriptHaripritha

Streaming computing: architectures, and tchnologiesNatalino Busa

Hadoop MapReduce ParadigmTarjMehta1

Distributed deep learningMehdi Shibahara

Online learning with structured streaming, spark summit brussels 2016Ram Sriharsha

Monitoring and Alerting with InfluxDB 2.0 | Deniz Kusefoglu & Nate Isley | In...InfluxData

Ufuc Celebi – Stream & Batch Processing in one SystemFlink Forward

La actualidad más candente (20)

Tuning Java Servers

Psdot 1 optimization of resource provisioning cost in cloud computing

Introduction to Reactive programming

Scalable Realtime Analytics with declarative SQL like Complex Event Processin...

Netflix machine learning

ICTER 2014 Invited Talk: Large Scale Data Processing in the Real World: from ...

An Efficient Decentralized Load Balancing Algorithm in Cloud Computing

High Resolution Energy Modeling that Scales with Apache Spark 2.0 Spark Summi...

Learning From the Past: Automated Rule Generation for CEP - DEBS 2014

1. introduction

18 Data Streams

Efficient monitoring and alerting

Doc5

Mapreduce script

Streaming computing: architectures, and tchnologies

Hadoop MapReduce Paradigm

Distributed deep learning

Online learning with structured streaming, spark summit brussels 2016

Monitoring and Alerting with InfluxDB 2.0 | Deniz Kusefoglu & Nate Isley | In...

Ufuc Celebi – Stream & Batch Processing in one System

Similar a High Throughput Data Analysis

Moving Towards a Streaming ArchitectureGabriele Modena

Big Data ArchitectureGuido Schmutz

Monitoring as Software ValidationBioDec

Building a data pipeline to ingest data into Hadoop in minutes using Streamse...Guglielmo Iozzia

Intelligent MonitoringIntelie

Big Data Architectures @ JAX / BigDataCon 2016Guido Schmutz

Distributed computing poliivascucristian

Stream Processing OverviewMaycon Viana Bordin

DIET_BLASTFrederic Desprez

A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016 Databricks

Streaming AnalyticsNeera Agarwal

RTI Data-Distribution Service (DDS) Master Class 2011Gerardo Pardo-Castellote

Big data & HadoopAhmed Gamil

Would Mr. Spok choose Open Sourcevlcinsky

Microsoft DryadColin Clark

Natural Laws of Software PerformanceGibraltar Software

Greenplum for Internet Scale Analytics and Mining - Greenplum Summit 2018VMware Tanzu

Big Data AnalyticsOsman Ali

Stream Processing – Concepts and FrameworksGuido Schmutz

Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010ivan provalov

Similar a High Throughput Data Analysis (20)

Moving Towards a Streaming Architecture

Big Data Architecture

Monitoring as Software Validation

Building a data pipeline to ingest data into Hadoop in minutes using Streamse...

Intelligent Monitoring

Big Data Architectures @ JAX / BigDataCon 2016

Distributed computing poli

Stream Processing Overview

DIET_BLAST

A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016

Streaming Analytics

RTI Data-Distribution Service (DDS) Master Class 2011

Big data & Hadoop

Would Mr. Spok choose Open Source

Microsoft Dryad

Natural Laws of Software Performance

Greenplum for Internet Scale Analytics and Mining - Greenplum Summit 2018

Big Data Analytics

Stream Processing – Concepts and Frameworks

Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010

Más de J Singh

OpenLSH - a framework for locality sensitive hashingJ Singh

Designing analytics for big dataJ Singh

Open LSH - september 2014 updateJ Singh

PaaS - google app engineJ Singh

Mining of massive datasets using locality sensitive hashing (LSH)J Singh

Data Analytic Technology Platforms: Options and TradeoffsJ Singh

Facebook Analytics with Elastic Map/ReduceJ Singh

Big Data LaboratoryJ Singh

The Hadoop EcosystemJ Singh

Social Media Mining using GAE Map ReduceJ Singh

NoSQL and MapReduceJ Singh

CS 542 -- Concurrency Control, Distributed CommitJ Singh

CS 542 -- Failure Recovery, Concurrency ControlJ Singh

CS 542 -- Query OptimizationJ Singh

CS 542 -- Query ExecutionJ Singh

CS 542 Putting it all together -- Storage ManagementJ Singh

CS 542 Parallel DBs, NoSQL, MapReduceJ Singh

CS 542 Database Index StructuresJ Singh

CS 542 Controlling Database Integrity and PerformanceJ Singh

CS 542 Overview of query processingJ Singh

Más de J Singh (20)

OpenLSH - a framework for locality sensitive hashing

Designing analytics for big data

Open LSH - september 2014 update

PaaS - google app engine

Mining of massive datasets using locality sensitive hashing (LSH)

Data Analytic Technology Platforms: Options and Tradeoffs

Facebook Analytics with Elastic Map/Reduce

Big Data Laboratory

The Hadoop Ecosystem

Social Media Mining using GAE Map Reduce

NoSQL and MapReduce

CS 542 -- Concurrency Control, Distributed Commit

CS 542 -- Failure Recovery, Concurrency Control

CS 542 -- Query Optimization

CS 542 -- Query Execution

CS 542 Putting it all together -- Storage Management

CS 542 Parallel DBs, NoSQL, MapReduce

CS 542 Database Index Structures

CS 542 Controlling Database Integrity and Performance

CS 542 Overview of query processing

Último

Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro

SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal

"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays

AI as an Interface for Commercial BuildingsMemoori

Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm

Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos

Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited

DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy

Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software

My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer

Powerpoint exploring the locations used in television show Time Clashcharlottematthew16

DevEX - reference for building teams, processes, and platformsSergiu Bodiu

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays

WordPress Websites for Engineers: Elevate Your Brandgvaughan

Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity

Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed

Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

High Throughput Data Analysis

1. High-throughput data analysis A Streaming Reports Platform Authors J Singh, Early Stage IT David Zheng, Early Stage IT Contributor Satya Gupta, Virsec Systems October 3, 2011

2. High-throughput data analysis A few examples of streaming data problems A concrete problem we solved How we solved it Take-away lessons

3. Streaming Data Data arrives continuously Must be processed continuously Emit analysis results or alerts as needed

4. Security: Scrapers, Spammers, …

5. Monitoring and Alerts

6. Financial Markets

7. High-throughput data analysis A few examples of streaming data problems A concrete problem we solved How we solved it Take-away lessons

10. Requirements Fast inserts into the database Thenature and amount of analysis required was hard to judge in the beginning Previous experience with Map/Reduce in the Google App Engine environment had shown promise but GAE was not appropriate for this application Slick, demo-worthy web interface for presenting results Stream-mode operation Start showing results within a few seconds of starting the Resolve Virtual Machine, and update it periodically as more data is collected and analyzed.

11. High-throughput data analysis A few examples of streaming data problems A concrete problem we solved How we solved it Take-away lessons

12. Key decisions Chunk up data into 1-second “slices” as it arrives Use a collection for signaling the availability of each data slice Process each chunk as it becomes available Use Map/Reduce for analysis Exploit Parallelism of the data by using as many processors as needed to maintain the “flow rate” Pipeline the various Map Reduce jobs to maintain sequentiality of data

13. Pipeline Component: Listener Listener Goal: push the data into MongoDB as fast as possible Receives the data from the Resolve Virtual Machine and stores it into MongoDB Self-describing data 12 different types of data fed over 12 different sockets Written in C++ Socket Interface at one end MongoDB C++ driver at other end

14.

15. Did everything it was supposed to do

16. Allowed us to focus on our problem, not on MongoDB

17.

18. “Function Call Structure” Analysis FnNameTotalTimeSrcFnAddress PID SF_CFA map (emit: FnName, TotalTime, min_addr, NumOfCalls, PID, …) Shuffle stage: FnName: CreateRaceObjects { {TotalTime: 3, min_addr: 2 , NumOfCalls: 1 , PID: 1, …} {TotalTime: 7, min_addr: 3, NumOfCalls: 1, PID: 1, …} {TotalTime: 4, min_addr: 1, NumOfCalls: 1, PID: 1, …} {TotalTime: 6, min_addr: 1, NumOfCalls: 1, PID: 1, …} } reduce Output: {FnName: CreateRaceObjects{TotalTime: 20, min_addr: 1 , NumOfCalls: 2 , PID: 1, …}

19.

20. And requires a page design of its own

21. Tool of choice: DjangoNonRel

22. But Python driver for MongoDB was sufficient for most work.

23.

24. Progression of Solutions

25. Endpoint Stack Data Capture (Listener) Custom, preferably written in C++ or Java NoSQL Database MongoDB Well suited for high speedinserts Calculation Platform MongoDB Map/Reduce Could use Hadoop but startup times are a concern Presentation Django Non-Rel

26. About Us Involved with Map/Reduce and NoSQL technologies on several platforms Many students in J’s Database Systems class at WPI did a project on a NoSQL database. DataThinks.org is a new service of Early Stage IT Building and operating “Big Data” analytics services Thanks

High Throughput Data Analysis

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a High Throughput Data Analysis

Similar a High Throughput Data Analysis (20)

Más de J Singh

Más de J Singh (20)

Último

Último (20)

High Throughput Data Analysis