SlideShare una empresa de Scribd logo
1 de 28
Making sense of streaming Big Data               Flume – HBase Real-Time Big Data Analytics with SLA -Dani Abel Rayan
Who am I ? Interned with Cloudera. Flume Contributor. HBaseUser. Work with KarstenSchwan @ GaTech Joining as “Big Data Engineer” in a lead role to manage exponential growing data for makers of League of Legends (Multiplayer Online Battle Arena)- recently received 400M
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.
Why near Real-Time? Activity stream data is a normal part of any website for reporting on usage of the site.  Activity data is things like page views, information about what content was shown, searches, etc.  This kind of thing is usually handled by logging the activity out to some kind of file and then periodically aggregating these files for analysis.  In recent years, however, activity data has become a critical part of the production features of websites, and a slightly more sophisticated set of infrastructure is needed.
The Big Picture Why you want to build this ? - Customer retargeting
The Big Picture Content serving by measuring current audience interests. Product Patterns – Twitter Streams S4 is being used for applications such as personalization, user feedback, malicious traffic detection, and real-time search Location based streams – find out people matching specific threshold almost near real-time - RealTime Shopping/Restaurants discount So many possibilities!
Million Impressions in a sec … 100 nodes (soon to be 500) in CERCS Each one can generate 10,000 impressions in one second. Specific products are given “a known”      impression rates and others are pseudo random  The challenging task is to ensure, that we can bucket the Product impression in proper ColFamilies within a SLA of few seconds.
Whats the storage ? HBase Currently used for real-time analytics in companies like Facebook (also FB messaging), Yahoo!, Twitter The high-throughput stream of immutable activity data represents a real computational challenge as the volume may easily be 10x or 100x larger than the next largest data source on a site.  Do we really need to store everything ? Nope! HBase has TTL for ColFamilies Wait! What are “Column Families” ?
HBase Data Model
HBase Data Model
HBase TTL DESCRIPTION                                                                     ENABLED  {NAME => 'test',  FAMILIES =>  [{NAME => 'host', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', COMPRESSION => 'NONE', VERSIONS => '3', TTL => '2147483647',  BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'},  {NAME => ' info', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', COMPRESSION => 'NONE', VERSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'fal se', BLOCKCACHE => 'true'},  {NAME => 'log', BLOOMFILTER => 'NONE', REPLICATION _SCOPE => '0', VERSIONS => '3', COMPRESSION => 'NONE', TTL => '2147483647',       BL OCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'},  {NAME => 'pro duct', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', COMPRESSION => 'NONE', VERSIONS => '3', TTL => '10', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}
HBase Architecture Why NoSQL ? Hbase ? Horizontal scalability  Commodity Hardware Hadoop based Only one index possibile– RowKey Very high write load ->   20 billion events per day (200,000 events per second) with a lag of less than 30 seconds. (at Facebook)
Flume Flume is a distributed, reliable, and available service for efficiently moving large amounts of data soon after the data is produced.  The primary use case for Flume is as a logging system that gathers a set of log files on every machine in a cluster and aggregates them to a centralized persistent store such as the Hadoop Distributed File System (HDFS). The system was designed with these four key goals in mind: Reliability  Scalability  Manageability  Extensibility
Where it is used ? Mozilla Shopzilla AOL Simple GEO Path ….
Flume Architecture
Flume Data Model Flume internally converts every external source of data into a stream of events.  Events are Flume’s unit of data and are a simple and flexible representation. An event is composed of a body and metadata.  The event body is a string of bytes representing the content of an event. For example, a line in a log file is represented as an event whose body was the actual byte representation of that line.  The event metadata is a table of key / value pairs that capture some detail about the event, such as the time it was created or the name of the machine on which it originated.  This table can be appended as an event travels along a Flume flow, and the table can be read to control the operation of individual components of that flow. For example, the machine name attached to an event can be used to control the output path where the event is written at the end of the flow. An event’s body can be up to 32KB long - although this limit can be controlled via a system property, it is recommended that it is not changed in order to preserve performance.
Flume – HBase connector This is challenging since we need to interface single dimensional key-value pairs into multi-dimensional key-value pair Many possible approaches: 1. "usage: hbase(quot;tablequot;, quot;rowkeyquot;, "  + "quot;cf1quot;," + " quot;c1quot;, quot;val1quot;[,quot;cf2quot;, quot;c2quot;, quot;val2quot;, ....]{, "  + KW_BUFFER_SIZE + "=int, " + KW_USE_WAL + "=true|false})"; 2. usage: attr2hbase(quot;tablequot; [,quot;sysFamilyquot;[, quot;writeBodyquot;[,quot;attrPrefixquot;[,quot;writeBufferSizequot;[,quot;writeToWalquot;]]]]])"; https://issues.cloudera.org/browse/FLUME-6
The Story so far … for a demo Deployed Flume Agent Nodes in ~ 100 machines Agents monitor a “specific” log directory in all the machines. Any logfile matching the *.hb” suffix will be continuously tailed. Deployed Flume Collector Nodes in 5 machines One HBase – psuedo distributed
Demo Single Event   - Choose a Machine 7,9,10,11,12   - Choose a VM 2,3, ….20 1000’s of Event    -  Lets do it again on multiple machines Failover chain Demo Inactive Demo
Evaluation Flume is an awesome product but it isn’t perfect and there are instances wherein certain flows though they appear “ACTIVE” aren’t spewing out any data and sometimes are inactive or work slower because of long queues. No SLA guarantees So are the other products like S4 from Yahoo! And Kafka from LinkedIN
Monalytics
SLA Monalytics - combined monitoring and analysis systems used for managing large-scale data center systems. Due to the scale and complexity on commodity software/hardware in data centers, the performance problems in the streaming MapReduce system are inevitable.  Solving those problems is extremely hard and much more stringent, if you need to meet SLA: “We will bucket the impressions in less than 2sec, no matter what”
So many moving parts Still need to be Adaptable Needs to be configurable with new stores End to End Monitoring of SLA Reliability Garbage Collection Pauses Good use case for Monalytics New technologies are really cool to implement, but once you past the initial honeymoon phase, complexities start surfacing up and one either enters the “cul-de-sac” mode to fall-back on the more traditional methods.
Demo MonalyticsLatency Start – Stop 100 nodes
Similar systems to Flume S4 from Yahoo! Kafka from LinkedIN FlumeBase – an extension to support SQL like constructs operating on Data Streams – another startup
Thanks ! dr@verticalengine.com Questions ?

Más contenido relacionado

La actualidad más candente

Migration Best Practices - Search Y 2019, Paris
Migration Best Practices - Search Y 2019, ParisMigration Best Practices - Search Y 2019, Paris
Migration Best Practices - Search Y 2019, ParisBastian Grimm
 
Super speed around the globe - SearchLeeds 2018
Super speed around the globe - SearchLeeds 2018Super speed around the globe - SearchLeeds 2018
Super speed around the globe - SearchLeeds 2018Bastian Grimm
 
Migration Best Practices - SMX West 2019
Migration Best Practices - SMX West 2019Migration Best Practices - SMX West 2019
Migration Best Practices - SMX West 2019Bastian Grimm
 
International Site Speed Tweaks - ISS 2017 Barcelona
International Site Speed Tweaks - ISS 2017 BarcelonaInternational Site Speed Tweaks - ISS 2017 Barcelona
International Site Speed Tweaks - ISS 2017 BarcelonaBastian Grimm
 
Whats Next in SEO & CRO - 3XE Conference 2018 Dublin
Whats Next in SEO & CRO - 3XE Conference 2018 DublinWhats Next in SEO & CRO - 3XE Conference 2018 Dublin
Whats Next in SEO & CRO - 3XE Conference 2018 DublinBastian Grimm
 
Migration Best-Practices: Successfully re-launching your website - SMX New Yo...
Migration Best-Practices: Successfully re-launching your website - SMX New Yo...Migration Best-Practices: Successfully re-launching your website - SMX New Yo...
Migration Best-Practices: Successfully re-launching your website - SMX New Yo...Bastian Grimm
 
Web Performance Madness - brightonSEO 2018
Web Performance Madness - brightonSEO 2018Web Performance Madness - brightonSEO 2018
Web Performance Madness - brightonSEO 2018Bastian Grimm
 
Welcome to a new reality - DeepCrawl Webinar 2018
Welcome to a new reality - DeepCrawl Webinar 2018Welcome to a new reality - DeepCrawl Webinar 2018
Welcome to a new reality - DeepCrawl Webinar 2018Bastian Grimm
 
Scaling a SaaS backend with PostgreSQL - A case study
Scaling a SaaS backend with PostgreSQL - A case studyScaling a SaaS backend with PostgreSQL - A case study
Scaling a SaaS backend with PostgreSQL - A case studyOliver Seemann
 
Structured Data & Schema.org - SMX Milan 2014
Structured Data & Schema.org - SMX Milan 2014Structured Data & Schema.org - SMX Milan 2014
Structured Data & Schema.org - SMX Milan 2014Bastian Grimm
 
How fast is fast enough - SMX West 2018
How fast is fast enough - SMX West 2018How fast is fast enough - SMX West 2018
How fast is fast enough - SMX West 2018Bastian Grimm
 
Everyone Screws Up HTTPS
Everyone Screws Up HTTPSEveryone Screws Up HTTPS
Everyone Screws Up HTTPSpatrickstox
 

La actualidad más candente (13)

Migration Best Practices - Search Y 2019, Paris
Migration Best Practices - Search Y 2019, ParisMigration Best Practices - Search Y 2019, Paris
Migration Best Practices - Search Y 2019, Paris
 
Super speed around the globe - SearchLeeds 2018
Super speed around the globe - SearchLeeds 2018Super speed around the globe - SearchLeeds 2018
Super speed around the globe - SearchLeeds 2018
 
Migration Best Practices - SMX West 2019
Migration Best Practices - SMX West 2019Migration Best Practices - SMX West 2019
Migration Best Practices - SMX West 2019
 
International Site Speed Tweaks - ISS 2017 Barcelona
International Site Speed Tweaks - ISS 2017 BarcelonaInternational Site Speed Tweaks - ISS 2017 Barcelona
International Site Speed Tweaks - ISS 2017 Barcelona
 
Whats Next in SEO & CRO - 3XE Conference 2018 Dublin
Whats Next in SEO & CRO - 3XE Conference 2018 DublinWhats Next in SEO & CRO - 3XE Conference 2018 Dublin
Whats Next in SEO & CRO - 3XE Conference 2018 Dublin
 
Migration Best-Practices: Successfully re-launching your website - SMX New Yo...
Migration Best-Practices: Successfully re-launching your website - SMX New Yo...Migration Best-Practices: Successfully re-launching your website - SMX New Yo...
Migration Best-Practices: Successfully re-launching your website - SMX New Yo...
 
Big Data Workshop
Big Data WorkshopBig Data Workshop
Big Data Workshop
 
Web Performance Madness - brightonSEO 2018
Web Performance Madness - brightonSEO 2018Web Performance Madness - brightonSEO 2018
Web Performance Madness - brightonSEO 2018
 
Welcome to a new reality - DeepCrawl Webinar 2018
Welcome to a new reality - DeepCrawl Webinar 2018Welcome to a new reality - DeepCrawl Webinar 2018
Welcome to a new reality - DeepCrawl Webinar 2018
 
Scaling a SaaS backend with PostgreSQL - A case study
Scaling a SaaS backend with PostgreSQL - A case studyScaling a SaaS backend with PostgreSQL - A case study
Scaling a SaaS backend with PostgreSQL - A case study
 
Structured Data & Schema.org - SMX Milan 2014
Structured Data & Schema.org - SMX Milan 2014Structured Data & Schema.org - SMX Milan 2014
Structured Data & Schema.org - SMX Milan 2014
 
How fast is fast enough - SMX West 2018
How fast is fast enough - SMX West 2018How fast is fast enough - SMX West 2018
How fast is fast enough - SMX West 2018
 
Everyone Screws Up HTTPS
Everyone Screws Up HTTPSEveryone Screws Up HTTPS
Everyone Screws Up HTTPS
 

Destacado

HBase Consistency and Performance Improvements
HBase Consistency and Performance ImprovementsHBase Consistency and Performance Improvements
HBase Consistency and Performance ImprovementsDataWorks Summit
 
Hadoop Summit 2012 | HBase Consistency and Performance Improvements
Hadoop Summit 2012 | HBase Consistency and Performance ImprovementsHadoop Summit 2012 | HBase Consistency and Performance Improvements
Hadoop Summit 2012 | HBase Consistency and Performance ImprovementsCloudera, Inc.
 
Time-Series Apache HBase
Time-Series Apache HBaseTime-Series Apache HBase
Time-Series Apache HBaseHBaseCon
 
Build a Time Series Application with Apache Spark and Apache HBase
Build a Time Series Application with Apache Spark and Apache  HBaseBuild a Time Series Application with Apache Spark and Apache  HBase
Build a Time Series Application with Apache Spark and Apache HBaseCarol McDonald
 
Hortonworks Technical Workshop: HBase For Mission Critical Applications
Hortonworks Technical Workshop: HBase For Mission Critical ApplicationsHortonworks Technical Workshop: HBase For Mission Critical Applications
Hortonworks Technical Workshop: HBase For Mission Critical ApplicationsHortonworks
 
Content Identification using HBase
Content Identification using HBaseContent Identification using HBase
Content Identification using HBaseHBaseCon
 
New Security Features in Apache HBase 0.98: An Operator's Guide
New Security Features in Apache HBase 0.98: An Operator's GuideNew Security Features in Apache HBase 0.98: An Operator's Guide
New Security Features in Apache HBase 0.98: An Operator's GuideHBaseCon
 
Design Patterns for Building 360-degree Views with HBase and Kiji
Design Patterns for Building 360-degree Views with HBase and KijiDesign Patterns for Building 360-degree Views with HBase and Kiji
Design Patterns for Building 360-degree Views with HBase and KijiHBaseCon
 
SE2016 Java Valerii Moisieienko "Apache HBase Workshop"
SE2016 Java Valerii Moisieienko "Apache HBase Workshop"SE2016 Java Valerii Moisieienko "Apache HBase Workshop"
SE2016 Java Valerii Moisieienko "Apache HBase Workshop"Inhacking
 
Meet HBase 1.0
Meet HBase 1.0Meet HBase 1.0
Meet HBase 1.0enissoz
 

Destacado (14)

HBase Consistency and Performance Improvements
HBase Consistency and Performance ImprovementsHBase Consistency and Performance Improvements
HBase Consistency and Performance Improvements
 
Apache HBase 0.98
Apache HBase 0.98Apache HBase 0.98
Apache HBase 0.98
 
Hbase Nosql
Hbase NosqlHbase Nosql
Hbase Nosql
 
Hadoop Summit 2012 | HBase Consistency and Performance Improvements
Hadoop Summit 2012 | HBase Consistency and Performance ImprovementsHadoop Summit 2012 | HBase Consistency and Performance Improvements
Hadoop Summit 2012 | HBase Consistency and Performance Improvements
 
IoT:what about data storage?
IoT:what about data storage?IoT:what about data storage?
IoT:what about data storage?
 
Time-Series Apache HBase
Time-Series Apache HBaseTime-Series Apache HBase
Time-Series Apache HBase
 
Build a Time Series Application with Apache Spark and Apache HBase
Build a Time Series Application with Apache Spark and Apache  HBaseBuild a Time Series Application with Apache Spark and Apache  HBase
Build a Time Series Application with Apache Spark and Apache HBase
 
Hortonworks Technical Workshop: HBase For Mission Critical Applications
Hortonworks Technical Workshop: HBase For Mission Critical ApplicationsHortonworks Technical Workshop: HBase For Mission Critical Applications
Hortonworks Technical Workshop: HBase For Mission Critical Applications
 
Content Identification using HBase
Content Identification using HBaseContent Identification using HBase
Content Identification using HBase
 
New Security Features in Apache HBase 0.98: An Operator's Guide
New Security Features in Apache HBase 0.98: An Operator's GuideNew Security Features in Apache HBase 0.98: An Operator's Guide
New Security Features in Apache HBase 0.98: An Operator's Guide
 
Design Patterns for Building 360-degree Views with HBase and Kiji
Design Patterns for Building 360-degree Views with HBase and KijiDesign Patterns for Building 360-degree Views with HBase and Kiji
Design Patterns for Building 360-degree Views with HBase and Kiji
 
Apache Flume
Apache FlumeApache Flume
Apache Flume
 
SE2016 Java Valerii Moisieienko "Apache HBase Workshop"
SE2016 Java Valerii Moisieienko "Apache HBase Workshop"SE2016 Java Valerii Moisieienko "Apache HBase Workshop"
SE2016 Java Valerii Moisieienko "Apache HBase Workshop"
 
Meet HBase 1.0
Meet HBase 1.0Meet HBase 1.0
Meet HBase 1.0
 

Similar a Streaming map reduce

Ldap Synchronization Connector @ 2011.RMLL
Ldap Synchronization Connector @ 2011.RMLLLdap Synchronization Connector @ 2011.RMLL
Ldap Synchronization Connector @ 2011.RMLLsbahloul
 
Hadoop Summit 2011 - Using a Hadoop Data Pipeline to Build a Graph of Users a...
Hadoop Summit 2011 - Using a Hadoop Data Pipeline to Build a Graph of Users a...Hadoop Summit 2011 - Using a Hadoop Data Pipeline to Build a Graph of Users a...
Hadoop Summit 2011 - Using a Hadoop Data Pipeline to Build a Graph of Users a...Bill Graham
 
Big Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWSBig Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWSjavier ramirez
 
ActiveWarehouse/ETL - BI & DW for Ruby/Rails
ActiveWarehouse/ETL - BI & DW for Ruby/RailsActiveWarehouse/ETL - BI & DW for Ruby/Rails
ActiveWarehouse/ETL - BI & DW for Ruby/RailsPaul Gallagher
 
Splice Machine Overview
Splice Machine OverviewSplice Machine Overview
Splice Machine OverviewKunal Gupta
 
Real-World Pulsar Architectural Patterns
Real-World Pulsar Architectural PatternsReal-World Pulsar Architectural Patterns
Real-World Pulsar Architectural PatternsDevin Bost
 
Dowling buso-feature-store-logical-clocks-spark-ai-summit-2020.pptx
Dowling buso-feature-store-logical-clocks-spark-ai-summit-2020.pptxDowling buso-feature-store-logical-clocks-spark-ai-summit-2020.pptx
Dowling buso-feature-store-logical-clocks-spark-ai-summit-2020.pptxLex Avstreikh
 
Building a Feature Store around Dataframes and Apache Spark
Building a Feature Store around Dataframes and Apache SparkBuilding a Feature Store around Dataframes and Apache Spark
Building a Feature Store around Dataframes and Apache SparkDatabricks
 
AWS Webcast - Tableau Big Data Solution Showcase
AWS Webcast - Tableau Big Data Solution ShowcaseAWS Webcast - Tableau Big Data Solution Showcase
AWS Webcast - Tableau Big Data Solution ShowcaseAmazon Web Services
 
Web 2.0 Lessonplan Day1
Web 2.0 Lessonplan Day1Web 2.0 Lessonplan Day1
Web 2.0 Lessonplan Day1Jesse Thomas
 
Build Data Lakes and Analytics on AWS
Build Data Lakes and Analytics on AWS Build Data Lakes and Analytics on AWS
Build Data Lakes and Analytics on AWS Amazon Web Services
 
How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!Databricks
 
BigDataCloud meetup - July 8th - Cost effective big-data processing using Ama...
BigDataCloud meetup - July 8th - Cost effective big-data processing using Ama...BigDataCloud meetup - July 8th - Cost effective big-data processing using Ama...
BigDataCloud meetup - July 8th - Cost effective big-data processing using Ama...BigDataCloud
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Cloudera, Inc.
 
Practical catalyst
Practical catalystPractical catalyst
Practical catalystdwm042
 
Creating Semantic Mashups Bridging Web 2 0 And The Semantic Web Presentation 1
Creating Semantic Mashups  Bridging Web 2 0 And The Semantic Web Presentation 1Creating Semantic Mashups  Bridging Web 2 0 And The Semantic Web Presentation 1
Creating Semantic Mashups Bridging Web 2 0 And The Semantic Web Presentation 1jward5519
 
Creating Semantic Mashups Bridging Web 2 0 And The Semantic Web Presentation 1
Creating Semantic Mashups  Bridging Web 2 0 And The Semantic Web Presentation 1Creating Semantic Mashups  Bridging Web 2 0 And The Semantic Web Presentation 1
Creating Semantic Mashups Bridging Web 2 0 And The Semantic Web Presentation 1jward5519
 
Sparkflows - Build E2E Data Analytics Use Cases in less than 30 mins
Sparkflows - Build E2E Data Analytics Use Cases in less than 30 minsSparkflows - Build E2E Data Analytics Use Cases in less than 30 mins
Sparkflows - Build E2E Data Analytics Use Cases in less than 30 minssparkflows
 
Evolving Your Big Data Use Cases from Batch to Real-Time - AWS May 2016 Webi...
Evolving Your Big Data Use Cases from Batch to Real-Time - AWS May 2016  Webi...Evolving Your Big Data Use Cases from Batch to Real-Time - AWS May 2016  Webi...
Evolving Your Big Data Use Cases from Batch to Real-Time - AWS May 2016 Webi...Amazon Web Services
 

Similar a Streaming map reduce (20)

Ldap Synchronization Connector @ 2011.RMLL
Ldap Synchronization Connector @ 2011.RMLLLdap Synchronization Connector @ 2011.RMLL
Ldap Synchronization Connector @ 2011.RMLL
 
Hadoop Summit 2011 - Using a Hadoop Data Pipeline to Build a Graph of Users a...
Hadoop Summit 2011 - Using a Hadoop Data Pipeline to Build a Graph of Users a...Hadoop Summit 2011 - Using a Hadoop Data Pipeline to Build a Graph of Users a...
Hadoop Summit 2011 - Using a Hadoop Data Pipeline to Build a Graph of Users a...
 
Big Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWSBig Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWS
 
ActiveWarehouse/ETL - BI & DW for Ruby/Rails
ActiveWarehouse/ETL - BI & DW for Ruby/RailsActiveWarehouse/ETL - BI & DW for Ruby/Rails
ActiveWarehouse/ETL - BI & DW for Ruby/Rails
 
Splice Machine Overview
Splice Machine OverviewSplice Machine Overview
Splice Machine Overview
 
Real-World Pulsar Architectural Patterns
Real-World Pulsar Architectural PatternsReal-World Pulsar Architectural Patterns
Real-World Pulsar Architectural Patterns
 
Dowling buso-feature-store-logical-clocks-spark-ai-summit-2020.pptx
Dowling buso-feature-store-logical-clocks-spark-ai-summit-2020.pptxDowling buso-feature-store-logical-clocks-spark-ai-summit-2020.pptx
Dowling buso-feature-store-logical-clocks-spark-ai-summit-2020.pptx
 
Building a Feature Store around Dataframes and Apache Spark
Building a Feature Store around Dataframes and Apache SparkBuilding a Feature Store around Dataframes and Apache Spark
Building a Feature Store around Dataframes and Apache Spark
 
AWS Webcast - Tableau Big Data Solution Showcase
AWS Webcast - Tableau Big Data Solution ShowcaseAWS Webcast - Tableau Big Data Solution Showcase
AWS Webcast - Tableau Big Data Solution Showcase
 
Scaling 101
Scaling 101Scaling 101
Scaling 101
 
Web 2.0 Lessonplan Day1
Web 2.0 Lessonplan Day1Web 2.0 Lessonplan Day1
Web 2.0 Lessonplan Day1
 
Build Data Lakes and Analytics on AWS
Build Data Lakes and Analytics on AWS Build Data Lakes and Analytics on AWS
Build Data Lakes and Analytics on AWS
 
How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!
 
BigDataCloud meetup - July 8th - Cost effective big-data processing using Ama...
BigDataCloud meetup - July 8th - Cost effective big-data processing using Ama...BigDataCloud meetup - July 8th - Cost effective big-data processing using Ama...
BigDataCloud meetup - July 8th - Cost effective big-data processing using Ama...
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
 
Practical catalyst
Practical catalystPractical catalyst
Practical catalyst
 
Creating Semantic Mashups Bridging Web 2 0 And The Semantic Web Presentation 1
Creating Semantic Mashups  Bridging Web 2 0 And The Semantic Web Presentation 1Creating Semantic Mashups  Bridging Web 2 0 And The Semantic Web Presentation 1
Creating Semantic Mashups Bridging Web 2 0 And The Semantic Web Presentation 1
 
Creating Semantic Mashups Bridging Web 2 0 And The Semantic Web Presentation 1
Creating Semantic Mashups  Bridging Web 2 0 And The Semantic Web Presentation 1Creating Semantic Mashups  Bridging Web 2 0 And The Semantic Web Presentation 1
Creating Semantic Mashups Bridging Web 2 0 And The Semantic Web Presentation 1
 
Sparkflows - Build E2E Data Analytics Use Cases in less than 30 mins
Sparkflows - Build E2E Data Analytics Use Cases in less than 30 minsSparkflows - Build E2E Data Analytics Use Cases in less than 30 mins
Sparkflows - Build E2E Data Analytics Use Cases in less than 30 mins
 
Evolving Your Big Data Use Cases from Batch to Real-Time - AWS May 2016 Webi...
Evolving Your Big Data Use Cases from Batch to Real-Time - AWS May 2016  Webi...Evolving Your Big Data Use Cases from Batch to Real-Time - AWS May 2016  Webi...
Evolving Your Big Data Use Cases from Batch to Real-Time - AWS May 2016 Webi...
 

Último

A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 

Último (20)

A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 

Streaming map reduce

  • 1. Making sense of streaming Big Data Flume – HBase Real-Time Big Data Analytics with SLA -Dani Abel Rayan
  • 2. Who am I ? Interned with Cloudera. Flume Contributor. HBaseUser. Work with KarstenSchwan @ GaTech Joining as “Big Data Engineer” in a lead role to manage exponential growing data for makers of League of Legends (Multiplayer Online Battle Arena)- recently received 400M
  • 3. This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.
  • 4. Why near Real-Time? Activity stream data is a normal part of any website for reporting on usage of the site. Activity data is things like page views, information about what content was shown, searches, etc. This kind of thing is usually handled by logging the activity out to some kind of file and then periodically aggregating these files for analysis. In recent years, however, activity data has become a critical part of the production features of websites, and a slightly more sophisticated set of infrastructure is needed.
  • 5. The Big Picture Why you want to build this ? - Customer retargeting
  • 6. The Big Picture Content serving by measuring current audience interests. Product Patterns – Twitter Streams S4 is being used for applications such as personalization, user feedback, malicious traffic detection, and real-time search Location based streams – find out people matching specific threshold almost near real-time - RealTime Shopping/Restaurants discount So many possibilities!
  • 7. Million Impressions in a sec … 100 nodes (soon to be 500) in CERCS Each one can generate 10,000 impressions in one second. Specific products are given “a known” impression rates and others are pseudo random The challenging task is to ensure, that we can bucket the Product impression in proper ColFamilies within a SLA of few seconds.
  • 8. Whats the storage ? HBase Currently used for real-time analytics in companies like Facebook (also FB messaging), Yahoo!, Twitter The high-throughput stream of immutable activity data represents a real computational challenge as the volume may easily be 10x or 100x larger than the next largest data source on a site. Do we really need to store everything ? Nope! HBase has TTL for ColFamilies Wait! What are “Column Families” ?
  • 11. HBase TTL DESCRIPTION ENABLED {NAME => 'test', FAMILIES => [{NAME => 'host', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', COMPRESSION => 'NONE', VERSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}, {NAME => ' info', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', COMPRESSION => 'NONE', VERSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'fal se', BLOCKCACHE => 'true'}, {NAME => 'log', BLOOMFILTER => 'NONE', REPLICATION _SCOPE => '0', VERSIONS => '3', COMPRESSION => 'NONE', TTL => '2147483647', BL OCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}, {NAME => 'pro duct', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', COMPRESSION => 'NONE', VERSIONS => '3', TTL => '10', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}
  • 12.
  • 13. HBase Architecture Why NoSQL ? Hbase ? Horizontal scalability Commodity Hardware Hadoop based Only one index possibile– RowKey Very high write load -> 20 billion events per day (200,000 events per second) with a lag of less than 30 seconds. (at Facebook)
  • 14. Flume Flume is a distributed, reliable, and available service for efficiently moving large amounts of data soon after the data is produced. The primary use case for Flume is as a logging system that gathers a set of log files on every machine in a cluster and aggregates them to a centralized persistent store such as the Hadoop Distributed File System (HDFS). The system was designed with these four key goals in mind: Reliability Scalability Manageability Extensibility
  • 15. Where it is used ? Mozilla Shopzilla AOL Simple GEO Path ….
  • 17. Flume Data Model Flume internally converts every external source of data into a stream of events. Events are Flume’s unit of data and are a simple and flexible representation. An event is composed of a body and metadata. The event body is a string of bytes representing the content of an event. For example, a line in a log file is represented as an event whose body was the actual byte representation of that line. The event metadata is a table of key / value pairs that capture some detail about the event, such as the time it was created or the name of the machine on which it originated. This table can be appended as an event travels along a Flume flow, and the table can be read to control the operation of individual components of that flow. For example, the machine name attached to an event can be used to control the output path where the event is written at the end of the flow. An event’s body can be up to 32KB long - although this limit can be controlled via a system property, it is recommended that it is not changed in order to preserve performance.
  • 18. Flume – HBase connector This is challenging since we need to interface single dimensional key-value pairs into multi-dimensional key-value pair Many possible approaches: 1. "usage: hbase(quot;tablequot;, quot;rowkeyquot;, " + "quot;cf1quot;," + " quot;c1quot;, quot;val1quot;[,quot;cf2quot;, quot;c2quot;, quot;val2quot;, ....]{, " + KW_BUFFER_SIZE + "=int, " + KW_USE_WAL + "=true|false})"; 2. usage: attr2hbase(quot;tablequot; [,quot;sysFamilyquot;[, quot;writeBodyquot;[,quot;attrPrefixquot;[,quot;writeBufferSizequot;[,quot;writeToWalquot;]]]]])"; https://issues.cloudera.org/browse/FLUME-6
  • 19. The Story so far … for a demo Deployed Flume Agent Nodes in ~ 100 machines Agents monitor a “specific” log directory in all the machines. Any logfile matching the *.hb” suffix will be continuously tailed. Deployed Flume Collector Nodes in 5 machines One HBase – psuedo distributed
  • 20.
  • 21. Demo Single Event - Choose a Machine 7,9,10,11,12 - Choose a VM 2,3, ….20 1000’s of Event - Lets do it again on multiple machines Failover chain Demo Inactive Demo
  • 22. Evaluation Flume is an awesome product but it isn’t perfect and there are instances wherein certain flows though they appear “ACTIVE” aren’t spewing out any data and sometimes are inactive or work slower because of long queues. No SLA guarantees So are the other products like S4 from Yahoo! And Kafka from LinkedIN
  • 24. SLA Monalytics - combined monitoring and analysis systems used for managing large-scale data center systems. Due to the scale and complexity on commodity software/hardware in data centers, the performance problems in the streaming MapReduce system are inevitable. Solving those problems is extremely hard and much more stringent, if you need to meet SLA: “We will bucket the impressions in less than 2sec, no matter what”
  • 25. So many moving parts Still need to be Adaptable Needs to be configurable with new stores End to End Monitoring of SLA Reliability Garbage Collection Pauses Good use case for Monalytics New technologies are really cool to implement, but once you past the initial honeymoon phase, complexities start surfacing up and one either enters the “cul-de-sac” mode to fall-back on the more traditional methods.
  • 26. Demo MonalyticsLatency Start – Stop 100 nodes
  • 27. Similar systems to Flume S4 from Yahoo! Kafka from LinkedIN FlumeBase – an extension to support SQL like constructs operating on Data Streams – another startup