SlideShare una empresa de Scribd logo
1 de 43
Descargar para leer sin conexión
Data Insights in Netflix
                      Danny Yuan (@g9yuayon)
                      Jae Bae




Friday, March 1, 13                            1
Who Am I?




Friday, March 1, 13               2
Who Am I?
    Member of Netflix’s Platform
    Engineering team, working on
    very large scale data
    infrastructure (@g9yuayon)




Friday, March 1, 13                2
Who Am I?
    Member of Netflix’s Platform
    Engineering team, working on
    very large scale data
    infrastructure (@g9yuayon)

   Built and operated Netflix’s
   cloud crypto service




Friday, March 1, 13                 2
Who Am I?
    Member of Netflix’s Platform
    Engineering team, working on
    very large scale data
    infrastructure (@g9yuayon)

   Built and operated Netflix’s
   cloud crypto service

   Worked with Jae Bae on
   querying multi-dimensional data
   in real time




Friday, March 1, 13                  2
Friday, March 1, 13                                                                  3

Developers usually think about monitoring metrics when “real-time” data is
mentioned. We have powerful monitoring systems that track millions of metrics
per second. But I’m not going to talk about it today. Monitoring metric is crucial
data. That itself would warrant another multi-hour talk by our monitoring
team. :-)
No Monitoring Metrics Today




Friday, March 1, 13                                                                  3

Developers usually think about monitoring metrics when “real-time” data is
mentioned. We have powerful monitoring systems that track millions of metrics
per second. But I’m not going to talk about it today. Monitoring metric is crucial
data. That itself would warrant another multi-hour talk by our monitoring
team. :-)
photo credit: http://www.flickr.com/photos/decade_null/142235888/sizes/o/in/photostream/



Friday, March 1, 13                                                                                             4

Instead, I’m going to talk about logs. Why is it interesting at all?
1,500,000

Friday, March 1, 13                                                                    5

During peak hours, our data pipeline collects over 1.5 million log events per second
70,000,000,000

Friday, March 1, 13                6

Or 70 billions a day on average.
Server Farm
                                                    Log Filter          Sink Plugin          Hadoop




      Server Farm                                                                              Kafka
                                                    Log Filter          Sink Plugin                       Druid
                       Log Collectors




     Server Farm
                                                    Log Filter          Sink Plugin       ElasticSearch




photo credit: http://www.flickr.com/photos/decade_null/142235888/sizes/m/in/photostream/
Friday, March 1, 13                                                                                               7

We have this tens of thousands of machines, all of which send log data over a robust data
pipeline to highly reliable data collectors. The collectors then filter the data, transform the
data, and dispatch the data to to different destinations for further processing.

Photo credit: http://www.flickr.com/photos/decade_null/142235888/sizes/m/in/
photostream/
Highly Reliable Data Pipeline


      Server Farm
                                                    Log Filter          Sink Plugin          Hadoop




      Server Farm                                                                              Kafka
                                                    Log Filter          Sink Plugin                       Druid
                       Log Collectors




     Server Farm
                                                    Log Filter          Sink Plugin       ElasticSearch




photo credit: http://www.flickr.com/photos/decade_null/142235888/sizes/m/in/photostream/
Friday, March 1, 13                                                                                               7

We have this tens of thousands of machines, all of which send log data over a robust data
pipeline to highly reliable data collectors. The collectors then filter the data, transform the
data, and dispatch the data to to different destinations for further processing.

Photo credit: http://www.flickr.com/photos/decade_null/142235888/sizes/m/in/
photostream/
A Humble Beginning




Friday, March 1, 13                                                                            8

We didn’t build everything in one night. Actually, we had a humble start. I did a lot of log
scraping like these. I also used R to analyze logs. But these are specific tasks, and at some
point
A Humble Beginning




Friday, March 1, 13                                                                            8

We didn’t build everything in one night. Actually, we had a humble start. I did a lot of log
scraping like these. I also used R to analyze logs. But these are specific tasks, and at some
point
A Humble Beginning




Friday, March 1, 13                                                                            8

We didn’t build everything in one night. Actually, we had a humble start. I did a lot of log
scraping like these. I also used R to analyze logs. But these are specific tasks, and at some
point
A Humble Beginning




Friday, March 1, 13                                                                            8

We didn’t build everything in one night. Actually, we had a humble start. I did a lot of log
scraping like these. I also used R to analyze logs. But these are specific tasks, and at some
point
Friday, March 1, 13                                                                          9

Something happened. Our traffic turned into a hockey stick, and the number of applications
exploded. So, log traffic also exploded. Simple log scraping wouldn’t cut it any more.
Friday, March 1, 13                                                                          9

Something happened. Our traffic turned into a hockey stick, and the number of applications
exploded. So, log traffic also exploded. Simple log scraping wouldn’t cut it any more.
Application
                                                                 Application

                                Application
                                                 Application               Application



                                                             Application
                      Application       Application

                                                       Application    Application




Friday, March 1, 13                                                                          9

Something happened. Our traffic turned into a hockey stick, and the number of applications
exploded. So, log traffic also exploded. Simple log scraping wouldn’t cut it any more.
So We Evolved




Friday, March 1, 13                                                                      10

So we evolved. One thing we built was a hadoop grep. This tool searches TBs of data. It is
much more useful that the one provided by Apache Hadoop Distribution, because it supports
many more Grep options like context, sorting by columns, and etc. And DSE’s Hadoop-as-a-
service greatly helps each team.
So We Evolved




Friday, March 1, 13                                                                      10

So we evolved. One thing we built was a hadoop grep. This tool searches TBs of data. It is
much more useful that the one provided by Apache Hadoop Distribution, because it supports
many more Grep options like context, sorting by columns, and etc. And DSE’s Hadoop-as-a-
service greatly helps each team.
So We Evolved




hgrep -C 10 -k 5,2,3 'users.*[1-9]{3}' *catalina.out s3//bucket




Friday, March 1, 13                                                                      10

So we evolved. One thing we built was a hadoop grep. This tool searches TBs of data. It is
much more useful that the one provided by Apache Hadoop Distribution, because it supports
many more Grep options like context, sorting by columns, and etc. And DSE’s Hadoop-as-a-
service greatly helps each team.
So We Evolved




hgrep -C 10 -k 5,2,3 'users.*[1-9]{3}' *catalina.out s3//bucket




Friday, March 1, 13                                                                      10

So we evolved. One thing we built was a hadoop grep. This tool searches TBs of data. It is
much more useful that the one provided by Apache Hadoop Distribution, because it supports
many more Grep options like context, sorting by columns, and etc. And DSE’s Hadoop-as-a-
service greatly helps each team.
Friday, March 1, 13                                                   11

A search tool that searches live instances’ logs is also developed.
Friday, March 1, 13                                                   11

A search tool that searches live instances’ logs is also developed.
Friday, March 1, 13                                                   11

A search tool that searches live instances’ logs is also developed.
Friday, March 1, 13                                                   11

A search tool that searches live instances’ logs is also developed.
Friday, March 1, 13                                                   11

A search tool that searches live instances’ logs is also developed.
Friday, March 1, 13                                                   11

A search tool that searches live instances’ logs is also developed.
Field Name      Field Value

                      Client     “API”

                      Server   “Cryptex”

               StatusCode         200

          ResponseTime             73



Friday, March 1, 13                          12

Hive becomes indispensable.
Friday, March 1, 13     13

DSE Sting is a bless.
Friday, March 1, 13     13

DSE Sting is a bless.
Friday, March 1, 13     13

DSE Sting is a bless.
Friday, March 1, 13                                                  14

So we built yet another tool to scratch it with the help of Druid.
Still, We Have a Real-Time Itch




Friday, March 1, 13                                                  14

So we built yet another tool to scratch it with the help of Druid.
Friday, March 1, 13                                                                     15

Error summary in the past 10 seconds. You get to slice and dice through arbitrary
combination of different dimension across multiple time series.

Trends over search query of “90210” by Canadians

How many people started streaming any episode of House of Cards in the past hour, grouped
Friday, March 1, 13                                                                     15

Error summary in the past 10 seconds. You get to slice and dice through arbitrary
combination of different dimension across multiple time series.

Trends over search query of “90210” by Canadians

How many people started streaming any episode of House of Cards in the past hour, grouped
Friday, March 1, 13                                                                     15

Error summary in the past 10 seconds. You get to slice and dice through arbitrary
combination of different dimension across multiple time series.

Trends over search query of “90210” by Canadians

How many people started streaming any episode of House of Cards in the past hour, grouped
Friday, March 1, 13                                                                          16

A query of all the users who started streaming House of Cards in the past three hours, and
results came back in seconds.
Friday, March 1, 13                                                                          16

A query of all the users who started streaming House of Cards in the past three hours, and
results came back in seconds.
Friday, March 1, 13                                                                          16

A query of all the users who started streaming House of Cards in the past three hours, and
results came back in seconds.
Interested?




Friday, March 1, 13                 17
See You
                      Tomorrow

Friday, March 1, 13                                                                               18

If you’re interested in how we did the real-time interactive queries with the help of Druid, do
come to our talk. See you tomorrow

Más contenido relacionado

La actualidad más candente

A real-time (lambda) architecture using Hadoop & Storm (NoSQL Matters Cologne...
A real-time (lambda) architecture using Hadoop & Storm (NoSQL Matters Cologne...A real-time (lambda) architecture using Hadoop & Storm (NoSQL Matters Cologne...
A real-time (lambda) architecture using Hadoop & Storm (NoSQL Matters Cologne...Nathan Bijnens
 
Mantis: Netflix's Event Stream Processing System
Mantis: Netflix's Event Stream Processing SystemMantis: Netflix's Event Stream Processing System
Mantis: Netflix's Event Stream Processing SystemC4Media
 
Apache Storm Concepts
Apache Storm ConceptsApache Storm Concepts
Apache Storm ConceptsAndré Dias
 
The Secrets of Building Realtime Big Data Systems
The Secrets of Building Realtime Big Data SystemsThe Secrets of Building Realtime Big Data Systems
The Secrets of Building Realtime Big Data Systemsnathanmarz
 
AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017
AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017
AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017Monal Daxini
 
Storm – Streaming Data Analytics at Scale - StampedeCon 2014
Storm – Streaming Data Analytics at Scale - StampedeCon 2014Storm – Streaming Data Analytics at Scale - StampedeCon 2014
Storm – Streaming Data Analytics at Scale - StampedeCon 2014StampedeCon
 
Real-Time Big Data at In-Memory Speed, Using Storm
Real-Time Big Data at In-Memory Speed, Using StormReal-Time Big Data at In-Memory Speed, Using Storm
Real-Time Big Data at In-Memory Speed, Using StormNati Shalom
 
Storm at spider.io - London Storm Meetup 2013-06-18
Storm at spider.io - London Storm Meetup 2013-06-18Storm at spider.io - London Storm Meetup 2013-06-18
Storm at spider.io - London Storm Meetup 2013-06-18Ashley Brown
 
High Performance and Scalable Geospatial Analytics on Cloud with Open Source
High Performance and Scalable Geospatial Analytics on Cloud with Open SourceHigh Performance and Scalable Geospatial Analytics on Cloud with Open Source
High Performance and Scalable Geospatial Analytics on Cloud with Open Source"Constantin \"Cristi\"" Stanca
 
Real time big data analytics with Storm by Ron Bodkin of Think Big Analytics
Real time big data analytics with Storm by Ron Bodkin of Think Big AnalyticsReal time big data analytics with Storm by Ron Bodkin of Think Big Analytics
Real time big data analytics with Storm by Ron Bodkin of Think Big AnalyticsData Con LA
 
Reference architecture for Internet of Things
Reference architecture for Internet of ThingsReference architecture for Internet of Things
Reference architecture for Internet of ThingsSujee Maniyam
 
Running Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformRunning Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformEva Tse
 
Special Purpose Quantum Annealing Quantum Computer v1.0
Special Purpose Quantum Annealing Quantum Computer v1.0Special Purpose Quantum Annealing Quantum Computer v1.0
Special Purpose Quantum Annealing Quantum Computer v1.0Aditya Yadav
 
The Netflix Way to deal with Big Data Problems
The Netflix Way to deal with Big Data ProblemsThe Netflix Way to deal with Big Data Problems
The Netflix Way to deal with Big Data ProblemsMonal Daxini
 
Netflix Keystone—Cloud scale event processing pipeline
Netflix Keystone—Cloud scale event processing pipelineNetflix Keystone—Cloud scale event processing pipeline
Netflix Keystone—Cloud scale event processing pipelineMonal Daxini
 
Drag and Drop Open Source GeoTools ETL with Apache NiFi
Drag and Drop Open Source GeoTools ETL with Apache NiFiDrag and Drop Open Source GeoTools ETL with Apache NiFi
Drag and Drop Open Source GeoTools ETL with Apache NiFi"Constantin \"Cristi\"" Stanca
 
Realtime Risk Management Using Kafka, Python, and Spark Streaming by Nick Evans
Realtime Risk Management Using Kafka, Python, and Spark Streaming by Nick EvansRealtime Risk Management Using Kafka, Python, and Spark Streaming by Nick Evans
Realtime Risk Management Using Kafka, Python, and Spark Streaming by Nick EvansSpark Summit
 
SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2...
SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2...SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2...
SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2...Nicolas Kourtellis
 
Realtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and HadoopRealtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and HadoopDataWorks Summit
 

La actualidad más candente (20)

A real-time (lambda) architecture using Hadoop & Storm (NoSQL Matters Cologne...
A real-time (lambda) architecture using Hadoop & Storm (NoSQL Matters Cologne...A real-time (lambda) architecture using Hadoop & Storm (NoSQL Matters Cologne...
A real-time (lambda) architecture using Hadoop & Storm (NoSQL Matters Cologne...
 
Mantis: Netflix's Event Stream Processing System
Mantis: Netflix's Event Stream Processing SystemMantis: Netflix's Event Stream Processing System
Mantis: Netflix's Event Stream Processing System
 
Yahoo compares Storm and Spark
Yahoo compares Storm and SparkYahoo compares Storm and Spark
Yahoo compares Storm and Spark
 
Apache Storm Concepts
Apache Storm ConceptsApache Storm Concepts
Apache Storm Concepts
 
The Secrets of Building Realtime Big Data Systems
The Secrets of Building Realtime Big Data SystemsThe Secrets of Building Realtime Big Data Systems
The Secrets of Building Realtime Big Data Systems
 
AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017
AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017
AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017
 
Storm – Streaming Data Analytics at Scale - StampedeCon 2014
Storm – Streaming Data Analytics at Scale - StampedeCon 2014Storm – Streaming Data Analytics at Scale - StampedeCon 2014
Storm – Streaming Data Analytics at Scale - StampedeCon 2014
 
Real-Time Big Data at In-Memory Speed, Using Storm
Real-Time Big Data at In-Memory Speed, Using StormReal-Time Big Data at In-Memory Speed, Using Storm
Real-Time Big Data at In-Memory Speed, Using Storm
 
Storm at spider.io - London Storm Meetup 2013-06-18
Storm at spider.io - London Storm Meetup 2013-06-18Storm at spider.io - London Storm Meetup 2013-06-18
Storm at spider.io - London Storm Meetup 2013-06-18
 
High Performance and Scalable Geospatial Analytics on Cloud with Open Source
High Performance and Scalable Geospatial Analytics on Cloud with Open SourceHigh Performance and Scalable Geospatial Analytics on Cloud with Open Source
High Performance and Scalable Geospatial Analytics on Cloud with Open Source
 
Real time big data analytics with Storm by Ron Bodkin of Think Big Analytics
Real time big data analytics with Storm by Ron Bodkin of Think Big AnalyticsReal time big data analytics with Storm by Ron Bodkin of Think Big Analytics
Real time big data analytics with Storm by Ron Bodkin of Think Big Analytics
 
Reference architecture for Internet of Things
Reference architecture for Internet of ThingsReference architecture for Internet of Things
Reference architecture for Internet of Things
 
Running Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformRunning Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data Platform
 
Special Purpose Quantum Annealing Quantum Computer v1.0
Special Purpose Quantum Annealing Quantum Computer v1.0Special Purpose Quantum Annealing Quantum Computer v1.0
Special Purpose Quantum Annealing Quantum Computer v1.0
 
The Netflix Way to deal with Big Data Problems
The Netflix Way to deal with Big Data ProblemsThe Netflix Way to deal with Big Data Problems
The Netflix Way to deal with Big Data Problems
 
Netflix Keystone—Cloud scale event processing pipeline
Netflix Keystone—Cloud scale event processing pipelineNetflix Keystone—Cloud scale event processing pipeline
Netflix Keystone—Cloud scale event processing pipeline
 
Drag and Drop Open Source GeoTools ETL with Apache NiFi
Drag and Drop Open Source GeoTools ETL with Apache NiFiDrag and Drop Open Source GeoTools ETL with Apache NiFi
Drag and Drop Open Source GeoTools ETL with Apache NiFi
 
Realtime Risk Management Using Kafka, Python, and Spark Streaming by Nick Evans
Realtime Risk Management Using Kafka, Python, and Spark Streaming by Nick EvansRealtime Risk Management Using Kafka, Python, and Spark Streaming by Nick Evans
Realtime Risk Management Using Kafka, Python, and Spark Streaming by Nick Evans
 
SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2...
SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2...SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2...
SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2...
 
Realtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and HadoopRealtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and Hadoop
 

Destacado

QConSF 2014 talk on Netflix Mantis, a stream processing system
QConSF 2014 talk on Netflix Mantis, a stream processing systemQConSF 2014 talk on Netflix Mantis, a stream processing system
QConSF 2014 talk on Netflix Mantis, a stream processing systemDanny Yuan
 
Elasticsearch in Netflix
Elasticsearch in NetflixElasticsearch in Netflix
Elasticsearch in NetflixDanny Yuan
 
Improvements to Flink & it's Applications in Alibaba Search
Improvements to Flink & it's Applications in Alibaba SearchImprovements to Flink & it's Applications in Alibaba Search
Improvements to Flink & it's Applications in Alibaba SearchDataWorks Summit/Hadoop Summit
 
Helio, a Continues Real-Time Fraud Detection and Monitoring Solution
Helio, a Continues Real-Time Fraud Detection and Monitoring SolutionHelio, a Continues Real-Time Fraud Detection and Monitoring Solution
Helio, a Continues Real-Time Fraud Detection and Monitoring SolutionAmir Sedighi
 
Apache Kylin Streaming
Apache Kylin Streaming Apache Kylin Streaming
Apache Kylin Streaming hongbin ma
 
Apache Kylin @ Big Data Europe 2015
Apache Kylin @ Big Data Europe 2015Apache Kylin @ Big Data Europe 2015
Apache Kylin @ Big Data Europe 2015Seshu Adunuthula
 
Real-time Analytics with Apache Flink and Druid
Real-time Analytics with Apache Flink and DruidReal-time Analytics with Apache Flink and Druid
Real-time Analytics with Apache Flink and DruidJan Graßegger
 
Case Study: Realtime Analytics with Druid
Case Study: Realtime Analytics with DruidCase Study: Realtime Analytics with Druid
Case Study: Realtime Analytics with DruidSalil Kalia
 
Lambda Architectures in Practice
Lambda Architectures in PracticeLambda Architectures in Practice
Lambda Architectures in PracticeC4Media
 
Apache kylin 2.0: from classic olap to real-time data warehouse
Apache kylin 2.0: from classic olap to real-time data warehouseApache kylin 2.0: from classic olap to real-time data warehouse
Apache kylin 2.0: from classic olap to real-time data warehouseYang Li
 
Design cube in Apache Kylin
Design cube in Apache KylinDesign cube in Apache Kylin
Design cube in Apache KylinYang Li
 
Online Fraud Detection Using Big Data Analytics Webinar
Online Fraud Detection Using Big Data Analytics WebinarOnline Fraud Detection Using Big Data Analytics Webinar
Online Fraud Detection Using Big Data Analytics WebinarDatameer
 
Druid at Hadoop Ecosystem
Druid at Hadoop EcosystemDruid at Hadoop Ecosystem
Druid at Hadoop EcosystemSlim Bouguerra
 
Detecting fraud with Python and machine learning
Detecting fraud with Python and machine learningDetecting fraud with Python and machine learning
Detecting fraud with Python and machine learningwgyn
 
Real-time fraud detection in credit card transactions
Real-time fraud detection in credit card transactionsReal-time fraud detection in credit card transactions
Real-time fraud detection in credit card transactionsMariusz Rafało
 
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and DruidPulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and DruidTony Ng
 
Fraud Detection in Real-time @ Apache Big Data Con
Fraud Detection in Real-time @ Apache Big Data ConFraud Detection in Real-time @ Apache Big Data Con
Fraud Detection in Real-time @ Apache Big Data ConSeshika Fernando
 
AWS re:Invent 2016: Fraud Detection with Amazon Machine Learning on AWS (FIN301)
AWS re:Invent 2016: Fraud Detection with Amazon Machine Learning on AWS (FIN301)AWS re:Invent 2016: Fraud Detection with Amazon Machine Learning on AWS (FIN301)
AWS re:Invent 2016: Fraud Detection with Amazon Machine Learning on AWS (FIN301)Amazon Web Services
 
Bigdata based fraud detection
Bigdata based fraud detectionBigdata based fraud detection
Bigdata based fraud detectionMk Kim
 

Destacado (20)

QConSF 2014 talk on Netflix Mantis, a stream processing system
QConSF 2014 talk on Netflix Mantis, a stream processing systemQConSF 2014 talk on Netflix Mantis, a stream processing system
QConSF 2014 talk on Netflix Mantis, a stream processing system
 
Elasticsearch in Netflix
Elasticsearch in NetflixElasticsearch in Netflix
Elasticsearch in Netflix
 
Improvements to Flink & it's Applications in Alibaba Search
Improvements to Flink & it's Applications in Alibaba SearchImprovements to Flink & it's Applications in Alibaba Search
Improvements to Flink & it's Applications in Alibaba Search
 
Helio, a Continues Real-Time Fraud Detection and Monitoring Solution
Helio, a Continues Real-Time Fraud Detection and Monitoring SolutionHelio, a Continues Real-Time Fraud Detection and Monitoring Solution
Helio, a Continues Real-Time Fraud Detection and Monitoring Solution
 
Apache Kylin Streaming
Apache Kylin Streaming Apache Kylin Streaming
Apache Kylin Streaming
 
Apache Kylin @ Big Data Europe 2015
Apache Kylin @ Big Data Europe 2015Apache Kylin @ Big Data Europe 2015
Apache Kylin @ Big Data Europe 2015
 
Real-time Analytics with Apache Flink and Druid
Real-time Analytics with Apache Flink and DruidReal-time Analytics with Apache Flink and Druid
Real-time Analytics with Apache Flink and Druid
 
Case Study: Realtime Analytics with Druid
Case Study: Realtime Analytics with DruidCase Study: Realtime Analytics with Druid
Case Study: Realtime Analytics with Druid
 
Lambda Architectures in Practice
Lambda Architectures in PracticeLambda Architectures in Practice
Lambda Architectures in Practice
 
Apache kylin 2.0: from classic olap to real-time data warehouse
Apache kylin 2.0: from classic olap to real-time data warehouseApache kylin 2.0: from classic olap to real-time data warehouse
Apache kylin 2.0: from classic olap to real-time data warehouse
 
Design cube in Apache Kylin
Design cube in Apache KylinDesign cube in Apache Kylin
Design cube in Apache Kylin
 
Online Fraud Detection Using Big Data Analytics Webinar
Online Fraud Detection Using Big Data Analytics WebinarOnline Fraud Detection Using Big Data Analytics Webinar
Online Fraud Detection Using Big Data Analytics Webinar
 
Druid at Hadoop Ecosystem
Druid at Hadoop EcosystemDruid at Hadoop Ecosystem
Druid at Hadoop Ecosystem
 
Detecting fraud with Python and machine learning
Detecting fraud with Python and machine learningDetecting fraud with Python and machine learning
Detecting fraud with Python and machine learning
 
Real-time fraud detection in credit card transactions
Real-time fraud detection in credit card transactionsReal-time fraud detection in credit card transactions
Real-time fraud detection in credit card transactions
 
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and DruidPulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
 
Fraud Detection in Real-time @ Apache Big Data Con
Fraud Detection in Real-time @ Apache Big Data ConFraud Detection in Real-time @ Apache Big Data Con
Fraud Detection in Real-time @ Apache Big Data Con
 
AWS re:Invent 2016: Fraud Detection with Amazon Machine Learning on AWS (FIN301)
AWS re:Invent 2016: Fraud Detection with Amazon Machine Learning on AWS (FIN301)AWS re:Invent 2016: Fraud Detection with Amazon Machine Learning on AWS (FIN301)
AWS re:Invent 2016: Fraud Detection with Amazon Machine Learning on AWS (FIN301)
 
Bigdata based fraud detection
Bigdata based fraud detectionBigdata based fraud detection
Bigdata based fraud detection
 
Scalable Real-time analytics using Druid
Scalable Real-time analytics using DruidScalable Real-time analytics using Druid
Scalable Real-time analytics using Druid
 

Similar a Strata lightening-talk

Practical Semantic Web and Why You Should Care - DrupalCon DC 2009
Practical Semantic Web and Why You Should Care - DrupalCon DC 2009Practical Semantic Web and Why You Should Care - DrupalCon DC 2009
Practical Semantic Web and Why You Should Care - DrupalCon DC 2009Boris Mann
 
Storage for next-generation sequencing
Storage for next-generation sequencingStorage for next-generation sequencing
Storage for next-generation sequencingGuy Coates
 
The architecture of data analytics PaaS on AWS
The architecture of data analytics PaaS on AWSThe architecture of data analytics PaaS on AWS
The architecture of data analytics PaaS on AWSTreasure Data, Inc.
 
Evolution of Monitoring and Prometheus (Dublin 2018)
Evolution of Monitoring and Prometheus (Dublin 2018)Evolution of Monitoring and Prometheus (Dublin 2018)
Evolution of Monitoring and Prometheus (Dublin 2018)Brian Brazil
 
Big data and APIs for PHP developers - SXSW 2011
Big data and APIs for PHP developers - SXSW 2011Big data and APIs for PHP developers - SXSW 2011
Big data and APIs for PHP developers - SXSW 2011Eli White
 
Why and How to integrate Hadoop and NoSQL?
Why and How to integrate Hadoop and NoSQL?Why and How to integrate Hadoop and NoSQL?
Why and How to integrate Hadoop and NoSQL?Tugdual Grall
 
Go After 4 Years in Production - QCon 2015
Go After 4 Years in Production - QCon 2015Go After 4 Years in Production - QCon 2015
Go After 4 Years in Production - QCon 2015Travis Reeder
 
Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014
Dataiku  - hadoop ecosystem - @Epitech Paris - janvier 2014Dataiku  - hadoop ecosystem - @Epitech Paris - janvier 2014
Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014Dataiku
 
State of Puppet
State of PuppetState of Puppet
State of PuppetPuppet
 
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)Jeff Magnusson
 
Unleashing the Rails Asset Pipeline
Unleashing the Rails Asset PipelineUnleashing the Rails Asset Pipeline
Unleashing the Rails Asset PipelineKenneth Kalmer
 
What's this NetKernel Thing Anyway?
What's this NetKernel Thing Anyway?What's this NetKernel Thing Anyway?
What's this NetKernel Thing Anyway?Darren Cruse
 
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)Adam Kawa
 
OSDC 2018 | The Computer science behind a modern distributed data store by Ma...
OSDC 2018 | The Computer science behind a modern distributed data store by Ma...OSDC 2018 | The Computer science behind a modern distributed data store by Ma...
OSDC 2018 | The Computer science behind a modern distributed data store by Ma...NETWAYS
 
[B6]heroku postgres-hgmnz
[B6]heroku postgres-hgmnz[B6]heroku postgres-hgmnz
[B6]heroku postgres-hgmnzNAVER D2
 
Intro elasticsearch taswarbhatti
Intro elasticsearch taswarbhattiIntro elasticsearch taswarbhatti
Intro elasticsearch taswarbhattiTaswar Bhatti
 

Similar a Strata lightening-talk (20)

Treasure Data Cloud Strategy
Treasure Data Cloud StrategyTreasure Data Cloud Strategy
Treasure Data Cloud Strategy
 
Practical Semantic Web and Why You Should Care - DrupalCon DC 2009
Practical Semantic Web and Why You Should Care - DrupalCon DC 2009Practical Semantic Web and Why You Should Care - DrupalCon DC 2009
Practical Semantic Web and Why You Should Care - DrupalCon DC 2009
 
Storage for next-generation sequencing
Storage for next-generation sequencingStorage for next-generation sequencing
Storage for next-generation sequencing
 
The architecture of data analytics PaaS on AWS
The architecture of data analytics PaaS on AWSThe architecture of data analytics PaaS on AWS
The architecture of data analytics PaaS on AWS
 
Evolution of Monitoring and Prometheus (Dublin 2018)
Evolution of Monitoring and Prometheus (Dublin 2018)Evolution of Monitoring and Prometheus (Dublin 2018)
Evolution of Monitoring and Prometheus (Dublin 2018)
 
Big data and APIs for PHP developers - SXSW 2011
Big data and APIs for PHP developers - SXSW 2011Big data and APIs for PHP developers - SXSW 2011
Big data and APIs for PHP developers - SXSW 2011
 
Why and How to integrate Hadoop and NoSQL?
Why and How to integrate Hadoop and NoSQL?Why and How to integrate Hadoop and NoSQL?
Why and How to integrate Hadoop and NoSQL?
 
Go After 4 Years in Production - QCon 2015
Go After 4 Years in Production - QCon 2015Go After 4 Years in Production - QCon 2015
Go After 4 Years in Production - QCon 2015
 
Treasure Data and Heroku
Treasure Data and HerokuTreasure Data and Heroku
Treasure Data and Heroku
 
Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014
Dataiku  - hadoop ecosystem - @Epitech Paris - janvier 2014Dataiku  - hadoop ecosystem - @Epitech Paris - janvier 2014
Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014
 
State of Puppet
State of PuppetState of Puppet
State of Puppet
 
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
 
Big Data and Hadoop - An Introduction
Big Data and Hadoop - An IntroductionBig Data and Hadoop - An Introduction
Big Data and Hadoop - An Introduction
 
Unleashing the Rails Asset Pipeline
Unleashing the Rails Asset PipelineUnleashing the Rails Asset Pipeline
Unleashing the Rails Asset Pipeline
 
What's this NetKernel Thing Anyway?
What's this NetKernel Thing Anyway?What's this NetKernel Thing Anyway?
What's this NetKernel Thing Anyway?
 
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
OSDC 2018 | The Computer science behind a modern distributed data store by Ma...
OSDC 2018 | The Computer science behind a modern distributed data store by Ma...OSDC 2018 | The Computer science behind a modern distributed data store by Ma...
OSDC 2018 | The Computer science behind a modern distributed data store by Ma...
 
[B6]heroku postgres-hgmnz
[B6]heroku postgres-hgmnz[B6]heroku postgres-hgmnz
[B6]heroku postgres-hgmnz
 
Intro elasticsearch taswarbhatti
Intro elasticsearch taswarbhattiIntro elasticsearch taswarbhatti
Intro elasticsearch taswarbhatti
 

Último

SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 

Último (20)

SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 

Strata lightening-talk

  • 1. Data Insights in Netflix Danny Yuan (@g9yuayon) Jae Bae Friday, March 1, 13 1
  • 2. Who Am I? Friday, March 1, 13 2
  • 3. Who Am I? Member of Netflix’s Platform Engineering team, working on very large scale data infrastructure (@g9yuayon) Friday, March 1, 13 2
  • 4. Who Am I? Member of Netflix’s Platform Engineering team, working on very large scale data infrastructure (@g9yuayon) Built and operated Netflix’s cloud crypto service Friday, March 1, 13 2
  • 5. Who Am I? Member of Netflix’s Platform Engineering team, working on very large scale data infrastructure (@g9yuayon) Built and operated Netflix’s cloud crypto service Worked with Jae Bae on querying multi-dimensional data in real time Friday, March 1, 13 2
  • 6. Friday, March 1, 13 3 Developers usually think about monitoring metrics when “real-time” data is mentioned. We have powerful monitoring systems that track millions of metrics per second. But I’m not going to talk about it today. Monitoring metric is crucial data. That itself would warrant another multi-hour talk by our monitoring team. :-)
  • 7. No Monitoring Metrics Today Friday, March 1, 13 3 Developers usually think about monitoring metrics when “real-time” data is mentioned. We have powerful monitoring systems that track millions of metrics per second. But I’m not going to talk about it today. Monitoring metric is crucial data. That itself would warrant another multi-hour talk by our monitoring team. :-)
  • 8. photo credit: http://www.flickr.com/photos/decade_null/142235888/sizes/o/in/photostream/ Friday, March 1, 13 4 Instead, I’m going to talk about logs. Why is it interesting at all?
  • 9. 1,500,000 Friday, March 1, 13 5 During peak hours, our data pipeline collects over 1.5 million log events per second
  • 10. 70,000,000,000 Friday, March 1, 13 6 Or 70 billions a day on average.
  • 11. Server Farm Log Filter Sink Plugin Hadoop Server Farm Kafka Log Filter Sink Plugin Druid Log Collectors Server Farm Log Filter Sink Plugin ElasticSearch photo credit: http://www.flickr.com/photos/decade_null/142235888/sizes/m/in/photostream/ Friday, March 1, 13 7 We have this tens of thousands of machines, all of which send log data over a robust data pipeline to highly reliable data collectors. The collectors then filter the data, transform the data, and dispatch the data to to different destinations for further processing. Photo credit: http://www.flickr.com/photos/decade_null/142235888/sizes/m/in/ photostream/
  • 12. Highly Reliable Data Pipeline Server Farm Log Filter Sink Plugin Hadoop Server Farm Kafka Log Filter Sink Plugin Druid Log Collectors Server Farm Log Filter Sink Plugin ElasticSearch photo credit: http://www.flickr.com/photos/decade_null/142235888/sizes/m/in/photostream/ Friday, March 1, 13 7 We have this tens of thousands of machines, all of which send log data over a robust data pipeline to highly reliable data collectors. The collectors then filter the data, transform the data, and dispatch the data to to different destinations for further processing. Photo credit: http://www.flickr.com/photos/decade_null/142235888/sizes/m/in/ photostream/
  • 13. A Humble Beginning Friday, March 1, 13 8 We didn’t build everything in one night. Actually, we had a humble start. I did a lot of log scraping like these. I also used R to analyze logs. But these are specific tasks, and at some point
  • 14. A Humble Beginning Friday, March 1, 13 8 We didn’t build everything in one night. Actually, we had a humble start. I did a lot of log scraping like these. I also used R to analyze logs. But these are specific tasks, and at some point
  • 15. A Humble Beginning Friday, March 1, 13 8 We didn’t build everything in one night. Actually, we had a humble start. I did a lot of log scraping like these. I also used R to analyze logs. But these are specific tasks, and at some point
  • 16. A Humble Beginning Friday, March 1, 13 8 We didn’t build everything in one night. Actually, we had a humble start. I did a lot of log scraping like these. I also used R to analyze logs. But these are specific tasks, and at some point
  • 17. Friday, March 1, 13 9 Something happened. Our traffic turned into a hockey stick, and the number of applications exploded. So, log traffic also exploded. Simple log scraping wouldn’t cut it any more.
  • 18. Friday, March 1, 13 9 Something happened. Our traffic turned into a hockey stick, and the number of applications exploded. So, log traffic also exploded. Simple log scraping wouldn’t cut it any more.
  • 19. Application Application Application Application Application Application Application Application Application Application Friday, March 1, 13 9 Something happened. Our traffic turned into a hockey stick, and the number of applications exploded. So, log traffic also exploded. Simple log scraping wouldn’t cut it any more.
  • 20. So We Evolved Friday, March 1, 13 10 So we evolved. One thing we built was a hadoop grep. This tool searches TBs of data. It is much more useful that the one provided by Apache Hadoop Distribution, because it supports many more Grep options like context, sorting by columns, and etc. And DSE’s Hadoop-as-a- service greatly helps each team.
  • 21. So We Evolved Friday, March 1, 13 10 So we evolved. One thing we built was a hadoop grep. This tool searches TBs of data. It is much more useful that the one provided by Apache Hadoop Distribution, because it supports many more Grep options like context, sorting by columns, and etc. And DSE’s Hadoop-as-a- service greatly helps each team.
  • 22. So We Evolved hgrep -C 10 -k 5,2,3 'users.*[1-9]{3}' *catalina.out s3//bucket Friday, March 1, 13 10 So we evolved. One thing we built was a hadoop grep. This tool searches TBs of data. It is much more useful that the one provided by Apache Hadoop Distribution, because it supports many more Grep options like context, sorting by columns, and etc. And DSE’s Hadoop-as-a- service greatly helps each team.
  • 23. So We Evolved hgrep -C 10 -k 5,2,3 'users.*[1-9]{3}' *catalina.out s3//bucket Friday, March 1, 13 10 So we evolved. One thing we built was a hadoop grep. This tool searches TBs of data. It is much more useful that the one provided by Apache Hadoop Distribution, because it supports many more Grep options like context, sorting by columns, and etc. And DSE’s Hadoop-as-a- service greatly helps each team.
  • 24. Friday, March 1, 13 11 A search tool that searches live instances’ logs is also developed.
  • 25. Friday, March 1, 13 11 A search tool that searches live instances’ logs is also developed.
  • 26. Friday, March 1, 13 11 A search tool that searches live instances’ logs is also developed.
  • 27. Friday, March 1, 13 11 A search tool that searches live instances’ logs is also developed.
  • 28. Friday, March 1, 13 11 A search tool that searches live instances’ logs is also developed.
  • 29. Friday, March 1, 13 11 A search tool that searches live instances’ logs is also developed.
  • 30. Field Name Field Value Client “API” Server “Cryptex” StatusCode 200 ResponseTime 73 Friday, March 1, 13 12 Hive becomes indispensable.
  • 31. Friday, March 1, 13 13 DSE Sting is a bless.
  • 32. Friday, March 1, 13 13 DSE Sting is a bless.
  • 33. Friday, March 1, 13 13 DSE Sting is a bless.
  • 34. Friday, March 1, 13 14 So we built yet another tool to scratch it with the help of Druid.
  • 35. Still, We Have a Real-Time Itch Friday, March 1, 13 14 So we built yet another tool to scratch it with the help of Druid.
  • 36. Friday, March 1, 13 15 Error summary in the past 10 seconds. You get to slice and dice through arbitrary combination of different dimension across multiple time series. Trends over search query of “90210” by Canadians How many people started streaming any episode of House of Cards in the past hour, grouped
  • 37. Friday, March 1, 13 15 Error summary in the past 10 seconds. You get to slice and dice through arbitrary combination of different dimension across multiple time series. Trends over search query of “90210” by Canadians How many people started streaming any episode of House of Cards in the past hour, grouped
  • 38. Friday, March 1, 13 15 Error summary in the past 10 seconds. You get to slice and dice through arbitrary combination of different dimension across multiple time series. Trends over search query of “90210” by Canadians How many people started streaming any episode of House of Cards in the past hour, grouped
  • 39. Friday, March 1, 13 16 A query of all the users who started streaming House of Cards in the past three hours, and results came back in seconds.
  • 40. Friday, March 1, 13 16 A query of all the users who started streaming House of Cards in the past three hours, and results came back in seconds.
  • 41. Friday, March 1, 13 16 A query of all the users who started streaming House of Cards in the past three hours, and results came back in seconds.
  • 43. See You Tomorrow Friday, March 1, 13 18 If you’re interested in how we did the real-time interactive queries with the help of Druid, do come to our talk. See you tomorrow