SlideShare una empresa de Scribd logo
1 de 37
Our journey with Druid
From initial research to full production scale
Danny Ruchman + Itai Yaffe
Nielsen
Introduction
Danny Ruchman Itai Yaffe
● Software Engineer
and team manager
● Focused on big
data processing
solutions
● Big Data
Infrastructure
Developer
● Dealing with Big
Data challenges for
the last 5 years
Nielsen Marketing Cloud (NMC)
● eXelate was acquired by Nielsen 3 years ago
● A Data company
● Machine learning models for insights
● Business decisions
● Targeting
Nielsen Marketing Cloud - questions we try to answer
● How many users of a certain profile can we reach
Campaign for fancy women sneakers -
● How many hits for a specific web page in a date range
NMC high-level architecture
NMC high-level architecture
NMC high-level architecture
The need
● Nielsen Marketing Cloud business question
○ How many unique devices we have encountered:
■ over a given date range
■ for a given set of attributes (segments, regions, etc.)
● Find the number of distinct elements in a data stream which
may contain repeated elements in real time
The need
● Store everything
● Store only 1 bit per device
○ 10B Devices-1.25 GB/day
○ 10B Devices*80K attributes - 100 TB/day
● Approximate
Possible solutions
Naive
Bit VectorApprox.
Our journey
● Elasticsearch
○ Indexing data
■ 250 GB of daily data, 10 hours
■ Affect query time
○ Querying
■ Low concurrency
■ Scans on all the shards of the
corresponding index
What we tried
● Preprocessing
● Statistical algorithms (e.g HyperLogLog)
● K Minimum Values (KMV)
● Estimate set cardinality
● Supports set-theoretic operations
X Y
● ThetaSketch mathematical framework - generalization of KMV
X Y
ThetaSketch
KMV intuition
Number of Std Dev 1 2
Confidence Interval 68.27% 95.45%
16,384 0.78% 1.56%
32,768 0.55% 1.10%
65,536 0.39% 0.78%
ThetaSketch error
K Value
Error
%ERROR
“Very fast highly scalable columnar data-store”
DRUID
Powered by Druid
Why is it cool?
● Store trillions of events, petabytes of data
● Sub-second analytic queries
● Highly scalable
● Cost effective
LongSumAggregator
2016-11-15
Timestamp Attribute Device ID
11111 3a4c1f2d84a5c179435c1fea86e6ae02
2016-11-15 11111 3a4c1f2d84a5c179435c1fea86e6ae02
2016-11-15 11111 5dd59f9bd068f802a7c6dd832bf60d02
2016-11-15 22222 5dd59f9bd068f802a7c6dd832bf60d02
2016-11-15 333333 5dd59f9bd068f802a7c6dd832bf60d02
Timestamp Attribute Simple Count
2016-11-15
2016-11-15
2016-11-15
11111
22222
33333
3
1
1
Roll-up -
Simple Count
Roll-up -
Count Distinct
ThetaSketchAggregator
2016-11-15
Timestamp Attribute Device ID
11111 3a4c1f2d84a5c179435c1fea86e6ae02
2016-11-15 11111 3a4c1f2d84a5c179435c1fea86e6ae02
2016-11-15 11111 5dd59f9bd068f802a7c6dd832bf60d02
2016-11-15 22222 5dd59f9bd068f802a7c6dd832bf60d02
2016-11-15 333333 5dd59f9bd068f802a7c6dd832bf60d02
Timestamp Attribute Count Distinct
2016-11-15
2016-11-15
2016-11-15
11111
22222
33333
2
1
1
Druid architecture
How do we use Druid
Query performance benchmark
Concurrent Queries
Avg.ResponseTime
Druid .
Elasticsearch
Guidelines and pitfalls
● Setup is not easy
Guidelines and pitfalls
● Monitoring your system
Guidelines and pitfalls
● Monitoring your system - important metrics (incomplete list) :
○ Broker query time
○ Historical query time
○ Historical query wait time
○ Pending segments
○ Broker query TTFB
○ ...
Guidelines and pitfalls
● Monitoring your system
Guidelines and pitfalls
● Data modeling
○ Reduce the number of intersections
○ Different datasources for different use cases
2016-11-15
2016-11-15
2016-11-15
Timestamp Attribute
Count
Distinct
Timestamp Attribute Region
Count
Distinct
US XXXXXX US
Porsche
Intent
XXXXXX
Porsche
Intent
... ......
XXXXXX
...
Guidelines and pitfalls
● Query optimization
○ Combine multiple queries into single query
○ Use filters
○ Use groupBy v2 engine (default since 0.10.0)
○ Use timeseries rather than groupBy queries
(where applicable)
Guidelines and pitfalls
● Batch Ingestion
○ EMR Tuning
■ 140-nodes cluster
● 85% spot instances => ~80% cost reduction
○ Druid input file format - Parquet vs CSV
■ Reduced indexing time by X4
■ Reduced used storage by X10
○ Concurrent ingestion tasks (one per EMR cluster and datasource)
■ Set worker select strategy to fillCapacityWithAffinity
Guidelines and pitfalls
● Batch Ingestion (WIP)
○ Action - pre-aggregating the data in Spark Streaming app
■ Aggregating the data by key
● groupBy().agg() for simple counts
● combineByKey() for distinct count (using the DataSketches packages)
Requires setting isInputThetaSketch=true on ingestion task
■ Increased micro-batch interval from 30 minutes to 1 hour
○ Result :
■ # of output records is ~2000X smaller and total size of output files is less
than 1%, compared to the previous version
■ 10X less nodes in the EMR cluster running the MapReduce ingestion job
Guidelines and pitfalls
● Community
Future work
● Improving accuracy for small set <-> big set intersections
● Improving query performance
○ groupBy V2
○ NVMe SSDs
○ Switching to timeseries query type where applicable
○ Apply less granular aggregation where applicable
(e.g 1 month rather than 1 day)
● Upgrading Druid to 0.11.0
● Exploring option of tiering of query processing nodes
○ Reporting vs interactive queries
○ Hot vs cold data
● Using SQL interface (experimental)
● Using Lookups (experimental)
DRUID ES
What have we learned?
● Druid is a columnar, time series data-store
● Can store trillions of events and serve analytic queries in sub-second
● Highly-scalable, cost-effective
● Widely used among Big Data companies
● Can be used for :
○ Distinct count (via ThetaSketch)
○ Simple counts
● Setup is not easy
● Ingestion has little effect on query performance (deep storage usage)
● Provides very good visibility
● Improve query performance by carefully designing your data model and building
your queries
QUESTIONS?
Join us - https://www.comeet.co/jobs/nielsen/33.000
Big Data Architect
Java & Machine Learning Developer
Junior Big Data Developer
And more...
THANK YOU!https://www.linkedin.com/in/danny-ruchman-70211a27/
https://www.linkedin.com/in/itaiy/
Druid vs ES
10TB/day
4 Hours/day
15GB/day
280ms-350ms
$55K/month
DRUID
250GB/day
10 Hours/day
2.5TB (total)
500ms-6000ms
$80K/month
ES

Más contenido relacionado

La actualidad más candente

ClickHouse Deep Dive, by Aleksei Milovidov
ClickHouse Deep Dive, by Aleksei MilovidovClickHouse Deep Dive, by Aleksei Milovidov
ClickHouse Deep Dive, by Aleksei MilovidovAltinity Ltd
 
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
Migration to ClickHouse. Practical guide, by Alexander ZaitsevMigration to ClickHouse. Practical guide, by Alexander Zaitsev
Migration to ClickHouse. Practical guide, by Alexander ZaitsevAltinity Ltd
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsDatabricks
 
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
Adaptive Query Execution: Speeding Up Spark SQL at RuntimeAdaptive Query Execution: Speeding Up Spark SQL at Runtime
Adaptive Query Execution: Speeding Up Spark SQL at RuntimeDatabricks
 
ClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEO
ClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEOClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEO
ClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEOAltinity Ltd
 
Scalable Monitoring Using Apache Spark and Friends with Utkarsh Bhatnagar
Scalable Monitoring Using Apache Spark and Friends with Utkarsh BhatnagarScalable Monitoring Using Apache Spark and Friends with Utkarsh Bhatnagar
Scalable Monitoring Using Apache Spark and Friends with Utkarsh BhatnagarDatabricks
 
Counting Unique Users in Real-Time: Here's a Challenge for You!
Counting Unique Users in Real-Time: Here's a Challenge for You!Counting Unique Users in Real-Time: Here's a Challenge for You!
Counting Unique Users in Real-Time: Here's a Challenge for You!DataWorks Summit
 
Building Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache AirflowBuilding Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache AirflowSid Anand
 
Iceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data AnalyticsIceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data AnalyticsAlluxio, Inc.
 
ClickHouse materialized views - a secret weapon for high performance analytic...
ClickHouse materialized views - a secret weapon for high performance analytic...ClickHouse materialized views - a secret weapon for high performance analytic...
ClickHouse materialized views - a secret weapon for high performance analytic...Altinity Ltd
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Flink Forward
 
Rainbird: Realtime Analytics at Twitter (Strata 2011)
Rainbird: Realtime Analytics at Twitter (Strata 2011)Rainbird: Realtime Analytics at Twitter (Strata 2011)
Rainbird: Realtime Analytics at Twitter (Strata 2011)Kevin Weil
 
Apache Hudi: The Path Forward
Apache Hudi: The Path ForwardApache Hudi: The Path Forward
Apache Hudi: The Path ForwardAlluxio, Inc.
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
 
Importance of ML Reproducibility & Applications with MLfLow
Importance of ML Reproducibility & Applications with MLfLowImportance of ML Reproducibility & Applications with MLfLow
Importance of ML Reproducibility & Applications with MLfLowDatabricks
 
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy FarkasVirtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy FarkasFlink Forward
 
ClickHouse Introduction by Alexander Zaitsev, Altinity CTO
ClickHouse Introduction by Alexander Zaitsev, Altinity CTOClickHouse Introduction by Alexander Zaitsev, Altinity CTO
ClickHouse Introduction by Alexander Zaitsev, Altinity CTOAltinity Ltd
 
High Performance, High Reliability Data Loading on ClickHouse
High Performance, High Reliability Data Loading on ClickHouseHigh Performance, High Reliability Data Loading on ClickHouse
High Performance, High Reliability Data Loading on ClickHouseAltinity Ltd
 
The Current State of Table API in 2022
The Current State of Table API in 2022The Current State of Table API in 2022
The Current State of Table API in 2022Flink Forward
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsSpark Summit
 

La actualidad más candente (20)

ClickHouse Deep Dive, by Aleksei Milovidov
ClickHouse Deep Dive, by Aleksei MilovidovClickHouse Deep Dive, by Aleksei Milovidov
ClickHouse Deep Dive, by Aleksei Milovidov
 
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
Migration to ClickHouse. Practical guide, by Alexander ZaitsevMigration to ClickHouse. Practical guide, by Alexander Zaitsev
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
Adaptive Query Execution: Speeding Up Spark SQL at RuntimeAdaptive Query Execution: Speeding Up Spark SQL at Runtime
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
 
ClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEO
ClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEOClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEO
ClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEO
 
Scalable Monitoring Using Apache Spark and Friends with Utkarsh Bhatnagar
Scalable Monitoring Using Apache Spark and Friends with Utkarsh BhatnagarScalable Monitoring Using Apache Spark and Friends with Utkarsh Bhatnagar
Scalable Monitoring Using Apache Spark and Friends with Utkarsh Bhatnagar
 
Counting Unique Users in Real-Time: Here's a Challenge for You!
Counting Unique Users in Real-Time: Here's a Challenge for You!Counting Unique Users in Real-Time: Here's a Challenge for You!
Counting Unique Users in Real-Time: Here's a Challenge for You!
 
Building Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache AirflowBuilding Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache Airflow
 
Iceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data AnalyticsIceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data Analytics
 
ClickHouse materialized views - a secret weapon for high performance analytic...
ClickHouse materialized views - a secret weapon for high performance analytic...ClickHouse materialized views - a secret weapon for high performance analytic...
ClickHouse materialized views - a secret weapon for high performance analytic...
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...
 
Rainbird: Realtime Analytics at Twitter (Strata 2011)
Rainbird: Realtime Analytics at Twitter (Strata 2011)Rainbird: Realtime Analytics at Twitter (Strata 2011)
Rainbird: Realtime Analytics at Twitter (Strata 2011)
 
Apache Hudi: The Path Forward
Apache Hudi: The Path ForwardApache Hudi: The Path Forward
Apache Hudi: The Path Forward
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Importance of ML Reproducibility & Applications with MLfLow
Importance of ML Reproducibility & Applications with MLfLowImportance of ML Reproducibility & Applications with MLfLow
Importance of ML Reproducibility & Applications with MLfLow
 
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy FarkasVirtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
 
ClickHouse Introduction by Alexander Zaitsev, Altinity CTO
ClickHouse Introduction by Alexander Zaitsev, Altinity CTOClickHouse Introduction by Alexander Zaitsev, Altinity CTO
ClickHouse Introduction by Alexander Zaitsev, Altinity CTO
 
High Performance, High Reliability Data Loading on ClickHouse
High Performance, High Reliability Data Loading on ClickHouseHigh Performance, High Reliability Data Loading on ClickHouse
High Performance, High Reliability Data Loading on ClickHouse
 
The Current State of Table API in 2022
The Current State of Table API in 2022The Current State of Table API in 2022
The Current State of Table API in 2022
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications
 

Similar a Our journey with druid - from initial research to full production scale

Using druid for interactive count distinct queries at scale
Using druid for interactive count distinct queries at scaleUsing druid for interactive count distinct queries at scale
Using druid for interactive count distinct queries at scaleItai Yaffe
 
Using druid for interactive count distinct queries at scale @ nmc
Using druid  for interactive count distinct queries at scale @ nmcUsing druid  for interactive count distinct queries at scale @ nmc
Using druid for interactive count distinct queries at scale @ nmcIdo Shilon
 
Druid meetup @walkme
Druid meetup @walkmeDruid meetup @walkme
Druid meetup @walkmeDori Waldman
 
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...NETWAYS
 
TRHUG 2015 - Veloxity Big Data Migration Use Case
TRHUG 2015 - Veloxity Big Data Migration Use CaseTRHUG 2015 - Veloxity Big Data Migration Use Case
TRHUG 2015 - Veloxity Big Data Migration Use CaseHakan Ilter
 
NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1Ruslan Meshenberg
 
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB
 
Auditing data and answering the life long question, is it the end of the day ...
Auditing data and answering the life long question, is it the end of the day ...Auditing data and answering the life long question, is it the end of the day ...
Auditing data and answering the life long question, is it the end of the day ...Simona Meriam
 
Elasticsearch as a time series database
Elasticsearch as a time series databaseElasticsearch as a time series database
Elasticsearch as a time series databasefelixbarny
 
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned Omid Vahdaty
 
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3  Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3 Omid Vahdaty
 
Piano Media - approach to data gathering and processing
Piano Media - approach to data gathering and processingPiano Media - approach to data gathering and processing
Piano Media - approach to data gathering and processingMartinStrycek
 
AWS Techniques and lessons writing low cost autoscaling GitLab runners
AWS Techniques and lessons writing low cost autoscaling GitLab runnersAWS Techniques and lessons writing low cost autoscaling GitLab runners
AWS Techniques and lessons writing low cost autoscaling GitLab runnersAnthony Scata
 
[Virtual Meetup] Using Elasticsearch as a Time-Series Database in the Endpoin...
[Virtual Meetup] Using Elasticsearch as a Time-Series Database in the Endpoin...[Virtual Meetup] Using Elasticsearch as a Time-Series Database in the Endpoin...
[Virtual Meetup] Using Elasticsearch as a Time-Series Database in the Endpoin...Anna Ossowski
 
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | EnglishAWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | EnglishOmid Vahdaty
 
Tapjoy OpenStack Summit Paris Breakout Session
Tapjoy OpenStack Summit Paris Breakout SessionTapjoy OpenStack Summit Paris Breakout Session
Tapjoy OpenStack Summit Paris Breakout SessionWeston Jossey
 
Netflix SRE perf meetup_slides
Netflix SRE perf meetup_slidesNetflix SRE perf meetup_slides
Netflix SRE perf meetup_slidesEd Hunter
 
Security Monitoring for big Infrastructures without a Million Dollar budget
Security Monitoring for big Infrastructures without a Million Dollar budgetSecurity Monitoring for big Infrastructures without a Million Dollar budget
Security Monitoring for big Infrastructures without a Million Dollar budgetJuan Berner
 

Similar a Our journey with druid - from initial research to full production scale (20)

Using druid for interactive count distinct queries at scale
Using druid for interactive count distinct queries at scaleUsing druid for interactive count distinct queries at scale
Using druid for interactive count distinct queries at scale
 
Using druid for interactive count distinct queries at scale @ nmc
Using druid  for interactive count distinct queries at scale @ nmcUsing druid  for interactive count distinct queries at scale @ nmc
Using druid for interactive count distinct queries at scale @ nmc
 
Druid meetup @walkme
Druid meetup @walkmeDruid meetup @walkme
Druid meetup @walkme
 
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
 
TRHUG 2015 - Veloxity Big Data Migration Use Case
TRHUG 2015 - Veloxity Big Data Migration Use CaseTRHUG 2015 - Veloxity Big Data Migration Use Case
TRHUG 2015 - Veloxity Big Data Migration Use Case
 
NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1
 
Cloud arch patterns
Cloud arch patternsCloud arch patterns
Cloud arch patterns
 
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
 
Auditing data and answering the life long question, is it the end of the day ...
Auditing data and answering the life long question, is it the end of the day ...Auditing data and answering the life long question, is it the end of the day ...
Auditing data and answering the life long question, is it the end of the day ...
 
Elasticsearch as a time series database
Elasticsearch as a time series databaseElasticsearch as a time series database
Elasticsearch as a time series database
 
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
 
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3  Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3
 
Piano Media - approach to data gathering and processing
Piano Media - approach to data gathering and processingPiano Media - approach to data gathering and processing
Piano Media - approach to data gathering and processing
 
AWS Techniques and lessons writing low cost autoscaling GitLab runners
AWS Techniques and lessons writing low cost autoscaling GitLab runnersAWS Techniques and lessons writing low cost autoscaling GitLab runners
AWS Techniques and lessons writing low cost autoscaling GitLab runners
 
[Virtual Meetup] Using Elasticsearch as a Time-Series Database in the Endpoin...
[Virtual Meetup] Using Elasticsearch as a Time-Series Database in the Endpoin...[Virtual Meetup] Using Elasticsearch as a Time-Series Database in the Endpoin...
[Virtual Meetup] Using Elasticsearch as a Time-Series Database in the Endpoin...
 
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | EnglishAWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
 
Tapjoy OpenStack Summit Paris Breakout Session
Tapjoy OpenStack Summit Paris Breakout SessionTapjoy OpenStack Summit Paris Breakout Session
Tapjoy OpenStack Summit Paris Breakout Session
 
Druid
DruidDruid
Druid
 
Netflix SRE perf meetup_slides
Netflix SRE perf meetup_slidesNetflix SRE perf meetup_slides
Netflix SRE perf meetup_slides
 
Security Monitoring for big Infrastructures without a Million Dollar budget
Security Monitoring for big Infrastructures without a Million Dollar budgetSecurity Monitoring for big Infrastructures without a Million Dollar budget
Security Monitoring for big Infrastructures without a Million Dollar budget
 

Más de Itai Yaffe

Mastering Partitioning for High-Volume Data Processing
Mastering Partitioning for High-Volume Data ProcessingMastering Partitioning for High-Volume Data Processing
Mastering Partitioning for High-Volume Data ProcessingItai Yaffe
 
Solving Data Engineers Velocity - Wix's Data Warehouse Automation
Solving Data Engineers Velocity - Wix's Data Warehouse AutomationSolving Data Engineers Velocity - Wix's Data Warehouse Automation
Solving Data Engineers Velocity - Wix's Data Warehouse AutomationItai Yaffe
 
Lessons Learnt from Running Thousands of On-demand Spark Applications
Lessons Learnt from Running Thousands of On-demand Spark ApplicationsLessons Learnt from Running Thousands of On-demand Spark Applications
Lessons Learnt from Running Thousands of On-demand Spark ApplicationsItai Yaffe
 
Why do the majority of Data Science projects never make it to production?
Why do the majority of Data Science projects never make it to production?Why do the majority of Data Science projects never make it to production?
Why do the majority of Data Science projects never make it to production?Itai Yaffe
 
Planning a data solution - "By Failing to prepare, you are preparing to fail"
Planning a data solution - "By Failing to prepare, you are preparing to fail"Planning a data solution - "By Failing to prepare, you are preparing to fail"
Planning a data solution - "By Failing to prepare, you are preparing to fail"Itai Yaffe
 
Evaluating Big Data & ML Solutions - Opening Notes
Evaluating Big Data & ML Solutions - Opening NotesEvaluating Big Data & ML Solutions - Opening Notes
Evaluating Big Data & ML Solutions - Opening NotesItai Yaffe
 
Big data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real timeBig data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real timeItai Yaffe
 
Data Lakes on Public Cloud: Breaking Data Management Monoliths
Data Lakes on Public Cloud: Breaking Data Management MonolithsData Lakes on Public Cloud: Breaking Data Management Monoliths
Data Lakes on Public Cloud: Breaking Data Management MonolithsItai Yaffe
 
Unleashing the Power of your Data
Unleashing the Power of your DataUnleashing the Power of your Data
Unleashing the Power of your DataItai Yaffe
 
Data Lake on Public Cloud - Opening Notes
Data Lake on Public Cloud - Opening NotesData Lake on Public Cloud - Opening Notes
Data Lake on Public Cloud - Opening NotesItai Yaffe
 
Airflow Summit 2020 - Migrating airflow based spark jobs to kubernetes - the ...
Airflow Summit 2020 - Migrating airflow based spark jobs to kubernetes - the ...Airflow Summit 2020 - Migrating airflow based spark jobs to kubernetes - the ...
Airflow Summit 2020 - Migrating airflow based spark jobs to kubernetes - the ...Itai Yaffe
 
DevTalks Reimagined 2020 - Funnel Analysis with Spark and Druid
DevTalks Reimagined 2020 - Funnel Analysis with Spark and DruidDevTalks Reimagined 2020 - Funnel Analysis with Spark and Druid
DevTalks Reimagined 2020 - Funnel Analysis with Spark and DruidItai Yaffe
 
Virtual Apache Druid Meetup: AIADA (Ask Itai and David Anything)
Virtual Apache Druid Meetup: AIADA (Ask Itai and David Anything)Virtual Apache Druid Meetup: AIADA (Ask Itai and David Anything)
Virtual Apache Druid Meetup: AIADA (Ask Itai and David Anything)Itai Yaffe
 
Introducing Kafka Connect and Implementing Custom Connectors
Introducing Kafka Connect and Implementing Custom ConnectorsIntroducing Kafka Connect and Implementing Custom Connectors
Introducing Kafka Connect and Implementing Custom ConnectorsItai Yaffe
 
A Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's RoadmapA Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's RoadmapItai Yaffe
 
Scalable Incremental Index for Druid
Scalable Incremental Index for DruidScalable Incremental Index for Druid
Scalable Incremental Index for DruidItai Yaffe
 
Funnel Analysis with Spark and Druid
Funnel Analysis with Spark and DruidFunnel Analysis with Spark and Druid
Funnel Analysis with Spark and DruidItai Yaffe
 
The benefits of running Spark on your own Docker
The benefits of running Spark on your own DockerThe benefits of running Spark on your own Docker
The benefits of running Spark on your own DockerItai Yaffe
 
Optimizing Spark-based data pipelines - are you up for it?
Optimizing Spark-based data pipelines - are you up for it?Optimizing Spark-based data pipelines - are you up for it?
Optimizing Spark-based data pipelines - are you up for it?Itai Yaffe
 
Scheduling big data workloads on serverless infrastructure
Scheduling big data workloads on serverless infrastructureScheduling big data workloads on serverless infrastructure
Scheduling big data workloads on serverless infrastructureItai Yaffe
 

Más de Itai Yaffe (20)

Mastering Partitioning for High-Volume Data Processing
Mastering Partitioning for High-Volume Data ProcessingMastering Partitioning for High-Volume Data Processing
Mastering Partitioning for High-Volume Data Processing
 
Solving Data Engineers Velocity - Wix's Data Warehouse Automation
Solving Data Engineers Velocity - Wix's Data Warehouse AutomationSolving Data Engineers Velocity - Wix's Data Warehouse Automation
Solving Data Engineers Velocity - Wix's Data Warehouse Automation
 
Lessons Learnt from Running Thousands of On-demand Spark Applications
Lessons Learnt from Running Thousands of On-demand Spark ApplicationsLessons Learnt from Running Thousands of On-demand Spark Applications
Lessons Learnt from Running Thousands of On-demand Spark Applications
 
Why do the majority of Data Science projects never make it to production?
Why do the majority of Data Science projects never make it to production?Why do the majority of Data Science projects never make it to production?
Why do the majority of Data Science projects never make it to production?
 
Planning a data solution - "By Failing to prepare, you are preparing to fail"
Planning a data solution - "By Failing to prepare, you are preparing to fail"Planning a data solution - "By Failing to prepare, you are preparing to fail"
Planning a data solution - "By Failing to prepare, you are preparing to fail"
 
Evaluating Big Data & ML Solutions - Opening Notes
Evaluating Big Data & ML Solutions - Opening NotesEvaluating Big Data & ML Solutions - Opening Notes
Evaluating Big Data & ML Solutions - Opening Notes
 
Big data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real timeBig data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real time
 
Data Lakes on Public Cloud: Breaking Data Management Monoliths
Data Lakes on Public Cloud: Breaking Data Management MonolithsData Lakes on Public Cloud: Breaking Data Management Monoliths
Data Lakes on Public Cloud: Breaking Data Management Monoliths
 
Unleashing the Power of your Data
Unleashing the Power of your DataUnleashing the Power of your Data
Unleashing the Power of your Data
 
Data Lake on Public Cloud - Opening Notes
Data Lake on Public Cloud - Opening NotesData Lake on Public Cloud - Opening Notes
Data Lake on Public Cloud - Opening Notes
 
Airflow Summit 2020 - Migrating airflow based spark jobs to kubernetes - the ...
Airflow Summit 2020 - Migrating airflow based spark jobs to kubernetes - the ...Airflow Summit 2020 - Migrating airflow based spark jobs to kubernetes - the ...
Airflow Summit 2020 - Migrating airflow based spark jobs to kubernetes - the ...
 
DevTalks Reimagined 2020 - Funnel Analysis with Spark and Druid
DevTalks Reimagined 2020 - Funnel Analysis with Spark and DruidDevTalks Reimagined 2020 - Funnel Analysis with Spark and Druid
DevTalks Reimagined 2020 - Funnel Analysis with Spark and Druid
 
Virtual Apache Druid Meetup: AIADA (Ask Itai and David Anything)
Virtual Apache Druid Meetup: AIADA (Ask Itai and David Anything)Virtual Apache Druid Meetup: AIADA (Ask Itai and David Anything)
Virtual Apache Druid Meetup: AIADA (Ask Itai and David Anything)
 
Introducing Kafka Connect and Implementing Custom Connectors
Introducing Kafka Connect and Implementing Custom ConnectorsIntroducing Kafka Connect and Implementing Custom Connectors
Introducing Kafka Connect and Implementing Custom Connectors
 
A Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's RoadmapA Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's Roadmap
 
Scalable Incremental Index for Druid
Scalable Incremental Index for DruidScalable Incremental Index for Druid
Scalable Incremental Index for Druid
 
Funnel Analysis with Spark and Druid
Funnel Analysis with Spark and DruidFunnel Analysis with Spark and Druid
Funnel Analysis with Spark and Druid
 
The benefits of running Spark on your own Docker
The benefits of running Spark on your own DockerThe benefits of running Spark on your own Docker
The benefits of running Spark on your own Docker
 
Optimizing Spark-based data pipelines - are you up for it?
Optimizing Spark-based data pipelines - are you up for it?Optimizing Spark-based data pipelines - are you up for it?
Optimizing Spark-based data pipelines - are you up for it?
 
Scheduling big data workloads on serverless infrastructure
Scheduling big data workloads on serverless infrastructureScheduling big data workloads on serverless infrastructure
Scheduling big data workloads on serverless infrastructure
 

Último

BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...shambhavirathore45
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...shivangimorya083
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 

Último (20)

BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 

Our journey with druid - from initial research to full production scale

  • 1. Our journey with Druid From initial research to full production scale Danny Ruchman + Itai Yaffe Nielsen
  • 2. Introduction Danny Ruchman Itai Yaffe ● Software Engineer and team manager ● Focused on big data processing solutions ● Big Data Infrastructure Developer ● Dealing with Big Data challenges for the last 5 years
  • 3. Nielsen Marketing Cloud (NMC) ● eXelate was acquired by Nielsen 3 years ago ● A Data company ● Machine learning models for insights ● Business decisions ● Targeting
  • 4. Nielsen Marketing Cloud - questions we try to answer ● How many users of a certain profile can we reach Campaign for fancy women sneakers - ● How many hits for a specific web page in a date range
  • 8. The need ● Nielsen Marketing Cloud business question ○ How many unique devices we have encountered: ■ over a given date range ■ for a given set of attributes (segments, regions, etc.) ● Find the number of distinct elements in a data stream which may contain repeated elements in real time
  • 10. ● Store everything ● Store only 1 bit per device ○ 10B Devices-1.25 GB/day ○ 10B Devices*80K attributes - 100 TB/day ● Approximate Possible solutions Naive Bit VectorApprox.
  • 11. Our journey ● Elasticsearch ○ Indexing data ■ 250 GB of daily data, 10 hours ■ Affect query time ○ Querying ■ Low concurrency ■ Scans on all the shards of the corresponding index
  • 12. What we tried ● Preprocessing ● Statistical algorithms (e.g HyperLogLog)
  • 13. ● K Minimum Values (KMV) ● Estimate set cardinality ● Supports set-theoretic operations X Y ● ThetaSketch mathematical framework - generalization of KMV X Y ThetaSketch
  • 15. Number of Std Dev 1 2 Confidence Interval 68.27% 95.45% 16,384 0.78% 1.56% 32,768 0.55% 1.10% 65,536 0.39% 0.78% ThetaSketch error K Value Error %ERROR
  • 16. “Very fast highly scalable columnar data-store” DRUID
  • 18. Why is it cool? ● Store trillions of events, petabytes of data ● Sub-second analytic queries ● Highly scalable ● Cost effective
  • 19. LongSumAggregator 2016-11-15 Timestamp Attribute Device ID 11111 3a4c1f2d84a5c179435c1fea86e6ae02 2016-11-15 11111 3a4c1f2d84a5c179435c1fea86e6ae02 2016-11-15 11111 5dd59f9bd068f802a7c6dd832bf60d02 2016-11-15 22222 5dd59f9bd068f802a7c6dd832bf60d02 2016-11-15 333333 5dd59f9bd068f802a7c6dd832bf60d02 Timestamp Attribute Simple Count 2016-11-15 2016-11-15 2016-11-15 11111 22222 33333 3 1 1 Roll-up - Simple Count
  • 20. Roll-up - Count Distinct ThetaSketchAggregator 2016-11-15 Timestamp Attribute Device ID 11111 3a4c1f2d84a5c179435c1fea86e6ae02 2016-11-15 11111 3a4c1f2d84a5c179435c1fea86e6ae02 2016-11-15 11111 5dd59f9bd068f802a7c6dd832bf60d02 2016-11-15 22222 5dd59f9bd068f802a7c6dd832bf60d02 2016-11-15 333333 5dd59f9bd068f802a7c6dd832bf60d02 Timestamp Attribute Count Distinct 2016-11-15 2016-11-15 2016-11-15 11111 22222 33333 2 1 1
  • 22. How do we use Druid
  • 23. Query performance benchmark Concurrent Queries Avg.ResponseTime Druid . Elasticsearch
  • 24. Guidelines and pitfalls ● Setup is not easy
  • 25. Guidelines and pitfalls ● Monitoring your system
  • 26. Guidelines and pitfalls ● Monitoring your system - important metrics (incomplete list) : ○ Broker query time ○ Historical query time ○ Historical query wait time ○ Pending segments ○ Broker query TTFB ○ ...
  • 27. Guidelines and pitfalls ● Monitoring your system
  • 28. Guidelines and pitfalls ● Data modeling ○ Reduce the number of intersections ○ Different datasources for different use cases 2016-11-15 2016-11-15 2016-11-15 Timestamp Attribute Count Distinct Timestamp Attribute Region Count Distinct US XXXXXX US Porsche Intent XXXXXX Porsche Intent ... ...... XXXXXX ...
  • 29. Guidelines and pitfalls ● Query optimization ○ Combine multiple queries into single query ○ Use filters ○ Use groupBy v2 engine (default since 0.10.0) ○ Use timeseries rather than groupBy queries (where applicable)
  • 30. Guidelines and pitfalls ● Batch Ingestion ○ EMR Tuning ■ 140-nodes cluster ● 85% spot instances => ~80% cost reduction ○ Druid input file format - Parquet vs CSV ■ Reduced indexing time by X4 ■ Reduced used storage by X10 ○ Concurrent ingestion tasks (one per EMR cluster and datasource) ■ Set worker select strategy to fillCapacityWithAffinity
  • 31. Guidelines and pitfalls ● Batch Ingestion (WIP) ○ Action - pre-aggregating the data in Spark Streaming app ■ Aggregating the data by key ● groupBy().agg() for simple counts ● combineByKey() for distinct count (using the DataSketches packages) Requires setting isInputThetaSketch=true on ingestion task ■ Increased micro-batch interval from 30 minutes to 1 hour ○ Result : ■ # of output records is ~2000X smaller and total size of output files is less than 1%, compared to the previous version ■ 10X less nodes in the EMR cluster running the MapReduce ingestion job
  • 33. Future work ● Improving accuracy for small set <-> big set intersections ● Improving query performance ○ groupBy V2 ○ NVMe SSDs ○ Switching to timeseries query type where applicable ○ Apply less granular aggregation where applicable (e.g 1 month rather than 1 day) ● Upgrading Druid to 0.11.0 ● Exploring option of tiering of query processing nodes ○ Reporting vs interactive queries ○ Hot vs cold data ● Using SQL interface (experimental) ● Using Lookups (experimental)
  • 34. DRUID ES What have we learned? ● Druid is a columnar, time series data-store ● Can store trillions of events and serve analytic queries in sub-second ● Highly-scalable, cost-effective ● Widely used among Big Data companies ● Can be used for : ○ Distinct count (via ThetaSketch) ○ Simple counts ● Setup is not easy ● Ingestion has little effect on query performance (deep storage usage) ● Provides very good visibility ● Improve query performance by carefully designing your data model and building your queries
  • 35. QUESTIONS? Join us - https://www.comeet.co/jobs/nielsen/33.000 Big Data Architect Java & Machine Learning Developer Junior Big Data Developer And more...
  • 37. Druid vs ES 10TB/day 4 Hours/day 15GB/day 280ms-350ms $55K/month DRUID 250GB/day 10 Hours/day 2.5TB (total) 500ms-6000ms $80K/month ES

Notas del editor

  1. Welcome to nielsen marketing cloud offices Thank you for coming to hear about our journey with druid We will try to make it interesting and valuable for you
  2. Me, 5th year @ NMC Today I lead one of the 2 data teams. I am doing it in the last 2 years and I am loving every minute and I am going to talk about NMC data pipeline loads all our data into our data sources put you in context of druid path Itai, 5th year @ NMC Senior big data engineer One of the engineers who lead the druid research and implementation effort will talk about the druid from idea to production, will give super cool tips for beginners Question - pop in - know -> answer, don’t know - will say it’s a good question with complicated solution, let’s talk after the session
  3. Nielsen marketing cloud or NMC in short A group inside Nielsen, Born from exelate company that was acquired by Nielsen 3 years ago Nielsen is a data company and so are we and we had strong business relationship until at some point they decided to go for it and acquired exelate Data company meaning Buying and onboarding data into NMC from data providers, customers and Nielsen data We have huge high quality dataset enrich the data using machine learning models in order to create more relevant quality insights categorize and sell according to a need Helping brands to take intelligence business decisions E.g. Targeting in the digital marketing world Meaning help fit ads to viewers For example street sign can fit to a very small % of people who see it vs Online ads that can fit the profile of the individual that sees it More interesting to the user More chances he will click the ad Better ROI for the marketer
  4. What are the questions we try to answer in NMC that helps our customers to take business decisions ? A lot of questions but to lead to what druid is coming to solveֿ Translating from human problem to technical problem: Uu (distinct) count Simple count
  5. Few words on NMC data pipeline architecture: Frontend layer: Receives all the online and offline data traffic Bare metal on different data centers 3 us,2 eu ,2 pacific near real time - high throughput/low latency challenges Backend layer Aws Cloud based process all the frontend layer outputs ETL’s - load data to data sources aggregated and raw Applications layer Also in the cloud Variety of apps above all our data sources Web - NMC data configurations (segments, audiences etc) campaign analysis , campaign management tools etc. visualized profile graphs reports
  6. In each data center the frontend layer is build out of few systems: Data serving - in house gets the external traffic online and offline analyse the event read/write user repository Run algorithms to identify relevant buyers output to kafka Scale of 5B events/day with 200ms SLA while most events are done in few tens of ms’s Modeling and scoring system - in house Scoring and learning Online machines learning algorithms look alike models Cross device models 1.7T models a day Avarage 1100 model execution in a single event Less than 20ms User repository Aerospike 8B active users in us space 3B active users in eu space Everything goes to kafke in each DC and replicated using uReplicator to the centralized kafka in the cloud uReplicator - uber open source that we modified a little to fit our needs, like mirror maker
  7. Centralizes Kafka cluster in the cloud for all the data from all kafka clusters in all DC’s 50 big brokers 10 topics Deals with 11B events per day Kafka data processing is done mainly with spark streaming Loading sources such as DRUID - real time analytics from the web applications 10B events per day Clustrix - relational DB, aggregated data RDR - data lake 15 TB per day - compressed parquet Snowflake - managed db for DS and analytics purposes Now after presenting the data pipeline itay will go deeper into druid use case and implementation details
  8. Danny talked about the 2 main use-cases - counting unique users and counting hits (or “simple counts”). The first one is somewhat harder, so this is going to be at the focus of my part of the presentation Past… Mention “cardinality” and “real-time dashboard” Explain the need to union and intersect
  9. Demo
  10. Who’s familiar with the count-distinct problem? For the 2 first solutions, we need to store data per device per attribute per day Bit vector - Elastic search /Redis is an example of such system Approximation has a certain error rate
  11. We tried to introduce new cluster dedicated for indexing only and then use backup and restore to the second cluster This method was very expensive and was partially helpful Tuning for better performance also didn’t help too much The story about the demo when we were at the bar (December 20th, 2016)...
  12. Preprocessing - Too many combinations - The formula length is not bounded (show some numbers) HyperLogLog -Implementation in ElasticSearch was too slow (done on query time) - Set operations increase the error dramatically
  13. ThetaSketch is based on KMV Unions and Intersections increase the error The problematic case is intersection of very small set with very big set
  14. The larger the K the smaller the Error However larger K means more memory & storage needed Demo - http://content.research.neustar.biz/blog/kmv.html
  15. So we talked about statistical algorithms, which is nice, but we needed a practical solution… OOTB supports ThetaSketch algorithm Open source, written in Java (works for us, as we know Java…) Who’s familiar with Druid?
  16. Just to give you a sense of where Druid is used in production... http://druid.io/druid-powered.html
  17. I’ll try to cover all these reasons in the next slides
  18. Timeseries database - first thing you need to know about Druid Column types : Timestamp Dimensions Metrics Together they comprise a Datasource Agg is done on ingestion time (outcome is much smaller in size) In query time, it’s closer to a key-value search
  19. We’re just using a different type of aggregator (the ThetaSketch aggregator) to get count distinct, but everything else is essentially the same. The one big difference is multiple ingestions of the same data : For ThetaSketch - not a problem due to the fact is “samples” the data (chooses the K minimal values); Whereas for Sum - we’re going to get wrong numbers (e.g 2X as big if we ingest the data twice) To mitigate it, we’ve added a simple meta-data store to prevent ingesting the same data twice (I’ll discuss it later)
  20. We have 3 types of processes - ingestion, querying, management All processes are decoupled and scalable Ingestion (real time - e.g from Kafka, batch - talk about deep storage, how data is aggregated in ingestion time) Querying (brokers, historicals, query performance during ingestion vs ES) Lambda architecture (for those who don’t know - it’s “a data-processing architecture designed to handle massive quantities of data by taking advantage of both batch- and stream-processing methods.”)
  21. Explain the tuple and what is happening during the aggregation Mention it says “ThetaSketchAggregator”, but again - we also use LongSumAggregator for simple counts We ingest a lot more data today than we did in ES (10B events/day in TBs of data vs 250GB in ES) I mentioned our meta-data store earlier - each component in the flow updates that meta-data store, to prevent ingestion of the same data twice
  22. We can see that while ES response time is exponentially increasing, Druid response time is relatively stable Benchmark using : Druid Cluster : 1x Broker (r3.8xlarge) , 8x Historical (r3.8xlarge) Elasticsearch Cluster : 20 nodes (r3.8xlarge) This is how we use it, now switching to how we got there and the pains...
  23. Setup is not easy Separate config/servers/tuning Caused the deployment to take a few months Use the Druid recommendation for Production configuration
  24. Monitoring Your System Druid has built in support for Graphite ( exports many metrics ), so does Spark. We also export metrics to Graphite from our ingestion tasks (written in Python) and from the NMC backend (aesreporter) to provide a complete, end-to-end view of the system. In this example, query time is very high due to a high number of pending segments (i.e segments that are queued to be scanned in order to answer the query)
  25. http://druid.io/docs/latest/operations/metrics.html Note: Druid is very verbose, can emit a lot of metrics, beware of overwhelming your Graphite server
  26. Monitoring Your System To avoid having to maintain our own Graphite backend, we use Hosted Graphite (https://www.hostedgraphite.com/), which allows us not only to collect metrics from all the components mentioned earlier, but also to build Grafana dashboards and to get Slack notifications when something unexpected happens (e.g ingestion task failure).
  27. Data Modeling If using Theta sketch - reduce the number of intersections (show a slide of the old and new data model). In this example, US is a very big set, Porsche intent is (probably) a small set It didn’t solve all use-cases, but it gives you an idea of how you can approach the problem Different datasources - e.g lower accuracy (i.e lower K) for faster queries VS higher accuracy with a bit slower queries
  28. Combine multiple queries over the REST API (explain why?) There can be billions of rows, so filter the data as part of the query groupBy v2 offers better performance and memory management (e.g generates per-segment results using a fully off-heap map) Switching from groupBy to timeseries query seems to have solved the “io.druid.java.util.common.IAE: Not enough capacity for even one row! Need[1,509,995,528] but have[0].” we had
  29. EMR tuning (spot instances (80% cost reduction, but it comes with a risk of out-biding and nodes are lost), druid MR prod config) Use Parquet Affinity - use fillCapacityWithAffinity to ingest data from multiple EMR clusters to the same Druid cluster (but different datasources) concurrently, see http://druid.io/docs/latest/configuration/indexing-service.html#affinity
  30. Why? Ingestion still takes a lot of time and resources There was almost no “penalty” on the Spark Streaming app (with the new version of the app)
  31. A slide mainly for reference “Breaking” large sets to smaller sets, then use all those sets as part of the query SQL - added in 0.10.0, can be used from the REST API or from JDBC, not ANSI-SQL Lookups - Druid has limited support for joins through query-time lookups. The common use case of query-time lookups is to replace one dimension value (e.g. a String ID) with another value (e.g. a human-readable String value).
  32. Ingestion has little effect on query + sub-second response for even 100s or 1000s of concurrent queries With Druid and ThetaSketch, we’ve improved our ingestion volume and query performance and concurrency by an order of magnitude with a lesser cost, compared to our old solution (We’ve achieved a more performant, scalable, cost-effective solution) Nice comparison of open-source OLAP systems for Big Data here - https://medium.com/@leventov/comparison-of-the-open-source-olap-systems-for-big-data-clickhouse-druid-and-pinot-8e042a5ed1c7
  33. Ingestion has little effect on query + sub-second response for even 100s or 1000s of concurrent queries Cost is for the entire solution (Druid cluster, EMR, etc.) With Druid and ThetaSketch, we’ve improved our ingestion volume and query performance and concurrency by an order of magnitude with a lesser cost, compared to our old solution (We’ve achieved a more performant, scalable, cost-effective solution)