Here at the Nielsen Marketing Cloud we use druid.io (http://druid.io/) as one of our main data stores, both for simple counts and for approximate count-distinct (DataSketches).
It’s been more than a year since we started using it, injecting billions of events each day to multiple druid clusters for different use-cases.
In this meet-up, we will share our journey, the challenges we had, the way we overcame them (at least most of them) and the steps we made to optimize the process around Druid to keep the solution cost effective.
Before diving into Druid, we will briefly present our data pipeline architecture, starting from the front-end serving system, deployed in number of geo-locations, to a centralized Kafka cluster in the cloud, and give some examples of the different processes that consume from Kafka and feed our different data sources.
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Our journey with druid - from initial research to full production scale
1. Our journey with Druid
From initial research to full production scale
Danny Ruchman + Itai Yaffe
Nielsen
2. Introduction
Danny Ruchman Itai Yaffe
● Software Engineer
and team manager
● Focused on big
data processing
solutions
● Big Data
Infrastructure
Developer
● Dealing with Big
Data challenges for
the last 5 years
3. Nielsen Marketing Cloud (NMC)
● eXelate was acquired by Nielsen 3 years ago
● A Data company
● Machine learning models for insights
● Business decisions
● Targeting
4. Nielsen Marketing Cloud - questions we try to answer
● How many users of a certain profile can we reach
Campaign for fancy women sneakers -
● How many hits for a specific web page in a date range
8. The need
● Nielsen Marketing Cloud business question
○ How many unique devices we have encountered:
■ over a given date range
■ for a given set of attributes (segments, regions, etc.)
● Find the number of distinct elements in a data stream which
may contain repeated elements in real time
10. ● Store everything
● Store only 1 bit per device
○ 10B Devices-1.25 GB/day
○ 10B Devices*80K attributes - 100 TB/day
● Approximate
Possible solutions
Naive
Bit VectorApprox.
11. Our journey
● Elasticsearch
○ Indexing data
■ 250 GB of daily data, 10 hours
■ Affect query time
○ Querying
■ Low concurrency
■ Scans on all the shards of the
corresponding index
12. What we tried
● Preprocessing
● Statistical algorithms (e.g HyperLogLog)
13. ● K Minimum Values (KMV)
● Estimate set cardinality
● Supports set-theoretic operations
X Y
● ThetaSketch mathematical framework - generalization of KMV
X Y
ThetaSketch
28. Guidelines and pitfalls
● Data modeling
○ Reduce the number of intersections
○ Different datasources for different use cases
2016-11-15
2016-11-15
2016-11-15
Timestamp Attribute
Count
Distinct
Timestamp Attribute Region
Count
Distinct
US XXXXXX US
Porsche
Intent
XXXXXX
Porsche
Intent
... ......
XXXXXX
...
29. Guidelines and pitfalls
● Query optimization
○ Combine multiple queries into single query
○ Use filters
○ Use groupBy v2 engine (default since 0.10.0)
○ Use timeseries rather than groupBy queries
(where applicable)
30. Guidelines and pitfalls
● Batch Ingestion
○ EMR Tuning
■ 140-nodes cluster
● 85% spot instances => ~80% cost reduction
○ Druid input file format - Parquet vs CSV
■ Reduced indexing time by X4
■ Reduced used storage by X10
○ Concurrent ingestion tasks (one per EMR cluster and datasource)
■ Set worker select strategy to fillCapacityWithAffinity
31. Guidelines and pitfalls
● Batch Ingestion (WIP)
○ Action - pre-aggregating the data in Spark Streaming app
■ Aggregating the data by key
● groupBy().agg() for simple counts
● combineByKey() for distinct count (using the DataSketches packages)
Requires setting isInputThetaSketch=true on ingestion task
■ Increased micro-batch interval from 30 minutes to 1 hour
○ Result :
■ # of output records is ~2000X smaller and total size of output files is less
than 1%, compared to the previous version
■ 10X less nodes in the EMR cluster running the MapReduce ingestion job
33. Future work
● Improving accuracy for small set <-> big set intersections
● Improving query performance
○ groupBy V2
○ NVMe SSDs
○ Switching to timeseries query type where applicable
○ Apply less granular aggregation where applicable
(e.g 1 month rather than 1 day)
● Upgrading Druid to 0.11.0
● Exploring option of tiering of query processing nodes
○ Reporting vs interactive queries
○ Hot vs cold data
● Using SQL interface (experimental)
● Using Lookups (experimental)
34. DRUID ES
What have we learned?
● Druid is a columnar, time series data-store
● Can store trillions of events and serve analytic queries in sub-second
● Highly-scalable, cost-effective
● Widely used among Big Data companies
● Can be used for :
○ Distinct count (via ThetaSketch)
○ Simple counts
● Setup is not easy
● Ingestion has little effect on query performance (deep storage usage)
● Provides very good visibility
● Improve query performance by carefully designing your data model and building
your queries
35. QUESTIONS?
Join us - https://www.comeet.co/jobs/nielsen/33.000
Big Data Architect
Java & Machine Learning Developer
Junior Big Data Developer
And more...
37. Druid vs ES
10TB/day
4 Hours/day
15GB/day
280ms-350ms
$55K/month
DRUID
250GB/day
10 Hours/day
2.5TB (total)
500ms-6000ms
$80K/month
ES
Notas del editor
Welcome to nielsen marketing cloud offices
Thank you for coming to hear about our journey with druid
We will try to make it interesting and valuable for you
Me, 5th year @ NMC
Today I lead one of the 2 data teams. I am doing it in the last 2 years and I am loving every minute
and I am going to talk about
NMC data pipeline
loads all our data into our data sources
put you in context of druid path
Itai, 5th year @ NMC
Senior big data engineer
One of the engineers who lead the druid research and implementation effort
will talk about the druid from idea to production, will give super cool tips for beginners
Question - pop in - know -> answer, don’t know - will say it’s a good question with complicated solution, let’s talk after the session
Nielsen marketing cloud or NMC in short
A group inside Nielsen,
Born from exelate company that was acquired by Nielsen 3 years ago
Nielsen is a data company and so are we and we had strong business relationship until at some point they decided to go for it and acquired exelate
Data company meaning
Buying and onboarding data into NMC from data providers, customers and Nielsen data
We have huge high quality dataset
enrich the data using machine learning models in order to create more relevant quality insights
categorize and sell according to a need
Helping brands to take intelligence business decisions
E.g. Targeting in the digital marketing world
Meaning help fit ads to viewers
For example street sign can fit to a very small % of people who see it vs
Online ads that can fit the profile of the individual that sees it
More interesting to the user
More chances he will click the ad
Better ROI for the marketer
What are the questions we try to answer in NMC that helps our customers to take business decisions ?
A lot of questions but to lead to what druid is coming to solveֿ
Translating from human problem to technical problem:
Uu (distinct) count
Simple count
Few words on NMC data pipeline architecture:
Frontend layer:
Receives all the online and offline data traffic
Bare metal on different data centers
3 us,2 eu ,2 pacific
near real time - high throughput/low latency challenges
Backend layer
Aws Cloud based
process all the frontend layer outputs
ETL’s - load data to data sources aggregated and raw
Applications layer
Also in the cloud
Variety of apps above all our data sources
Web - NMC
data configurations (segments, audiences etc)
campaign analysis , campaign management tools etc.
visualized profile graphs
reports
In each data center the frontend layer is build out of few systems:
Data serving - in house
gets the external traffic online and offline
analyse the event
read/write user repository
Run algorithms to identify relevant buyers
output to kafka
Scale of 5B events/day with 200ms SLA while most events are done in few tens of ms’s
Modeling and scoring system - in house
Scoring and learning
Online machines learning algorithms
look alike models
Cross device models
1.7T models a day
Avarage 1100 model execution in a single event
Less than 20ms
User repository
Aerospike
8B active users in us space
3B active users in eu space
Everything goes to kafke in each DC and replicated using uReplicator to the centralized kafka in the cloud
uReplicator - uber open source that we modified a little to fit our needs, like mirror maker
Centralizes Kafka cluster in the cloud for all the data from all kafka clusters in all DC’s
50 big brokers
10 topics
Deals with 11B events per day
Kafka data processing is done mainly with spark streaming
Loading sources such as
DRUID - real time analytics from the web applications
10B events per day
Clustrix - relational DB, aggregated data
RDR - data lake
15 TB per day - compressed parquet
Snowflake - managed db for DS and analytics purposes
Now after presenting the data pipeline itay will go deeper into druid use case and implementation details
Danny talked about the 2 main use-cases - counting unique users and counting hits (or “simple counts”). The first one is somewhat harder, so this is going to be at the focus of my part of the presentation
Past…
Mention “cardinality” and “real-time dashboard”
Explain the need to union and intersect
Demo
Who’s familiar with the count-distinct problem?
For the 2 first solutions, we need to store data per device per attribute per day
Bit vector - Elastic search /Redis is an example of such system
Approximation has a certain error rate
We tried to introduce new cluster dedicated for indexing only and then use backup and restore to the second cluster
This method was very expensive and was partially helpful
Tuning for better performance also didn’t help too much
The story about the demo when we were at the bar (December 20th, 2016)...
Preprocessing - Too many combinations - The formula length is not bounded (show some numbers)
HyperLogLog
-Implementation in ElasticSearch was too slow (done on query time)
- Set operations increase the error dramatically
ThetaSketch is based on KMV
Unions and Intersections increase the error
The problematic case is intersection of very small set with very big set
The larger the K the smaller the Error
However larger K means more memory & storage needed
Demo - http://content.research.neustar.biz/blog/kmv.html
So we talked about statistical algorithms, which is nice, but we needed a practical solution…
OOTB supports ThetaSketch algorithm
Open source, written in Java (works for us, as we know Java…)
Who’s familiar with Druid?
Just to give you a sense of where Druid is used in production...
http://druid.io/druid-powered.html
I’ll try to cover all these reasons in the next slides
Timeseries database - first thing you need to know about Druid
Column types :
Timestamp
Dimensions
Metrics
Together they comprise a Datasource
Agg is done on ingestion time (outcome is much smaller in size)
In query time, it’s closer to a key-value search
We’re just using a different type of aggregator (the ThetaSketch aggregator) to get count distinct, but everything else is essentially the same.
The one big difference is multiple ingestions of the same data :
For ThetaSketch - not a problem due to the fact is “samples” the data (chooses the K minimal values);
Whereas for Sum - we’re going to get wrong numbers (e.g 2X as big if we ingest the data twice)
To mitigate it, we’ve added a simple meta-data store to prevent ingesting the same data twice (I’ll discuss it later)
We have 3 types of processes - ingestion, querying, managementAll processes are decoupled and scalable
Ingestion (real time - e.g from Kafka, batch - talk about deep storage, how data is aggregated in ingestion time)Querying (brokers, historicals, query performance during ingestion vs ES)
Lambda architecture (for those who don’t know - it’s “a data-processing architecture designed to handle massive quantities of data by taking advantage of both batch- and stream-processing methods.”)
Explain the tuple and what is happening during the aggregation
Mention it says “ThetaSketchAggregator”, but again - we also use LongSumAggregator for simple counts
We ingest a lot more data today than we did in ES (10B events/day in TBs of data vs 250GB in ES)
I mentioned our meta-data store earlier - each component in the flow updates that meta-data store, to prevent ingestion of the same data twice
We can see that while ES response time is exponentially increasing, Druid response time is relatively stable
Benchmark using :
Druid Cluster : 1x Broker (r3.8xlarge) , 8x Historical (r3.8xlarge)
Elasticsearch Cluster : 20 nodes (r3.8xlarge)
This is how we use it, now switching to how we got there and the pains...
Setup is not easy
Separate config/servers/tuning
Caused the deployment to take a few months
Use the Druid recommendation for Production configuration
Monitoring Your System
Druid has built in support for Graphite ( exports many metrics ), so does Spark.We also export metrics to Graphite from our ingestion tasks (written in Python) and from the NMC backend (aesreporter) to provide a complete, end-to-end view of the system.
In this example, query time is very high due to a high number of pending segments (i.e segments that are queued to be scanned in order to answer the query)
http://druid.io/docs/latest/operations/metrics.html
Note: Druid is very verbose, can emit a lot of metrics, beware of overwhelming your Graphite server
Monitoring Your System
To avoid having to maintain our own Graphite backend, we use Hosted Graphite (https://www.hostedgraphite.com/), which allows us not only to collect metrics from all the components mentioned earlier, but also to build Grafana dashboards and to get Slack notifications when something unexpected happens (e.g ingestion task failure).
Data Modeling
If using Theta sketch - reduce the number of intersections (show a slide of the old and new data model).In this example, US is a very big set, Porsche intent is (probably) a small setIt didn’t solve all use-cases, but it gives you an idea of how you can approach the problem
Different datasources - e.g lower accuracy (i.e lower K) for faster queries VS higher accuracy with a bit slower queries
Combine multiple queries over the REST API (explain why?)
There can be billions of rows, so filter the data as part of the query
groupBy v2 offers better performance and memory management (e.g generates per-segment results using a fully off-heap map)
Switching from groupBy to timeseries query seems to have solved the “io.druid.java.util.common.IAE: Not enough capacity for even one row! Need[1,509,995,528] but have[0].” we had
EMR tuning (spot instances (80% cost reduction, but it comes with a risk of out-biding and nodes are lost), druid MR prod config)
Use Parquet
Affinity - use fillCapacityWithAffinity to ingest data from multiple EMR clusters to the same Druid cluster (but different datasources) concurrently, see http://druid.io/docs/latest/configuration/indexing-service.html#affinity
Why? Ingestion still takes a lot of time and resources
There was almost no “penalty” on the Spark Streaming app (with the new version of the app)
A slide mainly for reference
“Breaking” large sets to smaller sets, then use all those sets as part of the query
SQL - added in 0.10.0, can be used from the REST API or from JDBC, not ANSI-SQL
Lookups - Druid has limited support for joins through query-time lookups. The common use case of query-time lookups is to replace one dimension value (e.g. a String ID) with another value (e.g. a human-readable String value).
Ingestion has little effect on query + sub-second response for even 100s or 1000s of concurrent queries
With Druid and ThetaSketch, we’ve improved our ingestion volume and query performance and concurrency by an order of magnitude with a lesser cost, compared to our old solution
(We’ve achieved a more performant, scalable, cost-effective solution)
Nice comparison of open-source OLAP systems for Big Data here - https://medium.com/@leventov/comparison-of-the-open-source-olap-systems-for-big-data-clickhouse-druid-and-pinot-8e042a5ed1c7
Ingestion has little effect on query + sub-second response for even 100s or 1000s of concurrent queries
Cost is for the entire solution (Druid cluster, EMR, etc.)
With Druid and ThetaSketch, we’ve improved our ingestion volume and query performance and concurrency by an order of magnitude with a lesser cost, compared to our old solution
(We’ve achieved a more performant, scalable, cost-effective solution)