SlideShare a Scribd company logo
1 of 74
Apache Pinot
Meetup
Enabling Real-time OLAP At OLTP Scale
Wednesday, Sep 2, 2020 at 6:00 PM PST
6:00 - 6:10 Introduction - New & Upcoming
6:10 - 7:10 Tech Talks
7:10 - 7:30 Q&A
(https://www.sli.do (code: Pinot_Meetup)
Agenda
What have we been upto?
Released 0.4.0
● S3 Deep Storage Support
● Range & Text Indexing Enhancements
● Theta-Sketches & Complex
Aggregation Functions
● Transforms at Ingestion Time
Pinot Video Tutorials
● Pinot on Kubernetes
● Setting up Pinot Cluster
Pinot Talk @Kafka Summit
Community Growth
2x increase in last 3 months!
(Join us on Apache Pinot Slack)
More than 100K Docker pulls
Community Growth
20+ 127 7283
Companies Contributors Commits
Upcoming 0.5.0 Release
Query
● Geo-Spatial Indexing
● Post Aggregation
● Having Clause
● JDBC Client
Ingestion
● Filtering during ingestion
● ProtoBuf format
Operations
● Revamped Cluster Manager UI
New Cluster-Manager UI
What’s Next?
Spark Pinot Connector
Presto-SQL DDLs
Tiered Storage
Realtime-only ingestion
● Eliminate need for offline ingestion pipeline
● Auto compaction and relocation
Upsert
● Support for mutability
Complex Data Types
● List, Map, Struct, JSON
The Speakers
Questions can be added at https://app.sli.do (code: Pinot_Meetup)
Bill Kuang
Staff Software Engineer
LinkedIn
Seunghyun Lee
Senior Software Engineer
LinkedIn
Srisudha Garimella
Manager - Technology
Publicis Sapient
Large Multi-Set Count Distinct
Analytics using ThetaSketches
in Pinot
Scaling Pinot at LinkedIn for
member impacting use cases
Application & Tuning Apache
Pinot for Personalization use-
case
Approximating Large Multiset Cardinalities @ LinkedIn’s Scale
Staff Software Engineer
Apache Pinot
Mayank Shrivastava
Staff Software Engineer
Bill Kuang
1
2
3
4
Agenda
Introduction
Case Study
Theta Sketch + Pinot
Performance Tuning
Scenario
●I am an advertiser - I want to know how many people am I targeting
●I need analytics on number of unique viewers who:
○Lives in US or Canada AND
○Works at LinkedIn AND
○Knows Java/C++
●But how many people is that?
LinkedIn Campaign Manager
Scenario
US/Canada
Java/C++
LNKD
???
Question we want to answer:
How many people satisfy all of
the following search criteria?
●Lives in US or Canada AND
●Works at LinkedIn AND
●Knows Java/C++
Naive Approach 1
●Take all possible combinations of dimensions
○Skills
○Company
○Location
●Count the number of viewers in each combination of dimensions
●GROUP BY + COUNT
Skills Company Location Member
Python, Java Slack US 123
Java, C++ LinkedIn US 234
C++, Go Google Canada 345
Eat, Sleep MyHouse, Inc. US 456
Why Naive Approach 1 Doesn’t Work
●Extremely large data size
●Real product has ~50 columns
●Each column is multi-value
○A member can have multiple skills, etc.
●Linear growth to number of members
Naive Approach 2
●Hash sets!!!
●Perform set union/intersect/diff operations
●Works great on small datasets
○Maybe 100s, 1000s, or even 10,000s
US [ 1, 2, 3, 4, 5, … ]
Canada [ 2, 3, 4, 5, 6, … ]
Java [ 3, 4, 5, 6, 7, … ]
LinkedIn [ 4, 5, 6, 7, 8, … ]
Why Naive Approach 2 Doesn’t Work
●Extremely large data size
●Linear growth to number of members per row…
●High query times on large HashSets
What if...
What if… … ...
Theta Sketch Overview
•Approximation data structure (Similar to HyperLogLogs)
•Used for COUNT DISTINCT queries
•Theta Sketch supports Union, Intersection, and Diff operations
•HyperLogLog only support Union operations
•Reference
Theta Sketch Error Estimation
●Sketch computes a probability distribution
●Given standard deviation, return range
Theta Sketch Error Estimation (2)
●Error is data/query dependent
●Example
○Approximate set A [1..1 billion] intersect set B [1]
○Approximated Cardinality of Intersection: 0
○Error: 100%
●Generally larger errors with intersections than unions
Theta Sketch on Pinot - Example Schema
●Key: set identifier (dimensions)
●Value: <binary> serialized theta sketch
Theta Sketch on Pinot - Query
SELECT
DISTINCTCOUNTTHETASKETCH(
sketch,
‘nominalEntries=1024’, -- logK = 10
‘dimValue=US’,
‘dimValue=Canada’,
‘dimValue=Java,
‘dimValue=LNKD,
‘SET_INTERSECT(SET_UNION($1, $2), $3, $4)’)
FROM table
WHERE dimName=Location and dimValue IN (‘US’, ‘Canada’)
OR dimName=Skill and dimValue=‘Java’
OR dimName=Company and dimValue=‘LNKD’)
Theta Sketch on Pinot - Query (2)
SELECT
DISTINCTCOUNTTHETASKETCH(
sketch,
‘nominalEntries=1024’,
‘dimValue=US’, -- server returns sketch to broker
‘dimValue=Canada’, -- server returns sketch to broker
‘dimValue=Java’, -- server returns sketch to broker
‘dimValue=LNKD’, -- server returns sketch to broker
‘SET_INTERSECT(SET_UNION($1, $2), $3, $4)’) -- broker evals
FROM table
WHERE dimName=Location and dimValue IN (‘US’, ‘Canada’)
OR dimName=Skill and dimValue=‘Java’
OR dimName=Company and dimValue=‘LNKD’)
ServersServersServersServers
Broker
Theta Sketch on Pinot - Query (3)
Slightly Better Query - broker doing less work, servers doing more work
Latency reduction - ~70% from real production use case
SELECT
DISTINCTCOUNTTHETASKETCH(
sketch,
‘nominalEntries=1024’,
‘dimValue IN (‘US’, ‘Canada’)’,
‘dimValue IN (‘Java’, ‘C++’),
‘dimValue IN (‘LNKD’),
‘SET_INTERSECT($1, $2, $3)’)
FROM table
WHERE dimName=Location and dimValue IN (‘US’, ‘Canada’)
OR dimName=Skill and dimValue=‘Java’
OR dimName=Company and dimValue=‘LNKD’)
Theta Sketch on Pinot - Query (4)
By distributing more work to servers (less aggregation work on Broker)
●Filter out empty
sketches
●Lazy creation of
union/intersection/diffs
●Single-threaded queries ●Avoid redundant
merge of empty
sketches
●Single-threaded queries
●Distributing more tasks
to servers
●20 QPS
No optimizations
Single-threaded queries
Theta Sketch on Pinot
●90% Reduction in data size
●95% Reduction in Pinot Push execution time
Theta Sketch on Pinot - Preliminary Results
Sketch Size logK = 20
Latency (95th PCT) 500 ms
Error Margin (95th PCT) <20%
Sketch Size logK = 12
Latency (95th PCT) 50 ms
Theta Sketch Performance
●Error
○Intersection/Diff has higher errors than Union
○Intersection on sets with large cardinality differences tend to
have higher error
■E.g. Set(1 billion items) intersect Set(1 item)
●Latency
○More union/intersection/diff operations, higher the latency
○Larger the sketch, higher the latency
Conclusion
High
Dimensionality
Reporting Data
Scaling Pinot at LinkedIn for Member Impacting Use Cases
Seunghyun Lee
Senior Software Engineer
Impression Discounting Technique
●Do not recommend the same items if the
user already has seen them multiple times.
●Apply discounting factor computed as
f(itemImpressionCount) in real-time fashion
●Prevent the recommended items from being
stale.
●Used by 10+ relevance use cases
■Feed item recommendation
■Jobs recommendation
■Potential connection recommendation
■Ads recommendation
Impression Discounting Use Case Architecture
Data Lake
Stream
Processing
Raw Tracking
Data
Data Extraction
& Transformation
Application
Server
event: member X viewed item i1
Q: How many times member X
has seen items (i1, i2, i3…) ?
Processed Data
A: (i1 -> 5, i2 -> 0….)
Let’s apply the discounting factor
to the score for i1 since the user
already seen this item for many
times!
Feed Impression Counting Use Case
Requirements
SELECT sum(count) FROM T
WHERE memberId = $memberId
AND item IN (...) // 1500 items
AND action = ‘VIEW’
AND time > (now - n days)
...
GROUP BY item
memberId item action time count
11111 articlexxx LIKE 2020/09/18 1
22222 articleyyy VIEW 2020/09/18 2
... ... ... ... ....
Schema
Query
Pattern
●3k QPS at the peak
●< 100 milliseconds for
99 percentile latency
●Ingesting at 100k
messages/sec
●100s of billions of records
SLA Requirements
Starting Point
SELECT sum(count) FROM T
WHERE memberId = $memberId
AND itemId IN (1500 items)
AND action = ‘VIEW’
AND time > (now - n days)
...
GROUP BY item
●Sorted index on memberId
●No inverted index - scanning was faster after memberId filtering
●Pre-aggregated data based on days-since-epoch timestamp.
●Using low-level consumer (LLC) solves the scalability issue for real-time
ingestion by allowing each server to consume from a subset of partitions.
"tableIndexConfig": {
"invertedIndexColumns": [],
"sortedColumn": ["memberId"]
...
}
Performance Improvements
Feature/Performance Improvement QPS P99 Latency
Baseline 3000 -
Can we do better?
25
Nodes
15 Offline + 10 Realtime
(Not able to run 3k qps)
Single Node
50 QPS within SLA
Stage 1. Optimizing Single Server Query
Performance
Realtime
Server
Offline
Server
Broker
Queries
Streaming
Data
Bottleneck: Dictionary Encoding for Item Column
1
2
0
3
3
aa
b
ccc
dddd
docId
Dictionary based forward
index for item column
docId memberId item
0 1 b
1 2 ccc
2 2 aa
3 2 dddd
4 3 dddd
●70% size wasted due to padding (Due to a few long item string)
●Item is a high cardinality column → low compression rate
●Worse performance due to random IO for dictionary look-up
aab ddddccc
1 4 6 10 14
chunk
offset
header
dddd
Raw forward index for
item column
docId memberId item
0 1 b
1 2 ccc
2 2 aa
3 2 dddd
4 3 dddd
●Raw forward index reduced the item column size by 70% (no padding)
●Benefit from locality because itemId are sorted based on memberId!
●Chunk can be compressed with Snappy (optional)
"tableIndexConfig": {
"noDictionaryColumns": [
"itemId”
]
}
Bottleneck: Dictionary Encoding for Item Column
Bottleneck: Processing All Segments
p = 2
select…
where mId
= 101...
partition
= 101 % 3
= 2
p = 2
p = 1p = 1
p = 0p = 0
"tableIndexConfig": {
"segmentPartitionConfig": {
"columnPartitionMap": {
"memberId": {
"functionName": "murmur",
"numPartitions": 32
}
}
}
}
●Partitioning data on memberId & server side segment pruning
●Processing ~1000 segments → 30 segments per query
SELECT sum(count) FROM T
WHERE memberId = $memberId
...
Performance Improvements
Feature/Performance Improvement QPS P99 Latency
Baseline (Single Machine) 50/3000 100ms/does not scale
Raw forward index, data partitioning & pruning 3000 270ms
Feature/Performance Improvement QPS P99 Latency
Baseline 3000 -
Raw forward index, data partitioning & pruning 3000 270ms
25
Nodes
15 Offline + 10 Realtime
Stage 2. Optimizing Query Routing
Realtime
Server
Offline
Server
Broker
Queries
Streaming
Data
Bottleneck: Querying All Servers
2 3
1 4
2 3
1 4
query 1
query 2
4 2
1 3
1 2
3 4
query 1
query 2
RG1
RG2
●Adding more servers doesn’t scale after certain point because P99 latency is
dominated by slow servers (e.g. garbage collection)
●Replica Group: a set of servers that serves a complete set of segments for a table
●Replica group aware segment assignment & routing provides the horizontal
scalability to Pinot!
"segmentsConfig": {
"replicaGroupStrategyConfig": {
"numInstancesPerPartition": 2
},
"replication": 3
...
}
Performance Improvements
Feature/Performance Improvement QPS P99 Latency
Baseline (Single Machine) 50 100ms
Raw forward index, data partitioning & pruning 3000 270ms
Replica group segment assignment & routing 3000 220ms
Feature/Performance Improvement QPS P99 Latency
Baseline 3000 -
Raw forward index, data partitioning & pruning 3000 270ms
Replica group segment assignment & routing 3000 220ms
25
Nodes
15 Offline + 10 Realtime
Stage 3. Performance Profiling
Realtime
Server
Offline
Server
Broker
Queries
Streaming
Data
Bottleneck: Inefficient Code
●Iterations of profiling to identify the hotspots and optimize the code
●Improved the inefficient TOP N algorithm on the broker
○Original: Push N, Pop N
○Better: Push until size N. if x > min_value, pop min_value, push x
●Remove unnecessary JSON Serialization & Deserialization
●Remove unnecessary String operations
○String.format(), String.split() String.join()... are very expensive!
Performance Improvements
Feature/Performance Improvement QPS P99 Latency
Baseline (Single Machine) 50 100ms
Raw forward index, data partitioning & pruning 3000 270ms
Replica group segment assignment & routing 3000 220ms
Priority queue fix & remove JSON conversions 3000 170ms
Avoid String.format, String.split, String.join 3000 100ms
Feature/Performance Improvement QPS P99 Latency
Baseline 3000 -
Raw forward index, data partitioning & pruning 3000 270ms
Replica group segment assignment & routing 3000 220ms
Priority queue fix & remove JSON conversions 3000 170ms
Avoid String.format, String.split, String.join 3000 100ms
Stage 4. Optimizing Real-time Server Performance
Realtime
Server
Offline
Server
Broker
Queries
Streaming
Data
Bottleneck: Frequent GCs on Real-time Servers
●Pinot has been using off-heap for loading immutable segments
(MMAP, Direct ByteBuffer)
●Consuming segments used to store consumed data on JVM heap.
●Use Off-heap for consumed data to avoid GCs.
●Performed well at the ingestion rate of ~100k messages/sec
●Default setting for all use cases @ LinkedIn
pinot.server.instance.realtime.alloc.offheap = true
(server-side config)
Bottleneck: Large Sized Real-time Segments
1111, a, 2020/09/18,
1
2222, a, 2020/09/18,
1
1111, a, 2020/09/18,
1
memberId itemId time count
1111 a 2020/09/18 2
2222 b 2020/09/18 1
●While offline segments are pre-aggregated, real-time segments
contain too many rows due to high message throughput
●Aggregate metrics feature aggregates data on-the-fly for
consuming segments
"tableIndexConfig": {
"aggregateMetrics": true
...
}
Performance Improvements
Feature/Performance Improvement QPS P99 Latency
Baseline (Single Machine) 50 100ms
Raw forward index, data partitioning & pruning 3000 270ms
Replica group segment assignment & routing 3000 220ms
Priority queue fix & remove JSON conversions 3000 170ms
Avoid String.format, String.split, String.join 3000 100ms
Off-heap, Aggregate metrics on real-time server 3000 80ms
Feature/Performance Improvement QPS P99 Latency
Baseline 3000 -
Raw forward index, data partitioning & pruning 3000 270ms
Replica group segment assignment & routing 3000 220ms
Priority queue fix & remove JSON conversions 3000 170ms
Avoid String.format, String.split, String.join 3000 100ms
Off-heap, Aggregate metrics on real-time server 3000 80ms
Impression discounting use cases in today
10+ 50K+ <100ms
Impression
discounting
use Cases
Queries Per Second
(50% of entire traffic)
99th percentile
latency
Takeaways
●Supporting impression discounting use case pushed the limit of Pinot to the
next level.
○Proved that high throughput - low latency use case can be served by the
columnar store!
●Profiling is important
○Small code change can make a huge performance improvements
●Working on Pinot is fun!
○low level system - data storage format, query engine, garbage collection
○distributed system - segment assignment & routing, partitioning, replication
Application & Tuning Apache Pinot for
Personalization use-case
Personalized Platform - Real-time, Contextual and Personalized feeds
58
Recent View
Vehicle
Last
Customized
Build
Derive
Insights
Most
frequently
Visited
Models
Recommen
d
Features/Sp
ecs
Most
Popular
Vehicles
Personalization Levers in Auto Industry
We have to start gathering information in
order to build profiles.
60
VISIT
WEBSIT
E
View
inventor
y
Find
Model
Specs
Schedul
e a test
Drive
“Like” on
Vehicle
Explore
d a
Specific
Trim
Compare
Models
Visit
Vehicle
page Visit
Vehicle
Details
Page
Customize
Builds
Customize
Models
Visit
compar
e site View
Incentiv
es
Find
dealer
Customiz
e Models
Customize
Builds of
Vehicles
Get a Quote
Find
Dealer
Reques
t a
Brochur
e
Schedul
e a Test
Drive Get A
Quote Sign Up For
Updates
Get A
Brochure
Apache Pinot – Key Component of the Architecture
61
 Real-time OLAP Data Store
 Distributed System
 Highly Scalable
 Supports Low Latency Analytics
62
High Level Architecture with Apache Pinot
63
Performance Stats
Write TPS Read QPS Data Retention Data Volume
5000 (5X) 5000 (10 X) 3 months 250 – 300 million records
Resource Availability
40 cores,
160GB RAM
95th percentile < 5-10 ms
@ Resource utilization <70%
Accepted Throughput
Issues, Lessons Learnt & Tuning Pinot
64
65
Ingestion @ 5K TPS
Querying @ 5K QPS
Data Volume : <=15 million
66
Consuming Segment
Kafka Partitions
Threshold Of Consuming
Segment
SCALE:
Expected traffic = 10,000 records/day
= 10,000
Effect of Number of Partitions
Threshold = 10,000 records
 Kafka Partitions are a means of achieving
parallelism.
 For instance having 10 partitions in this case means
the consuming segment would be in memory for 10
days and we would get 10 segments.
 The underlying Kafka topic retention has to be
adjusted to ensure there is no data loss in any
situation
 Real time Provisioning tool – to choose segment size
67
TUNING MEMORY CONFI PARAMETERS &
APPLYING QPS
Problems Faced
OUT OF MEMORY
As Segment Size increased,
there was a OOM coming up
RESPONSE TIMES
SHOOTING
1. As QPS is increasing
2. As Data Volume is
increasing
68
Best Practices Setting Up Pinot Cluster
Observability Traffic & Thresholds
Memory
Management
Runtime Provisioning
Tool
• Prometheus +
Grafana used to
capture and Analyze
the JMX metrics to
get insights
• Heap Size, CPU
Utilization, RAM
Utilized, Off Heap
etc.. helped
• Experiment and
derive the best
segment size based
on sample segment,
retention period etc.
• 2 kinds of memory
modes supported for
consuming and
completed segments
– MMAP and Heap
• Based on
recommendation
from Runtime
Provisioning Tool,
this can be
configured as off
Heap, if memory
resources are
available.
• Time, Size and
Number of
records are
Thresholds
• As a
recommended
practice, time and
size could be
used hand in
hand
69
Three steps to tune P95th Value
01
PARTITION AWARE ROUTING
02
APPLIED SORTED INDEX
03
REPLICA GROUP
Reduced segments queried by n-folds
(n = no. of partitions on topic)
Query routed to subset of servers,
improving scatter and gather
(Total traffic)
--------------------------
(no. of replica groups)
Sorted Index vs Inverted Index
Traffic on server =
Math behind Pinot Strategies Applied
70
P1 P2
Replica Per Partition=3
Pinot Server - 0
Pinot Brokers
X
QPS
X/6
QPS
Kafka Partitions
Pinot Server - 1
Pinot Server - 3 Pinot Server - 4
Pinot Server - 5 Pinot Server - 6
X/6
QPS
X/6
QPS
X/6
QPS
X/6
QPS
X/6
QPS
71
Pinot Cluster Setup
Number of Partitions in Kafka = 3
Number of Replicas per Partition = 3
Segment Size = 100 MB = 10 mil records
Data volume of 250- 300+million records
Throughput of @ 10k TPS
30ms 09ms
3x better
LATENCY
Official website –
https://pinot.apache.org/
OLAP DBs comparison -
https://medium.com/@lev
entov/comparison-of-the-
open-source-olap-systems-
for-big-data-clickhouse-
druid-and-pinot-
8e042a5ed1c7
73
Q&A
Ask questions at https://www.sli.do
(code: Pinot_Meetup)

More Related Content

What's hot

Archmage, Pinterest’s Real-time Analytics Platform on Druid
Archmage, Pinterest’s Real-time Analytics Platform on DruidArchmage, Pinterest’s Real-time Analytics Platform on Druid
Archmage, Pinterest’s Real-time Analytics Platform on Druid
Imply
 
Data Lineage with Apache Airflow using Marquez
Data Lineage with Apache Airflow using Marquez Data Lineage with Apache Airflow using Marquez
Data Lineage with Apache Airflow using Marquez
Willy Lulciuc
 
Zipline—Airbnb’s Declarative Feature Engineering Framework
Zipline—Airbnb’s Declarative Feature Engineering FrameworkZipline—Airbnb’s Declarative Feature Engineering Framework
Zipline—Airbnb’s Declarative Feature Engineering Framework
Databricks
 
Kafka as your Data Lake - is it Feasible?
Kafka as your Data Lake - is it Feasible?Kafka as your Data Lake - is it Feasible?
Kafka as your Data Lake - is it Feasible?
Guido Schmutz
 

What's hot (20)

Automate Your Kafka Cluster with Kubernetes Custom Resources
Automate Your Kafka Cluster with Kubernetes Custom Resources Automate Your Kafka Cluster with Kubernetes Custom Resources
Automate Your Kafka Cluster with Kubernetes Custom Resources
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
 
Pinot: Realtime Distributed OLAP datastore
Pinot: Realtime Distributed OLAP datastorePinot: Realtime Distributed OLAP datastore
Pinot: Realtime Distributed OLAP datastore
 
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy FarkasVirtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
 
Archmage, Pinterest’s Real-time Analytics Platform on Druid
Archmage, Pinterest’s Real-time Analytics Platform on DruidArchmage, Pinterest’s Real-time Analytics Platform on Druid
Archmage, Pinterest’s Real-time Analytics Platform on Druid
 
Fluent Bit: Log Forwarding at Scale
Fluent Bit: Log Forwarding at ScaleFluent Bit: Log Forwarding at Scale
Fluent Bit: Log Forwarding at Scale
 
New Features in Apache Pinot
New Features in Apache PinotNew Features in Apache Pinot
New Features in Apache Pinot
 
Workshop on Google Cloud Data Platform
Workshop on Google Cloud Data PlatformWorkshop on Google Cloud Data Platform
Workshop on Google Cloud Data Platform
 
Producer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache KafkaProducer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache Kafka
 
Introduction to Apache Flink
Introduction to Apache FlinkIntroduction to Apache Flink
Introduction to Apache Flink
 
AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ec...
AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ec...AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ec...
AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ec...
 
Data Lineage with Apache Airflow using Marquez
Data Lineage with Apache Airflow using Marquez Data Lineage with Apache Airflow using Marquez
Data Lineage with Apache Airflow using Marquez
 
Zipline—Airbnb’s Declarative Feature Engineering Framework
Zipline—Airbnb’s Declarative Feature Engineering FrameworkZipline—Airbnb’s Declarative Feature Engineering Framework
Zipline—Airbnb’s Declarative Feature Engineering Framework
 
Kafka as your Data Lake - is it Feasible?
Kafka as your Data Lake - is it Feasible?Kafka as your Data Lake - is it Feasible?
Kafka as your Data Lake - is it Feasible?
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
 
CDC patterns in Apache Kafka®
CDC patterns in Apache Kafka®CDC patterns in Apache Kafka®
CDC patterns in Apache Kafka®
 
[215] Druid로 쉽고 빠르게 데이터 분석하기
[215] Druid로 쉽고 빠르게 데이터 분석하기[215] Druid로 쉽고 빠르게 데이터 분석하기
[215] Druid로 쉽고 빠르게 데이터 분석하기
 
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityApache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Monitoring Oracle Database Instances with Zabbix
Monitoring Oracle Database Instances with ZabbixMonitoring Oracle Database Instances with Zabbix
Monitoring Oracle Database Instances with Zabbix
 

Similar to Apache Pinot Meetup Sept02, 2020

Similar to Apache Pinot Meetup Sept02, 2020 (20)

Sprint 45 review
Sprint 45 reviewSprint 45 review
Sprint 45 review
 
Scalding big ADta
Scalding big ADtaScalding big ADta
Scalding big ADta
 
Dsdt meetup 2017 11-21
Dsdt meetup 2017 11-21Dsdt meetup 2017 11-21
Dsdt meetup 2017 11-21
 
DSDT Meetup Nov 2017
DSDT Meetup Nov 2017DSDT Meetup Nov 2017
DSDT Meetup Nov 2017
 
Revealing ALLSTOCKER
Revealing ALLSTOCKERRevealing ALLSTOCKER
Revealing ALLSTOCKER
 
PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase
PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase
PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase
 
Altitude San Francisco 2018: Logging at the Edge
Altitude San Francisco 2018: Logging at the Edge Altitude San Francisco 2018: Logging at the Edge
Altitude San Francisco 2018: Logging at the Edge
 
Expanding your impact with programmability in the data center
Expanding your impact with programmability in the data centerExpanding your impact with programmability in the data center
Expanding your impact with programmability in the data center
 
Sergei Sokolenko "Advances in Stream Analytics: Apache Beam and Google Cloud ...
Sergei Sokolenko "Advances in Stream Analytics: Apache Beam and Google Cloud ...Sergei Sokolenko "Advances in Stream Analytics: Apache Beam and Google Cloud ...
Sergei Sokolenko "Advances in Stream Analytics: Apache Beam and Google Cloud ...
 
Google Cloud Dataflow
Google Cloud DataflowGoogle Cloud Dataflow
Google Cloud Dataflow
 
Scaling 100PB Data Warehouse in Cloud
Scaling 100PB Data Warehouse in CloudScaling 100PB Data Warehouse in Cloud
Scaling 100PB Data Warehouse in Cloud
 
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFix
 
Adtech scala-performance-tuning-150323223738-conversion-gate01
Adtech scala-performance-tuning-150323223738-conversion-gate01Adtech scala-performance-tuning-150323223738-conversion-gate01
Adtech scala-performance-tuning-150323223738-conversion-gate01
 
Adtech x Scala x Performance tuning
Adtech x Scala x Performance tuningAdtech x Scala x Performance tuning
Adtech x Scala x Performance tuning
 
My past-3 yeas-developer-journey-at-linkedin-by-iantsai
My past-3 yeas-developer-journey-at-linkedin-by-iantsaiMy past-3 yeas-developer-journey-at-linkedin-by-iantsai
My past-3 yeas-developer-journey-at-linkedin-by-iantsai
 
Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...
 
Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in Spark
 
Professional Services Insights into Improving Sitecore XP
Professional Services Insights into Improving Sitecore XPProfessional Services Insights into Improving Sitecore XP
Professional Services Insights into Improving Sitecore XP
 
Scio - A Scala API for Google Cloud Dataflow & Apache Beam
Scio - A Scala API for Google Cloud Dataflow & Apache BeamScio - A Scala API for Google Cloud Dataflow & Apache Beam
Scio - A Scala API for Google Cloud Dataflow & Apache Beam
 

Recently uploaded

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 

Recently uploaded (20)

Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 

Apache Pinot Meetup Sept02, 2020

  • 1. Apache Pinot Meetup Enabling Real-time OLAP At OLTP Scale Wednesday, Sep 2, 2020 at 6:00 PM PST
  • 2. 6:00 - 6:10 Introduction - New & Upcoming 6:10 - 7:10 Tech Talks 7:10 - 7:30 Q&A (https://www.sli.do (code: Pinot_Meetup) Agenda
  • 3. What have we been upto? Released 0.4.0 ● S3 Deep Storage Support ● Range & Text Indexing Enhancements ● Theta-Sketches & Complex Aggregation Functions ● Transforms at Ingestion Time Pinot Video Tutorials ● Pinot on Kubernetes ● Setting up Pinot Cluster Pinot Talk @Kafka Summit
  • 4. Community Growth 2x increase in last 3 months! (Join us on Apache Pinot Slack) More than 100K Docker pulls
  • 5. Community Growth 20+ 127 7283 Companies Contributors Commits
  • 6. Upcoming 0.5.0 Release Query ● Geo-Spatial Indexing ● Post Aggregation ● Having Clause ● JDBC Client Ingestion ● Filtering during ingestion ● ProtoBuf format Operations ● Revamped Cluster Manager UI
  • 8. What’s Next? Spark Pinot Connector Presto-SQL DDLs Tiered Storage Realtime-only ingestion ● Eliminate need for offline ingestion pipeline ● Auto compaction and relocation Upsert ● Support for mutability Complex Data Types ● List, Map, Struct, JSON
  • 9. The Speakers Questions can be added at https://app.sli.do (code: Pinot_Meetup) Bill Kuang Staff Software Engineer LinkedIn Seunghyun Lee Senior Software Engineer LinkedIn Srisudha Garimella Manager - Technology Publicis Sapient Large Multi-Set Count Distinct Analytics using ThetaSketches in Pinot Scaling Pinot at LinkedIn for member impacting use cases Application & Tuning Apache Pinot for Personalization use- case
  • 10. Approximating Large Multiset Cardinalities @ LinkedIn’s Scale Staff Software Engineer Apache Pinot Mayank Shrivastava Staff Software Engineer Bill Kuang
  • 12. Scenario ●I am an advertiser - I want to know how many people am I targeting ●I need analytics on number of unique viewers who: ○Lives in US or Canada AND ○Works at LinkedIn AND ○Knows Java/C++ ●But how many people is that?
  • 14. Scenario US/Canada Java/C++ LNKD ??? Question we want to answer: How many people satisfy all of the following search criteria? ●Lives in US or Canada AND ●Works at LinkedIn AND ●Knows Java/C++
  • 15. Naive Approach 1 ●Take all possible combinations of dimensions ○Skills ○Company ○Location ●Count the number of viewers in each combination of dimensions ●GROUP BY + COUNT Skills Company Location Member Python, Java Slack US 123 Java, C++ LinkedIn US 234 C++, Go Google Canada 345 Eat, Sleep MyHouse, Inc. US 456
  • 16. Why Naive Approach 1 Doesn’t Work ●Extremely large data size ●Real product has ~50 columns ●Each column is multi-value ○A member can have multiple skills, etc. ●Linear growth to number of members
  • 17. Naive Approach 2 ●Hash sets!!! ●Perform set union/intersect/diff operations ●Works great on small datasets ○Maybe 100s, 1000s, or even 10,000s US [ 1, 2, 3, 4, 5, … ] Canada [ 2, 3, 4, 5, 6, … ] Java [ 3, 4, 5, 6, 7, … ] LinkedIn [ 4, 5, 6, 7, 8, … ]
  • 18. Why Naive Approach 2 Doesn’t Work ●Extremely large data size ●Linear growth to number of members per row… ●High query times on large HashSets
  • 21. Theta Sketch Overview •Approximation data structure (Similar to HyperLogLogs) •Used for COUNT DISTINCT queries •Theta Sketch supports Union, Intersection, and Diff operations •HyperLogLog only support Union operations •Reference
  • 22. Theta Sketch Error Estimation ●Sketch computes a probability distribution ●Given standard deviation, return range
  • 23. Theta Sketch Error Estimation (2) ●Error is data/query dependent ●Example ○Approximate set A [1..1 billion] intersect set B [1] ○Approximated Cardinality of Intersection: 0 ○Error: 100% ●Generally larger errors with intersections than unions
  • 24. Theta Sketch on Pinot - Example Schema ●Key: set identifier (dimensions) ●Value: <binary> serialized theta sketch
  • 25. Theta Sketch on Pinot - Query SELECT DISTINCTCOUNTTHETASKETCH( sketch, ‘nominalEntries=1024’, -- logK = 10 ‘dimValue=US’, ‘dimValue=Canada’, ‘dimValue=Java, ‘dimValue=LNKD, ‘SET_INTERSECT(SET_UNION($1, $2), $3, $4)’) FROM table WHERE dimName=Location and dimValue IN (‘US’, ‘Canada’) OR dimName=Skill and dimValue=‘Java’ OR dimName=Company and dimValue=‘LNKD’)
  • 26. Theta Sketch on Pinot - Query (2) SELECT DISTINCTCOUNTTHETASKETCH( sketch, ‘nominalEntries=1024’, ‘dimValue=US’, -- server returns sketch to broker ‘dimValue=Canada’, -- server returns sketch to broker ‘dimValue=Java’, -- server returns sketch to broker ‘dimValue=LNKD’, -- server returns sketch to broker ‘SET_INTERSECT(SET_UNION($1, $2), $3, $4)’) -- broker evals FROM table WHERE dimName=Location and dimValue IN (‘US’, ‘Canada’) OR dimName=Skill and dimValue=‘Java’ OR dimName=Company and dimValue=‘LNKD’) ServersServersServersServers Broker
  • 27. Theta Sketch on Pinot - Query (3) Slightly Better Query - broker doing less work, servers doing more work Latency reduction - ~70% from real production use case SELECT DISTINCTCOUNTTHETASKETCH( sketch, ‘nominalEntries=1024’, ‘dimValue IN (‘US’, ‘Canada’)’, ‘dimValue IN (‘Java’, ‘C++’), ‘dimValue IN (‘LNKD’), ‘SET_INTERSECT($1, $2, $3)’) FROM table WHERE dimName=Location and dimValue IN (‘US’, ‘Canada’) OR dimName=Skill and dimValue=‘Java’ OR dimName=Company and dimValue=‘LNKD’)
  • 28. Theta Sketch on Pinot - Query (4) By distributing more work to servers (less aggregation work on Broker) ●Filter out empty sketches ●Lazy creation of union/intersection/diffs ●Single-threaded queries ●Avoid redundant merge of empty sketches ●Single-threaded queries ●Distributing more tasks to servers ●20 QPS No optimizations Single-threaded queries
  • 29. Theta Sketch on Pinot ●90% Reduction in data size ●95% Reduction in Pinot Push execution time
  • 30. Theta Sketch on Pinot - Preliminary Results Sketch Size logK = 20 Latency (95th PCT) 500 ms Error Margin (95th PCT) <20% Sketch Size logK = 12 Latency (95th PCT) 50 ms
  • 31. Theta Sketch Performance ●Error ○Intersection/Diff has higher errors than Union ○Intersection on sets with large cardinality differences tend to have higher error ■E.g. Set(1 billion items) intersect Set(1 item) ●Latency ○More union/intersection/diff operations, higher the latency ○Larger the sketch, higher the latency
  • 33. Scaling Pinot at LinkedIn for Member Impacting Use Cases Seunghyun Lee Senior Software Engineer
  • 34. Impression Discounting Technique ●Do not recommend the same items if the user already has seen them multiple times. ●Apply discounting factor computed as f(itemImpressionCount) in real-time fashion ●Prevent the recommended items from being stale. ●Used by 10+ relevance use cases ■Feed item recommendation ■Jobs recommendation ■Potential connection recommendation ■Ads recommendation
  • 35. Impression Discounting Use Case Architecture Data Lake Stream Processing Raw Tracking Data Data Extraction & Transformation Application Server event: member X viewed item i1 Q: How many times member X has seen items (i1, i2, i3…) ? Processed Data A: (i1 -> 5, i2 -> 0….) Let’s apply the discounting factor to the score for i1 since the user already seen this item for many times!
  • 36. Feed Impression Counting Use Case Requirements SELECT sum(count) FROM T WHERE memberId = $memberId AND item IN (...) // 1500 items AND action = ‘VIEW’ AND time > (now - n days) ... GROUP BY item memberId item action time count 11111 articlexxx LIKE 2020/09/18 1 22222 articleyyy VIEW 2020/09/18 2 ... ... ... ... .... Schema Query Pattern ●3k QPS at the peak ●< 100 milliseconds for 99 percentile latency ●Ingesting at 100k messages/sec ●100s of billions of records SLA Requirements
  • 37. Starting Point SELECT sum(count) FROM T WHERE memberId = $memberId AND itemId IN (1500 items) AND action = ‘VIEW’ AND time > (now - n days) ... GROUP BY item ●Sorted index on memberId ●No inverted index - scanning was faster after memberId filtering ●Pre-aggregated data based on days-since-epoch timestamp. ●Using low-level consumer (LLC) solves the scalability issue for real-time ingestion by allowing each server to consume from a subset of partitions. "tableIndexConfig": { "invertedIndexColumns": [], "sortedColumn": ["memberId"] ... }
  • 38. Performance Improvements Feature/Performance Improvement QPS P99 Latency Baseline 3000 - Can we do better? 25 Nodes 15 Offline + 10 Realtime (Not able to run 3k qps) Single Node 50 QPS within SLA
  • 39. Stage 1. Optimizing Single Server Query Performance Realtime Server Offline Server Broker Queries Streaming Data
  • 40. Bottleneck: Dictionary Encoding for Item Column 1 2 0 3 3 aa b ccc dddd docId Dictionary based forward index for item column docId memberId item 0 1 b 1 2 ccc 2 2 aa 3 2 dddd 4 3 dddd ●70% size wasted due to padding (Due to a few long item string) ●Item is a high cardinality column → low compression rate ●Worse performance due to random IO for dictionary look-up
  • 41. aab ddddccc 1 4 6 10 14 chunk offset header dddd Raw forward index for item column docId memberId item 0 1 b 1 2 ccc 2 2 aa 3 2 dddd 4 3 dddd ●Raw forward index reduced the item column size by 70% (no padding) ●Benefit from locality because itemId are sorted based on memberId! ●Chunk can be compressed with Snappy (optional) "tableIndexConfig": { "noDictionaryColumns": [ "itemId” ] } Bottleneck: Dictionary Encoding for Item Column
  • 42. Bottleneck: Processing All Segments p = 2 select… where mId = 101... partition = 101 % 3 = 2 p = 2 p = 1p = 1 p = 0p = 0 "tableIndexConfig": { "segmentPartitionConfig": { "columnPartitionMap": { "memberId": { "functionName": "murmur", "numPartitions": 32 } } } } ●Partitioning data on memberId & server side segment pruning ●Processing ~1000 segments → 30 segments per query SELECT sum(count) FROM T WHERE memberId = $memberId ...
  • 43. Performance Improvements Feature/Performance Improvement QPS P99 Latency Baseline (Single Machine) 50/3000 100ms/does not scale Raw forward index, data partitioning & pruning 3000 270ms Feature/Performance Improvement QPS P99 Latency Baseline 3000 - Raw forward index, data partitioning & pruning 3000 270ms 25 Nodes 15 Offline + 10 Realtime
  • 44. Stage 2. Optimizing Query Routing Realtime Server Offline Server Broker Queries Streaming Data
  • 45. Bottleneck: Querying All Servers 2 3 1 4 2 3 1 4 query 1 query 2 4 2 1 3 1 2 3 4 query 1 query 2 RG1 RG2 ●Adding more servers doesn’t scale after certain point because P99 latency is dominated by slow servers (e.g. garbage collection) ●Replica Group: a set of servers that serves a complete set of segments for a table ●Replica group aware segment assignment & routing provides the horizontal scalability to Pinot! "segmentsConfig": { "replicaGroupStrategyConfig": { "numInstancesPerPartition": 2 }, "replication": 3 ... }
  • 46. Performance Improvements Feature/Performance Improvement QPS P99 Latency Baseline (Single Machine) 50 100ms Raw forward index, data partitioning & pruning 3000 270ms Replica group segment assignment & routing 3000 220ms Feature/Performance Improvement QPS P99 Latency Baseline 3000 - Raw forward index, data partitioning & pruning 3000 270ms Replica group segment assignment & routing 3000 220ms 25 Nodes 15 Offline + 10 Realtime
  • 47. Stage 3. Performance Profiling Realtime Server Offline Server Broker Queries Streaming Data
  • 48. Bottleneck: Inefficient Code ●Iterations of profiling to identify the hotspots and optimize the code ●Improved the inefficient TOP N algorithm on the broker ○Original: Push N, Pop N ○Better: Push until size N. if x > min_value, pop min_value, push x ●Remove unnecessary JSON Serialization & Deserialization ●Remove unnecessary String operations ○String.format(), String.split() String.join()... are very expensive!
  • 49. Performance Improvements Feature/Performance Improvement QPS P99 Latency Baseline (Single Machine) 50 100ms Raw forward index, data partitioning & pruning 3000 270ms Replica group segment assignment & routing 3000 220ms Priority queue fix & remove JSON conversions 3000 170ms Avoid String.format, String.split, String.join 3000 100ms Feature/Performance Improvement QPS P99 Latency Baseline 3000 - Raw forward index, data partitioning & pruning 3000 270ms Replica group segment assignment & routing 3000 220ms Priority queue fix & remove JSON conversions 3000 170ms Avoid String.format, String.split, String.join 3000 100ms
  • 50. Stage 4. Optimizing Real-time Server Performance Realtime Server Offline Server Broker Queries Streaming Data
  • 51. Bottleneck: Frequent GCs on Real-time Servers ●Pinot has been using off-heap for loading immutable segments (MMAP, Direct ByteBuffer) ●Consuming segments used to store consumed data on JVM heap. ●Use Off-heap for consumed data to avoid GCs. ●Performed well at the ingestion rate of ~100k messages/sec ●Default setting for all use cases @ LinkedIn pinot.server.instance.realtime.alloc.offheap = true (server-side config)
  • 52. Bottleneck: Large Sized Real-time Segments 1111, a, 2020/09/18, 1 2222, a, 2020/09/18, 1 1111, a, 2020/09/18, 1 memberId itemId time count 1111 a 2020/09/18 2 2222 b 2020/09/18 1 ●While offline segments are pre-aggregated, real-time segments contain too many rows due to high message throughput ●Aggregate metrics feature aggregates data on-the-fly for consuming segments "tableIndexConfig": { "aggregateMetrics": true ... }
  • 53. Performance Improvements Feature/Performance Improvement QPS P99 Latency Baseline (Single Machine) 50 100ms Raw forward index, data partitioning & pruning 3000 270ms Replica group segment assignment & routing 3000 220ms Priority queue fix & remove JSON conversions 3000 170ms Avoid String.format, String.split, String.join 3000 100ms Off-heap, Aggregate metrics on real-time server 3000 80ms Feature/Performance Improvement QPS P99 Latency Baseline 3000 - Raw forward index, data partitioning & pruning 3000 270ms Replica group segment assignment & routing 3000 220ms Priority queue fix & remove JSON conversions 3000 170ms Avoid String.format, String.split, String.join 3000 100ms Off-heap, Aggregate metrics on real-time server 3000 80ms
  • 54. Impression discounting use cases in today 10+ 50K+ <100ms Impression discounting use Cases Queries Per Second (50% of entire traffic) 99th percentile latency
  • 55. Takeaways ●Supporting impression discounting use case pushed the limit of Pinot to the next level. ○Proved that high throughput - low latency use case can be served by the columnar store! ●Profiling is important ○Small code change can make a huge performance improvements ●Working on Pinot is fun! ○low level system - data storage format, query engine, garbage collection ○distributed system - segment assignment & routing, partitioning, replication
  • 56.
  • 57. Application & Tuning Apache Pinot for Personalization use-case
  • 58. Personalized Platform - Real-time, Contextual and Personalized feeds 58
  • 60. We have to start gathering information in order to build profiles. 60 VISIT WEBSIT E View inventor y Find Model Specs Schedul e a test Drive “Like” on Vehicle Explore d a Specific Trim Compare Models Visit Vehicle page Visit Vehicle Details Page Customize Builds Customize Models Visit compar e site View Incentiv es Find dealer Customiz e Models Customize Builds of Vehicles Get a Quote Find Dealer Reques t a Brochur e Schedul e a Test Drive Get A Quote Sign Up For Updates Get A Brochure
  • 61. Apache Pinot – Key Component of the Architecture 61  Real-time OLAP Data Store  Distributed System  Highly Scalable  Supports Low Latency Analytics
  • 62. 62 High Level Architecture with Apache Pinot
  • 63. 63 Performance Stats Write TPS Read QPS Data Retention Data Volume 5000 (5X) 5000 (10 X) 3 months 250 – 300 million records Resource Availability 40 cores, 160GB RAM 95th percentile < 5-10 ms @ Resource utilization <70% Accepted Throughput
  • 64. Issues, Lessons Learnt & Tuning Pinot 64
  • 65. 65 Ingestion @ 5K TPS Querying @ 5K QPS Data Volume : <=15 million
  • 66. 66 Consuming Segment Kafka Partitions Threshold Of Consuming Segment SCALE: Expected traffic = 10,000 records/day = 10,000 Effect of Number of Partitions Threshold = 10,000 records  Kafka Partitions are a means of achieving parallelism.  For instance having 10 partitions in this case means the consuming segment would be in memory for 10 days and we would get 10 segments.  The underlying Kafka topic retention has to be adjusted to ensure there is no data loss in any situation  Real time Provisioning tool – to choose segment size
  • 67. 67 TUNING MEMORY CONFI PARAMETERS & APPLYING QPS Problems Faced OUT OF MEMORY As Segment Size increased, there was a OOM coming up RESPONSE TIMES SHOOTING 1. As QPS is increasing 2. As Data Volume is increasing
  • 68. 68 Best Practices Setting Up Pinot Cluster Observability Traffic & Thresholds Memory Management Runtime Provisioning Tool • Prometheus + Grafana used to capture and Analyze the JMX metrics to get insights • Heap Size, CPU Utilization, RAM Utilized, Off Heap etc.. helped • Experiment and derive the best segment size based on sample segment, retention period etc. • 2 kinds of memory modes supported for consuming and completed segments – MMAP and Heap • Based on recommendation from Runtime Provisioning Tool, this can be configured as off Heap, if memory resources are available. • Time, Size and Number of records are Thresholds • As a recommended practice, time and size could be used hand in hand
  • 69. 69 Three steps to tune P95th Value 01 PARTITION AWARE ROUTING 02 APPLIED SORTED INDEX 03 REPLICA GROUP Reduced segments queried by n-folds (n = no. of partitions on topic) Query routed to subset of servers, improving scatter and gather (Total traffic) -------------------------- (no. of replica groups) Sorted Index vs Inverted Index Traffic on server =
  • 70. Math behind Pinot Strategies Applied 70 P1 P2 Replica Per Partition=3 Pinot Server - 0 Pinot Brokers X QPS X/6 QPS Kafka Partitions Pinot Server - 1 Pinot Server - 3 Pinot Server - 4 Pinot Server - 5 Pinot Server - 6 X/6 QPS X/6 QPS X/6 QPS X/6 QPS X/6 QPS
  • 71. 71 Pinot Cluster Setup Number of Partitions in Kafka = 3 Number of Replicas per Partition = 3 Segment Size = 100 MB = 10 mil records
  • 72. Data volume of 250- 300+million records Throughput of @ 10k TPS 30ms 09ms 3x better LATENCY
  • 73. Official website – https://pinot.apache.org/ OLAP DBs comparison - https://medium.com/@lev entov/comparison-of-the- open-source-olap-systems- for-big-data-clickhouse- druid-and-pinot- 8e042a5ed1c7 73
  • 74. Q&A Ask questions at https://www.sli.do (code: Pinot_Meetup)