3. What have we been upto?
Released 0.4.0
● S3 Deep Storage Support
● Range & Text Indexing Enhancements
● Theta-Sketches & Complex
Aggregation Functions
● Transforms at Ingestion Time
Pinot Video Tutorials
● Pinot on Kubernetes
● Setting up Pinot Cluster
Pinot Talk @Kafka Summit
8. What’s Next?
Spark Pinot Connector
Presto-SQL DDLs
Tiered Storage
Realtime-only ingestion
● Eliminate need for offline ingestion pipeline
● Auto compaction and relocation
Upsert
● Support for mutability
Complex Data Types
● List, Map, Struct, JSON
9. The Speakers
Questions can be added at https://app.sli.do (code: Pinot_Meetup)
Bill Kuang
Staff Software Engineer
LinkedIn
Seunghyun Lee
Senior Software Engineer
LinkedIn
Srisudha Garimella
Manager - Technology
Publicis Sapient
Large Multi-Set Count Distinct
Analytics using ThetaSketches
in Pinot
Scaling Pinot at LinkedIn for
member impacting use cases
Application & Tuning Apache
Pinot for Personalization use-
case
10. Approximating Large Multiset Cardinalities @ LinkedIn’s Scale
Staff Software Engineer
Apache Pinot
Mayank Shrivastava
Staff Software Engineer
Bill Kuang
12. Scenario
●I am an advertiser - I want to know how many people am I targeting
●I need analytics on number of unique viewers who:
○Lives in US or Canada AND
○Works at LinkedIn AND
○Knows Java/C++
●But how many people is that?
15. Naive Approach 1
●Take all possible combinations of dimensions
○Skills
○Company
○Location
●Count the number of viewers in each combination of dimensions
●GROUP BY + COUNT
Skills Company Location Member
Python, Java Slack US 123
Java, C++ LinkedIn US 234
C++, Go Google Canada 345
Eat, Sleep MyHouse, Inc. US 456
16. Why Naive Approach 1 Doesn’t Work
●Extremely large data size
●Real product has ~50 columns
●Each column is multi-value
○A member can have multiple skills, etc.
●Linear growth to number of members
17. Naive Approach 2
●Hash sets!!!
●Perform set union/intersect/diff operations
●Works great on small datasets
○Maybe 100s, 1000s, or even 10,000s
US [ 1, 2, 3, 4, 5, … ]
Canada [ 2, 3, 4, 5, 6, … ]
Java [ 3, 4, 5, 6, 7, … ]
LinkedIn [ 4, 5, 6, 7, 8, … ]
18. Why Naive Approach 2 Doesn’t Work
●Extremely large data size
●Linear growth to number of members per row…
●High query times on large HashSets
21. Theta Sketch Overview
•Approximation data structure (Similar to HyperLogLogs)
•Used for COUNT DISTINCT queries
•Theta Sketch supports Union, Intersection, and Diff operations
•HyperLogLog only support Union operations
•Reference
22. Theta Sketch Error Estimation
●Sketch computes a probability distribution
●Given standard deviation, return range
23. Theta Sketch Error Estimation (2)
●Error is data/query dependent
●Example
○Approximate set A [1..1 billion] intersect set B [1]
○Approximated Cardinality of Intersection: 0
○Error: 100%
●Generally larger errors with intersections than unions
24. Theta Sketch on Pinot - Example Schema
●Key: set identifier (dimensions)
●Value: <binary> serialized theta sketch
25. Theta Sketch on Pinot - Query
SELECT
DISTINCTCOUNTTHETASKETCH(
sketch,
‘nominalEntries=1024’, -- logK = 10
‘dimValue=US’,
‘dimValue=Canada’,
‘dimValue=Java,
‘dimValue=LNKD,
‘SET_INTERSECT(SET_UNION($1, $2), $3, $4)’)
FROM table
WHERE dimName=Location and dimValue IN (‘US’, ‘Canada’)
OR dimName=Skill and dimValue=‘Java’
OR dimName=Company and dimValue=‘LNKD’)
26. Theta Sketch on Pinot - Query (2)
SELECT
DISTINCTCOUNTTHETASKETCH(
sketch,
‘nominalEntries=1024’,
‘dimValue=US’, -- server returns sketch to broker
‘dimValue=Canada’, -- server returns sketch to broker
‘dimValue=Java’, -- server returns sketch to broker
‘dimValue=LNKD’, -- server returns sketch to broker
‘SET_INTERSECT(SET_UNION($1, $2), $3, $4)’) -- broker evals
FROM table
WHERE dimName=Location and dimValue IN (‘US’, ‘Canada’)
OR dimName=Skill and dimValue=‘Java’
OR dimName=Company and dimValue=‘LNKD’)
ServersServersServersServers
Broker
27. Theta Sketch on Pinot - Query (3)
Slightly Better Query - broker doing less work, servers doing more work
Latency reduction - ~70% from real production use case
SELECT
DISTINCTCOUNTTHETASKETCH(
sketch,
‘nominalEntries=1024’,
‘dimValue IN (‘US’, ‘Canada’)’,
‘dimValue IN (‘Java’, ‘C++’),
‘dimValue IN (‘LNKD’),
‘SET_INTERSECT($1, $2, $3)’)
FROM table
WHERE dimName=Location and dimValue IN (‘US’, ‘Canada’)
OR dimName=Skill and dimValue=‘Java’
OR dimName=Company and dimValue=‘LNKD’)
28. Theta Sketch on Pinot - Query (4)
By distributing more work to servers (less aggregation work on Broker)
●Filter out empty
sketches
●Lazy creation of
union/intersection/diffs
●Single-threaded queries ●Avoid redundant
merge of empty
sketches
●Single-threaded queries
●Distributing more tasks
to servers
●20 QPS
No optimizations
Single-threaded queries
29. Theta Sketch on Pinot
●90% Reduction in data size
●95% Reduction in Pinot Push execution time
31. Theta Sketch Performance
●Error
○Intersection/Diff has higher errors than Union
○Intersection on sets with large cardinality differences tend to
have higher error
■E.g. Set(1 billion items) intersect Set(1 item)
●Latency
○More union/intersection/diff operations, higher the latency
○Larger the sketch, higher the latency
33. Scaling Pinot at LinkedIn for Member Impacting Use Cases
Seunghyun Lee
Senior Software Engineer
34. Impression Discounting Technique
●Do not recommend the same items if the
user already has seen them multiple times.
●Apply discounting factor computed as
f(itemImpressionCount) in real-time fashion
●Prevent the recommended items from being
stale.
●Used by 10+ relevance use cases
■Feed item recommendation
■Jobs recommendation
■Potential connection recommendation
■Ads recommendation
35. Impression Discounting Use Case Architecture
Data Lake
Stream
Processing
Raw Tracking
Data
Data Extraction
& Transformation
Application
Server
event: member X viewed item i1
Q: How many times member X
has seen items (i1, i2, i3…) ?
Processed Data
A: (i1 -> 5, i2 -> 0….)
Let’s apply the discounting factor
to the score for i1 since the user
already seen this item for many
times!
36. Feed Impression Counting Use Case
Requirements
SELECT sum(count) FROM T
WHERE memberId = $memberId
AND item IN (...) // 1500 items
AND action = ‘VIEW’
AND time > (now - n days)
...
GROUP BY item
memberId item action time count
11111 articlexxx LIKE 2020/09/18 1
22222 articleyyy VIEW 2020/09/18 2
... ... ... ... ....
Schema
Query
Pattern
●3k QPS at the peak
●< 100 milliseconds for
99 percentile latency
●Ingesting at 100k
messages/sec
●100s of billions of records
SLA Requirements
37. Starting Point
SELECT sum(count) FROM T
WHERE memberId = $memberId
AND itemId IN (1500 items)
AND action = ‘VIEW’
AND time > (now - n days)
...
GROUP BY item
●Sorted index on memberId
●No inverted index - scanning was faster after memberId filtering
●Pre-aggregated data based on days-since-epoch timestamp.
●Using low-level consumer (LLC) solves the scalability issue for real-time
ingestion by allowing each server to consume from a subset of partitions.
"tableIndexConfig": {
"invertedIndexColumns": [],
"sortedColumn": ["memberId"]
...
}
39. Stage 1. Optimizing Single Server Query
Performance
Realtime
Server
Offline
Server
Broker
Queries
Streaming
Data
40. Bottleneck: Dictionary Encoding for Item Column
1
2
0
3
3
aa
b
ccc
dddd
docId
Dictionary based forward
index for item column
docId memberId item
0 1 b
1 2 ccc
2 2 aa
3 2 dddd
4 3 dddd
●70% size wasted due to padding (Due to a few long item string)
●Item is a high cardinality column → low compression rate
●Worse performance due to random IO for dictionary look-up
41. aab ddddccc
1 4 6 10 14
chunk
offset
header
dddd
Raw forward index for
item column
docId memberId item
0 1 b
1 2 ccc
2 2 aa
3 2 dddd
4 3 dddd
●Raw forward index reduced the item column size by 70% (no padding)
●Benefit from locality because itemId are sorted based on memberId!
●Chunk can be compressed with Snappy (optional)
"tableIndexConfig": {
"noDictionaryColumns": [
"itemId”
]
}
Bottleneck: Dictionary Encoding for Item Column
42. Bottleneck: Processing All Segments
p = 2
select…
where mId
= 101...
partition
= 101 % 3
= 2
p = 2
p = 1p = 1
p = 0p = 0
"tableIndexConfig": {
"segmentPartitionConfig": {
"columnPartitionMap": {
"memberId": {
"functionName": "murmur",
"numPartitions": 32
}
}
}
}
●Partitioning data on memberId & server side segment pruning
●Processing ~1000 segments → 30 segments per query
SELECT sum(count) FROM T
WHERE memberId = $memberId
...
43. Performance Improvements
Feature/Performance Improvement QPS P99 Latency
Baseline (Single Machine) 50/3000 100ms/does not scale
Raw forward index, data partitioning & pruning 3000 270ms
Feature/Performance Improvement QPS P99 Latency
Baseline 3000 -
Raw forward index, data partitioning & pruning 3000 270ms
25
Nodes
15 Offline + 10 Realtime
44. Stage 2. Optimizing Query Routing
Realtime
Server
Offline
Server
Broker
Queries
Streaming
Data
45. Bottleneck: Querying All Servers
2 3
1 4
2 3
1 4
query 1
query 2
4 2
1 3
1 2
3 4
query 1
query 2
RG1
RG2
●Adding more servers doesn’t scale after certain point because P99 latency is
dominated by slow servers (e.g. garbage collection)
●Replica Group: a set of servers that serves a complete set of segments for a table
●Replica group aware segment assignment & routing provides the horizontal
scalability to Pinot!
"segmentsConfig": {
"replicaGroupStrategyConfig": {
"numInstancesPerPartition": 2
},
"replication": 3
...
}
47. Stage 3. Performance Profiling
Realtime
Server
Offline
Server
Broker
Queries
Streaming
Data
48. Bottleneck: Inefficient Code
●Iterations of profiling to identify the hotspots and optimize the code
●Improved the inefficient TOP N algorithm on the broker
○Original: Push N, Pop N
○Better: Push until size N. if x > min_value, pop min_value, push x
●Remove unnecessary JSON Serialization & Deserialization
●Remove unnecessary String operations
○String.format(), String.split() String.join()... are very expensive!
50. Stage 4. Optimizing Real-time Server Performance
Realtime
Server
Offline
Server
Broker
Queries
Streaming
Data
51. Bottleneck: Frequent GCs on Real-time Servers
●Pinot has been using off-heap for loading immutable segments
(MMAP, Direct ByteBuffer)
●Consuming segments used to store consumed data on JVM heap.
●Use Off-heap for consumed data to avoid GCs.
●Performed well at the ingestion rate of ~100k messages/sec
●Default setting for all use cases @ LinkedIn
pinot.server.instance.realtime.alloc.offheap = true
(server-side config)
52. Bottleneck: Large Sized Real-time Segments
1111, a, 2020/09/18,
1
2222, a, 2020/09/18,
1
1111, a, 2020/09/18,
1
memberId itemId time count
1111 a 2020/09/18 2
2222 b 2020/09/18 1
●While offline segments are pre-aggregated, real-time segments
contain too many rows due to high message throughput
●Aggregate metrics feature aggregates data on-the-fly for
consuming segments
"tableIndexConfig": {
"aggregateMetrics": true
...
}
54. Impression discounting use cases in today
10+ 50K+ <100ms
Impression
discounting
use Cases
Queries Per Second
(50% of entire traffic)
99th percentile
latency
55. Takeaways
●Supporting impression discounting use case pushed the limit of Pinot to the
next level.
○Proved that high throughput - low latency use case can be served by the
columnar store!
●Profiling is important
○Small code change can make a huge performance improvements
●Working on Pinot is fun!
○low level system - data storage format, query engine, garbage collection
○distributed system - segment assignment & routing, partitioning, replication
60. We have to start gathering information in
order to build profiles.
60
VISIT
WEBSIT
E
View
inventor
y
Find
Model
Specs
Schedul
e a test
Drive
“Like” on
Vehicle
Explore
d a
Specific
Trim
Compare
Models
Visit
Vehicle
page Visit
Vehicle
Details
Page
Customize
Builds
Customize
Models
Visit
compar
e site View
Incentiv
es
Find
dealer
Customiz
e Models
Customize
Builds of
Vehicles
Get a Quote
Find
Dealer
Reques
t a
Brochur
e
Schedul
e a Test
Drive Get A
Quote Sign Up For
Updates
Get A
Brochure
61. Apache Pinot – Key Component of the Architecture
61
Real-time OLAP Data Store
Distributed System
Highly Scalable
Supports Low Latency Analytics
66. 66
Consuming Segment
Kafka Partitions
Threshold Of Consuming
Segment
SCALE:
Expected traffic = 10,000 records/day
= 10,000
Effect of Number of Partitions
Threshold = 10,000 records
Kafka Partitions are a means of achieving
parallelism.
For instance having 10 partitions in this case means
the consuming segment would be in memory for 10
days and we would get 10 segments.
The underlying Kafka topic retention has to be
adjusted to ensure there is no data loss in any
situation
Real time Provisioning tool – to choose segment size
67. 67
TUNING MEMORY CONFI PARAMETERS &
APPLYING QPS
Problems Faced
OUT OF MEMORY
As Segment Size increased,
there was a OOM coming up
RESPONSE TIMES
SHOOTING
1. As QPS is increasing
2. As Data Volume is
increasing
68. 68
Best Practices Setting Up Pinot Cluster
Observability Traffic & Thresholds
Memory
Management
Runtime Provisioning
Tool
• Prometheus +
Grafana used to
capture and Analyze
the JMX metrics to
get insights
• Heap Size, CPU
Utilization, RAM
Utilized, Off Heap
etc.. helped
• Experiment and
derive the best
segment size based
on sample segment,
retention period etc.
• 2 kinds of memory
modes supported for
consuming and
completed segments
– MMAP and Heap
• Based on
recommendation
from Runtime
Provisioning Tool,
this can be
configured as off
Heap, if memory
resources are
available.
• Time, Size and
Number of
records are
Thresholds
• As a
recommended
practice, time and
size could be
used hand in
hand
69. 69
Three steps to tune P95th Value
01
PARTITION AWARE ROUTING
02
APPLIED SORTED INDEX
03
REPLICA GROUP
Reduced segments queried by n-folds
(n = no. of partitions on topic)
Query routed to subset of servers,
improving scatter and gather
(Total traffic)
--------------------------
(no. of replica groups)
Sorted Index vs Inverted Index
Traffic on server =
70. Math behind Pinot Strategies Applied
70
P1 P2
Replica Per Partition=3
Pinot Server - 0
Pinot Brokers
X
QPS
X/6
QPS
Kafka Partitions
Pinot Server - 1
Pinot Server - 3 Pinot Server - 4
Pinot Server - 5 Pinot Server - 6
X/6
QPS
X/6
QPS
X/6
QPS
X/6
QPS
X/6
QPS
71. 71
Pinot Cluster Setup
Number of Partitions in Kafka = 3
Number of Replicas per Partition = 3
Segment Size = 100 MB = 10 mil records
72. Data volume of 250- 300+million records
Throughput of @ 10k TPS
30ms 09ms
3x better
LATENCY