Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1yvBv0T.
Neha Narkhede of Kafka fame shares the experience of building LinkedIn's powerful and efficient data pipeline infrastructure around Apache Kafka and Samza to process billions of events every day. Filmed at qconsf.com.
Neha Narkhede is a key contributor to Apache's Kafka & Samza.
08448380779 Call Girls In Friends Colony Women Seeking Men
Samza in LinkedIn: How LinkedIn Processes Billions of Events Everyday in Real-time
1. STREAM PROCESSING AT
LINKEDIN: APACHE KAFKA &
APACHE SAMZA
Processing billions of events every day
2. Watch the video with slide
synchronization on InfoQ.com!
http://www.infoq.com/presentations
/samza-linkedin-2014
InfoQ.com: News & Community Site
• 750,000 unique visitors/month
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• News 15-20 / week
• Articles 3-4 / week
• Presentations (videos) 12-15 / week
• Interviews 2-3 / week
• Books 1 / month
3. Presented at QCon San Francisco
www.qconsf.com
Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
4. Neha Narkhede
¨ Co-founder and Head of Engineering @ Stealth
Startup
¨ Prior to this…
¤ Lead, Streams Infrastructure @ LinkedIn (Kafka &
Samza)
¤ One of the initial authors of Apache Kafka, committer
and PMC member
¨ Reach out at @nehanarkhede
6. The Data Needs Pyramid
Self
actualization
Esteem
Love/Belonging
Safety
Physiological
Maslow's hierarchy of needs
Automation
Understanding
Data processing
Data collection
Data needs
8. Increase in diversity of data
1980+
2000+
Events (clicks, impressions, pageviews)
Application logs (errors, service calls)
Application metrics (CPU usage, requests/sec)
2010+
Siloed
data
feeds
Database data (users, products, orders etc)
IoT sensors
9. Explosion in diversity of systems
¨ Live Systems
¤ Voldemort
¤ Espresso
¤ GraphDB
¤ Search
¤ Samza
¨ Batch
¤ Hadoop
¤ Teradata
10. Data integration disaster
Oracle User Tracking
Oracle
Oracle
Espresso
Espresso
Hadoop
Log
Search
Monitoring
Data
Warehous
e
Social
Graph
Rec.
Engine
Security ...
Search Email
Voldemort
Voldemort
Voldemort
Espresso
Logs
Operational
Metrics
Production Services
11. Centralized service
Oracle User Tracking
Oracle
Oracle
Espresso
Espresso
Hadoop
Voldemort
Voldemort
Log
Search
Monitorin
g
Data
Warehous
e
Social
Graph
Rec
Engine &
Life
Security ...
Search Email
Voldemort
Espresso
Logs
Operational
Metrics
Production Services
Data Pipeline
13. Kafka at 10,000 ft
¨ Distributed from
ground up
¨ Persistent
¨ Multi-subscriber
Cluster of brokers
Producer
Producer
Producer
Producer
Producer
Producer
Producer
Consumer
Producer
Consumer
Producer
Consumer
14. Key design principles
¨ Scalability of a file system
¤ Hundreds of MB/sec/server throughput
¤ Many TBs per server
¨ Guarantees of a database
¤ Messages strictly ordered
¤ All data persistent
¨ Distributed by default
¤ Replication model
¤ Partitioning model
16. Apache Kafka @ LinkedIn
¨ 175 TB of in-flight log data per colo
¨ Low-latency: ~1.5ms
¨ Replicated to each datacenter
¨ Tens of thousands of data producers
¨ Thousands of consumers
¨ 7 million messages written/sec
¨ 35 million messages read/sec
¨ Hadoop integration
17. Logs
The data structure every systems engineer should
know
18. The Log
¨ Ordered
¨ Append only
¨ Immutable
1st record next record written
0 1 2 3 4 5 6 7 8 9 10 11 12
35. M/R Operation Primitives
¨ Filter records matching some condition
¨ Map record = f(record)
¨ Join Two/more datasets by key
¨ Group records with same key
¨ Aggregate f(records within the same group)
¨ Pipe job 1’s output => job 2’s input
36. M/R Operation Primitives on streams
¨ Filter records matching some condition
¨ Map record = f(record)
¨ Join Two/more datasets by key
¨ Group records with same key
¨ Aggregate f(records within the same group)
¨ Pipe job 1’s output => job 2’s input
Requires state
maintenance
38. Example: Newsfeed
User ... posted "..."
User 567 posted "Hello World"
Status update log
Fan out
messages to
followers
Push notification log
567 -> [123, 679, 789, ...]
999 -> [156, 343, ... ]
User 989 posted "Blah Blah"
External connection DB
Refresh user 123's newsfeed
Refresh user 679's newsfeed
Refresh user ...'s newsfeed
39. Local state vs Remote state: Remote
100-500K msg/sec/node 100-500K msg/sec/node
Disk
Remote state
1-5K queries/sec ??
ex: Cassandra, MongoDB, etc
Samza task
partition 0
Samza task
partition 1
❌ Performance
❌ Isolation
❌ Limited APIs
40. Local state: Bring data closer to
computation
Local
Samza task
partition 0
LevelDB/RocksDB
Samza task
partition 1
Local
LevelDB/RocksDB
41. Local state: Bring data closer to
computation
Local
Samza task
partition 0
LevelDB/RocksDB
Samza task
partition 1
Local
LevelDB/RocksDB
Disk
Change log stream
42. Example Revisited: Newsfeed
User ... posted "..."
User 567 posted "Hello World"
User ... followed ...
User 123 followed 567
User 890 followed 234
Status update log New connection log
Fan out
messages to
followers
Push notification log
567 -> [123, 679, 789, ...]
999 -> [156, 343, ... ]
User 989 posted "Blah Blah"
Refresh user 123's newsfeed
Refresh user 679's newsfeed
Refresh user ...'s newsfeed
44. Fault tolerance in Samza
Local
Samza task
partition 0
LevelDB/RocksDB
Samza task
partition 1
Local
LevelDB/RocksDB
Durable change log
45. Slow jobs
Log A Job 1
Log B Log C
Job 2
Log D Log E
❌ Drop data
❌ Backpressure
❌ Queue
❌ In memory
✅ On disk (KAFKA)
46. Summary
¨ Real time data integration is crucial for the success
and adoption of stream processing
¨ Logs form the basis for real time data integration
¨ Stream processing = f(logs)
¨ Samza is designed from ground-up for scalability
and provides fault-tolerant, persistent state