SlideShare una empresa de Scribd logo
1 de 48
STREAM PROCESSING AT 
LINKEDIN: APACHE KAFKA & 
APACHE SAMZA 
Processing billions of events every day
Watch the video with slide 
synchronization on InfoQ.com! 
http://www.infoq.com/presentations 
/samza-linkedin-2014 
InfoQ.com: News & Community Site 
• 750,000 unique visitors/month 
• Published in 4 languages (English, Chinese, Japanese and Brazilian 
Portuguese) 
• Post content from our QCon conferences 
• News 15-20 / week 
• Articles 3-4 / week 
• Presentations (videos) 12-15 / week 
• Interviews 2-3 / week 
• Books 1 / month
Presented at QCon San Francisco 
www.qconsf.com 
Purpose of QCon 
- to empower software development by facilitating the spread of 
knowledge and innovation 
Strategy 
- practitioner-driven conference designed for YOU: influencers of 
change and innovation in your teams 
- speakers and topics driving the evolution and innovation 
- connecting and catalyzing the influencers and innovators 
Highlights 
- attended by more than 12,000 delegates since 2007 
- held in 9 cities worldwide
Neha Narkhede 
¨ Co-founder and Head of Engineering @ Stealth 
Startup 
¨ Prior to this… 
¤ Lead, Streams Infrastructure @ LinkedIn (Kafka & 
Samza) 
¤ One of the initial authors of Apache Kafka, committer 
and PMC member 
¨ Reach out at @nehanarkhede
Agenda 
¨ Real-time Data Integration 
¨ Introduction to Logs & Apache Kafka 
¨ Logs & Stream processing 
¨ Apache Samza 
¨ Stateful stream processing
The Data Needs Pyramid 
Self 
actualization 
Esteem 
Love/Belonging 
Safety 
Physiological 
Maslow's hierarchy of needs 
Automation 
Understanding 
Data processing 
Data collection 
Data needs
Agenda 
¨ Real-time Data Integration 
¨ Introduction to Logs & Apache Kafka 
¨ Logs & Stream processing 
¨ Apache Samza 
¨ Stateful stream processing
Increase in diversity of data 
1980+ 
2000+ 
Events (clicks, impressions, pageviews) 
Application logs (errors, service calls) 
Application metrics (CPU usage, requests/sec) 
2010+ 
Siloed 
data 
feeds 
Database data (users, products, orders etc) 
IoT sensors
Explosion in diversity of systems 
¨ Live Systems 
¤ Voldemort 
¤ Espresso 
¤ GraphDB 
¤ Search 
¤ Samza 
¨ Batch 
¤ Hadoop 
¤ Teradata
Data integration disaster 
Oracle User Tracking 
Oracle 
Oracle 
Espresso 
Espresso 
Hadoop 
Log 
Search 
Monitoring 
Data 
Warehous 
e 
Social 
Graph 
Rec. 
Engine 
Security ... 
Search Email 
Voldemort 
Voldemort 
Voldemort 
Espresso 
Logs 
Operational 
Metrics 
Production Services
Centralized service 
Oracle User Tracking 
Oracle 
Oracle 
Espresso 
Espresso 
Hadoop 
Voldemort 
Voldemort 
Log 
Search 
Monitorin 
g 
Data 
Warehous 
e 
Social 
Graph 
Rec 
Engine & 
Life 
Security ... 
Search Email 
Voldemort 
Espresso 
Logs 
Operational 
Metrics 
Production Services 
Data Pipeline
Agenda 
¨ Real-time Data Integration 
¨ Introduction to Logs & 
Apache Kafka 
¨ Logs & Stream processing 
¨ Apache Samza 
¨ Stateful stream processing
Kafka at 10,000 ft 
¨ Distributed from 
ground up 
¨ Persistent 
¨ Multi-subscriber 
Cluster of brokers 
Producer 
Producer 
Producer 
Producer 
Producer 
Producer 
Producer 
Consumer 
Producer 
Consumer 
Producer 
Consumer
Key design principles 
¨ Scalability of a file system 
¤ Hundreds of MB/sec/server throughput 
¤ Many TBs per server 
¨ Guarantees of a database 
¤ Messages strictly ordered 
¤ All data persistent 
¨ Distributed by default 
¤ Replication model 
¤ Partitioning model
Kafka adoption
Apache Kafka @ LinkedIn 
¨ 175 TB of in-flight log data per colo 
¨ Low-latency: ~1.5ms 
¨ Replicated to each datacenter 
¨ Tens of thousands of data producers 
¨ Thousands of consumers 
¨ 7 million messages written/sec 
¨ 35 million messages read/sec 
¨ Hadoop integration
Logs 
The data structure every systems engineer should 
know
The Log 
¨ Ordered 
¨ Append only 
¨ Immutable 
1st record next record written 
0 1 2 3 4 5 6 7 8 9 10 11 12
The Log: Partitioning 
Partition 0 0 1 2 3 4 5 6 7 8 9 10 11 12 
0 1 2 3 4 5 6 7 8 9 
0 1 2 3 4 5 6 7 8 9 10 11 12 
Partition 1 
Partition 2 13 14 15 16
Logs: pub/sub done right 
Data source 
writes 
0 1 2 3 4 5 6 7 8 9 10 11 12 
Destination 
system A 
(time = 7) 
reads reads 
Destination 
system B 
(time = 11)
Logs for data integration 
User updates 
profile with 
new job 
Newsfeed 
KAFKA 
Search Hadoop Standardization 
engine
Agenda 
¨ Real-time Data Integration 
¨ Introduction to Logs & Apache Kafka 
¨ Logs & Stream processing 
¨ Apache Samza 
¨ Stateful stream processing
Stream processing = f(log) 
Log A Job 1 Log B
Stream processing = f(log) 
Log A Job 1 
Log B Log C 
Job 2 
Log D Log E
Apache Samza at LinkedIn 
User updates 
profile with 
new job 
Newsfeed 
KAFKA 
Search Hadoop Standardization 
engine
Latency spectrum of data systems 
RPC 
Synchronous (milliseconds) 
Batch (Hours) 
Latency 
Asynchronous processing 
(seconds to minutes)
Agenda 
¨ Real-time Data Integration 
¨ Introduction to Logs & Apache Kafka 
¨ Logs & Stream processing 
¨ Apache Samza 
¨ Stateful stream processing
Samza API 
public interface StreamTask { 
void process (IncomingMessageEnvelope envelope, 
MessageCollector collector, 
TaskCoordinator coordinator); 
} 
getKey(), getMsg() 
sendMsg(topic, key, value) 
commit(), shutdown()
Samza Architecture (Logical view) 
Log A 
partition 0 partition 1 partition 2 
Task 1 Task 2 Task 3 
partition 0 partition 1 
Log B
Samza Architecture (Logical view) 
Log A 
partition 0 partition 1 partition 2 
Samza container 1 Samza container 2 
Task 1 Task 2 Task 3 
partition 0 partition 1 
Log B
Samza Architecture (Physical view) 
Samza container 1 Samza container 2 
Host 1 Host 2
Samza Architecture (Physical view) 
Node manager Node manager 
Samza container 1 Samza container 2 
Host 1 Host 2 
Samza 
YARN AM
Samza Architecture (Physical view) 
Node manager Node manager 
Samza container 1 Samza container 2 
Host 1 Host 2 
Samza 
YARN AM 
Kafka Kafka
Samza Architecture: Equivalence to 
Map Reduce 
Node manager Node manager 
Map Reduce Map Reduce YARN AM 
HDFS HDFS 
Host 1 Host 2
M/R Operation Primitives 
¨ Filter records matching some condition 
¨ Map record = f(record) 
¨ Join Two/more datasets by key 
¨ Group records with same key 
¨ Aggregate f(records within the same group) 
¨ Pipe job 1’s output => job 2’s input
M/R Operation Primitives on streams 
¨ Filter records matching some condition 
¨ Map record = f(record) 
¨ Join Two/more datasets by key 
¨ Group records with same key 
¨ Aggregate f(records within the same group) 
¨ Pipe job 1’s output => job 2’s input 
Requires state 
maintenance
Agenda 
¨ Real-time Data Integration 
¨ Introduction to Logs & Apache Kafka 
¨ Logs & Stream processing 
¨ Apache Samza 
¨ Stateful stream processing
Example: Newsfeed 
User ... posted "..." 
User 567 posted "Hello World" 
Status update log 
Fan out 
messages to 
followers 
Push notification log 
567 -> [123, 679, 789, ...] 
999 -> [156, 343, ... ] 
User 989 posted "Blah Blah" 
External connection DB 
Refresh user 123's newsfeed 
Refresh user 679's newsfeed 
Refresh user ...'s newsfeed
Local state vs Remote state: Remote 
100-500K msg/sec/node 100-500K msg/sec/node 
Disk 
Remote state 
1-5K queries/sec ?? 
ex: Cassandra, MongoDB, etc 
Samza task 
partition 0 
Samza task 
partition 1 
❌ Performance 
❌ Isolation 
❌ Limited APIs
Local state: Bring data closer to 
computation 
Local 
Samza task 
partition 0 
LevelDB/RocksDB 
Samza task 
partition 1 
Local 
LevelDB/RocksDB
Local state: Bring data closer to 
computation 
Local 
Samza task 
partition 0 
LevelDB/RocksDB 
Samza task 
partition 1 
Local 
LevelDB/RocksDB 
Disk 
Change log stream
Example Revisited: Newsfeed 
User ... posted "..." 
User 567 posted "Hello World" 
User ... followed ... 
User 123 followed 567 
User 890 followed 234 
Status update log New connection log 
Fan out 
messages to 
followers 
Push notification log 
567 -> [123, 679, 789, ...] 
999 -> [156, 343, ... ] 
User 989 posted "Blah Blah" 
Refresh user 123's newsfeed 
Refresh user 679's newsfeed 
Refresh user ...'s newsfeed
Fault tolerance? 
Node manager Node manager 
Samza container 1 Samza container 2 
Host 1 Host 2 
Samza 
YARN AM 
Kafka Kafka
Fault tolerance in Samza 
Local 
Samza task 
partition 0 
LevelDB/RocksDB 
Samza task 
partition 1 
Local 
LevelDB/RocksDB 
Durable change log
Slow jobs 
Log A Job 1 
Log B Log C 
Job 2 
Log D Log E 
❌ Drop data 
❌ Backpressure 
❌ Queue 
❌ In memory 
✅ On disk (KAFKA)
Summary 
¨ Real time data integration is crucial for the success 
and adoption of stream processing 
¨ Logs form the basis for real time data integration 
¨ Stream processing = f(logs) 
¨ Samza is designed from ground-up for scalability 
and provides fault-tolerant, persistent state
Thank you! 
¨ The Log 
¤ http://bit.ly/the_log 
¨ Apache Kafka 
¤ http://kafka.apache.org 
¨ Apache Samza 
¤ http://samza.incubator.apache.org 
¨ Me 
¤ @nehanarkhede 
¤ http://www.linkedin.com/in/nehanarkhede
Watch the video with slide synchronization on 
InfoQ.com! 
http://www.infoq.com/presentations/samza-linkedin- 
2014

Más contenido relacionado

Destacado

A Random Walk Through Search Research
A Random Walk Through Search ResearchA Random Walk Through Search Research
A Random Walk Through Search ResearchNick Watkins
 
Samza: Real-time Stream Processing at LinkedIn
Samza: Real-time Stream Processing at LinkedInSamza: Real-time Stream Processing at LinkedIn
Samza: Real-time Stream Processing at LinkedInC4Media
 
LinkedIn Data Products
LinkedIn Data ProductsLinkedIn Data Products
LinkedIn Data ProductsVitaly Gordon
 
Developing Data Products
Developing Data ProductsDeveloping Data Products
Developing Data ProductsPeter Skomoroch
 
Introduction to Streaming Analytics
Introduction to Streaming AnalyticsIntroduction to Streaming Analytics
Introduction to Streaming AnalyticsGuido Schmutz
 
Scalable complex event processing on samza @UBER
Scalable complex event processing on samza @UBERScalable complex event processing on samza @UBER
Scalable complex event processing on samza @UBERShuyi Chen
 
Experience economy
Experience economyExperience economy
Experience economyTrieu Nguyen
 
ML and Data Science at Uber - GITPro talk 2017
ML and Data Science at Uber - GITPro talk 2017ML and Data Science at Uber - GITPro talk 2017
ML and Data Science at Uber - GITPro talk 2017Sudhir Tonse
 
Recomendation system: Community Detection Based Recomendation System using Hy...
Recomendation system: Community Detection Based Recomendation System using Hy...Recomendation system: Community Detection Based Recomendation System using Hy...
Recomendation system: Community Detection Based Recomendation System using Hy...Rajul Kukreja
 
Customer Event Hub - the modern Customer 360° view
Customer Event Hub - the modern Customer 360° viewCustomer Event Hub - the modern Customer 360° view
Customer Event Hub - the modern Customer 360° viewGuido Schmutz
 
Real-time Streaming Analytics: Business Value, Use Cases and Architectural Co...
Real-time Streaming Analytics: Business Value, Use Cases and Architectural Co...Real-time Streaming Analytics: Business Value, Use Cases and Architectural Co...
Real-time Streaming Analytics: Business Value, Use Cases and Architectural Co...Impetus Technologies
 
Graph database super star
Graph database super starGraph database super star
Graph database super starandres_taylor
 
Stream Computing & Analytics at Uber
Stream Computing & Analytics at UberStream Computing & Analytics at Uber
Stream Computing & Analytics at UberSudhir Tonse
 

Destacado (13)

A Random Walk Through Search Research
A Random Walk Through Search ResearchA Random Walk Through Search Research
A Random Walk Through Search Research
 
Samza: Real-time Stream Processing at LinkedIn
Samza: Real-time Stream Processing at LinkedInSamza: Real-time Stream Processing at LinkedIn
Samza: Real-time Stream Processing at LinkedIn
 
LinkedIn Data Products
LinkedIn Data ProductsLinkedIn Data Products
LinkedIn Data Products
 
Developing Data Products
Developing Data ProductsDeveloping Data Products
Developing Data Products
 
Introduction to Streaming Analytics
Introduction to Streaming AnalyticsIntroduction to Streaming Analytics
Introduction to Streaming Analytics
 
Scalable complex event processing on samza @UBER
Scalable complex event processing on samza @UBERScalable complex event processing on samza @UBER
Scalable complex event processing on samza @UBER
 
Experience economy
Experience economyExperience economy
Experience economy
 
ML and Data Science at Uber - GITPro talk 2017
ML and Data Science at Uber - GITPro talk 2017ML and Data Science at Uber - GITPro talk 2017
ML and Data Science at Uber - GITPro talk 2017
 
Recomendation system: Community Detection Based Recomendation System using Hy...
Recomendation system: Community Detection Based Recomendation System using Hy...Recomendation system: Community Detection Based Recomendation System using Hy...
Recomendation system: Community Detection Based Recomendation System using Hy...
 
Customer Event Hub - the modern Customer 360° view
Customer Event Hub - the modern Customer 360° viewCustomer Event Hub - the modern Customer 360° view
Customer Event Hub - the modern Customer 360° view
 
Real-time Streaming Analytics: Business Value, Use Cases and Architectural Co...
Real-time Streaming Analytics: Business Value, Use Cases and Architectural Co...Real-time Streaming Analytics: Business Value, Use Cases and Architectural Co...
Real-time Streaming Analytics: Business Value, Use Cases and Architectural Co...
 
Graph database super star
Graph database super starGraph database super star
Graph database super star
 
Stream Computing & Analytics at Uber
Stream Computing & Analytics at UberStream Computing & Analytics at Uber
Stream Computing & Analytics at Uber
 

Más de C4Media

Streaming a Million Likes/Second: Real-Time Interactions on Live Video
Streaming a Million Likes/Second: Real-Time Interactions on Live VideoStreaming a Million Likes/Second: Real-Time Interactions on Live Video
Streaming a Million Likes/Second: Real-Time Interactions on Live VideoC4Media
 
Next Generation Client APIs in Envoy Mobile
Next Generation Client APIs in Envoy MobileNext Generation Client APIs in Envoy Mobile
Next Generation Client APIs in Envoy MobileC4Media
 
Software Teams and Teamwork Trends Report Q1 2020
Software Teams and Teamwork Trends Report Q1 2020Software Teams and Teamwork Trends Report Q1 2020
Software Teams and Teamwork Trends Report Q1 2020C4Media
 
Understand the Trade-offs Using Compilers for Java Applications
Understand the Trade-offs Using Compilers for Java ApplicationsUnderstand the Trade-offs Using Compilers for Java Applications
Understand the Trade-offs Using Compilers for Java ApplicationsC4Media
 
Kafka Needs No Keeper
Kafka Needs No KeeperKafka Needs No Keeper
Kafka Needs No KeeperC4Media
 
High Performing Teams Act Like Owners
High Performing Teams Act Like OwnersHigh Performing Teams Act Like Owners
High Performing Teams Act Like OwnersC4Media
 
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to JavaDoes Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to JavaC4Media
 
Service Meshes- The Ultimate Guide
Service Meshes- The Ultimate GuideService Meshes- The Ultimate Guide
Service Meshes- The Ultimate GuideC4Media
 
Shifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDShifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDC4Media
 
CI/CD for Machine Learning
CI/CD for Machine LearningCI/CD for Machine Learning
CI/CD for Machine LearningC4Media
 
Fault Tolerance at Speed
Fault Tolerance at SpeedFault Tolerance at Speed
Fault Tolerance at SpeedC4Media
 
Architectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsArchitectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsC4Media
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsC4Media
 
Build Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerBuild Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerC4Media
 
User & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleUser & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleC4Media
 
Scaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeScaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeC4Media
 
Make Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereMake Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereC4Media
 
The Talk You've Been Await-ing For
The Talk You've Been Await-ing ForThe Talk You've Been Await-ing For
The Talk You've Been Await-ing ForC4Media
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data EngineeringC4Media
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreC4Media
 

Más de C4Media (20)

Streaming a Million Likes/Second: Real-Time Interactions on Live Video
Streaming a Million Likes/Second: Real-Time Interactions on Live VideoStreaming a Million Likes/Second: Real-Time Interactions on Live Video
Streaming a Million Likes/Second: Real-Time Interactions on Live Video
 
Next Generation Client APIs in Envoy Mobile
Next Generation Client APIs in Envoy MobileNext Generation Client APIs in Envoy Mobile
Next Generation Client APIs in Envoy Mobile
 
Software Teams and Teamwork Trends Report Q1 2020
Software Teams and Teamwork Trends Report Q1 2020Software Teams and Teamwork Trends Report Q1 2020
Software Teams and Teamwork Trends Report Q1 2020
 
Understand the Trade-offs Using Compilers for Java Applications
Understand the Trade-offs Using Compilers for Java ApplicationsUnderstand the Trade-offs Using Compilers for Java Applications
Understand the Trade-offs Using Compilers for Java Applications
 
Kafka Needs No Keeper
Kafka Needs No KeeperKafka Needs No Keeper
Kafka Needs No Keeper
 
High Performing Teams Act Like Owners
High Performing Teams Act Like OwnersHigh Performing Teams Act Like Owners
High Performing Teams Act Like Owners
 
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to JavaDoes Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
 
Service Meshes- The Ultimate Guide
Service Meshes- The Ultimate GuideService Meshes- The Ultimate Guide
Service Meshes- The Ultimate Guide
 
Shifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDShifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CD
 
CI/CD for Machine Learning
CI/CD for Machine LearningCI/CD for Machine Learning
CI/CD for Machine Learning
 
Fault Tolerance at Speed
Fault Tolerance at SpeedFault Tolerance at Speed
Fault Tolerance at Speed
 
Architectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsArchitectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep Systems
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.js
 
Build Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerBuild Your Own WebAssembly Compiler
Build Your Own WebAssembly Compiler
 
User & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleUser & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix Scale
 
Scaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeScaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's Edge
 
Make Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereMake Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home Everywhere
 
The Talk You've Been Await-ing For
The Talk You've Been Await-ing ForThe Talk You've Been Await-ing For
The Talk You've Been Await-ing For
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data Engineering
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
 

Último

Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 

Último (20)

Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 

Samza in LinkedIn: How LinkedIn Processes Billions of Events Everyday in Real-time

  • 1. STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE SAMZA Processing billions of events every day
  • 2. Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations /samza-linkedin-2014 InfoQ.com: News & Community Site • 750,000 unique visitors/month • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • News 15-20 / week • Articles 3-4 / week • Presentations (videos) 12-15 / week • Interviews 2-3 / week • Books 1 / month
  • 3. Presented at QCon San Francisco www.qconsf.com Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide
  • 4. Neha Narkhede ¨ Co-founder and Head of Engineering @ Stealth Startup ¨ Prior to this… ¤ Lead, Streams Infrastructure @ LinkedIn (Kafka & Samza) ¤ One of the initial authors of Apache Kafka, committer and PMC member ¨ Reach out at @nehanarkhede
  • 5. Agenda ¨ Real-time Data Integration ¨ Introduction to Logs & Apache Kafka ¨ Logs & Stream processing ¨ Apache Samza ¨ Stateful stream processing
  • 6. The Data Needs Pyramid Self actualization Esteem Love/Belonging Safety Physiological Maslow's hierarchy of needs Automation Understanding Data processing Data collection Data needs
  • 7. Agenda ¨ Real-time Data Integration ¨ Introduction to Logs & Apache Kafka ¨ Logs & Stream processing ¨ Apache Samza ¨ Stateful stream processing
  • 8. Increase in diversity of data 1980+ 2000+ Events (clicks, impressions, pageviews) Application logs (errors, service calls) Application metrics (CPU usage, requests/sec) 2010+ Siloed data feeds Database data (users, products, orders etc) IoT sensors
  • 9. Explosion in diversity of systems ¨ Live Systems ¤ Voldemort ¤ Espresso ¤ GraphDB ¤ Search ¤ Samza ¨ Batch ¤ Hadoop ¤ Teradata
  • 10. Data integration disaster Oracle User Tracking Oracle Oracle Espresso Espresso Hadoop Log Search Monitoring Data Warehous e Social Graph Rec. Engine Security ... Search Email Voldemort Voldemort Voldemort Espresso Logs Operational Metrics Production Services
  • 11. Centralized service Oracle User Tracking Oracle Oracle Espresso Espresso Hadoop Voldemort Voldemort Log Search Monitorin g Data Warehous e Social Graph Rec Engine & Life Security ... Search Email Voldemort Espresso Logs Operational Metrics Production Services Data Pipeline
  • 12. Agenda ¨ Real-time Data Integration ¨ Introduction to Logs & Apache Kafka ¨ Logs & Stream processing ¨ Apache Samza ¨ Stateful stream processing
  • 13. Kafka at 10,000 ft ¨ Distributed from ground up ¨ Persistent ¨ Multi-subscriber Cluster of brokers Producer Producer Producer Producer Producer Producer Producer Consumer Producer Consumer Producer Consumer
  • 14. Key design principles ¨ Scalability of a file system ¤ Hundreds of MB/sec/server throughput ¤ Many TBs per server ¨ Guarantees of a database ¤ Messages strictly ordered ¤ All data persistent ¨ Distributed by default ¤ Replication model ¤ Partitioning model
  • 16. Apache Kafka @ LinkedIn ¨ 175 TB of in-flight log data per colo ¨ Low-latency: ~1.5ms ¨ Replicated to each datacenter ¨ Tens of thousands of data producers ¨ Thousands of consumers ¨ 7 million messages written/sec ¨ 35 million messages read/sec ¨ Hadoop integration
  • 17. Logs The data structure every systems engineer should know
  • 18. The Log ¨ Ordered ¨ Append only ¨ Immutable 1st record next record written 0 1 2 3 4 5 6 7 8 9 10 11 12
  • 19. The Log: Partitioning Partition 0 0 1 2 3 4 5 6 7 8 9 10 11 12 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 10 11 12 Partition 1 Partition 2 13 14 15 16
  • 20. Logs: pub/sub done right Data source writes 0 1 2 3 4 5 6 7 8 9 10 11 12 Destination system A (time = 7) reads reads Destination system B (time = 11)
  • 21. Logs for data integration User updates profile with new job Newsfeed KAFKA Search Hadoop Standardization engine
  • 22. Agenda ¨ Real-time Data Integration ¨ Introduction to Logs & Apache Kafka ¨ Logs & Stream processing ¨ Apache Samza ¨ Stateful stream processing
  • 23. Stream processing = f(log) Log A Job 1 Log B
  • 24. Stream processing = f(log) Log A Job 1 Log B Log C Job 2 Log D Log E
  • 25. Apache Samza at LinkedIn User updates profile with new job Newsfeed KAFKA Search Hadoop Standardization engine
  • 26. Latency spectrum of data systems RPC Synchronous (milliseconds) Batch (Hours) Latency Asynchronous processing (seconds to minutes)
  • 27. Agenda ¨ Real-time Data Integration ¨ Introduction to Logs & Apache Kafka ¨ Logs & Stream processing ¨ Apache Samza ¨ Stateful stream processing
  • 28. Samza API public interface StreamTask { void process (IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator); } getKey(), getMsg() sendMsg(topic, key, value) commit(), shutdown()
  • 29. Samza Architecture (Logical view) Log A partition 0 partition 1 partition 2 Task 1 Task 2 Task 3 partition 0 partition 1 Log B
  • 30. Samza Architecture (Logical view) Log A partition 0 partition 1 partition 2 Samza container 1 Samza container 2 Task 1 Task 2 Task 3 partition 0 partition 1 Log B
  • 31. Samza Architecture (Physical view) Samza container 1 Samza container 2 Host 1 Host 2
  • 32. Samza Architecture (Physical view) Node manager Node manager Samza container 1 Samza container 2 Host 1 Host 2 Samza YARN AM
  • 33. Samza Architecture (Physical view) Node manager Node manager Samza container 1 Samza container 2 Host 1 Host 2 Samza YARN AM Kafka Kafka
  • 34. Samza Architecture: Equivalence to Map Reduce Node manager Node manager Map Reduce Map Reduce YARN AM HDFS HDFS Host 1 Host 2
  • 35. M/R Operation Primitives ¨ Filter records matching some condition ¨ Map record = f(record) ¨ Join Two/more datasets by key ¨ Group records with same key ¨ Aggregate f(records within the same group) ¨ Pipe job 1’s output => job 2’s input
  • 36. M/R Operation Primitives on streams ¨ Filter records matching some condition ¨ Map record = f(record) ¨ Join Two/more datasets by key ¨ Group records with same key ¨ Aggregate f(records within the same group) ¨ Pipe job 1’s output => job 2’s input Requires state maintenance
  • 37. Agenda ¨ Real-time Data Integration ¨ Introduction to Logs & Apache Kafka ¨ Logs & Stream processing ¨ Apache Samza ¨ Stateful stream processing
  • 38. Example: Newsfeed User ... posted "..." User 567 posted "Hello World" Status update log Fan out messages to followers Push notification log 567 -> [123, 679, 789, ...] 999 -> [156, 343, ... ] User 989 posted "Blah Blah" External connection DB Refresh user 123's newsfeed Refresh user 679's newsfeed Refresh user ...'s newsfeed
  • 39. Local state vs Remote state: Remote 100-500K msg/sec/node 100-500K msg/sec/node Disk Remote state 1-5K queries/sec ?? ex: Cassandra, MongoDB, etc Samza task partition 0 Samza task partition 1 ❌ Performance ❌ Isolation ❌ Limited APIs
  • 40. Local state: Bring data closer to computation Local Samza task partition 0 LevelDB/RocksDB Samza task partition 1 Local LevelDB/RocksDB
  • 41. Local state: Bring data closer to computation Local Samza task partition 0 LevelDB/RocksDB Samza task partition 1 Local LevelDB/RocksDB Disk Change log stream
  • 42. Example Revisited: Newsfeed User ... posted "..." User 567 posted "Hello World" User ... followed ... User 123 followed 567 User 890 followed 234 Status update log New connection log Fan out messages to followers Push notification log 567 -> [123, 679, 789, ...] 999 -> [156, 343, ... ] User 989 posted "Blah Blah" Refresh user 123's newsfeed Refresh user 679's newsfeed Refresh user ...'s newsfeed
  • 43. Fault tolerance? Node manager Node manager Samza container 1 Samza container 2 Host 1 Host 2 Samza YARN AM Kafka Kafka
  • 44. Fault tolerance in Samza Local Samza task partition 0 LevelDB/RocksDB Samza task partition 1 Local LevelDB/RocksDB Durable change log
  • 45. Slow jobs Log A Job 1 Log B Log C Job 2 Log D Log E ❌ Drop data ❌ Backpressure ❌ Queue ❌ In memory ✅ On disk (KAFKA)
  • 46. Summary ¨ Real time data integration is crucial for the success and adoption of stream processing ¨ Logs form the basis for real time data integration ¨ Stream processing = f(logs) ¨ Samza is designed from ground-up for scalability and provides fault-tolerant, persistent state
  • 47. Thank you! ¨ The Log ¤ http://bit.ly/the_log ¨ Apache Kafka ¤ http://kafka.apache.org ¨ Apache Samza ¤ http://samza.incubator.apache.org ¨ Me ¤ @nehanarkhede ¤ http://www.linkedin.com/in/nehanarkhede
  • 48. Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations/samza-linkedin- 2014