SlideShare una empresa de Scribd logo
1 de 40
Big Data and NoSQL in REAL TIME
Facebook and Twitter Examples
Ron Zavner
Agenda
 Our real time world…
 Flavors of Big Data
 Facebook messaging and real time analytics system
 Twitter analytics system
 Winning architecture?
2
® Copyright 2011 Gigaspaces Ltd. All Rights
What is Real Time?
3
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
We’re Living in a Real Time World…
Homeland Security
Real Time Search
Social
eCommerce
User Tracking &
Engagement
Financial Services
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved4
Big Data Predictions
“Over the next few years we'll see the adoption of scalable
frameworks and platforms for handling
streaming, or near real-time, analysis and processing. In the
same way that Hadoop has been borne out of large-scale web
applications, these platforms will be driven by the needs of large-
scale location-aware mobile, social and sensor use.”
Edd Dumbill, O’REILLY
5
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved6
The Two Vs of Big Data
Velocity Volume
The Flavors of Big Data Analytics
Counting Correlating Research
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved7
Analytics – Counting
 How many
signups, tweets, retweet
s for a topic?
 What’s the average
latency?
 Demographics
 Countries and cities
 Gender
 Age groups
 Device types
 …
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved8
Analytics – Correlating
 What devices fail at the
same time?
 What features get user
hooked?
 What places on the
globe are “happening”?
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved9
Analytics – Research
 Sentiment analysis
 “Obama is popular”
 Trends
 “People like to tweet
after watching
American Idol”
 Spam patterns
 How can you tell when
a user spams?
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved10
It’s All about Timing
• Event driven / stream processing
• High resolution – every tweet gets counted
• Ad-hoc querying
• Medium resolution
• Long running batch jobs (ETL, map/reduce)
• Low resolution (trends & patterns)
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved11
This is what
we’re here to
discuss 
FACEBOOK REAL-TIME
ANALYTICS SYSTEM
12
Store 135+ Billion Messages A Month
13
® Copyright 2011 Gigaspaces Ltd. All Rights
The actual analytics..
 Like button analytics
 Comments box analytics
14
® Copyright 2011 Gigaspaces Ltd. All Rights
Goals
 Show why plugins are valuable
 Make the data more actionable
 Make the data more timely
 Remove point of failures
 Handle massive load - 200K events per second
15
® Copyright 2011 Gigaspaces Ltd. All Rights
Technology Evaluation
 MySQL DB Counters
 In-Memory Counters
 MapReduce
 Cassandra
 HBase
16
® Copyright 2011 Gigaspaces Ltd. All Rights
PTail
Scribe
Puma
Hbase
FACEBOOK
Log
FACEBOOK
Log
FACEBOOK
Log
HDFS
Real Time Long Term
Batch
1.5 Sec
The solution..
10,000
write/sec
per server
Keep Things In Memory
Facebook keeps 80% of its
data in Memory
(Stanford research)
RAM is 100-1000x faster
than Disk (Random seek)
• Disk: 5 -10ms
• RAM: ~0.001msec
TWITTER REAL-TIME
ANALYTICS SYSTEM
19
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved20
Twitter Reach – Here’s One Use Case
Let’s start with some
statistics ….
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved21
Twitter in Numbers (March 2011)
Source: http://blog.twitter.com/2011/03/numbers.html
It takes a week for users to
send 1 billion Tweets.
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved22
Twitter in Numbers (March 2011)
Source: http://blog.twitter.com/2011/03/numbers.html
On average,
140 million
tweets get sent every day.
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved23
Twitter in Numbers (March 2011)
Source: http://blog.twitter.com/2011/03/numbers.html
The highest
throughput to date is
6,939 tweets/sec.
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved24
Twitter in Numbers (March 2011)
Source: http://blog.twitter.com/2011/03/numbers.html
460,000 new
accounts
are created daily.
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved25
Twitter in Numbers (March 2011)
Source: http://blog.twitter.com/2011/03/numbers.html
5% of the users generate
75% of the content.
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved26
Twitter in Numbers
Source: http://www.sysomos.com/insidetwitter/
Challenge – Word Count
Word:Count
Tweets
Count
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved27
• Hottest topics
• URL mentions
• etc.
 (Tens of) thousands of tweets per second to
process
 Assumption: Need to process in near real time
 Aggregate counters for each word
 A few 10s of thousands of words (or hundreds of
thousands if we include URLs)
 System needs to linearly scale
 System needs to be fault tolerant
Word Count - Analyze the Problem
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved28
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved29
Use EDA (Event Driven Architecture)
TokenizerRaw FiltererTokenized CounterFiltered
Sharding (Partitioning)
Tokenizer1 Filterer 1
Tokenizer2 Filterer 2
Tokenizer
3
Filterer 3
Tokenizer
n
Filterer n
Counter
Updater 1
Counter
Updater 2
Counter
Updater 3
Counter
Updater n
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved31
Computing Reach with Event Streams
Twitter Storm
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved32
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved33
Twitter Storm
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved34
Storm Overview
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved35
Storm Cluster
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved36
Streaming word count with Storm
 Storage
 Data Persistency
 Querying
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved37
Storm Limitation
Spouts
Bolt
Topologies
 Event driven / flow
 Reliable
 Storage
 Data Persistency
 Querying
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved38
Winner is… storm & in memory data grids
 Facebook messages
 http://highscalability.com/blog/2010/11/16/facebooks-new-real-time-
messaging-system-hbase-to-store-135.html
 Facebook Real time analytics
 http://highscalability.com/blog/2011/3/22/facebooks-new-realtime-
analytics-system-hbase-to-process-20.html
 Learn and fork the code on github:
https://github.com/Gigaspaces/rt-analytics
 Detailed blog post
http://bit.ly/gs-bigdata-analytics
 Twitter in numbers:
http://blog.twitter.com/2011/03/numbers.html
 Twitter Storm:
http://bit.ly/twitter-storm
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved39
References
RonZ@gigaspaces.com
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved40
Q&A

Más contenido relacionado

Último

WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 

Último (20)

WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 

Destacado

Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsPixeldarts
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthThinkNow
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Applitools
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at WorkGetSmarter
 

Destacado (20)

Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
 

Big data and noSQL in real time

  • 1. Big Data and NoSQL in REAL TIME Facebook and Twitter Examples Ron Zavner
  • 2. Agenda  Our real time world…  Flavors of Big Data  Facebook messaging and real time analytics system  Twitter analytics system  Winning architecture? 2 ® Copyright 2011 Gigaspaces Ltd. All Rights
  • 3. What is Real Time? 3 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
  • 4. We’re Living in a Real Time World… Homeland Security Real Time Search Social eCommerce User Tracking & Engagement Financial Services ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved4
  • 5. Big Data Predictions “Over the next few years we'll see the adoption of scalable frameworks and platforms for handling streaming, or near real-time, analysis and processing. In the same way that Hadoop has been borne out of large-scale web applications, these platforms will be driven by the needs of large- scale location-aware mobile, social and sensor use.” Edd Dumbill, O’REILLY 5 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
  • 6. ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved6 The Two Vs of Big Data Velocity Volume
  • 7. The Flavors of Big Data Analytics Counting Correlating Research ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved7
  • 8. Analytics – Counting  How many signups, tweets, retweet s for a topic?  What’s the average latency?  Demographics  Countries and cities  Gender  Age groups  Device types  … ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved8
  • 9. Analytics – Correlating  What devices fail at the same time?  What features get user hooked?  What places on the globe are “happening”? ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved9
  • 10. Analytics – Research  Sentiment analysis  “Obama is popular”  Trends  “People like to tweet after watching American Idol”  Spam patterns  How can you tell when a user spams? ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved10
  • 11. It’s All about Timing • Event driven / stream processing • High resolution – every tweet gets counted • Ad-hoc querying • Medium resolution • Long running batch jobs (ETL, map/reduce) • Low resolution (trends & patterns) ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved11 This is what we’re here to discuss 
  • 13. Store 135+ Billion Messages A Month 13 ® Copyright 2011 Gigaspaces Ltd. All Rights
  • 14. The actual analytics..  Like button analytics  Comments box analytics 14 ® Copyright 2011 Gigaspaces Ltd. All Rights
  • 15. Goals  Show why plugins are valuable  Make the data more actionable  Make the data more timely  Remove point of failures  Handle massive load - 200K events per second 15 ® Copyright 2011 Gigaspaces Ltd. All Rights
  • 16. Technology Evaluation  MySQL DB Counters  In-Memory Counters  MapReduce  Cassandra  HBase 16 ® Copyright 2011 Gigaspaces Ltd. All Rights
  • 17. PTail Scribe Puma Hbase FACEBOOK Log FACEBOOK Log FACEBOOK Log HDFS Real Time Long Term Batch 1.5 Sec The solution.. 10,000 write/sec per server
  • 18. Keep Things In Memory Facebook keeps 80% of its data in Memory (Stanford research) RAM is 100-1000x faster than Disk (Random seek) • Disk: 5 -10ms • RAM: ~0.001msec
  • 20. ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved20 Twitter Reach – Here’s One Use Case
  • 21. Let’s start with some statistics …. ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved21 Twitter in Numbers (March 2011) Source: http://blog.twitter.com/2011/03/numbers.html
  • 22. It takes a week for users to send 1 billion Tweets. ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved22 Twitter in Numbers (March 2011) Source: http://blog.twitter.com/2011/03/numbers.html
  • 23. On average, 140 million tweets get sent every day. ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved23 Twitter in Numbers (March 2011) Source: http://blog.twitter.com/2011/03/numbers.html
  • 24. The highest throughput to date is 6,939 tweets/sec. ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved24 Twitter in Numbers (March 2011) Source: http://blog.twitter.com/2011/03/numbers.html
  • 25. 460,000 new accounts are created daily. ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved25 Twitter in Numbers (March 2011) Source: http://blog.twitter.com/2011/03/numbers.html
  • 26. 5% of the users generate 75% of the content. ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved26 Twitter in Numbers Source: http://www.sysomos.com/insidetwitter/
  • 27. Challenge – Word Count Word:Count Tweets Count ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved27 • Hottest topics • URL mentions • etc.
  • 28.  (Tens of) thousands of tweets per second to process  Assumption: Need to process in near real time  Aggregate counters for each word  A few 10s of thousands of words (or hundreds of thousands if we include URLs)  System needs to linearly scale  System needs to be fault tolerant Word Count - Analyze the Problem ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved28
  • 29. ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved29 Use EDA (Event Driven Architecture) TokenizerRaw FiltererTokenized CounterFiltered
  • 30. Sharding (Partitioning) Tokenizer1 Filterer 1 Tokenizer2 Filterer 2 Tokenizer 3 Filterer 3 Tokenizer n Filterer n Counter Updater 1 Counter Updater 2 Counter Updater 3 Counter Updater n
  • 31. ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved31 Computing Reach with Event Streams
  • 32. Twitter Storm ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved32
  • 33. ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved33 Twitter Storm
  • 34. ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved34 Storm Overview
  • 35. ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved35 Storm Cluster
  • 36. ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved36 Streaming word count with Storm
  • 37.  Storage  Data Persistency  Querying ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved37 Storm Limitation Spouts Bolt Topologies
  • 38.  Event driven / flow  Reliable  Storage  Data Persistency  Querying ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved38 Winner is… storm & in memory data grids
  • 39.  Facebook messages  http://highscalability.com/blog/2010/11/16/facebooks-new-real-time- messaging-system-hbase-to-store-135.html  Facebook Real time analytics  http://highscalability.com/blog/2011/3/22/facebooks-new-realtime- analytics-system-hbase-to-process-20.html  Learn and fork the code on github: https://github.com/Gigaspaces/rt-analytics  Detailed blog post http://bit.ly/gs-bigdata-analytics  Twitter in numbers: http://blog.twitter.com/2011/03/numbers.html  Twitter Storm: http://bit.ly/twitter-storm ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved39 References
  • 40. RonZ@gigaspaces.com ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved40 Q&A

Notas del editor

  1. Real time is ideally less than a second, not 30 seconds, not 5 seconds
  2. We live almost every aspect of our lives in a real-time world. Think about our social communications; we update our friends online via social networks and micro-blogging, we text from our mobiles, or message from our laptops. But it's not just our social lives; we shop online whenever we want, we search the web for immediate answers to our questions, we trade stocks online, we pay our bills, and do our banking. All online and all in real time.Real time doesn't just affect our personal lives. Enterprises and government agencies need real-time insights to be successful, whether they are investment firms that need fast access to market views and risk analysis, or retailers that need to adjust their online campaigns and recommendations. Even homeland security has come to increasingly rely on real-time monitoring.The amount of data that flows in these systems is huge.Major social networking platforms like Facebook and Twitter have developed their own architectures for handling the need for real-time analytics on huge amounts of data. However, not every company has the need or resources to build their own Twitter-like solution.
  3. Big data is definitely expected to grow and expand. Amount of data is growing and the demand grows as well. The requirements for analytics in real time is a must.
  4. The Two Vs of Big Data are velocity and volume. As said before, the volume of data we need to handle is huge and at the same time we need to do it fast. We are required to make very complex calculations in read time and we need to perform those for a very large amount of data. The data is usually spread among many servers, distributed and each server would perform it’s calculation and then results would be aggregated – map reduce. This is a very common pattern to perform real time analytics. Having said that, we can see that sometimes the latency requirement is more challenging and we need to improve the time it takes to make these calculations. You can’t go straight to relational DB – not designed to handle the speed and volumes we’re talking about, that’s why we can look at NoSQL or cache.NoSQL can go further // I don’t have contraints of a relational db and I can store the data as it is (in JSON – the format used by Twitter) – but processing the sheer amount of data in the timeframes we need is incredibly challenging.
  5. I think analytics – when we’re talking about Big Data and something like Twitter – can be split into three categories, or buckets.The first bucket is “Counting” How many signups, tweets or retweets are there for a topic?I might also be interested in counting in relation to demographic information – for example, how many people are tweeting right now at this event and on what types of devices?The “Correlating” bucket might contain questions like how many twitter users are using desktop vs mobile - and what's the trend? Within the week, within in the month?Our 3rd bucket “Research” is similar to 2, but looking at more depth in the past – here we require a lot of processing of historic data
  6. Counting calculations – we expect to see results in real time.The challenge is reliability > not that we lose money, but the accuracy of the system is going to be damaged, so the value of the report is going to be meaningless. Counting requires a very high high resolution - every tweet counts - we don't know which one will be important. If we lose something, the accuracy of the system will be damaged.
  7. Correlating – we expect to see most results also in real time.These are the interactive queries where we expect a result that I can layout in my browser or a BI tool.
  8. Research calcsare historical and Hadoop (for example) is a very popular framework for doing batch analytics. We don’t expect for real time response here but you never know what’s next 
  9. It’s All about Timing.We expect to see real time results for lots of our calculations.We also need to make sure that our architecture allows us to be scalable.Today we might need to work with 100K TPS and it can easily grow to 200K TPS.We need to be highly available as well, we need to ensure zero downtime.For these we can use event driven and stream processing architectures.Correlation and research calculations are very interesting topics and we can expect longer response time, we however are going to examine the real time challenge.
  10. We are going to talk about how facebook real time analytics system and also how they choose to store 135+ billion messages a month
  11. http://highscalability.com/blog/2010/11/16/facebooks-new-real-time-messaging-system-hbase-to-store-135.htmlYou may have read somewhere that Facebook has introduced a new Social Inbox integrating email, IM, SMS,  text messages, on-site Facebook messages. All-in-all they need to store over 135 billion messages a month. Where do they store all that stuff? One of the posts gave the surprise answer - HBase beat out MySQL, Cassandra, and a few others.Why a surprise? Facebook created Cassandra and it was purpose built for an inbox type application, but they found Cassandra's eventual consistency model wasn't a good match for their new real-time Messages product. Facebook also has an extensive MySQL infrastructure, but they found performance suffered as data set and indexes grew larger. And they could have built their own, but they chose HBase.HBase is a scaleout table store supporting very high rates of row-level updates over massive amounts of data. Exactly what is needed for a Messaging system. HBase is also a column based key-value store built on the BigTable model. It's good at fetching rows by key or scanning ranges of rows and filtering. Also what is needed for a Messaging system. Complex queries are not supported however. Queries are generally given over to an analytics tool like Hive, which Facebook created to make sense of their multi-petabyte data warehouse, and Hive is based on Hadoop's file system, HDFS, which is also used by HBase.
  12. Over the past year, social plugins have become an important and growing source of traffic for millions of websites. Today we're releasing a new version of Insights for Websites to give you better analytics on how people interact with your content and to help you optimize your website in real-time.Like button analyticsFor the first time, you can now access real-time analytics to optimize Like buttons across both your site and on Facebook. We use anonymized data to show you the number of times people saw Like buttons, clicked Like buttons, saw Like stories on Facebook, and clicked Like stories to visit your website.
  13. Plugins are valueableSocial plugins have become an important and growing source of traffic for millions of websites over the past year. We released a new version of Insights for Websites last week to give site owners better analytics on how people interact with their content and to help them optimize their websites in real time. To accomplish this, we had to engineer a system that could process over 20 billion events per day (200,000 events per second) with a lag of less than 30 seconds.Data actionableHelp users take action to make their content more valuable.How many people see a plugin, how many people take action on it, and how many are converted to traffic back on your site.  Make the data more timelyWent from a 48-hour turn around to 30 seconds.Multiple points of failure were removed to make this goal. 
  14. http://highscalability.com/blog/2011/3/22/facebooks-new-realtime-analytics-system-hbase-to-process-20.htmlMySQL DB CountersHave a row with a key and a counter.Results in lots of database activity.Stats are kept at a day bucket granularity. Every day at midnight the stats would roll over. When the roll over period is reached this resulted in a lot of writes to the database, which caused a lot of lock contention.Tried to spread the work by taking into account time zones. Tried to shard things differently.The high write rate led to lock contention, it was easy to overload the databases, had to constantly monitor the databases, and had to rethink their sharding strategy.Solution not well tailored to the problem.In-Memory CountersIf you are worried about bottlenecks in IO then throw it all in-memory.No scale issues. Counters are stored in memory so writes are fast and the counters are easy to shard.Felt in-memory counters, for reasons not explained, weren't as accurate as other approaches. Even a 1% failure rate would be unacceptable. Analytics drive money so the counters have to be highly accurate. They didn't implement this system. It was a thought experiment and the accuracy issue caused them to move on.MapReduceUsed Hadoop/Hive for previous solution. Flexible. Easy to get running. Can handle IO, both massive writes and reads. Don't have to know how they will query ahead of time. The data can be stored and then queried.Not realtime. Many dependencies. Lots of points of failure. Complicated system. Not dependable enough to hit realtime goals.CassandraHBase seemed a better solution based on availability and the write rate.Write rate was the huge bottleneck being solved.
  15. http://highscalability.com/blog/2011/3/22/facebooks-new-realtime-analytics-system-hbase-to-process-20.htmlThe Winner: HBase + Scribe + Ptail + PumaAt a high level:HBase stores data across distributed machines.Use a tailing architecture, new events are stored in log files, and the logs are tailed.A system rolls the events up and writes them into storage.A UI pulls the data out and displays it to users.Data FlowUser clicks Like on a web page.Fires AJAX request to Facebook.Request is written to a log file using Scribe. Scribe handles issues like file roll over.Scribe is built on the same HTFS file store Hadoop is built on.Write extremely lean log lines. The more compact the log lines the more can be stored in memory.PtailData is read from the log files using Ptail. Ptail is an internal tool built to aggregate data from multiple Scribe stores. It tails the log files and pulls data out.Ptail data is separated out into three streams so they can eventually be sent to their own clusters in different datacenters.Plugin impressionNews feed impressionsActions (plugin + news feed)PumaBatch data to lessen the impact of hot keys. Even though HBase can handle a lot of writes per second they still want to batch data. A hot article will generate a lot of impressions and news feed impressions which will cause huge data skews which will cause IO issues. The more batching the better.Batch for 1.5 seconds on average. Would like to batch longer but they have so many URLs that they run out of memory when creating a hashtable.Wait for last flush to complete for starting new batch to avoid lock contention issues.UI  Renders DataFrontends are all written in PHP.The backend is written in Java and Thrift is used as the messaging format so PHP programs can query Java services.Caching solutions are used to make the web pages display more quickly.Performance varies by the statistic. A counter can come back quickly. Find the top URL in a domain can take longer. Range from .5 to a few seconds. The more and longer data is cached the less realtime it is.Set different caching TTLs in memcache.MapReduceThe data is then sent to MapReduce servers so it can be queried via Hive.This also serves as a backup plan as the data can be recovered from Hive.Raw logs are removed after a period of time.HBase is a distribute column store. Database interface to Hadoop. Facebook has people working internally on HBase. Unlike a relational database you don't create mappings between tables.You don't create indexes. The only index you have a primary row key.From the row key you can have millions of sparse columns of storage. It's very flexible. You don't have to specify the schema. You define column families to which you can add keys at anytime.Key feature to scalability and reliability is the WAL, write ahead log, which is a log of the operations that are supposed to occur. Based on the key, data is sharded to a region server. Written to WAL first.Data is put into memory. At some point in time or if enough data has been accumulated the data is flushed to disk.If the machine goes down you can recreate the data from the WAL. So there's no permanent data loss.Use a combination of the log and in-memory storage they can handle an extremely high rate of IO reliably. HBase handles failure detection and automatically routes across failures.Currently HBaseresharding is done manually.Automatic hot spot detection and resharding is on the roadmap for HBase, but it's not there yet.Every Tuesday someone looks at the keys and decides what changes to make in the sharding plan.Schema Store on a per URL basis a bunch of counters.A row key, which is the only lookup key, is the MD5 hash of the reverse domainSelecting the proper key structure helps with scanning and sharding.A problem they have is sharding data properly onto different machines. Using a MD5 hash makes it easier to say this range goes here and that range goes there. For URLs they do something similar, plus they add an ID on top of that. Every URL in Facebook is represented by a unique ID, which is used to help with sharding.A reverse domain, com.facebook/ for example, is used so that the data is clustered together. HBase is really good at scanning clustered data, so if they store the data so it's clustered together they can efficiently calculate stats across domains. Think of every row a URL and every cell as a counter, you are able to set different TTLs (time to live) for each cell. So if keeping an hourly count there's no reason to keep that around for every URL forever, so they set a TTL of two weeks. Typically set TTLs on a per column family basis. Per server they can handle 10,000 writes per second. Checkpointing is used to prevent data loss when reading data from log files. Tailers save log stream check points  in HBase.Replayed on startup so won't lose data.Useful for detecting click fraud, but it doesn't have fraud detection built in.Tailer Hot SpotsIn a distributed system there's a chance one part of the system can be hotter than another.One example are region servers that can be hot because more keys are being directed that way.One tailer can be lag behind another too.If one tailer is an hour behind and the others are up to date, what numbers do you display in the UI?For example, impressions have a way higher volume than actions, so CTR rates were way higher in the last hour.Solution is to figure out the least up to date tailer and use that when querying metrics.
  16. In Twitter, the primary relationship between entities is many-to-many. Every post is sent to numerous followers of the user who sent the post; at the same time, each user can follow many other users. This causes Twitter to behave like a living organism, growing unexpectedly in many different directions.Let me give you an example. One analytic where I need to process tweets is to determine Twitter Reach – Reach is how many unique Twitter accounts received tweets about my topic.So, how do I compute my reach?There are several stages in the processing1. First, I need to record every tweet2. Then I can count how many followers got that tweet3. Then I need to understand the distinct reach and I need to account for this > meaning for each follower I need to look at each of their followers and remove the duplicates.Try to image what it takes to produce that number. If my tweet is retweeted by 100 users, each of whom has 100 followers – well, it starts to take a fair bit of number crunching.
  17. Read mostly – duplicate the data so you can optimize the read.
  18. Let’s analyze the problems that a simple Twitter word count presentsThe challenge here seems straightforward:Tens of thousands of tweets need to be stored and parsed every secondWord counters need to be aggregated continuously. Even though tweets are limited to 140 characters, we are dealing with hundreds of thousands of words per second.This is big.
  19. In many ways this is the bench mark for other systems because it does stretch the limits > There is a huge amount of activity to analyze – the scale is enormous> And we want to grab a lot of information out of it – and this is the challenge - how do we grab the stream in real time without effecting latency?> how do we deal w/ that stream in real-time?> how do we handle the write scalability in real-time?> how do we make the system bullet-proof and easily scalable?> how do we begin to do analytics on this?
  20. Storm is a real time, open source data streaming framework that functions entirely in memory.  Storm is designed to be run on several machines to provided parallelism.Real-time processing is becoming very popular, and Storm is a popular open source framework and runtime used by Twitter for processing real-time data streams.  Storm addresses the complexity of running real time streams through a compute cluster by providing an elegant set of abstractions that make it easier to reason about your problem domain by letting you focus on data flows rather than on implementation details.
  21. It constructs a processing graph that feeds data from an input source through processing nodes.  The processing graph is called a "topology".  The input data sources are called "spouts", and the processing nodes are called "bolts".  The data model consists of tuples. Tuples flow from Spouts to the bolts, which execute user code. Besides simply being locations where data is transformed or accumulated, bolts can also join streams and branch streams. Storm topologies are deployed in a manner somewhat similar to a webapp; a jar file is presented to a deployer which distributes it around the cluster where it is loaded and executed.  A topology runs until it is killed.
  22. zookeeper - Storm uses Zookeeper to communicate between the "Nimbus"(master) and the 'Supervisors" (workers), as well as to store its current state. Zookeeper coodinates activity in the cluster, and provides operational state storage.storm-nimbus – The topology execution coordinator for the cluster. The Nimbus is a singleton in the cluster (i.e. not elastic). It is stateless however (due to storing state in Zookeeper) and there for can fail and be restarted without consequence even to running jobs.storm-supervisor – The supervisors actually run the topology code. There can/should be many of these (i.e. elastic). The parallelism attributes of a given topology are specified in the topology itself.
  23. Data grids are more event driven based while strom is used for flow/streaming. Storm have more capabilites. Storm is very specifically directed at the streaming problem, and is optimized for that use case. In order to produce extremely high throughput, it pushes responsibility for reliability outside of its own framework. Also because of its streaming focus, it provides higher level abstractions that make reasoning about streaming easier than in XAP.Reliable - The architecture is oriented to making data in-memory nearly as reliable as that on disk. Thus, writing into XAP involves some level of serialization and perhaps a network hop as well. Storm doesn't aspire to this level of reliability, instead it provides the means for the suppliers and consumers of data to provide it instead. Storm is "optimistic" in roughly the same sense that an optimistic lock in a database is optimistic: it assumes success is far more likely than failure, and so is willing to big hits to performance when failures occur because they are so rare. XAP is more pessimistic in this sense. XAP is designed to be a source of truth for the data it holds, and goes to great lengths to achieve it.For reasons sited above, there is no way, even in principle, for XAP to have a comparable thoughput to Storm: at least when there is no persistence. This caveat is critical however, since real world systems almost always need persistence, and ultra-fast in-memory persistence is one of XAP's main strengths. I also mentioned that Storm has higher level abstractions for Streaming, which make programming it more straightforward for streaming applications. Whereas in XAP you could implement streaming as a series of event driven processing stages, there is no concept of a "stream" or any kind of "flow" at the API level.Storm with XAPBasically, Spouts provide the source of tuples for Storm processing. For spouts to be maximally performant and reliable, they need to provide tuples in batches, and be able to replay failed batches when necessary. Of course, in order to have batches, you need storage, and to be able to replay batches, you need reliable storage. XAP is about the highest performing, reliable source of data out there, so a spout that serves tuples from XAP is a natural combination. Recall that Storm is a stream processing framework and runtime, and this presupposes the existence of a stream for it to read from. So there are really two artifacts needed for XAP to provide a spout to Storm: a "stream" in XAP, and of course the spout that reads from it. Realizing this, I wrote a simple service for XAP that leverages XAP's FIFO capabilities called XAPStream. It is a standalone (Storm independent) service that lets clients dynamically create, destroy, and of course read and write from streams in both batch and non-batch modes.