SlideShare una empresa de Scribd logo
1 de 26
Real Time Analytics for Big Data
A Twitter Inspired Case Study



                                   @natishalom
Big Data Predictions




         ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
                                                                2
The Two Vs of Big Data

         Velocity                                                   Volume




3            ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
We’re Living in a Real Time World…
        Social                           User Tracking &                 Homeland Security
                                          Engagement




      eCommerce                       Financial Services                 Real Time Search




4                 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
The Flavors of Big Data Analytics




       Counting                                Correlating               Research




5                 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
Analytics @ Twitter – Counting

     How many signups,
      tweets, retweets for a
      topic?
     What’s the average
      latency?
     Demographics
          Countries and cities
          Gender
          Age groups
          Device types
          …



6                     ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
Analytics @ Twitter – Correlating

     What devices fail at the
      same time?
     What features get user
      hooked?
     What places on the
      globe are “happening”?




7                 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
Analytics @ Twitter – Research

     Sentiment analysis
        “Obama is popular”
     Trends
        “People like to tweet
         after watching
         American Idol”
     Spam patterns
        How can you tell
         when a user spams?




8                   ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
It’s All about Timing




       “Real time”                      Reasonably Quick                     Batch
     (< few Seconds)                   (seconds - minutes)                (hours/days)




9                  ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
It’s All about Timing
               • Event driven / stream processing
               • High resolution – every tweet gets counted



               • Ad-hoc querying          This is what
               • Medium resolution (aggregations)
                                          we’re here              we’re here
                                                                  to discuss 
               • Long running batch jobs (ETL, map/reduce)
               • Low resolution (trends & patterns)

10         ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
Challenge – Word Count
        Tweets




11
                                  ?
          ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
                                                                       Count




                                                                 • URL mentions
                                                                 • etc.
                                                                                Word:Count




                                                                 • Hottest topics
URL Mentions – Here’s One Use Case




12        ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
Twitter in Numbers (March 2011)



     It takes a week for users to
     send   1 billion Tweets.
                                                      Source: http://blog.twitter.com/2011/03/numbers.html

13          ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
Twitter in Numbers (March 2011)



                On average,
        140 million
     tweets get sent every day.
                                                    Source: http://blog.twitter.com/2011/03/numbers.html

14        ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
Twitter in Numbers (March 2011)



         The highest
     throughput to date is
6,939 tweets/sec.
                                                    Source: http://blog.twitter.com/2011/03/numbers.html

15        ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
Twitter in Numbers (March 2011)



      460,000 new
       accounts
        are created daily.
                                                    Source: http://blog.twitter.com/2011/03/numbers.html

16        ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
Twitter in Numbers




     5% of the users generate
      75% of the content.
                                                            Source: http://www.sysomos.com/insidetwitter/

17        ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
Analyze the Problem

  (Tens of) thousands of tweets per second to
   process
      Assumption: Need to process in near real time
  Aggregate counters for each word
      A few 10s of thousands of words (or hundreds of
       thousands if we include URLs)
  System needs to linearly scale
  System needs to be fault tolerant


18            ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
Key Elements in
                        Real Time Big Data Analytics




19   ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
Sharding (Partitioning)

           Tokenizer                 Counter
                       Filterer 1
              1                     Updater 1

           Tokenizer                 Counter
                       Filterer 2   Updater 2
              2
           Tokenizer                 Counter
                       Filterer 3
              3                     Updater 3




           Tokenizer                 Counter
                       Filterer n
               n                    Updater n
Keep Things In Memory

   Facebook keeps 80% of its
   data in Memory
   (Stanford research)

   RAM is 100-1000x faster
   than Disk (Random seek)
   • Disk: 5 -10ms
   • RAM: ~0.001msec
Use EDA (Event Driven Architecture)




22        ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
Putting it all together




23         ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
Know Your Toolset




24        ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
References
  Writing your own twitter analytics:
   http://ht.ly/d8j4I
  Detailed blog post
     http://bit.ly/gs-bigdata-analytics
  Twitter in numbers:
     http://blog.twitter.com/2011/03/numbers.html
  Twitter Storm:
     http://bit.ly/twitter-storm
  Apache S4
     http://incubator.apache.org/s4/


25               ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
26   ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved

Más contenido relacionado

Más de Nati Shalom

What A No Compromises Hybrid Cloud Looks Like
What A No Compromises Hybrid Cloud Looks Like What A No Compromises Hybrid Cloud Looks Like
What A No Compromises Hybrid Cloud Looks Like Nati Shalom
 
Running OpenStack in Production
Running OpenStack in Production Running OpenStack in Production
Running OpenStack in Production Nati Shalom
 
Orchestration tool roundup kubernetes vs. docker vs. heat vs. terra form vs...
Orchestration tool roundup   kubernetes vs. docker vs. heat vs. terra form vs...Orchestration tool roundup   kubernetes vs. docker vs. heat vs. terra form vs...
Orchestration tool roundup kubernetes vs. docker vs. heat vs. terra form vs...Nati Shalom
 
Real World Example of Orchestrating Docker, Node JS, NFV on OpenStack
Real World Example of Orchestrating Docker, Node JS, NFV on OpenStackReal World Example of Orchestrating Docker, Node JS, NFV on OpenStack
Real World Example of Orchestrating Docker, Node JS, NFV on OpenStackNati Shalom
 
Real World Application Orchestration Made Easy on VMware vCloud Air, vSphere ...
Real World Application Orchestration Made Easy on VMware vCloud Air, vSphere ...Real World Application Orchestration Made Easy on VMware vCloud Air, vSphere ...
Real World Application Orchestration Made Easy on VMware vCloud Air, vSphere ...Nati Shalom
 
OpenStack Juno The Complete Lowdown and Tales from the Summit
OpenStack Juno The Complete Lowdown and Tales from the SummitOpenStack Juno The Complete Lowdown and Tales from the Summit
OpenStack Juno The Complete Lowdown and Tales from the SummitNati Shalom
 
Application and Network Orchestration using Heat & Tosca
Application and Network Orchestration using Heat & ToscaApplication and Network Orchestration using Heat & Tosca
Application and Network Orchestration using Heat & ToscaNati Shalom
 
Introduction to Cloudify for OpenStack users
Introduction to Cloudify for OpenStack users Introduction to Cloudify for OpenStack users
Introduction to Cloudify for OpenStack users Nati Shalom
 
Software Defined Operator
Software Defined OperatorSoftware Defined Operator
Software Defined OperatorNati Shalom
 
Complex Analytics with NoSQL Data Store in Real Time
Complex Analytics with NoSQL Data Store in Real TimeComplex Analytics with NoSQL Data Store in Real Time
Complex Analytics with NoSQL Data Store in Real TimeNati Shalom
 
Is Orchestration the Next Big Thing in DevOps
Is Orchestration the Next Big Thing in DevOpsIs Orchestration the Next Big Thing in DevOps
Is Orchestration the Next Big Thing in DevOpsNati Shalom
 
When networks meets apps (open stack atlanta)
When networks meets apps (open stack atlanta)When networks meets apps (open stack atlanta)
When networks meets apps (open stack atlanta)Nati Shalom
 
Application Centric Approach to Devops
Application Centric Approach to DevopsApplication Centric Approach to Devops
Application Centric Approach to DevopsNati Shalom
 
Case Studies for moving apps to the cloud - DLD 2013
Case Studies for moving apps to the cloud - DLD 2013Case Studies for moving apps to the cloud - DLD 2013
Case Studies for moving apps to the cloud - DLD 2013Nati Shalom
 
Application Centric DevOps
Application Centric DevOpsApplication Centric DevOps
Application Centric DevOpsNati Shalom
 
Real-Time Big Data at In-Memory Speed, Using Storm
Real-Time Big Data at In-Memory Speed, Using StormReal-Time Big Data at In-Memory Speed, Using Storm
Real-Time Big Data at In-Memory Speed, Using StormNati Shalom
 
Disaster Recovery on Demand on the Cloud
Disaster Recovery on Demand on the CloudDisaster Recovery on Demand on the Cloud
Disaster Recovery on Demand on the CloudNati Shalom
 
Disaster recovery on demand on the cloud
Disaster recovery on demand on the cloudDisaster recovery on demand on the cloud
Disaster recovery on demand on the cloudNati Shalom
 
Giga spaces cloudify road map-3 (citi)
Giga spaces cloudify road map-3 (citi)Giga spaces cloudify road map-3 (citi)
Giga spaces cloudify road map-3 (citi)Nati Shalom
 
Big Data on OpenStack
Big Data on OpenStackBig Data on OpenStack
Big Data on OpenStackNati Shalom
 

Más de Nati Shalom (20)

What A No Compromises Hybrid Cloud Looks Like
What A No Compromises Hybrid Cloud Looks Like What A No Compromises Hybrid Cloud Looks Like
What A No Compromises Hybrid Cloud Looks Like
 
Running OpenStack in Production
Running OpenStack in Production Running OpenStack in Production
Running OpenStack in Production
 
Orchestration tool roundup kubernetes vs. docker vs. heat vs. terra form vs...
Orchestration tool roundup   kubernetes vs. docker vs. heat vs. terra form vs...Orchestration tool roundup   kubernetes vs. docker vs. heat vs. terra form vs...
Orchestration tool roundup kubernetes vs. docker vs. heat vs. terra form vs...
 
Real World Example of Orchestrating Docker, Node JS, NFV on OpenStack
Real World Example of Orchestrating Docker, Node JS, NFV on OpenStackReal World Example of Orchestrating Docker, Node JS, NFV on OpenStack
Real World Example of Orchestrating Docker, Node JS, NFV on OpenStack
 
Real World Application Orchestration Made Easy on VMware vCloud Air, vSphere ...
Real World Application Orchestration Made Easy on VMware vCloud Air, vSphere ...Real World Application Orchestration Made Easy on VMware vCloud Air, vSphere ...
Real World Application Orchestration Made Easy on VMware vCloud Air, vSphere ...
 
OpenStack Juno The Complete Lowdown and Tales from the Summit
OpenStack Juno The Complete Lowdown and Tales from the SummitOpenStack Juno The Complete Lowdown and Tales from the Summit
OpenStack Juno The Complete Lowdown and Tales from the Summit
 
Application and Network Orchestration using Heat & Tosca
Application and Network Orchestration using Heat & ToscaApplication and Network Orchestration using Heat & Tosca
Application and Network Orchestration using Heat & Tosca
 
Introduction to Cloudify for OpenStack users
Introduction to Cloudify for OpenStack users Introduction to Cloudify for OpenStack users
Introduction to Cloudify for OpenStack users
 
Software Defined Operator
Software Defined OperatorSoftware Defined Operator
Software Defined Operator
 
Complex Analytics with NoSQL Data Store in Real Time
Complex Analytics with NoSQL Data Store in Real TimeComplex Analytics with NoSQL Data Store in Real Time
Complex Analytics with NoSQL Data Store in Real Time
 
Is Orchestration the Next Big Thing in DevOps
Is Orchestration the Next Big Thing in DevOpsIs Orchestration the Next Big Thing in DevOps
Is Orchestration the Next Big Thing in DevOps
 
When networks meets apps (open stack atlanta)
When networks meets apps (open stack atlanta)When networks meets apps (open stack atlanta)
When networks meets apps (open stack atlanta)
 
Application Centric Approach to Devops
Application Centric Approach to DevopsApplication Centric Approach to Devops
Application Centric Approach to Devops
 
Case Studies for moving apps to the cloud - DLD 2013
Case Studies for moving apps to the cloud - DLD 2013Case Studies for moving apps to the cloud - DLD 2013
Case Studies for moving apps to the cloud - DLD 2013
 
Application Centric DevOps
Application Centric DevOpsApplication Centric DevOps
Application Centric DevOps
 
Real-Time Big Data at In-Memory Speed, Using Storm
Real-Time Big Data at In-Memory Speed, Using StormReal-Time Big Data at In-Memory Speed, Using Storm
Real-Time Big Data at In-Memory Speed, Using Storm
 
Disaster Recovery on Demand on the Cloud
Disaster Recovery on Demand on the CloudDisaster Recovery on Demand on the Cloud
Disaster Recovery on Demand on the Cloud
 
Disaster recovery on demand on the cloud
Disaster recovery on demand on the cloudDisaster recovery on demand on the cloud
Disaster recovery on demand on the cloud
 
Giga spaces cloudify road map-3 (citi)
Giga spaces cloudify road map-3 (citi)Giga spaces cloudify road map-3 (citi)
Giga spaces cloudify road map-3 (citi)
 
Big Data on OpenStack
Big Data on OpenStackBig Data on OpenStack
Big Data on OpenStack
 

Último

SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 

Último (20)

SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 

Real Time Analytics for Big Data a Twiiter Case Study

  • 1. Real Time Analytics for Big Data A Twitter Inspired Case Study @natishalom
  • 2. Big Data Predictions ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved 2
  • 3. The Two Vs of Big Data Velocity Volume 3 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
  • 4. We’re Living in a Real Time World… Social User Tracking & Homeland Security Engagement eCommerce Financial Services Real Time Search 4 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
  • 5. The Flavors of Big Data Analytics Counting Correlating Research 5 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
  • 6. Analytics @ Twitter – Counting  How many signups, tweets, retweets for a topic?  What’s the average latency?  Demographics  Countries and cities  Gender  Age groups  Device types  … 6 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
  • 7. Analytics @ Twitter – Correlating  What devices fail at the same time?  What features get user hooked?  What places on the globe are “happening”? 7 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
  • 8. Analytics @ Twitter – Research  Sentiment analysis  “Obama is popular”  Trends  “People like to tweet after watching American Idol”  Spam patterns  How can you tell when a user spams? 8 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
  • 9. It’s All about Timing “Real time” Reasonably Quick Batch (< few Seconds) (seconds - minutes) (hours/days) 9 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
  • 10. It’s All about Timing • Event driven / stream processing • High resolution – every tweet gets counted • Ad-hoc querying This is what • Medium resolution (aggregations) we’re here we’re here to discuss  • Long running batch jobs (ETL, map/reduce) • Low resolution (trends & patterns) 10 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
  • 11. Challenge – Word Count Tweets 11 ? ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved Count • URL mentions • etc. Word:Count • Hottest topics
  • 12. URL Mentions – Here’s One Use Case 12 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
  • 13. Twitter in Numbers (March 2011) It takes a week for users to send 1 billion Tweets. Source: http://blog.twitter.com/2011/03/numbers.html 13 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
  • 14. Twitter in Numbers (March 2011) On average, 140 million tweets get sent every day. Source: http://blog.twitter.com/2011/03/numbers.html 14 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
  • 15. Twitter in Numbers (March 2011) The highest throughput to date is 6,939 tweets/sec. Source: http://blog.twitter.com/2011/03/numbers.html 15 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
  • 16. Twitter in Numbers (March 2011) 460,000 new accounts are created daily. Source: http://blog.twitter.com/2011/03/numbers.html 16 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
  • 17. Twitter in Numbers 5% of the users generate 75% of the content. Source: http://www.sysomos.com/insidetwitter/ 17 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
  • 18. Analyze the Problem  (Tens of) thousands of tweets per second to process  Assumption: Need to process in near real time  Aggregate counters for each word  A few 10s of thousands of words (or hundreds of thousands if we include URLs)  System needs to linearly scale  System needs to be fault tolerant 18 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
  • 19. Key Elements in Real Time Big Data Analytics 19 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
  • 20. Sharding (Partitioning) Tokenizer Counter Filterer 1 1 Updater 1 Tokenizer Counter Filterer 2 Updater 2 2 Tokenizer Counter Filterer 3 3 Updater 3 Tokenizer Counter Filterer n n Updater n
  • 21. Keep Things In Memory Facebook keeps 80% of its data in Memory (Stanford research) RAM is 100-1000x faster than Disk (Random seek) • Disk: 5 -10ms • RAM: ~0.001msec
  • 22. Use EDA (Event Driven Architecture) 22 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
  • 23. Putting it all together 23 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
  • 24. Know Your Toolset 24 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
  • 25. References  Writing your own twitter analytics: http://ht.ly/d8j4I  Detailed blog post http://bit.ly/gs-bigdata-analytics  Twitter in numbers: http://blog.twitter.com/2011/03/numbers.html  Twitter Storm: http://bit.ly/twitter-storm  Apache S4 http://incubator.apache.org/s4/ 25 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
  • 26. 26 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved

Notas del editor

  1. ActiveInsight