SlideShare a Scribd company logo
1 of 19
Download to read offline
Realtime Analytics
 with Cassandra
   or: How I Learned to
  Stopped Worrying and
       Love Counting
Combining “big” and “real-time” is hard

    Live & historical                    Drill downs
                         Trends...
      aggregates...                      and roll ups




2
                                                        Analytics
What is Realtime Analytics?
    eg “show me the number of mentions of
        ‘Acunu’ per day, between May and
          November 2011, on Twitter”


       Batch (Hadoop) approach would
    require processing ~30 billion tweets,
              or ~4.2 TB of data
                 http://blog.twitter.com/2011/03/numbers.html
Okay, so how are we
            going to do it?
                                counter
                                updates
             tweets
Twitter                   ?

  •   Push processing into ingest phase
  •   Make queries fast
Okay, so how are we
     going to do it?
For each tweet,
increment a bunch of counters,
such that answering a query
is as easy as reading some counters
Preparing the data
                              12:32:15 I like #trafficlights
Step 1: Get a feed of    12:33:43 Nobody expects...
        the tweets     12:33:49 I ate a #bee; woe is...
                      12:34:04 Man, @acunu rocks!

Step 2: Tokenise the
        tweet

Step 3: Increment counters            [1234, man]   +1
        in time buckets for           [1234, acunu] +1
        each token                    [1234, rock] +1
Querying
                            start: [01/05/11, acunu]
Step 1: Do a range query    end:   [30/05/11, acunu]

                                       Key            #Mentions
                              [01/05/11 00:01, acunu]    3
Step 2: Result table          [01/05/11 00:02, acunu]    5
                                        ...              ...


                              90

Step 3: Plot pretty graph     45
                               0
                                   May Jun Jul Aug Sept Oct Nov
Except it’s not that easy...
• Cassandra best practice is to use RandomPartitioner,
  so not possible to range queries on rows
• Could manually work out each row in range, do lots of
  point gets
  • This would suck - each query would be 100’s of random
    IOs on disk
• Need to use wide rows, range query is a column slice,
  each query ~1 IO - Denormalisation
So instead of this...
                              Key            #Mentions
                     [01/05/11 00:01, acunu]    3
                     [01/05/11 00:02, acunu]    5
                               ...              ...




                     We do this
                  Key           00:01       00:02        ...
            [01/05/11, acunu]     3           5          ...
            [02/05/11, acunu]    12           4          ...
                    ...           ...                    ...

Row key is ‘big’                     Column key is ‘small’
 time bucket                             time bucket
Demo
./painbird.py -u tom_wilkie

    http://ec2-176-34-212-226.eu-
 west-1.compute.amazonaws.com:8000
Now its your
  turn.....
1. Get a twitter account - http://twitter.com

2. Get some Cassandra VMs - http://goo.gl/Ruqlt

3. Cluster them up

4. Get the code - http://goo.gl/VxXKB

5. Implement the missing bits!

6. (Prizes for the ones that spot bugs!)
Get some Cassandra
       VMs


http://goo.gl/O9hkv
Cluster them up
• SSH in, set password (on both!)
• Check you can connect to the UI
• Use UI (click add host)
Get the code
SSH into one of the VMs:
# curl https://acunu-
oss.s3.amazonaws.com/
painbird-2.tar.gz | tar zxf -
# cd release
# ./painbird.py -u tom_wilkie
Implement the “core”

• In core.py
• def insert_tweet(cassandra, tweet):
• def do_query(cassandra, term, start, finish):
Check you data
-bash-3.2$ cassandra-cli
Connected to: "Test Cluster" on localhost/9160
Welcome to Cassandra CLI version 1.0.8.acunu2
Type 'help;' or '?' for help.
Type 'quit;' or 'exit;' to quit.

[default@unknown] use painbird;
Authenticated to keyspace: painbird
[default@painbird] list keywords;
Using default limit of 100
-------------------
RowKey: m-5-"woe
=> (counter=11, value=1)
Extensions
Extensions
UI
• Pretty graphs
• Automatically periodically update?
• Search multiple terms
Painbird
•   mentions of multiple terms
•   sentiment analysis - http://www.nltk.org/
•   filtering by multiple fields (geo + keyword)

More Related Content

Similar to Realtime Analytics on the Twitter Firehose with Apache Cassandra - Denormalized London 2012

Earthquake shakes twitter users
Earthquake shakes twitter usersEarthquake shakes twitter users
Earthquake shakes twitter usersEshan Mudwel
 
It's all about the timing
It's all about the timingIt's all about the timing
It's all about the timingSensePost
 
Reflex - How Does It Work? (extended dance remix)
Reflex - How Does It Work? (extended dance remix)Reflex - How Does It Work? (extended dance remix)
Reflex - How Does It Work? (extended dance remix)Rocco Caputo
 
Dataflow - A Unified Model for Batch and Streaming Data Processing
Dataflow - A Unified Model for Batch and Streaming Data ProcessingDataflow - A Unified Model for Batch and Streaming Data Processing
Dataflow - A Unified Model for Batch and Streaming Data ProcessingDoiT International
 
Psychtoolbox (PTB) practical course by Volodymyr B. Bogdanov, Lyon/Kyiv 2018...
Psychtoolbox (PTB) practical course  by Volodymyr B. Bogdanov, Lyon/Kyiv 2018...Psychtoolbox (PTB) practical course  by Volodymyr B. Bogdanov, Lyon/Kyiv 2018...
Psychtoolbox (PTB) practical course by Volodymyr B. Bogdanov, Lyon/Kyiv 2018...Volodymyr Bogdanov
 
Pointers lesson 3 (data types and pointer arithmetics)
Pointers lesson 3 (data types and pointer arithmetics)Pointers lesson 3 (data types and pointer arithmetics)
Pointers lesson 3 (data types and pointer arithmetics)SetuMaheshwari1
 
200603ash.pdf Performance Tuning Oracle DB
200603ash.pdf Performance Tuning Oracle DB200603ash.pdf Performance Tuning Oracle DB
200603ash.pdf Performance Tuning Oracle DBcookie1969
 
Analyze database system using a 3 d method
Analyze database system using a 3 d methodAnalyze database system using a 3 d method
Analyze database system using a 3 d methodAjith Narayanan
 
Asynchronous Awesome
Asynchronous AwesomeAsynchronous Awesome
Asynchronous AwesomeFlip Sasser
 
Upgrade and Self Help Tools for Atlassian Admins
Upgrade and Self Help Tools for Atlassian AdminsUpgrade and Self Help Tools for Atlassian Admins
Upgrade and Self Help Tools for Atlassian AdminsAtlassian
 
SKB Kontur: Digging Cassandra cluster
SKB Kontur: Digging Cassandra clusterSKB Kontur: Digging Cassandra cluster
SKB Kontur: Digging Cassandra clusterDataStax Academy
 
Digging Cassandra Cluster
Digging Cassandra ClusterDigging Cassandra Cluster
Digging Cassandra ClusterIvan Burmistrov
 
Beyond PHP - it's not (just) about the code
Beyond PHP - it's not (just) about the codeBeyond PHP - it's not (just) about the code
Beyond PHP - it's not (just) about the codeWim Godden
 

Similar to Realtime Analytics on the Twitter Firehose with Apache Cassandra - Denormalized London 2012 (15)

More Mining
More MiningMore Mining
More Mining
 
Earthquake shakes twitter users
Earthquake shakes twitter usersEarthquake shakes twitter users
Earthquake shakes twitter users
 
It's all about the timing
It's all about the timingIt's all about the timing
It's all about the timing
 
Reflex - How Does It Work? (extended dance remix)
Reflex - How Does It Work? (extended dance remix)Reflex - How Does It Work? (extended dance remix)
Reflex - How Does It Work? (extended dance remix)
 
Dataflow - A Unified Model for Batch and Streaming Data Processing
Dataflow - A Unified Model for Batch and Streaming Data ProcessingDataflow - A Unified Model for Batch and Streaming Data Processing
Dataflow - A Unified Model for Batch and Streaming Data Processing
 
Psychtoolbox (PTB) practical course by Volodymyr B. Bogdanov, Lyon/Kyiv 2018...
Psychtoolbox (PTB) practical course  by Volodymyr B. Bogdanov, Lyon/Kyiv 2018...Psychtoolbox (PTB) practical course  by Volodymyr B. Bogdanov, Lyon/Kyiv 2018...
Psychtoolbox (PTB) practical course by Volodymyr B. Bogdanov, Lyon/Kyiv 2018...
 
Pointers lesson 3 (data types and pointer arithmetics)
Pointers lesson 3 (data types and pointer arithmetics)Pointers lesson 3 (data types and pointer arithmetics)
Pointers lesson 3 (data types and pointer arithmetics)
 
200603ash.pdf Performance Tuning Oracle DB
200603ash.pdf Performance Tuning Oracle DB200603ash.pdf Performance Tuning Oracle DB
200603ash.pdf Performance Tuning Oracle DB
 
Analyze database system using a 3 d method
Analyze database system using a 3 d methodAnalyze database system using a 3 d method
Analyze database system using a 3 d method
 
Asynchronous Awesome
Asynchronous AwesomeAsynchronous Awesome
Asynchronous Awesome
 
Upgrade and Self Help Tools for Atlassian Admins
Upgrade and Self Help Tools for Atlassian AdminsUpgrade and Self Help Tools for Atlassian Admins
Upgrade and Self Help Tools for Atlassian Admins
 
SKB Kontur: Digging Cassandra cluster
SKB Kontur: Digging Cassandra clusterSKB Kontur: Digging Cassandra cluster
SKB Kontur: Digging Cassandra cluster
 
Digging Cassandra Cluster
Digging Cassandra ClusterDigging Cassandra Cluster
Digging Cassandra Cluster
 
Storm 2012 03-29
Storm 2012 03-29Storm 2012 03-29
Storm 2012 03-29
 
Beyond PHP - it's not (just) about the code
Beyond PHP - it's not (just) about the codeBeyond PHP - it's not (just) about the code
Beyond PHP - it's not (just) about the code
 

More from Acunu

Acunu and Hailo: a realtime analytics case study on Cassandra
Acunu and Hailo: a realtime analytics case study on CassandraAcunu and Hailo: a realtime analytics case study on Cassandra
Acunu and Hailo: a realtime analytics case study on CassandraAcunu
 
Virtual nodes: Operational Aspirin
Virtual nodes: Operational AspirinVirtual nodes: Operational Aspirin
Virtual nodes: Operational AspirinAcunu
 
Acunu Analytics and Cassandra at Hailo All Your Base 2013
Acunu Analytics and Cassandra at Hailo All Your Base 2013 Acunu Analytics and Cassandra at Hailo All Your Base 2013
Acunu Analytics and Cassandra at Hailo All Your Base 2013 Acunu
 
Understanding Cassandra internals to solve real-world problems
Understanding Cassandra internals to solve real-world problemsUnderstanding Cassandra internals to solve real-world problems
Understanding Cassandra internals to solve real-world problemsAcunu
 
Acunu Analytics: Simpler Real-Time Cassandra Apps
Acunu Analytics: Simpler Real-Time Cassandra AppsAcunu Analytics: Simpler Real-Time Cassandra Apps
Acunu Analytics: Simpler Real-Time Cassandra AppsAcunu
 
Real-time Cassandra
Real-time CassandraReal-time Cassandra
Real-time CassandraAcunu
 
Realtime Analytics with Cassandra
Realtime Analytics with CassandraRealtime Analytics with Cassandra
Realtime Analytics with CassandraAcunu
 
Acunu Analytics @ Cassandra London
Acunu Analytics @ Cassandra LondonAcunu Analytics @ Cassandra London
Acunu Analytics @ Cassandra LondonAcunu
 
Exploring Big Data value for your business
Exploring Big Data value for your businessExploring Big Data value for your business
Exploring Big Data value for your businessAcunu
 
Progressive NOSQL: Cassandra
Progressive NOSQL: CassandraProgressive NOSQL: Cassandra
Progressive NOSQL: CassandraAcunu
 
Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...
Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...
Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...Acunu
 
Cassandra EU 2012 - Putting the X Factor into Cassandra
Cassandra EU 2012 - Putting the X Factor into CassandraCassandra EU 2012 - Putting the X Factor into Cassandra
Cassandra EU 2012 - Putting the X Factor into CassandraAcunu
 
Cassandra EU 2012 - Netflix's Cassandra Architecture and Open Source Efforts
Cassandra EU 2012 - Netflix's Cassandra Architecture and Open Source EffortsCassandra EU 2012 - Netflix's Cassandra Architecture and Open Source Efforts
Cassandra EU 2012 - Netflix's Cassandra Architecture and Open Source EffortsAcunu
 
Next Generation Cassandra
Next Generation CassandraNext Generation Cassandra
Next Generation CassandraAcunu
 
Cassandra EU 2012 - CQL: Then, Now and When by Eric Evans
Cassandra EU 2012 - CQL: Then, Now and When by Eric Evans Cassandra EU 2012 - CQL: Then, Now and When by Eric Evans
Cassandra EU 2012 - CQL: Then, Now and When by Eric Evans Acunu
 
Cassandra EU 2012 - Storage Internals by Nicolas Favre-Felix
Cassandra EU 2012 - Storage Internals by Nicolas Favre-FelixCassandra EU 2012 - Storage Internals by Nicolas Favre-Felix
Cassandra EU 2012 - Storage Internals by Nicolas Favre-FelixAcunu
 
Cassandra EU 2012 - Highly Available: The Cassandra Distribution Model by Sam...
Cassandra EU 2012 - Highly Available: The Cassandra Distribution Model by Sam...Cassandra EU 2012 - Highly Available: The Cassandra Distribution Model by Sam...
Cassandra EU 2012 - Highly Available: The Cassandra Distribution Model by Sam...Acunu
 
Cassandra EU 2012 - Data modelling workshop by Richard Low
Cassandra EU 2012 - Data modelling workshop by Richard LowCassandra EU 2012 - Data modelling workshop by Richard Low
Cassandra EU 2012 - Data modelling workshop by Richard LowAcunu
 
Acunu Analytics
Acunu AnalyticsAcunu Analytics
Acunu AnalyticsAcunu
 
Cassandra Performance: Past, present & future
Cassandra Performance: Past, present & futureCassandra Performance: Past, present & future
Cassandra Performance: Past, present & futureAcunu
 

More from Acunu (20)

Acunu and Hailo: a realtime analytics case study on Cassandra
Acunu and Hailo: a realtime analytics case study on CassandraAcunu and Hailo: a realtime analytics case study on Cassandra
Acunu and Hailo: a realtime analytics case study on Cassandra
 
Virtual nodes: Operational Aspirin
Virtual nodes: Operational AspirinVirtual nodes: Operational Aspirin
Virtual nodes: Operational Aspirin
 
Acunu Analytics and Cassandra at Hailo All Your Base 2013
Acunu Analytics and Cassandra at Hailo All Your Base 2013 Acunu Analytics and Cassandra at Hailo All Your Base 2013
Acunu Analytics and Cassandra at Hailo All Your Base 2013
 
Understanding Cassandra internals to solve real-world problems
Understanding Cassandra internals to solve real-world problemsUnderstanding Cassandra internals to solve real-world problems
Understanding Cassandra internals to solve real-world problems
 
Acunu Analytics: Simpler Real-Time Cassandra Apps
Acunu Analytics: Simpler Real-Time Cassandra AppsAcunu Analytics: Simpler Real-Time Cassandra Apps
Acunu Analytics: Simpler Real-Time Cassandra Apps
 
Real-time Cassandra
Real-time CassandraReal-time Cassandra
Real-time Cassandra
 
Realtime Analytics with Cassandra
Realtime Analytics with CassandraRealtime Analytics with Cassandra
Realtime Analytics with Cassandra
 
Acunu Analytics @ Cassandra London
Acunu Analytics @ Cassandra LondonAcunu Analytics @ Cassandra London
Acunu Analytics @ Cassandra London
 
Exploring Big Data value for your business
Exploring Big Data value for your businessExploring Big Data value for your business
Exploring Big Data value for your business
 
Progressive NOSQL: Cassandra
Progressive NOSQL: CassandraProgressive NOSQL: Cassandra
Progressive NOSQL: Cassandra
 
Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...
Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...
Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...
 
Cassandra EU 2012 - Putting the X Factor into Cassandra
Cassandra EU 2012 - Putting the X Factor into CassandraCassandra EU 2012 - Putting the X Factor into Cassandra
Cassandra EU 2012 - Putting the X Factor into Cassandra
 
Cassandra EU 2012 - Netflix's Cassandra Architecture and Open Source Efforts
Cassandra EU 2012 - Netflix's Cassandra Architecture and Open Source EffortsCassandra EU 2012 - Netflix's Cassandra Architecture and Open Source Efforts
Cassandra EU 2012 - Netflix's Cassandra Architecture and Open Source Efforts
 
Next Generation Cassandra
Next Generation CassandraNext Generation Cassandra
Next Generation Cassandra
 
Cassandra EU 2012 - CQL: Then, Now and When by Eric Evans
Cassandra EU 2012 - CQL: Then, Now and When by Eric Evans Cassandra EU 2012 - CQL: Then, Now and When by Eric Evans
Cassandra EU 2012 - CQL: Then, Now and When by Eric Evans
 
Cassandra EU 2012 - Storage Internals by Nicolas Favre-Felix
Cassandra EU 2012 - Storage Internals by Nicolas Favre-FelixCassandra EU 2012 - Storage Internals by Nicolas Favre-Felix
Cassandra EU 2012 - Storage Internals by Nicolas Favre-Felix
 
Cassandra EU 2012 - Highly Available: The Cassandra Distribution Model by Sam...
Cassandra EU 2012 - Highly Available: The Cassandra Distribution Model by Sam...Cassandra EU 2012 - Highly Available: The Cassandra Distribution Model by Sam...
Cassandra EU 2012 - Highly Available: The Cassandra Distribution Model by Sam...
 
Cassandra EU 2012 - Data modelling workshop by Richard Low
Cassandra EU 2012 - Data modelling workshop by Richard LowCassandra EU 2012 - Data modelling workshop by Richard Low
Cassandra EU 2012 - Data modelling workshop by Richard Low
 
Acunu Analytics
Acunu AnalyticsAcunu Analytics
Acunu Analytics
 
Cassandra Performance: Past, present & future
Cassandra Performance: Past, present & futureCassandra Performance: Past, present & future
Cassandra Performance: Past, present & future
 

Recently uploaded

Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 

Recently uploaded (20)

Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 

Realtime Analytics on the Twitter Firehose with Apache Cassandra - Denormalized London 2012

  • 1. Realtime Analytics with Cassandra or: How I Learned to Stopped Worrying and Love Counting
  • 2. Combining “big” and “real-time” is hard Live & historical Drill downs Trends... aggregates... and roll ups 2 Analytics
  • 3. What is Realtime Analytics? eg “show me the number of mentions of ‘Acunu’ per day, between May and November 2011, on Twitter” Batch (Hadoop) approach would require processing ~30 billion tweets, or ~4.2 TB of data http://blog.twitter.com/2011/03/numbers.html
  • 4. Okay, so how are we going to do it? counter updates tweets Twitter ? • Push processing into ingest phase • Make queries fast
  • 5. Okay, so how are we going to do it? For each tweet, increment a bunch of counters, such that answering a query is as easy as reading some counters
  • 6. Preparing the data 12:32:15 I like #trafficlights Step 1: Get a feed of 12:33:43 Nobody expects... the tweets 12:33:49 I ate a #bee; woe is... 12:34:04 Man, @acunu rocks! Step 2: Tokenise the tweet Step 3: Increment counters [1234, man] +1 in time buckets for [1234, acunu] +1 each token [1234, rock] +1
  • 7. Querying start: [01/05/11, acunu] Step 1: Do a range query end: [30/05/11, acunu] Key #Mentions [01/05/11 00:01, acunu] 3 Step 2: Result table [01/05/11 00:02, acunu] 5 ... ... 90 Step 3: Plot pretty graph 45 0 May Jun Jul Aug Sept Oct Nov
  • 8. Except it’s not that easy... • Cassandra best practice is to use RandomPartitioner, so not possible to range queries on rows • Could manually work out each row in range, do lots of point gets • This would suck - each query would be 100’s of random IOs on disk • Need to use wide rows, range query is a column slice, each query ~1 IO - Denormalisation
  • 9. So instead of this... Key #Mentions [01/05/11 00:01, acunu] 3 [01/05/11 00:02, acunu] 5 ... ... We do this Key 00:01 00:02 ... [01/05/11, acunu] 3 5 ... [02/05/11, acunu] 12 4 ... ... ... ... Row key is ‘big’ Column key is ‘small’ time bucket time bucket
  • 10. Demo ./painbird.py -u tom_wilkie http://ec2-176-34-212-226.eu- west-1.compute.amazonaws.com:8000
  • 11. Now its your turn.....
  • 12. 1. Get a twitter account - http://twitter.com 2. Get some Cassandra VMs - http://goo.gl/Ruqlt 3. Cluster them up 4. Get the code - http://goo.gl/VxXKB 5. Implement the missing bits! 6. (Prizes for the ones that spot bugs!)
  • 13. Get some Cassandra VMs http://goo.gl/O9hkv
  • 14. Cluster them up • SSH in, set password (on both!) • Check you can connect to the UI • Use UI (click add host)
  • 15. Get the code SSH into one of the VMs: # curl https://acunu- oss.s3.amazonaws.com/ painbird-2.tar.gz | tar zxf - # cd release # ./painbird.py -u tom_wilkie
  • 16. Implement the “core” • In core.py • def insert_tweet(cassandra, tweet): • def do_query(cassandra, term, start, finish):
  • 17. Check you data -bash-3.2$ cassandra-cli Connected to: "Test Cluster" on localhost/9160 Welcome to Cassandra CLI version 1.0.8.acunu2 Type 'help;' or '?' for help. Type 'quit;' or 'exit;' to quit. [default@unknown] use painbird; Authenticated to keyspace: painbird [default@painbird] list keywords; Using default limit of 100 ------------------- RowKey: m-5-"woe => (counter=11, value=1)
  • 19. Extensions UI • Pretty graphs • Automatically periodically update? • Search multiple terms Painbird • mentions of multiple terms • sentiment analysis - http://www.nltk.org/ • filtering by multiple fields (geo + keyword)