SlideShare una empresa de Scribd logo
1 de 32
Descargar para leer sin conexión
Realtime Analytics
  with Apache
   Cassandra
        Tom Wilkie
 Founder & CTO, Acunu Ltd
       @tom_wilkie
101
• BigTable-style datamodel combined with
    Dynamo-style consistency

• Simple queries - put, get, range queries
• Multi-master architecture: no SPOF
• Tunable consistency, multi-DC aware
• Optimised for random writes & random
    range queries

• Atomic counters, wide rows, composite keys
2
                                               Analytics
Combining “big” and “real-time” is hard

    Live & historical                    Drill downs
                         Trends...
      aggregates...                      and roll ups




3
                                                        Analytics
Solution              Con

                       Scalability
                         $$$


                      Not realtime


               Spartan query semantics =>
                 complex, DIY solutions

4
                                            Analytics
Example I
    eg “show me the number of mentions of
        ‘Acunu’ per day, between May and
          November 2011, on Twitter”


    Batch (Hadoop) approach would require
    processing ~30 billion tweets, or ~4.2
                 TB of data
                  http://blog.twitter.com/2011/03/numbers.html


5
                                                                 Analytics
Okay, so how are we going to
                   do it?

    For each tweet,
    increment a bunch of counters,
    such that answering a query
    is as easy as reading some counters


6
                                          Analytics
Preparing the data
                              12:32:15 I like #trafficlights
Step 1: Get a feed of    12:33:43 Nobody expects...
        the tweets     12:33:49 I ate a #bee; woe is...
                      12:34:04 Man, @acunu rocks!

Step 2: Tokenise the
        tweet

Step 3: Increment counters            [1234, man]   +1
        in time buckets for           [1234, acunu] +1
        each token                    [1234, rock] +1
7
                                                              Analytics
Querying
                            start: [01/05/11, acunu]
Step 1: Do a range query    end:   [30/05/11, acunu]

                                       Key            #Mentions
                              [01/05/11 00:01, acunu]    3
Step 2: Result table          [01/05/11 00:02, acunu]    5
                                        ...              ...


                              90

Step 3: Plot pretty graph     45
                               0
                                   May Jun Jul Aug Sept Oct Nov
8
                                                              Analytics
k2                  k4




          k3                  k1

    Cassandra keys distributed based on
       hash or row key, ie randomly
9
                                          Analytics
Instead of this...
                                   Key            #Mentions
                          [01/05/11 00:01, acunu]    3
                          [01/05/11 00:02, acunu]    5
                                    ...              ...




                          We do this
                       Key           00:01       00:02        ...
                 [01/05/11, acunu]     3           5          ...
                 [02/05/11, acunu]    12           4          ...
                         ...           ...                    ...

     Row key is ‘big’                     Column key is ‘small’
      time bucket                             time bucket
10
                                                                    Analytics
Towards a more
     general solution...
       (Example II)



11
                           Analytics
count
                grouped by ...
                    day
  count
 distinct
(session)
     count       ... geography

avg(duration)
                  ... browser


12
                          Analytics
21:00      all→1345    :00→45      :01→62      :02→87       ...

                         22:00      all→3221    :00→22      :00→19     :02→104       ...
{
     cust_id: user01,      ...                                                       ...

     session_id: 102,      UK        all→228    user01→1   user14→12   user99→7      ...
     geography: UK,
                           US        all→354    user01→4   user04→8    user56→17     ...
     browser: IE,
     time: 22:02,          ...

}                       UK, 22:00   all→1904       ...

                           ∅        all→87314   UK→238     US→354         ...




13
                                                                                 Analytics
21:00      all→1345     :00→45     :01→62      :02→87       ...

                         22:00      all→3222     :00→22     :00→19     :02→105       ...
{
     cust_id: user01,      ...                                                       ...

     session_id: 102,      UK        all→229    user01→2   user14→12   user99→7      ...
     geography: UK,
                           US        all→354    user01→4   user04→8    user56→17     ...
     browser: IE,
     time: 22:02,          ...

}                       UK, 22:00   all→1905       ...

                           ∅        all→87315   UK→239     US→354         ...




14
                                                                                 Analytics
21:00      all→1345    :00→45      :01→62      :02→87       ...

      22:00      all→3221    :00→22      :00→19     :02→104       ...

        ...                                                       ...

        UK        all→228    user01→1   user14→12   user99→7      ...

        US        all→354    user01→4   user04→8    user56→17     ...

        ...

     UK, 22:00   all→1904       ...

        ∅        all→87314   UK→238     US→354         ...




15
                                                              Analytics
where time 21:00-22:00
 count(*)
                          21:00      all→1345    :00→45      :01→62      :02→87       ...

                          22:00      all→3222    :00→22      :01→19     :02→105       ...

                            ...                                                       ...

                            UK        all→229    user01→2   user14→12   user99→7      ...

                            US        all→354    user01→4   user04→8    user56→17     ...

                            ...

                         UK, 22:00   all→1905       ...

                            ∅        all→87315   UK→239     US→354         ...




16
                                                                                  Analytics
where time 21:00-22:00
 count(*)
                           21:00      all→1345    :00→45      :01→62      :02→87       ...


where time 22:00-23:00,    22:00      all→3222    :00→22      :01→19     :02→105       ...


 group by minute             ...                                                       ...

                             UK        all→229    user01→2   user14→12   user99→7      ...

                             US        all→354    user01→4   user04→8    user56→17     ...

                             ...

                          UK, 22:00   all→1905       ...

                             ∅        all→87315   UK→239     US→354         ...




17
                                                                                   Analytics
where time 21:00-22:00
 count(*)
                           21:00      all→1345     :00→45     :01→62      :02→87       ...


where time 22:00-23:00,    22:00      all→3222    :00→22      :01→19     :02→105       ...


 group by minute             ...                                                       ...

                             UK        all→229    user01→2   user14→12   user99→7      ...


where geography=UK           US        all→354    user01→4   user04→8    user56→17     ...


 group all by user,          ...

                          UK, 22:00   all→1905       ...

                             ∅        all→87315   UK→239      US→354        ...




18
                                                                                   Analytics
where time 21:00-22:00
 count(*)
                           21:00      all→1345     :00→45     :01→62      :02→87       ...


where time 22:00-23:00,    22:00      all→3222    :00→22      :01→19     :02→105       ...


 group by minute             ...                                                       ...

                             UK        all→229    user01→2   user14→12   user99→7      ...


where geography=UK           US        all→354    user01→4   user04→8    user56→17     ...


 group all by user,          ...

                          UK, 22:00   all→1905       ...

count all                    ∅        all→87315   UK→239      US→354        ...




19
                                                                                   Analytics
where time 21:00-22:00
 count(*)
                           21:00      all→1345     :00→45     :01→62      :02→87       ...


where time 22:00-23:00,    22:00      all→3222    :00→22      :01→19     :02→105       ...


 group by minute             ...                                                       ...

                             UK        all→229    user01→2   user14→12   user99→7      ...


where geography=UK           US        all→354    user01→4   user04→8    user56→17     ...


 group all by user,          ...

                          UK, 22:00   all→1905       ...

count all                    ∅        all→87315   UK→239      US→354        ...




group all by geo
20
                                                                                   Analytics
What about more than
       just aggregates?



21
                            Analytics
Approximate Analytics
                 Exact




     Real-time           Large Scale


22
                                       Analytics
Count Distinct

     Plan A: keep a list of all the things you’ve seen
               count them at query time


                Quick to update
                  ... but at scale ...
                Takes lots of space
                Takes a long time to query
23
                                                         Analytics
Approximate Distinct

     max # leading zeroes seen so far
         item          hash        leading zeroes   max so far

         x        00101001110...          2            2
         y        11010100111...          0            2
         z        00011101011...          3            3
                       ...
     ... to see a max of M takes about        2M    items

24
                                                                 Analytics
Approximate Distinct

            to reduce var, average over m=2k sub-streams

     item          hash          index, zeroes   max so far

     x       00101001110...          0, 0        0,0,0,0
     y       11010100111...          3, 1        0,0,0,1
     z       00011101011...          0, 1        1,0,0,1
                   ...
            take the harmonic mean
25
                                                              Analytics
Okay... now what?




                    Analytics
Analytics

                                     counter
                                     updates
Click stream    events
                          Acunu
Sensor data
                         Analytics
     etc




     •   Aggregate incrementally, on the fly
     •   Store live + historical aggregates
10x vs MySQL...




                  Analytics
Dashboard UI




29
                    Analytics
“Up and running in about 4 hours”
“We found out a competitor
  was scraping our data”

                       “We keep discovering use cases
                          we hadn’t thought of ”




http://vimeo.com/54026096

                                                  Analytics
"Quick, efficient and easy to
        get started"



                          "We're still finding new and
                           interesting use cases, which
                          just aren't possible with our
                               current datastores." Analytics
Thanks!

              Questions?

     http://www.acunu.com/download
            contact@acunu.com

32
                                     Analytics

Más contenido relacionado

Similar a Realtime Analytics with Apache Cassandra

Realtime analytics with Apache Cassandra - Tom Wilkie
Realtime analytics with Apache Cassandra - Tom WilkieRealtime analytics with Apache Cassandra - Tom Wilkie
Realtime analytics with Apache Cassandra - Tom WilkieJAX London
 
Realtime Analytics on the Twitter Firehose with Apache Cassandra - Denormaliz...
Realtime Analytics on the Twitter Firehose with Apache Cassandra - Denormaliz...Realtime Analytics on the Twitter Firehose with Apache Cassandra - Denormaliz...
Realtime Analytics on the Twitter Firehose with Apache Cassandra - Denormaliz...Acunu
 
Realtime Analytics on the Twitter Firehose with Cassandra
Realtime Analytics on the Twitter Firehose with CassandraRealtime Analytics on the Twitter Firehose with Cassandra
Realtime Analytics on the Twitter Firehose with CassandraAcunu
 
All Your Base
All Your BaseAll Your Base
All Your BaseAcunu
 
Harmony intune final
Harmony intune finalHarmony intune final
Harmony intune finalMongoDB
 
Playing in Tune: How We Refactored Cube to Terabyte Scale
Playing in Tune: How We Refactored Cube to Terabyte ScalePlaying in Tune: How We Refactored Cube to Terabyte Scale
Playing in Tune: How We Refactored Cube to Terabyte ScaleMongoDB
 
Brace yourselves, leap second is coming
Brace yourselves, leap second is comingBrace yourselves, leap second is coming
Brace yourselves, leap second is comingNati Cohen
 
Applying your Convolutional Neural Networks
Applying your Convolutional Neural NetworksApplying your Convolutional Neural Networks
Applying your Convolutional Neural NetworksDatabricks
 
"Practical Machine Learning With Ruby" by Iqbal Farabi (ID Ruby Community)
"Practical Machine Learning With Ruby" by Iqbal Farabi (ID Ruby Community)"Practical Machine Learning With Ruby" by Iqbal Farabi (ID Ruby Community)
"Practical Machine Learning With Ruby" by Iqbal Farabi (ID Ruby Community)Tech in Asia ID
 
Geo exploration simplified with Elastic Maps
Geo exploration simplified with Elastic MapsGeo exploration simplified with Elastic Maps
Geo exploration simplified with Elastic MapsElasticsearch
 
Columnar processing for SQL-on-Hadoop: The best is yet to come
Columnar processing for SQL-on-Hadoop: The best is yet to comeColumnar processing for SQL-on-Hadoop: The best is yet to come
Columnar processing for SQL-on-Hadoop: The best is yet to comeWang Zuo
 
Beyond PHP - it's not (just) about the code
Beyond PHP - it's not (just) about the codeBeyond PHP - it's not (just) about the code
Beyond PHP - it's not (just) about the codeWim Godden
 
MongoDB World 2018: Overnight to 60 Seconds: An IOT ETL Performance Case Study
MongoDB World 2018: Overnight to 60 Seconds: An IOT ETL Performance Case StudyMongoDB World 2018: Overnight to 60 Seconds: An IOT ETL Performance Case Study
MongoDB World 2018: Overnight to 60 Seconds: An IOT ETL Performance Case StudyMongoDB
 
Counting (Using Computer)
Counting (Using Computer)Counting (Using Computer)
Counting (Using Computer)roshmat
 
Introduction to Neural Networks
Introduction to Neural NetworksIntroduction to Neural Networks
Introduction to Neural NetworksDatabricks
 
IEEE ICDM 2018 Tutorial on Blockchain Data Analytics
IEEE ICDM 2018 Tutorial on Blockchain Data AnalyticsIEEE ICDM 2018 Tutorial on Blockchain Data Analytics
IEEE ICDM 2018 Tutorial on Blockchain Data AnalyticsCuneyt Gurcan Akcora
 
Mining the Geo Needles in the Social Haystack
Mining the Geo Needles in the Social HaystackMining the Geo Needles in the Social Haystack
Mining the Geo Needles in the Social HaystackMatthew Russell
 
FluentMigrator - Dayton .NET - July 2023
FluentMigrator - Dayton .NET - July 2023FluentMigrator - Dayton .NET - July 2023
FluentMigrator - Dayton .NET - July 2023Matthew Groves
 

Similar a Realtime Analytics with Apache Cassandra (20)

Realtime analytics with Apache Cassandra - Tom Wilkie
Realtime analytics with Apache Cassandra - Tom WilkieRealtime analytics with Apache Cassandra - Tom Wilkie
Realtime analytics with Apache Cassandra - Tom Wilkie
 
Realtime Analytics on the Twitter Firehose with Apache Cassandra - Denormaliz...
Realtime Analytics on the Twitter Firehose with Apache Cassandra - Denormaliz...Realtime Analytics on the Twitter Firehose with Apache Cassandra - Denormaliz...
Realtime Analytics on the Twitter Firehose with Apache Cassandra - Denormaliz...
 
Realtime Analytics on the Twitter Firehose with Cassandra
Realtime Analytics on the Twitter Firehose with CassandraRealtime Analytics on the Twitter Firehose with Cassandra
Realtime Analytics on the Twitter Firehose with Cassandra
 
All Your Base
All Your BaseAll Your Base
All Your Base
 
Harmony intune final
Harmony intune finalHarmony intune final
Harmony intune final
 
Playing in Tune: How We Refactored Cube to Terabyte Scale
Playing in Tune: How We Refactored Cube to Terabyte ScalePlaying in Tune: How We Refactored Cube to Terabyte Scale
Playing in Tune: How We Refactored Cube to Terabyte Scale
 
Brace yourselves, leap second is coming
Brace yourselves, leap second is comingBrace yourselves, leap second is coming
Brace yourselves, leap second is coming
 
Applying your Convolutional Neural Networks
Applying your Convolutional Neural NetworksApplying your Convolutional Neural Networks
Applying your Convolutional Neural Networks
 
"Practical Machine Learning With Ruby" by Iqbal Farabi (ID Ruby Community)
"Practical Machine Learning With Ruby" by Iqbal Farabi (ID Ruby Community)"Practical Machine Learning With Ruby" by Iqbal Farabi (ID Ruby Community)
"Practical Machine Learning With Ruby" by Iqbal Farabi (ID Ruby Community)
 
Geo exploration simplified with Elastic Maps
Geo exploration simplified with Elastic MapsGeo exploration simplified with Elastic Maps
Geo exploration simplified with Elastic Maps
 
Columnar processing for SQL-on-Hadoop: The best is yet to come
Columnar processing for SQL-on-Hadoop: The best is yet to comeColumnar processing for SQL-on-Hadoop: The best is yet to come
Columnar processing for SQL-on-Hadoop: The best is yet to come
 
Beyond PHP - it's not (just) about the code
Beyond PHP - it's not (just) about the codeBeyond PHP - it's not (just) about the code
Beyond PHP - it's not (just) about the code
 
Quick sort
Quick sortQuick sort
Quick sort
 
MongoDB World 2018: Overnight to 60 Seconds: An IOT ETL Performance Case Study
MongoDB World 2018: Overnight to 60 Seconds: An IOT ETL Performance Case StudyMongoDB World 2018: Overnight to 60 Seconds: An IOT ETL Performance Case Study
MongoDB World 2018: Overnight to 60 Seconds: An IOT ETL Performance Case Study
 
Counting (Using Computer)
Counting (Using Computer)Counting (Using Computer)
Counting (Using Computer)
 
Introduction to Neural Networks
Introduction to Neural NetworksIntroduction to Neural Networks
Introduction to Neural Networks
 
IEEE ICDM 2018 Tutorial on Blockchain Data Analytics
IEEE ICDM 2018 Tutorial on Blockchain Data AnalyticsIEEE ICDM 2018 Tutorial on Blockchain Data Analytics
IEEE ICDM 2018 Tutorial on Blockchain Data Analytics
 
Mining the Geo Needles in the Social Haystack
Mining the Geo Needles in the Social HaystackMining the Geo Needles in the Social Haystack
Mining the Geo Needles in the Social Haystack
 
FluentMigrator - Dayton .NET - July 2023
FluentMigrator - Dayton .NET - July 2023FluentMigrator - Dayton .NET - July 2023
FluentMigrator - Dayton .NET - July 2023
 
Se2017 query-optimizer
Se2017 query-optimizerSe2017 query-optimizer
Se2017 query-optimizer
 

Más de Acunu

Virtual nodes: Operational Aspirin
Virtual nodes: Operational AspirinVirtual nodes: Operational Aspirin
Virtual nodes: Operational AspirinAcunu
 
Acunu Analytics and Cassandra at Hailo All Your Base 2013
Acunu Analytics and Cassandra at Hailo All Your Base 2013 Acunu Analytics and Cassandra at Hailo All Your Base 2013
Acunu Analytics and Cassandra at Hailo All Your Base 2013 Acunu
 
Understanding Cassandra internals to solve real-world problems
Understanding Cassandra internals to solve real-world problemsUnderstanding Cassandra internals to solve real-world problems
Understanding Cassandra internals to solve real-world problemsAcunu
 
Acunu Analytics: Simpler Real-Time Cassandra Apps
Acunu Analytics: Simpler Real-Time Cassandra AppsAcunu Analytics: Simpler Real-Time Cassandra Apps
Acunu Analytics: Simpler Real-Time Cassandra AppsAcunu
 
Real-time Cassandra
Real-time CassandraReal-time Cassandra
Real-time CassandraAcunu
 
Realtime Analytics with Cassandra
Realtime Analytics with CassandraRealtime Analytics with Cassandra
Realtime Analytics with CassandraAcunu
 
Acunu Analytics @ Cassandra London
Acunu Analytics @ Cassandra LondonAcunu Analytics @ Cassandra London
Acunu Analytics @ Cassandra LondonAcunu
 
Exploring Big Data value for your business
Exploring Big Data value for your businessExploring Big Data value for your business
Exploring Big Data value for your businessAcunu
 
Progressive NOSQL: Cassandra
Progressive NOSQL: CassandraProgressive NOSQL: Cassandra
Progressive NOSQL: CassandraAcunu
 
Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...
Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...
Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...Acunu
 
Cassandra EU 2012 - Putting the X Factor into Cassandra
Cassandra EU 2012 - Putting the X Factor into CassandraCassandra EU 2012 - Putting the X Factor into Cassandra
Cassandra EU 2012 - Putting the X Factor into CassandraAcunu
 
Cassandra EU 2012 - Netflix's Cassandra Architecture and Open Source Efforts
Cassandra EU 2012 - Netflix's Cassandra Architecture and Open Source EffortsCassandra EU 2012 - Netflix's Cassandra Architecture and Open Source Efforts
Cassandra EU 2012 - Netflix's Cassandra Architecture and Open Source EffortsAcunu
 
Next Generation Cassandra
Next Generation CassandraNext Generation Cassandra
Next Generation CassandraAcunu
 
Cassandra EU 2012 - CQL: Then, Now and When by Eric Evans
Cassandra EU 2012 - CQL: Then, Now and When by Eric Evans Cassandra EU 2012 - CQL: Then, Now and When by Eric Evans
Cassandra EU 2012 - CQL: Then, Now and When by Eric Evans Acunu
 
Cassandra EU 2012 - Storage Internals by Nicolas Favre-Felix
Cassandra EU 2012 - Storage Internals by Nicolas Favre-FelixCassandra EU 2012 - Storage Internals by Nicolas Favre-Felix
Cassandra EU 2012 - Storage Internals by Nicolas Favre-FelixAcunu
 
Cassandra EU 2012 - Highly Available: The Cassandra Distribution Model by Sam...
Cassandra EU 2012 - Highly Available: The Cassandra Distribution Model by Sam...Cassandra EU 2012 - Highly Available: The Cassandra Distribution Model by Sam...
Cassandra EU 2012 - Highly Available: The Cassandra Distribution Model by Sam...Acunu
 
Cassandra EU 2012 - Data modelling workshop by Richard Low
Cassandra EU 2012 - Data modelling workshop by Richard LowCassandra EU 2012 - Data modelling workshop by Richard Low
Cassandra EU 2012 - Data modelling workshop by Richard LowAcunu
 
Cassandra Performance: Past, present & future
Cassandra Performance: Past, present & futureCassandra Performance: Past, present & future
Cassandra Performance: Past, present & futureAcunu
 
Supercharging Cassandra - GOTO Amsterdam
Supercharging Cassandra - GOTO AmsterdamSupercharging Cassandra - GOTO Amsterdam
Supercharging Cassandra - GOTO AmsterdamAcunu
 
Cassandra & the Acunu Data Platform
Cassandra & the Acunu Data PlatformCassandra & the Acunu Data Platform
Cassandra & the Acunu Data PlatformAcunu
 

Más de Acunu (20)

Virtual nodes: Operational Aspirin
Virtual nodes: Operational AspirinVirtual nodes: Operational Aspirin
Virtual nodes: Operational Aspirin
 
Acunu Analytics and Cassandra at Hailo All Your Base 2013
Acunu Analytics and Cassandra at Hailo All Your Base 2013 Acunu Analytics and Cassandra at Hailo All Your Base 2013
Acunu Analytics and Cassandra at Hailo All Your Base 2013
 
Understanding Cassandra internals to solve real-world problems
Understanding Cassandra internals to solve real-world problemsUnderstanding Cassandra internals to solve real-world problems
Understanding Cassandra internals to solve real-world problems
 
Acunu Analytics: Simpler Real-Time Cassandra Apps
Acunu Analytics: Simpler Real-Time Cassandra AppsAcunu Analytics: Simpler Real-Time Cassandra Apps
Acunu Analytics: Simpler Real-Time Cassandra Apps
 
Real-time Cassandra
Real-time CassandraReal-time Cassandra
Real-time Cassandra
 
Realtime Analytics with Cassandra
Realtime Analytics with CassandraRealtime Analytics with Cassandra
Realtime Analytics with Cassandra
 
Acunu Analytics @ Cassandra London
Acunu Analytics @ Cassandra LondonAcunu Analytics @ Cassandra London
Acunu Analytics @ Cassandra London
 
Exploring Big Data value for your business
Exploring Big Data value for your businessExploring Big Data value for your business
Exploring Big Data value for your business
 
Progressive NOSQL: Cassandra
Progressive NOSQL: CassandraProgressive NOSQL: Cassandra
Progressive NOSQL: Cassandra
 
Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...
Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...
Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...
 
Cassandra EU 2012 - Putting the X Factor into Cassandra
Cassandra EU 2012 - Putting the X Factor into CassandraCassandra EU 2012 - Putting the X Factor into Cassandra
Cassandra EU 2012 - Putting the X Factor into Cassandra
 
Cassandra EU 2012 - Netflix's Cassandra Architecture and Open Source Efforts
Cassandra EU 2012 - Netflix's Cassandra Architecture and Open Source EffortsCassandra EU 2012 - Netflix's Cassandra Architecture and Open Source Efforts
Cassandra EU 2012 - Netflix's Cassandra Architecture and Open Source Efforts
 
Next Generation Cassandra
Next Generation CassandraNext Generation Cassandra
Next Generation Cassandra
 
Cassandra EU 2012 - CQL: Then, Now and When by Eric Evans
Cassandra EU 2012 - CQL: Then, Now and When by Eric Evans Cassandra EU 2012 - CQL: Then, Now and When by Eric Evans
Cassandra EU 2012 - CQL: Then, Now and When by Eric Evans
 
Cassandra EU 2012 - Storage Internals by Nicolas Favre-Felix
Cassandra EU 2012 - Storage Internals by Nicolas Favre-FelixCassandra EU 2012 - Storage Internals by Nicolas Favre-Felix
Cassandra EU 2012 - Storage Internals by Nicolas Favre-Felix
 
Cassandra EU 2012 - Highly Available: The Cassandra Distribution Model by Sam...
Cassandra EU 2012 - Highly Available: The Cassandra Distribution Model by Sam...Cassandra EU 2012 - Highly Available: The Cassandra Distribution Model by Sam...
Cassandra EU 2012 - Highly Available: The Cassandra Distribution Model by Sam...
 
Cassandra EU 2012 - Data modelling workshop by Richard Low
Cassandra EU 2012 - Data modelling workshop by Richard LowCassandra EU 2012 - Data modelling workshop by Richard Low
Cassandra EU 2012 - Data modelling workshop by Richard Low
 
Cassandra Performance: Past, present & future
Cassandra Performance: Past, present & futureCassandra Performance: Past, present & future
Cassandra Performance: Past, present & future
 
Supercharging Cassandra - GOTO Amsterdam
Supercharging Cassandra - GOTO AmsterdamSupercharging Cassandra - GOTO Amsterdam
Supercharging Cassandra - GOTO Amsterdam
 
Cassandra & the Acunu Data Platform
Cassandra & the Acunu Data PlatformCassandra & the Acunu Data Platform
Cassandra & the Acunu Data Platform
 

Realtime Analytics with Apache Cassandra

  • 1. Realtime Analytics with Apache Cassandra Tom Wilkie Founder & CTO, Acunu Ltd @tom_wilkie
  • 2. 101 • BigTable-style datamodel combined with Dynamo-style consistency • Simple queries - put, get, range queries • Multi-master architecture: no SPOF • Tunable consistency, multi-DC aware • Optimised for random writes & random range queries • Atomic counters, wide rows, composite keys 2 Analytics
  • 3. Combining “big” and “real-time” is hard Live & historical Drill downs Trends... aggregates... and roll ups 3 Analytics
  • 4. Solution Con Scalability $$$ Not realtime Spartan query semantics => complex, DIY solutions 4 Analytics
  • 5. Example I eg “show me the number of mentions of ‘Acunu’ per day, between May and November 2011, on Twitter” Batch (Hadoop) approach would require processing ~30 billion tweets, or ~4.2 TB of data http://blog.twitter.com/2011/03/numbers.html 5 Analytics
  • 6. Okay, so how are we going to do it? For each tweet, increment a bunch of counters, such that answering a query is as easy as reading some counters 6 Analytics
  • 7. Preparing the data 12:32:15 I like #trafficlights Step 1: Get a feed of 12:33:43 Nobody expects... the tweets 12:33:49 I ate a #bee; woe is... 12:34:04 Man, @acunu rocks! Step 2: Tokenise the tweet Step 3: Increment counters [1234, man] +1 in time buckets for [1234, acunu] +1 each token [1234, rock] +1 7 Analytics
  • 8. Querying start: [01/05/11, acunu] Step 1: Do a range query end: [30/05/11, acunu] Key #Mentions [01/05/11 00:01, acunu] 3 Step 2: Result table [01/05/11 00:02, acunu] 5 ... ... 90 Step 3: Plot pretty graph 45 0 May Jun Jul Aug Sept Oct Nov 8 Analytics
  • 9. k2 k4 k3 k1 Cassandra keys distributed based on hash or row key, ie randomly 9 Analytics
  • 10. Instead of this... Key #Mentions [01/05/11 00:01, acunu] 3 [01/05/11 00:02, acunu] 5 ... ... We do this Key 00:01 00:02 ... [01/05/11, acunu] 3 5 ... [02/05/11, acunu] 12 4 ... ... ... ... Row key is ‘big’ Column key is ‘small’ time bucket time bucket 10 Analytics
  • 11. Towards a more general solution... (Example II) 11 Analytics
  • 12. count grouped by ... day count distinct (session) count ... geography avg(duration) ... browser 12 Analytics
  • 13. 21:00 all→1345 :00→45 :01→62 :02→87 ... 22:00 all→3221 :00→22 :00→19 :02→104 ... { cust_id: user01, ... ... session_id: 102, UK all→228 user01→1 user14→12 user99→7 ... geography: UK, US all→354 user01→4 user04→8 user56→17 ... browser: IE, time: 22:02, ... } UK, 22:00 all→1904 ... ∅ all→87314 UK→238 US→354 ... 13 Analytics
  • 14. 21:00 all→1345 :00→45 :01→62 :02→87 ... 22:00 all→3222 :00→22 :00→19 :02→105 ... { cust_id: user01, ... ... session_id: 102, UK all→229 user01→2 user14→12 user99→7 ... geography: UK, US all→354 user01→4 user04→8 user56→17 ... browser: IE, time: 22:02, ... } UK, 22:00 all→1905 ... ∅ all→87315 UK→239 US→354 ... 14 Analytics
  • 15. 21:00 all→1345 :00→45 :01→62 :02→87 ... 22:00 all→3221 :00→22 :00→19 :02→104 ... ... ... UK all→228 user01→1 user14→12 user99→7 ... US all→354 user01→4 user04→8 user56→17 ... ... UK, 22:00 all→1904 ... ∅ all→87314 UK→238 US→354 ... 15 Analytics
  • 16. where time 21:00-22:00 count(*) 21:00 all→1345 :00→45 :01→62 :02→87 ... 22:00 all→3222 :00→22 :01→19 :02→105 ... ... ... UK all→229 user01→2 user14→12 user99→7 ... US all→354 user01→4 user04→8 user56→17 ... ... UK, 22:00 all→1905 ... ∅ all→87315 UK→239 US→354 ... 16 Analytics
  • 17. where time 21:00-22:00 count(*) 21:00 all→1345 :00→45 :01→62 :02→87 ... where time 22:00-23:00, 22:00 all→3222 :00→22 :01→19 :02→105 ... group by minute ... ... UK all→229 user01→2 user14→12 user99→7 ... US all→354 user01→4 user04→8 user56→17 ... ... UK, 22:00 all→1905 ... ∅ all→87315 UK→239 US→354 ... 17 Analytics
  • 18. where time 21:00-22:00 count(*) 21:00 all→1345 :00→45 :01→62 :02→87 ... where time 22:00-23:00, 22:00 all→3222 :00→22 :01→19 :02→105 ... group by minute ... ... UK all→229 user01→2 user14→12 user99→7 ... where geography=UK US all→354 user01→4 user04→8 user56→17 ... group all by user, ... UK, 22:00 all→1905 ... ∅ all→87315 UK→239 US→354 ... 18 Analytics
  • 19. where time 21:00-22:00 count(*) 21:00 all→1345 :00→45 :01→62 :02→87 ... where time 22:00-23:00, 22:00 all→3222 :00→22 :01→19 :02→105 ... group by minute ... ... UK all→229 user01→2 user14→12 user99→7 ... where geography=UK US all→354 user01→4 user04→8 user56→17 ... group all by user, ... UK, 22:00 all→1905 ... count all ∅ all→87315 UK→239 US→354 ... 19 Analytics
  • 20. where time 21:00-22:00 count(*) 21:00 all→1345 :00→45 :01→62 :02→87 ... where time 22:00-23:00, 22:00 all→3222 :00→22 :01→19 :02→105 ... group by minute ... ... UK all→229 user01→2 user14→12 user99→7 ... where geography=UK US all→354 user01→4 user04→8 user56→17 ... group all by user, ... UK, 22:00 all→1905 ... count all ∅ all→87315 UK→239 US→354 ... group all by geo 20 Analytics
  • 21. What about more than just aggregates? 21 Analytics
  • 22. Approximate Analytics Exact Real-time Large Scale 22 Analytics
  • 23. Count Distinct Plan A: keep a list of all the things you’ve seen count them at query time Quick to update ... but at scale ... Takes lots of space Takes a long time to query 23 Analytics
  • 24. Approximate Distinct max # leading zeroes seen so far item hash leading zeroes max so far x 00101001110... 2 2 y 11010100111... 0 2 z 00011101011... 3 3 ... ... to see a max of M takes about 2M items 24 Analytics
  • 25. Approximate Distinct to reduce var, average over m=2k sub-streams item hash index, zeroes max so far x 00101001110... 0, 0 0,0,0,0 y 11010100111... 3, 1 0,0,0,1 z 00011101011... 0, 1 1,0,0,1 ... take the harmonic mean 25 Analytics
  • 26. Okay... now what? Analytics
  • 27. Analytics counter updates Click stream events Acunu Sensor data Analytics etc • Aggregate incrementally, on the fly • Store live + historical aggregates
  • 28. 10x vs MySQL... Analytics
  • 29. Dashboard UI 29 Analytics
  • 30. “Up and running in about 4 hours” “We found out a competitor was scraping our data” “We keep discovering use cases we hadn’t thought of ” http://vimeo.com/54026096 Analytics
  • 31. "Quick, efficient and easy to get started" "We're still finding new and interesting use cases, which just aren't possible with our current datastores." Analytics
  • 32. Thanks! Questions? http://www.acunu.com/download contact@acunu.com 32 Analytics