C* Summit EU 2013: Analytics On Top of Cassandra and Hadoop
1. Analytics on top of Cassandra and Hadoop
Dmitry Mezhensky | Mirantis Inc
#CASSANDRAEU
2. What we will discuss today
● Analytics on Cassandra using Hadoop
● Various types of statistics & implementation
● Scalability of approach
#CASSANDRAEU
3. Problems
● Too many statistics (more that 100)
● Various types
○ Top N
○ Time series
○ Min/max/average/median
○ Extremum values on time interval
○ Fraud analysis
● Huge amount of data
● Scalability of approach
#CASSANDRAEU
5. Top N
● Map phase generates <Key, Value> pairs, top N
is building by Value
● Reduce phase accumulates values, persist to
Cassandra is done via custom output format
● For top N entities in Cassandra suitable
comparator was used
#CASSANDRAEU
6. Top N
● One write stage to Cassandra sorting is done by
value
● On reading stage first N records will be Top N
values
#CASSANDRAEU
7. Time series
● Map phase generates pairs <Time, Value>
● Reduce phase accumulates (various behaviour
for different statistics)
● Persist to Cassandra using custom output format
& using one row key per statistics, one column
per date
#CASSANDRAEU
8. Maximum, minimum, extremum on interval
● Max/min values are simple to calculate
● Extremum on interval is calculating the similar to
time series
#CASSANDRAEU
9. Fraud analysis
● Fraud analysis is running after all statistics are
calculated
● Processed data is filtered by fraud filters
#CASSANDRAEU
10. Scalability approach
●
●
●
●
Data is reading/writing to Cassandra only
Hadoop is elastically scalable
Cassandra is elastically scalable
No bottleneck
#CASSANDRAEU