This presentation by Tim Moreton at NoSQL NOW! 2013 looks at the history of doing analytics in NoSQL databases. We look at the relative strengthes of normalized and denormalized approaches, and look at how Twitter and Facebook have built custom denormalized systems over NoSQL to support real-time analytics. We look at the lambda architecture, and show how Acunu Analytics provides OLAP cubes over NOSQL, combining denormalization with expressive SQL-like queries.
You can see the full talk here:
http://www.slideshare.net/Dataversity/nosql-and-big-data-analytics
3. Google Personalized Search, 2006
profiles
Serve customised search
results using user profiles
(read only, low latency)
Collect user queries, clickstream
(write only, high throughput)
user_id
searches clicks
BigTable
MapReduce via GFS
Out-of band batch analysis
to produce user profiles
7. Building block: Distributed counters
+1
+1
+1
+1
Total tweets
@timmoreton
2013-08-12
By date
By user
752
+1
+1
CASSANDRA
HBASE
RIAK
UPDATE table SET col = col + 1 WHERE id = 2;
curl -i http://host:8098/buckets/x/
counters/count2 -X POST -d "1"
table.incrementColumnValue(row, cf, col, 1);
10. "I believe firmly that ... you should
"denormalize" only as a last resort.
That is, you should back off from a
fully normalized design only if all
other strategies for improving
performance have somehow failed
to meet requirements."
C J Date 2005
14. Acunu Analytics
count by day count by
hour of day
uniques by
hashtag
raw events
2 New events update cubes
1 Define aggregate cubes
CREATE CUBE APPROX TOP(hashtag)
WHERE browser, time GROUP BY time
3 Rich instant queries over cubes
SELECT TOP(x) FROM t WHERE ..
GROUP BY d1, d2, ...
JOIN ... HAVING.. ORDER BY ..
+
4 Drilldown to raw events5 Backfill new cubes using historic data
15. API
event
stream
event
roll-up
cubes
Ingest
Processing
dashboard queries programatic interface
API
event
stream
event
store
roll-up
cubes
Ingest
Processing
dashboard queries programatic interface
Cassandra stores raw events and aggregates
Acunu Analytics manages cubes and maps inserts
and SQL-like queries to Cassandra reads and writes
API
event
stream
event
store
roll-up
cubes
Ingest
Processing
dashboard queries programatic interface
PROCESSING AT INGEST
JSON, CSV, log ingest
via RESTful HTTP API,
Flume, Storm, AMQP
Storm, MQ HTTP
Acunu Dashboards provides rich,
real-time, embeddable visualizations
SELECT AVG(r)
FROM metrics
GROUP BY
host;
AQL Alerting
!
Cubes
MILLISECOND QUERIES
API
event
stream
event
store
roll-up
cubes
Ingest
Processing
dashboard queries programatic interface
API for rich queries,
threshold alerting
Acunu Analytics
16. Conclusions
NoSQL is a great fit for collecting or serving datasets
with some structure at high scale, performance, availability
Real-time Big Data apps can’t use unplanned rich queries
Use atomic counters to pre-materialize quantitative
results in real-time -- but think carefully about flexibility
Do analytics out-of-band if timeliness is unimportant
A lambda architecture combines real-time with richer
processing, but adds complexity
Acunu Analytics offers real-time OLAP-style queries