This talk focuses on building a system from scratch, showing how to perform analytical queries in near real-time and still get the benefits of high performance database engine of Cassandra. The key subjects of my speech are:
● The splendors and miseries of NoSQL
● Apache Cassandra use-cases
● Difficulties of using MapReduce directly in Cassandra
● Amazon cloud solutions: Elastic MapReduce and S3
● “real-enough” time analysis
In particular the talk dives into ways of handling different kinds of semi-ad-hoc queries when using Cassandra, the pitfalls in designing a schema around a specific analytics use case. Some attention will be paid towards dealing with time series data in particular, which can present a real problem when using Column-Family or Key-Value store databases.
2. › Software
› an
Engineer at Thumbtack Technology
active user of various NoSQL solutions
› consulting with focus on scalability
› a significant part of my work is advising people on
which solutions to use and why
› big fan of BigData and clouds
3. › NoSQL
– not a silver bullet
› Choices that we make
› Cassandra: operational workload
› Cassandra: analytical workload
› The best of both worlds
› Some benchmarks
› Conclusions
4. •
well known ways to scale
•
•
•
•
scale in/out, scale by
function, data
denormalization
really works
each has disadvantages
mostly manual process
(newSQL)
http://qsec.deviantart.com
5. › solve
exactly these kind of problem
› rapid application development
aggregate
› schema flexibility
› auto-scale-out
› auto-failover
›
› amount
of data able to handle
› shared nothing architecture, no SPOF
› performance
6. › splendors
and miseries of aggregate
› CAP theorem dilemma
Consistency
Availability
Partition
Tolerance
9. (released by Facebook in 2008)
› elastic
scalability & linear performance *
› dynamic schema
› very high write throughput
› tunable per request consistency
› fault-tolerant design
› multiple datacenter and cloud readiness
› CaS transaction support *
* http://www.datastax.com/what-we-offer/products-services/datastax-enterprise/apache-cassandra
10. › Large
data set on commodity hardware
› Tradeoff between speed and reliability
› Heavy-write workload
› Time-series data
http://www.datastax.com/what-we-offer/products-services/datastax-enterprise/apache-cassandra
12. TIMESTAMP
12344567
SERVER
1
12326346
13124124
13237457
SERVER
2
13627236
› expensive
FIELD
1
DATA
DATA
DATA
DATA
DATA
…
select * from table
where timestamp > 12344567
and timestamp < 13237457
range queries across cluster
› unless shard by timestamp
› become a bottleneck for heavy-write workload
13. ›
›
all columns are sorted by name
row – aggregate item (never sharded)
get slice
row
key
1
Column
Family
row
key
2
column
1
value
1.1
column
2
value
1.2
column
3
value
1.3
..
..
column
1
column
2
...
column
M
value
2.1
value
2.2
…
value
2.M
column
N
value
1.N
get key
get range
+ combinations of these queries
+ composite columns
Super columns are discouraged and omitted here
14. ›
›
all columns are sorted by name
row – aggregate item (never sharded)
get_slice(row_key, from, to, count)
SERVER
1
SERVER
2
row
key
1
row
key
2
row
key
3
row
key
4
row
key
5
Emestamp
Emestamp
Emestamp
Emestamp
Emestamp
Emestamp
Emestamp
Emestamp
Emestamp
Emestamp
Emestamp
Emestamp
Emestamp
Emestamp
get_slice(“row key 1”, from:“timestamp 1”, null, 11)
15. ›
›
all columns are sorted by name
row – aggregate item (never sharded)
get_slice(row_key, from, to, count)
SERVER
1
SERVER
2
row
key
1
row
key
2
row
key
3
row
key
4
row
key
5
Emestamp
Emestamp
Emestamp
Emestamp
Emestamp
Emestamp
Emestamp
Emestamp
Emestamp
Emestamp
Emestamp
Emestamp
Emestamp
Emestamp
get_slice(“row key 1”, from:“timestamp 1”, null, 11)
Next page
get_slice(“row key 1”, from:“timestamp 11”, null, 11)
get_slice(“row key 1”, null, to:“timestamp 11”, 11)
Prev.page
16. › Time-range
› “get
with filter:
all events for User J from N to M”
› “get all success events for User J from N to M”
› “get all events for all user from N to M”
17. › Time-range
with filter:
› “get
all events for User J from N to M”
› “get all success events for User J from N to M”
Emestamp
1
› “get all events for all user from N to M”
events::success::User_123
events::success
events::User_123
value
1
Emestamp
1
value
1
Emestamp
1
value
1
18. › Counters:
› “get
# of events for User J grouped by hour”
› “get # of events for User J grouped by day”
events::success::User_123
events::User_123
1380400000
14
1380400000
842
1380403600
42
1380403600
1024
(group by day – same but in different column family for TTL support)
19. › row
key should consist of combination of fields with
high cardinality of values:
›
name, id, etc..
› boolean
›
values are bad option
composite columns – good option for it
› timestamp
› otherwise,
may help to spread historical data
scalability will not be linear
20. In theory – possible in real-time
› average, 3 dimensional filters, group by, etc..
But:
› hard to tune data model
› lack of aggregation options
› aggregation by historical data
21. “I want interactive reports”
Auto update
somehow
Cassandra
“Reports could be a little bit out of date, but I
want to control this delay value”
22. › Impact
on
production system
or
› Higher
total cost
of ownership
› Difficulties with
scalability
› hard to support
with multiple
clusters
http://www.datastax.com/docs/0.7/map_reduce/hadoop_mr
25. › Hadoop
tech.stack
› Automatic deployment
› Management API
› Temporal cluster
› Amazon S3 as data storage *
* copy from S3 to EMR HDFS and back
28. cluster lifecycle: Long-Running or Transient
› cold start = ~20 min
› tradeoff: cluster cost VS availability
› Compressing and Combiner tuning may speed-up jobs
very much
› common problems for all big data processing tools monitoring, testability and debug (MRUnit, local hadoop,
smaller data set)
›
34. real-time metrics update (sync):
› average latency - 60 msec
› process > 2,000 events per second
› generate > 1000 reports per second
real-time metrics update (async):
› process > 15,000 events per second
uploading to AWS S3: slow, but multi-threading helps *
it is more then enough, but what if …
35. › distributed
systems force you to make decisions
› systems like Cassandra trade speed for
Consistency
› CAP theorem is oversimplified
› you
have much more options
› polyglot
persistence can make this world a
better place
› do
not try to hammer every nail with the same
hammer
36. › Cassandra
– great for time series data and
heavy-write workload…
› ... but use cases should be clearly defined
37. › Amazon
› simple,
› Amazon
S3 – is great
slow, but predictable storage
EMR
› integration
with S3 – great
› very good API, but …
› … isn’t a magic trick and require
knowledge about Hadoop and skills for
effective usage