This document describes using Cassandra for a high-volume data ingestion and real-time analysis system. It outlines the deficiencies of the previous solution and how Cassandra improves it. The new solution uses Cassandra to capture messages from an e-commerce site at over 5,000 messages per second. It stores the data in Cassandra for real-time queries and analysis without lag, providing a single consolidated view across data centers. This enables low-latency troubleshooting and real-time dashboard updates.
C* Summit 2013: Large Scale Data Ingestion, Processing and Analysis: Then, Now & Future by Ameet Chaubal and Fausto Inestroza
1. Cassandra &
Next Generation Analysis
Cassandra for a high-velocity data
ingestion and real-time analysis system.
Ameet Chaubal & Fausto Inestroza
2. Presentation Route
• Describe
conven,onal
technology
solu,on
• Highlight
deficiencies
• Showcase
new
solu,on
implemented
using
Cassandra
• Layout
architecture
with
improvements
3. Business Case
• Capture messages from high-volume e-
Commerce site.
• Store them into a database
• Perform near real-time queries for
troubleshooting
• Perform deeper analysis a la BI.
4. Olden Days …
JMS Queue
Transient
Storage
RDBMS
Data
warehouse
Analysis
eCommerce Website
5. Business Case, Details…
Messages: 5000 msg/sec
~ 250 million / day
Message size : 1 Kb
JMS Queue
Transient
Storage
RDBMS
Data
warehouse
eCommerce Website
Decouple UI from storage
Multiple sinks
Dedicated storage Triage
Data Analysis
Business Intelligence
6. What’s the problem?
JMS Queue
Data
warehouse
SITE I
SITE II
JMS Queue
• Queue
Replication
problems
• Message Loss
• Other applications
affected in case of
failover
• Triage data isolated
• No universal view
• Data Consolidation
adds delay
• Inability to keep up
with increasing
messages
• Analysis always
lagging the action
• No low-latency
queries
Batch Load
Transient
storage
7. Problems Recap
• Over
5000
msg/sec
High
Write
Speed
• Extrac9on
&
Load
very
slow
ETL
from
Transient
storage
to
Data
warehouse
takes
over
4
hours
• Analysis
always
lags
events
by
hours
ETL
performed
in
batches
4
hours
apart
• No
high
availability
No
Geo-‐Redundancy
for
Transient
Storage
• Data
stored
in
disparate
buckets
No
Universal
view
of
data
for
“Triage”
applica9ons/troubleshoo9ng
• No
dashboard
No
low-‐latency
queries
• No
immediate
alert,
paRern
detec9on
No
real-‐9me
analysis
9. Role of Data Model
Before we get there,
what features are missing from Cassandra in
comparison to traditional RDBMS
10. Shortcomings… Opportunities
• No Joins across Column Families
• No analytical functions such as sum, count…
• Difficulty in constructing “WHERE” clause
predicates across composite columns
• Inability to order range of Keys in Random
Partitioner
11. Importance of Data model - Cassandra
• In lieu of JOINS, “smart” de-normalization techniques
are crucial.
• Need to use “FEATURES” of Cassandra to effectively
model the business rules and business data
• “Client” or “Application” code becomes extremely
important.
• “APPLICATION” + “DATABASE” => Full Package
12. Features of Cassandra Modeling
• “WIDE” Column Family
– Organize data in “horizontal” as opposed to “vertical” fashion as in RDBMS
• Automatic Sorting of Columns
– Important to “MODEL” the data in “COLUMNS” as opposed to rows.
• Faster Access to ALL COLUMNS of a Row Key
– All columns of a row key stored on ONE server =>fast iteration/aggregations
• Useful info in “COLUMN NAME”
– Ground breaking from RDBMS perspective
– Enables “MORE” “INFORMATION” to be PACKED
– “COLUMN” as entity becomes “MORE POWERFUL”.
• COMPOSITE Column NAMES:
– Column names can be COMPOSITES !!! Made up of multiple columns
– Auto sorting still works
13. Data Model
Wide
rows
with
sharding
Row
Key
=
“<min>|<part#>”
Role
of
par99on
#:
• Each
row
is
stored
by
a
single
server
and
with
5,000x60=300,000
events
per
minute,
that
would
put
large
load
for
a
minute
on
a
single
server.
• A
“par99on”
contrap9on
aims
to
“break”
this
huge
row,
remove
hotspots
and
spread
the
load
to
possibly
all
servers
• The
#
of
par99ons,
some
mul9ple
of
the
#
of
servers
• Finite
#
of
par99ons
–
s9ll
maintains
the
row
key
as
meaningful,
i.e.
we
can
construct
the
keys
for
a
certain
minute
and
fetch
records
for
them.
14. Composite Columns
• Composite Columns:
– Actual message stored as part of composite column
• Variable granularity grouping
– Minute: Row key based on minute
Min_par((on
(TEXT)
DC:TimeUUID:UserID:Message(Composite)
…
2012-‐07-‐18-‐08-‐13-‐p-‐1
Status
…
…
2012-‐07-‐19-‐11-‐21-‐p-‐3
Status
16. Data Center 3 (RO)
Data Center 2
(RW)
Data Center 1
(RW)
Geo-Redundancy
16
Data Center 4 (RO)
17. Data Consolidation and Extraction
• Single view of data across multiple locations
• Data extraction can be performed in parallel
• Data extraction process performed in
dedicated cluster of machines.
18. Low-Latency & Batch Applications
• Triaging
– Troubleshooting customer issues within 10 minutes of
occurrence
– Feeding a dashboard of live feed data through
aggregations performed in Counter CFs
• Analysis
– Analytical and ad Hoc queries to replace the need
for remote data warehouse eventually
– Map/Reduce via Hive without ETL
19. Opportunities Remaining
• Near real-time pattern detection and
response
• Message loss in JMS queue
• JMS queue replication.
• reducing the impact of Queue failover on
other applications
25. Scalability
Reliability
Data types, size, velocity
Mission critical data
Processing, computation, etc.
Time series / pattern
analysis
Fault-tolerance
What do we need?
Multiple use cases
26. How do we get this from Storm?
Processing guarantees
Low-level
Primitives
Parallelization
Robust fail-over strategies
Scalability
Reliability
Fault-tolerance
Processing, computation,
etc.
29. Integration with Cassandra
Cassandra
Optimal for time series data
Near-linear scalable
Low read/write latency
Scales in conjunction with Storm
Custom Bolt
Uses Hector API to access Cassandra
Creates dynamic columns per request
Stores relevant network data
33. Cassandra
KaUa
Spout
Pre-‐process
Sessionize
Calculate
N/
W
Speed
per
Session
Update
Speed
per
IP
Join
Compare
Speed
Store
in
Cassandra
Speed
by
Loca(on
Stream
1
Stream
2
KaUa
Spout
Tuple
(ip
1)
Tuple
(ip
1/NY)
Tuple
(NY)
Tuple
(ip
1/NY)
Branching
and
Joins
34. Lessons Learned
• Rebalance Topology
• Tweak parallelism in bolt
• Isolation of Topologies
• Use TimeUUIDUtils
• Log4j level set to INFO by default