Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.
Spark with Cassandra
@chbatey
@chbatey
Christopher Batey - @chbatey
Freelance Software engineer/Devops/Architect
Loves:
Breaking software
Distributed sy...
Audience
Assumption is…
Overview
Cassandra architecture
Modelling time series - weather data for many
stations
What can be done with pure C*
When ...
What do we use Spark for?
Batch processing
Machine Learning
Ad-hoc querying of large datasets
Streaming processing
What do we use Cassandra
for?
Operational Database
OLTP
Casandra overview
@chbatey
Master slave
Master
Async replication
Slave
@chbatey
Sharding
@chbatey
The other way
@chbatey
Consistent hashing
jim age: 36 car: ford gender: M
carol age: 37 car: bmw gender: F
johnny age: 12 gender: M
suzy...
999
49
0
50
A
B
C
D
249750
749
250
B
CD
A
Example
Node Start range End range Primary
key
Hash value
A 0 249 johnny 50
B 250 499 jim 350
C 500 749 suzy 600
D 750 999...
@chbatey
Fault tolerance
Replicate each price of data on multiple nodes
Keep replicas on different racks
Datacenter aware
DC2
client
RF3 RF3
C
C
WRITE
CL = 1 We have
replication!
DC1
Storing weather data
CREATE TABLE raw_weather_data (
weather_station text,
year int,
month int,
day int,
hour int,
temp do...
@chbatey
Primary key relationship
PRIMARY KEY ((weatherstation_id),year,month,day,hour)
Data Locality
weatherstation_id=‘10010:99999’ ?
1000 Node Cluster
You are here!
Primary key relationship
PRIMARY KEY ((weatherstation_id),year,month,day,hour)
Partition Key Clustering Columns
10010:99999
2005:12:1:8:temp 2005:12:1:7:temp
-5.6
PRIMARY KEY ((weatherstation_id),year,month,day,hour)
Partition Key Clustering Colu...
I have a question!!
What happens if I want to
do an adhoc query??
I’ve stored the data
partitioned by weather id…
… now I want a report for
all stations
I’ve stored the raw weather
data…
… now I want rollups/
aggregates
Analytics Workload
Isolation
Deployment
- Spark worker on each of the
Cassandra nodes
- Partitions made up of
LOCAL cassandra data
S C
S C
S C
S C
Cassandra Data is Distributed By Token Range
0
500
Node 1
Node 2
Node 3
Node 4
Cassandra Data is Distributed By Token Range
0
500
Node 1
Node 2
Node 3
Node 4
Without vnodes
Cassandra Data is Distributed By Token Range
0
500
Node 1
Node 2
Node 3
Node 4
With vnodes
Cassandra RDD
Each Spark partition is made up of
token ranges that live on the same
node
Each Spark partition is made up of
Cassandra partitions that are on the
same node
Storing weather data
CREATE TABLE raw_weather_data (
weather_station text,
year int,
month int,
day int,
hour int,
temp do...
(count: 24, mean: 14.428150, stdev: 7.092196, max: 28.034969, min: 0.675863)
Partition key
=
Single node
(count: 11242, mean: 8.921956, stdev: 7.428311, max: 29.997986, min: -2.200000)
No partition key
=
Every node
Not quick enough?
daily_aggregate_precip
CREATE TABLE daily_aggregate_precip (
weather_station text,
year int,
month int,
day int,
precipita...
Weather station info
725030:14732,2008,01,01,00,5.0,-3.9,1020.4,270,4.6,2,0.0
Creating a Stream
Saving the raw data
Building an aggregate
CREATE TABLE daily_aggregate_precip (
weather_station text,
year int,
month int,
day int,
precipitat...
Want more Spark/C*
goodness?
@helenaedelson
Conclusion
Cassandra = OLTP database for the large scale
Spark can be used to do complex queries in a
partition
Or analyti...
Thanks for listening
Questions later? @chbatey
Próxima SlideShare
Cargando en…5
×

Spark with Cassandra by Christopher Batey

2.739 visualizaciones

Publicado el

Spark with Cassandra by Christopher Batey

Publicado en: Datos y análisis
  • Sé el primero en comentar

Spark with Cassandra by Christopher Batey

  1. 1. Spark with Cassandra @chbatey
  2. 2. @chbatey Christopher Batey - @chbatey Freelance Software engineer/Devops/Architect Loves: Breaking software Distributed systems Hates: Fragile software Untested code :( Introduction
  3. 3. Audience
  4. 4. Assumption is…
  5. 5. Overview Cassandra architecture Modelling time series - weather data for many stations What can be done with pure C* When to introduce Spark
  6. 6. What do we use Spark for? Batch processing Machine Learning Ad-hoc querying of large datasets Streaming processing
  7. 7. What do we use Cassandra for? Operational Database OLTP
  8. 8. Casandra overview
  9. 9. @chbatey Master slave Master Async replication Slave
  10. 10. @chbatey Sharding
  11. 11. @chbatey The other way
  12. 12. @chbatey Consistent hashing jim age: 36 car: ford gender: M carol age: 37 car: bmw gender: F johnny age: 12 gender: M suzy age: 10 gender: F Partition Key Hash value jim 350 carol 998 johnny 50 suzy 600 Partition Key
  13. 13. 999 49 0 50 A B C D 249750 749 250 B CD A
  14. 14. Example Node Start range End range Primary key Hash value A 0 249 johnny 50 B 250 499 jim 350 C 500 749 suzy 600 D 750 999 carol 998
  15. 15. @chbatey Fault tolerance Replicate each price of data on multiple nodes Keep replicas on different racks Datacenter aware
  16. 16. DC2 client RF3 RF3 C C WRITE CL = 1 We have replication! DC1
  17. 17. Storing weather data CREATE TABLE raw_weather_data ( weather_station text, year int, month int, day int, hour int, temp double, PRIMARY KEY ((weather_station), year, month, day, hour) ) WITH CLUSTERING ORDER BY (year DESC, month DESC, day DESC, hour DESC);
  18. 18. @chbatey Primary key relationship PRIMARY KEY ((weatherstation_id),year,month,day,hour)
  19. 19. Data Locality weatherstation_id=‘10010:99999’ ? 1000 Node Cluster You are here!
  20. 20. Primary key relationship PRIMARY KEY ((weatherstation_id),year,month,day,hour) Partition Key Clustering Columns 10010:99999
  21. 21. 2005:12:1:8:temp 2005:12:1:7:temp -5.6 PRIMARY KEY ((weatherstation_id),year,month,day,hour) Partition Key Clustering Columns 10010:99999 -5.1 2005:12:1:10:temp -5.3 2005:12:1:9:temp -4.9 Primary key relationship
  22. 22. I have a question!! What happens if I want to do an adhoc query??
  23. 23. I’ve stored the data partitioned by weather id… … now I want a report for all stations
  24. 24. I’ve stored the raw weather data… … now I want rollups/ aggregates
  25. 25. Analytics Workload Isolation
  26. 26. Deployment - Spark worker on each of the Cassandra nodes - Partitions made up of LOCAL cassandra data S C S C S C S C
  27. 27. Cassandra Data is Distributed By Token Range 0 500 Node 1 Node 2 Node 3 Node 4
  28. 28. Cassandra Data is Distributed By Token Range 0 500 Node 1 Node 2 Node 3 Node 4 Without vnodes
  29. 29. Cassandra Data is Distributed By Token Range 0 500 Node 1 Node 2 Node 3 Node 4 With vnodes
  30. 30. Cassandra RDD
  31. 31. Each Spark partition is made up of token ranges that live on the same node
  32. 32. Each Spark partition is made up of Cassandra partitions that are on the same node
  33. 33. Storing weather data CREATE TABLE raw_weather_data ( weather_station text, year int, month int, day int, hour int, temp double, PRIMARY KEY ((weather_station), year, month, day, hour) ) WITH CLUSTERING ORDER BY (year DESC, month DESC, day DESC, hour DESC);
  34. 34. (count: 24, mean: 14.428150, stdev: 7.092196, max: 28.034969, min: 0.675863) Partition key = Single node
  35. 35. (count: 11242, mean: 8.921956, stdev: 7.428311, max: 29.997986, min: -2.200000) No partition key = Every node
  36. 36. Not quick enough?
  37. 37. daily_aggregate_precip CREATE TABLE daily_aggregate_precip ( weather_station text, year int, month int, day int, precipitation counter, PRIMARY KEY ((weather_station), year, month, day) ) WITH CLUSTERING ORDER BY (year DESC, month DESC, day DESC); SELECT precipitation FROM daily_aggregate_precip WHERE weather_station='010010:99999' AND year=2005 AND month=12 AND day>=1 AND day <= 7;
  38. 38. Weather station info
  39. 39. 725030:14732,2008,01,01,00,5.0,-3.9,1020.4,270,4.6,2,0.0
  40. 40. Creating a Stream
  41. 41. Saving the raw data
  42. 42. Building an aggregate CREATE TABLE daily_aggregate_precip ( weather_station text, year int, month int, day int, precipitation counter, PRIMARY KEY ((weather_station), year, month, day) ) WITH CLUSTERING ORDER BY (year DESC, month DESC, day DESC); CQL Counter
  43. 43. Want more Spark/C* goodness? @helenaedelson
  44. 44. Conclusion Cassandra = OLTP database for the large scale Spark can be used to do complex queries in a partition Or analytical queries for an entire table Spark streaming to keep tables up to date
  45. 45. Thanks for listening Questions later? @chbatey

×