SlideShare una empresa de Scribd logo
1 de 57
Descargar para leer sin conexión
Cassandra and Spark: Love at First Sight
Nick Bailey
@nickmbailey
What’s a “DataStax”?
• DataStax Enterprise
• Production Certification
• Spark, Solr, Hadoop integration
• Monitoring and Dev Tooling
• OpsCenter
• DevCenter
• Cassandra Drivers
• Java, Python, C#, C++, Ruby, …
2
What’s a “@nickmbailey”
• Joined DataStax (formerly Riptano) in 2010
• OpsCenter Architect
• Austin Cassandra Users Meetup Organizer
3
Cassandra Summit!
• Santa Clara, September 22 - 24, 2015
• http://cassandrasummit-datastax.com/
• Free! (Priority passes available)
4
1 Cassandra
2 Spark
3 Cassandra and Spark
4 Demo
Cassandra
6
7
Cassandra
• A Linearly Scaling and Fault Tolerant Distributed
Database
!
• Fully Distributed
– Data spread over many nodes
– All nodes participate in a cluster
– All nodes are equal
– No SPOF (shared nothing)
8
Cassandra
• Linearly Scaling
– Have More Data? Add more nodes.
– Need More Throughput? Add more nodes.
9http://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html
Cassandra
• Fault Tolerant
– Nodes Down != Database Down
– Datacenter Down != Database Down
10
Cassandra
• Fully Replicated
• Clients write local
• Data syncs across WAN
• Replication Factor per DC
11
US Europe
Client
Cassandra and the CAP Theorem
• The CAP Theorem limits what distributed systems can
do
!
• Consistency
• Availability
• Partition Tolerance
!
• Limits? “Pick 2 out of 3”
12
Cassandra: Distributed Architecture
13
Two knobs control Cassandra fault tolerance
• Replication Factor (server side)
– How many copies of the data should exist?
14
Client
B
AD
C
AB
A
CD
D
BC
Write A
RF=3
Two knobs control Cassandra fault tolerance
• Consistency Level (client side)
– How many replicas do we need to hear from before we
acknowledge?
15
Client
B
AD
C
AB
A
CD
D
BC
Write A
CL=QUORUM
Client
B
AD
C
AB
A
CD
D
BC
Write A
CL=ONE
Consistency Levels
• Applies to both Reads and Writes (i.e. is set on each
query)
!
• ONE – one replica from any DC
• LOCAL_ONE – one replica from local DC
• QUORUM – 51% of replicas from any DC
• LOCAL_QUORUM – 51% of replicas from local DC
• ALL – all replicas
• TWO
16
Consistency Level and Speed
• How many replicas we need to hear from can affect how
quickly we can read and write data in Cassandra
17
Client
B
AD
C
AB
A
CD
D
BC
5 µs ack
300 µs ack
12 µs ack
12 µs ack
Read A
(CL=QUORUM)
Consistency Level and Availability
• Consistency Level choice affects availability
• For example, QUORUM can tolerate one replica being
down and still be available (in RF=3)
18
Client
B
AD
C
AB
A
CD
D
BC
A=2
A=2
A=2
Read A
(CL=QUORUM)
Writes in the cluster
• Fully distributed, no SPOF
• Node that receives a request is the Coordinator for
request
• Any node can act as Coordinator
19
Client
B
AD
C
AB
A
CD
D
BC
Write A
(CL=ONE)
Coordinator Node
Reads in the cluster
• Same as writes in the cluster, reads are coordinated
• Any node can be the Coordinator Node
20
Client
B
AD
C
AB
A
CD
D
BC
Read A
(CL=QUORUM)
Coordinator Node
Reads and Eventual Consistency
• Cassandra is an AP system that is Eventually Consistent
so replicas may disagree
• Column values are timestamped
• In Cassandra, Last Write Wins (LWW)
21
Client
B
AD
C
AB
A
CD
D
BC
A=2!
Newer
A=1!
Older
A=2
Read A
(CL=QUORUM)
Christos from Netflix: “Eventual Consistency != Hopeful Consistency”
https://www.youtube.com/watch?v=lwIA8tsDXXE
Data Distribution
• Partition Key determines node placement
22
Partition Key
id='pmcfadin' lastname='McFadin'
id='jhaddad' firstname='Jon' lastname='Haddad'
id='ltillman' firstname='Luke' lastname='Tillman'
CREATE TABLE users (!
id text,!
firstname text,!
lastname text,!
PRIMARY KEY (id)!
);
Data Distribution
• The Partition Key is hashed using a consistent hashing
function (Murmur 3) and the output is used to place the data
on a node
!
!
!
!
!
!
• The data is also replicated to RF-1 other nodes
23
Partition Key
id='ltillman' firstname='Luke' lastname='Tillman'
Murmur3
id: ltillman Murmur3: A
B
AD
C
AB
A
CD
D
BC
RF=3
Cassandra Query Language (CQL)
24
Data Structures
• Keyspace is like RDBMS Database or
Schema
!
• Like RDBMS, Cassandra uses Tables to
store data
!
• Partitions can have multiple rows
25
Keyspace
Tables
Partitions
Rows
Schema Definition (DDL)
• Easy to define tables for storing data
• First part of Primary Key is the Partition Key
CREATE TABLE videos (!
videoid uuid,!
userid uuid,!
name text,!
description text,!
tags set<text>,!
added_date timestamp,!
PRIMARY KEY (videoid)!
);
Schema Definition (DDL)
CREATE TABLE videos (!
videoid uuid,!
userid uuid,!
name text,!
description text,!
tags set<text>,!
added_date timestamp,!
PRIMARY KEY (videoid)!
);
name ...
Keyboard Cat ...
Nyan Cat ...
Original Grumpy
Cat
...
videoid
689d56e5-
…
93357d73-
…
d978b136-
…
Clustering Columns
• Second part of Primary Key is Clustering Columns!
!
!
!
!
!
!
• Clustering columns affect ordering of data (on disk)
• Multiple rows per partition
28
CREATE TABLE comments_by_video (!
videoid uuid,!
commentid timeuuid,!
userid uuid,!
comment text,!
PRIMARY KEY (videoid, commentid)!
) WITH CLUSTERING ORDER BY (commentid DESC);
Clustering Columns
29
commentid ...
8982d56e5… ...
93822df62… ...
22dt62f69… ...
8319af913...
videoid
689d56e5-
…
689d56e5-
…
689d56e5-
…
93357d73-
…
Inserts and Updates
• Use INSERT or UPDATE to add and modify data
30
INSERT INTO comments_by_video (!
videoid, commentid, userid, comment)!
VALUES (!
'0fe6a...', '82be1...', 'ac346...', 'Awesome!');
UPDATE comments_by_video!
SET userid = 'ac346...', comment = 'Awesome!'!
WHERE videoid = '0fe6a...' AND commentid =
'82be1...';
Deletes
• Can specify a Time to Live (TTL) in seconds when doing
an INSERT or UPDATE
!
!
!
• Use DELETE statement to remove data
31
INSERT INTO comments_by_video
( ... )!
VALUES ( ... )!
USING TTL 86400;
DELETE FROM comments_by_video!
WHERE videoid = '0fe6a...' AND commentid =
'82be1...';
Querying
• Use SELECT to get data from your tables
!
!
!
!
• Always include Partition Key and optionally Clustering
Columns!
• Can use ORDER BY and LIMIT
!
• Use range queries (for example, by date) to slice partitions
32
SELECT * FROM comments_by_video !
WHERE videoid = 'a67cd...'!
LIMIT 10;
Cassandra Data Modeling
• Hmmm, looks like SQL, I know that…
33
Cassandra Data Modeling
• Hmmm, looks like SQL, I know that…
34
Cassandra Data Modeling
• Requires a different mindset than RDBMS modeling
• Know your data and your queries up front
• Queries drive a lot of the modeling decisions (i.e. “table
per query” pattern)
• Denormalize/Duplicate data at write time to do as few
queries as possible come read time
• Remember, disk is cheap and writes in Cassandra are
FAST
35
Other Data Modeling Concepts
• Lightweight Transactions
• JSON
• User Defined Types
• User Defined Functions
36
Spark
37
Spark
• Distributed Computing Framework
• Similar to the Hadoop map reduce engine
• Databricks
• Company by the creators of Spark
38
Spark Stack
39
Shark!
or

Spark SQL
Streaming ML
Spark (General execution engine)
Graph
Cassandra
General Purpose API
40
map reduce
General Purpose API
41
map reduce sample
filter count take
groupby fold first
sort reduceByKey partitionBy
union cogroup mapWith
join cross save
… … …
Spark Architecture
42
Worker
Master
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Spark: General Concepts
43
RDD
• Resilient Distributed Dataset
• Basically, a collection of elements to work on
• Building block for Spark
44
Operation Types
• Transformations
• Lazily computed
• Actions
• Force computation
45
Example
join
filter
groupBy
Stage 3
Stage 1
Stage 2
A: B:
C: D: E:
F:
map
Cassandra and Spark
47
Cassandra and Spark: The Why
• Easy analytics on your data
• Easy transformation of your data
48
Cassandra and Spark: The Why
• Simple Deployment
49
Worker
Master
Worker
Worker Worker
Cassandra and Spark: The Why
• Workload Isolation
50
50
Cassandra Cassandra + Spark
Cassandra and Spark: The How
• Cassandra Spark Driver
• https://github.com/datastax/spark-cassandra-connector
• Compatible With
• Spark 0.9+
• Cassandra 2.0+
• DataStax Enterprise 4.5+
51
Cassandra and Spark: The How
• Cassandra tables exposed as RDDS
• Read/write from/to Cassandra
• Automatic type conversion
52
Code
53
54
// Import Cassandra-specific functions on SparkContext and RDD objects!
import com.datastax.driver.spark._!
!
!
// Spark connection options!
val conf = new SparkConf(true)!
! ! .setMaster("spark://192.168.123.10:7077")!
! ! .setAppName("cassandra-demo")!
.set(“cassandra.connection.host", "192.168.123.10") // initial contact!
.set("cassandra.username", "cassandra")!
.set("cassandra.password", "cassandra") !
!
val sc = new SparkContext(conf)
Connecting
55
Accessing
CREATE TABLE test.words (word text PRIMARY KEY, count int);!
!
INSERT INTO test.words (word, count) VALUES ('bar', 30);!
INSERT INTO test.words (word, count) VALUES ('foo', 20);
// Use table as RDD!
val rdd = sc.cassandraTable("test", "words")!
// rdd: CassandraRDD[CassandraRow] = CassandraRDD[0]!
!
rdd.toArray.foreach(println)!
// CassandraRow[word: bar, count: 30]!
// CassandraRow[word: foo, count: 20]!
!
rdd.columnNames // Stream(word, count) !
rdd.size // 2!
!
val firstRow = rdd.first // firstRow: CassandraRow = CassandraRow[word: bar, count:
30]!
firstRow.getInt("count") // Int = 30
56
Saving
val newRdd = sc.parallelize(Seq(("cat", 40), ("fox", 50)))!
// newRdd: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[2]!
!
newRdd.saveToCassandra("test", "words", Seq("word", "count"))
SELECT * FROM test.words;!
!
word | count!
------+-------!
bar | 30!
foo | 20!
cat | 40!
fox | 50!
!
(4 rows)
Questions?
57

Más contenido relacionado

La actualidad más candente

Cassandra background-and-architecture
Cassandra background-and-architectureCassandra background-and-architecture
Cassandra background-and-architecture
Markus Klems
 
Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...
Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...
Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...
DataStax
 
Distribute Key Value Store
Distribute Key Value StoreDistribute Key Value Store
Distribute Key Value Store
Santal Li
 

La actualidad más candente (20)

An Overview of Apache Cassandra
An Overview of Apache CassandraAn Overview of Apache Cassandra
An Overview of Apache Cassandra
 
Cassandra internals
Cassandra internalsCassandra internals
Cassandra internals
 
Cassandra background-and-architecture
Cassandra background-and-architectureCassandra background-and-architecture
Cassandra background-and-architecture
 
Introduction to Cassandra
Introduction to CassandraIntroduction to Cassandra
Introduction to Cassandra
 
Introduction to Apache Cassandra
Introduction to Apache CassandraIntroduction to Apache Cassandra
Introduction to Apache Cassandra
 
Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...
Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...
Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...
 
Apache Cassandra @Geneva JUG 2013.02.26
Apache Cassandra @Geneva JUG 2013.02.26Apache Cassandra @Geneva JUG 2013.02.26
Apache Cassandra @Geneva JUG 2013.02.26
 
Cassandra: Open Source Bigtable + Dynamo
Cassandra: Open Source Bigtable + DynamoCassandra: Open Source Bigtable + Dynamo
Cassandra: Open Source Bigtable + Dynamo
 
Introduction to Cassandra: Replication and Consistency
Introduction to Cassandra: Replication and ConsistencyIntroduction to Cassandra: Replication and Consistency
Introduction to Cassandra: Replication and Consistency
 
How to size up an Apache Cassandra cluster (Training)
How to size up an Apache Cassandra cluster (Training)How to size up an Apache Cassandra cluster (Training)
How to size up an Apache Cassandra cluster (Training)
 
Distribute Key Value Store
Distribute Key Value StoreDistribute Key Value Store
Distribute Key Value Store
 
Everyday I’m scaling... Cassandra
Everyday I’m scaling... CassandraEveryday I’m scaling... Cassandra
Everyday I’m scaling... Cassandra
 
Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016
Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016
Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016
 
Cassandra EU 2012 - Netflix's Cassandra Architecture and Open Source Efforts
Cassandra EU 2012 - Netflix's Cassandra Architecture and Open Source EffortsCassandra EU 2012 - Netflix's Cassandra Architecture and Open Source Efforts
Cassandra EU 2012 - Netflix's Cassandra Architecture and Open Source Efforts
 
Building a Multi-Region Cluster at Target (Aaron Ploetz, Target) | Cassandra ...
Building a Multi-Region Cluster at Target (Aaron Ploetz, Target) | Cassandra ...Building a Multi-Region Cluster at Target (Aaron Ploetz, Target) | Cassandra ...
Building a Multi-Region Cluster at Target (Aaron Ploetz, Target) | Cassandra ...
 
BigData Developers MeetUp
BigData Developers MeetUpBigData Developers MeetUp
BigData Developers MeetUp
 
Apache Cassandra multi-datacenter essentials
Apache Cassandra multi-datacenter essentialsApache Cassandra multi-datacenter essentials
Apache Cassandra multi-datacenter essentials
 
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...
 
iland Internet Solutions: Leveraging Cassandra for real-time multi-datacenter...
iland Internet Solutions: Leveraging Cassandra for real-time multi-datacenter...iland Internet Solutions: Leveraging Cassandra for real-time multi-datacenter...
iland Internet Solutions: Leveraging Cassandra for real-time multi-datacenter...
 
Cassandra concepts, patterns and anti-patterns
Cassandra concepts, patterns and anti-patternsCassandra concepts, patterns and anti-patterns
Cassandra concepts, patterns and anti-patterns
 

Similar a Cassandra and Spark

Spark Summit EU talk by Miklos Christine paddling up the stream
Spark Summit EU talk by Miklos Christine paddling up the streamSpark Summit EU talk by Miklos Christine paddling up the stream
Spark Summit EU talk by Miklos Christine paddling up the stream
Spark Summit
 
Intro to Spark - for Denver Big Data Meetup
Intro to Spark - for Denver Big Data MeetupIntro to Spark - for Denver Big Data Meetup
Intro to Spark - for Denver Big Data Meetup
Gwen (Chen) Shapira
 

Similar a Cassandra and Spark (20)

Leveraging Cassandra for real-time multi-datacenter public cloud analytics
Leveraging Cassandra for real-time multi-datacenter public cloud analyticsLeveraging Cassandra for real-time multi-datacenter public cloud analytics
Leveraging Cassandra for real-time multi-datacenter public cloud analytics
 
Spark and cassandra (Hulu Talk)
Spark and cassandra (Hulu Talk)Spark and cassandra (Hulu Talk)
Spark and cassandra (Hulu Talk)
 
What's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You CareWhat's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You Care
 
Analytics with Cassandra & Spark
Analytics with Cassandra & SparkAnalytics with Cassandra & Spark
Analytics with Cassandra & Spark
 
Devops kc
Devops kcDevops kc
Devops kc
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
A Deep Dive into Apache Cassandra for .NET Developers
A Deep Dive into Apache Cassandra for .NET DevelopersA Deep Dive into Apache Cassandra for .NET Developers
A Deep Dive into Apache Cassandra for .NET Developers
 
Cassandra & Spark for IoT
Cassandra & Spark for IoTCassandra & Spark for IoT
Cassandra & Spark for IoT
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena EdelsonStreaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
 
Spark Summit EU talk by Miklos Christine paddling up the stream
Spark Summit EU talk by Miklos Christine paddling up the streamSpark Summit EU talk by Miklos Christine paddling up the stream
Spark Summit EU talk by Miklos Christine paddling up the stream
 
Intro to Spark - for Denver Big Data Meetup
Intro to Spark - for Denver Big Data MeetupIntro to Spark - for Denver Big Data Meetup
Intro to Spark - for Denver Big Data Meetup
 
Oscon 2019 - Optimizing analytical queries on Cassandra by 100x
Oscon 2019 - Optimizing analytical queries on Cassandra by 100xOscon 2019 - Optimizing analytical queries on Cassandra by 100x
Oscon 2019 - Optimizing analytical queries on Cassandra by 100x
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and AkkaStreaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and Akka
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Bloomreach - BloomStore Compute Cloud Infrastructure
Bloomreach - BloomStore Compute Cloud Infrastructure Bloomreach - BloomStore Compute Cloud Infrastructure
Bloomreach - BloomStore Compute Cloud Infrastructure
 
Real-Time Analytics with Apache Cassandra and Apache Spark,
Real-Time Analytics with Apache Cassandra and Apache Spark,Real-Time Analytics with Apache Cassandra and Apache Spark,
Real-Time Analytics with Apache Cassandra and Apache Spark,
 
Real-Time Analytics with Apache Cassandra and Apache Spark
Real-Time Analytics with Apache Cassandra and Apache SparkReal-Time Analytics with Apache Cassandra and Apache Spark
Real-Time Analytics with Apache Cassandra and Apache Spark
 
How Opera Syncs Tens of Millions of Browsers and Sleeps Well at Night
How Opera Syncs Tens of Millions of Browsers and Sleeps Well at NightHow Opera Syncs Tens of Millions of Browsers and Sleeps Well at Night
How Opera Syncs Tens of Millions of Browsers and Sleeps Well at Night
 
Spark Introduction
Spark IntroductionSpark Introduction
Spark Introduction
 

Más de nickmbailey

An Introduction to Cassandra on Linux
An Introduction to Cassandra on LinuxAn Introduction to Cassandra on Linux
An Introduction to Cassandra on Linux
nickmbailey
 
Introduction to Cassandra and Data Modeling
Introduction to Cassandra and Data ModelingIntroduction to Cassandra and Data Modeling
Introduction to Cassandra and Data Modeling
nickmbailey
 
CFS: Cassandra backed storage for Hadoop
CFS: Cassandra backed storage for HadoopCFS: Cassandra backed storage for Hadoop
CFS: Cassandra backed storage for Hadoop
nickmbailey
 

Más de nickmbailey (8)

Clojure at DataStax: The Long Road From Python to Clojure
Clojure at DataStax: The Long Road From Python to ClojureClojure at DataStax: The Long Road From Python to Clojure
Clojure at DataStax: The Long Road From Python to Clojure
 
Lightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and CassandraLightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and Cassandra
 
Cassandra and Clojure
Cassandra and ClojureCassandra and Clojure
Cassandra and Clojure
 
Introduction to Cassandra Basics
Introduction to Cassandra BasicsIntroduction to Cassandra Basics
Introduction to Cassandra Basics
 
An Introduction to Cassandra on Linux
An Introduction to Cassandra on LinuxAn Introduction to Cassandra on Linux
An Introduction to Cassandra on Linux
 
Introduction to Cassandra and Data Modeling
Introduction to Cassandra and Data ModelingIntroduction to Cassandra and Data Modeling
Introduction to Cassandra and Data Modeling
 
CFS: Cassandra backed storage for Hadoop
CFS: Cassandra backed storage for HadoopCFS: Cassandra backed storage for Hadoop
CFS: Cassandra backed storage for Hadoop
 
Clojure and the Web
Clojure and the WebClojure and the Web
Clojure and the Web
 

Último

%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
masabamasaba
 
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
masabamasaba
 
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Medical / Health Care (+971588192166) Mifepristone and Misoprostol tablets 200mg
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
masabamasaba
 
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Medical / Health Care (+971588192166) Mifepristone and Misoprostol tablets 200mg
 

Último (20)

VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
 
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
 
%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto
 
Artyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptxArtyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptx
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
 
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
 
%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
 
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
 
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
 
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the past
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
 
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
 

Cassandra and Spark

  • 1. Cassandra and Spark: Love at First Sight Nick Bailey @nickmbailey
  • 2. What’s a “DataStax”? • DataStax Enterprise • Production Certification • Spark, Solr, Hadoop integration • Monitoring and Dev Tooling • OpsCenter • DevCenter • Cassandra Drivers • Java, Python, C#, C++, Ruby, … 2
  • 3. What’s a “@nickmbailey” • Joined DataStax (formerly Riptano) in 2010 • OpsCenter Architect • Austin Cassandra Users Meetup Organizer 3
  • 4. Cassandra Summit! • Santa Clara, September 22 - 24, 2015 • http://cassandrasummit-datastax.com/ • Free! (Priority passes available) 4
  • 5. 1 Cassandra 2 Spark 3 Cassandra and Spark 4 Demo
  • 7. 7
  • 8. Cassandra • A Linearly Scaling and Fault Tolerant Distributed Database ! • Fully Distributed – Data spread over many nodes – All nodes participate in a cluster – All nodes are equal – No SPOF (shared nothing) 8
  • 9. Cassandra • Linearly Scaling – Have More Data? Add more nodes. – Need More Throughput? Add more nodes. 9http://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html
  • 10. Cassandra • Fault Tolerant – Nodes Down != Database Down – Datacenter Down != Database Down 10
  • 11. Cassandra • Fully Replicated • Clients write local • Data syncs across WAN • Replication Factor per DC 11 US Europe Client
  • 12. Cassandra and the CAP Theorem • The CAP Theorem limits what distributed systems can do ! • Consistency • Availability • Partition Tolerance ! • Limits? “Pick 2 out of 3” 12
  • 14. Two knobs control Cassandra fault tolerance • Replication Factor (server side) – How many copies of the data should exist? 14 Client B AD C AB A CD D BC Write A RF=3
  • 15. Two knobs control Cassandra fault tolerance • Consistency Level (client side) – How many replicas do we need to hear from before we acknowledge? 15 Client B AD C AB A CD D BC Write A CL=QUORUM Client B AD C AB A CD D BC Write A CL=ONE
  • 16. Consistency Levels • Applies to both Reads and Writes (i.e. is set on each query) ! • ONE – one replica from any DC • LOCAL_ONE – one replica from local DC • QUORUM – 51% of replicas from any DC • LOCAL_QUORUM – 51% of replicas from local DC • ALL – all replicas • TWO 16
  • 17. Consistency Level and Speed • How many replicas we need to hear from can affect how quickly we can read and write data in Cassandra 17 Client B AD C AB A CD D BC 5 µs ack 300 µs ack 12 µs ack 12 µs ack Read A (CL=QUORUM)
  • 18. Consistency Level and Availability • Consistency Level choice affects availability • For example, QUORUM can tolerate one replica being down and still be available (in RF=3) 18 Client B AD C AB A CD D BC A=2 A=2 A=2 Read A (CL=QUORUM)
  • 19. Writes in the cluster • Fully distributed, no SPOF • Node that receives a request is the Coordinator for request • Any node can act as Coordinator 19 Client B AD C AB A CD D BC Write A (CL=ONE) Coordinator Node
  • 20. Reads in the cluster • Same as writes in the cluster, reads are coordinated • Any node can be the Coordinator Node 20 Client B AD C AB A CD D BC Read A (CL=QUORUM) Coordinator Node
  • 21. Reads and Eventual Consistency • Cassandra is an AP system that is Eventually Consistent so replicas may disagree • Column values are timestamped • In Cassandra, Last Write Wins (LWW) 21 Client B AD C AB A CD D BC A=2! Newer A=1! Older A=2 Read A (CL=QUORUM) Christos from Netflix: “Eventual Consistency != Hopeful Consistency” https://www.youtube.com/watch?v=lwIA8tsDXXE
  • 22. Data Distribution • Partition Key determines node placement 22 Partition Key id='pmcfadin' lastname='McFadin' id='jhaddad' firstname='Jon' lastname='Haddad' id='ltillman' firstname='Luke' lastname='Tillman' CREATE TABLE users (! id text,! firstname text,! lastname text,! PRIMARY KEY (id)! );
  • 23. Data Distribution • The Partition Key is hashed using a consistent hashing function (Murmur 3) and the output is used to place the data on a node ! ! ! ! ! ! • The data is also replicated to RF-1 other nodes 23 Partition Key id='ltillman' firstname='Luke' lastname='Tillman' Murmur3 id: ltillman Murmur3: A B AD C AB A CD D BC RF=3
  • 25. Data Structures • Keyspace is like RDBMS Database or Schema ! • Like RDBMS, Cassandra uses Tables to store data ! • Partitions can have multiple rows 25 Keyspace Tables Partitions Rows
  • 26. Schema Definition (DDL) • Easy to define tables for storing data • First part of Primary Key is the Partition Key CREATE TABLE videos (! videoid uuid,! userid uuid,! name text,! description text,! tags set<text>,! added_date timestamp,! PRIMARY KEY (videoid)! );
  • 27. Schema Definition (DDL) CREATE TABLE videos (! videoid uuid,! userid uuid,! name text,! description text,! tags set<text>,! added_date timestamp,! PRIMARY KEY (videoid)! ); name ... Keyboard Cat ... Nyan Cat ... Original Grumpy Cat ... videoid 689d56e5- … 93357d73- … d978b136- …
  • 28. Clustering Columns • Second part of Primary Key is Clustering Columns! ! ! ! ! ! ! • Clustering columns affect ordering of data (on disk) • Multiple rows per partition 28 CREATE TABLE comments_by_video (! videoid uuid,! commentid timeuuid,! userid uuid,! comment text,! PRIMARY KEY (videoid, commentid)! ) WITH CLUSTERING ORDER BY (commentid DESC);
  • 29. Clustering Columns 29 commentid ... 8982d56e5… ... 93822df62… ... 22dt62f69… ... 8319af913... videoid 689d56e5- … 689d56e5- … 689d56e5- … 93357d73- …
  • 30. Inserts and Updates • Use INSERT or UPDATE to add and modify data 30 INSERT INTO comments_by_video (! videoid, commentid, userid, comment)! VALUES (! '0fe6a...', '82be1...', 'ac346...', 'Awesome!'); UPDATE comments_by_video! SET userid = 'ac346...', comment = 'Awesome!'! WHERE videoid = '0fe6a...' AND commentid = '82be1...';
  • 31. Deletes • Can specify a Time to Live (TTL) in seconds when doing an INSERT or UPDATE ! ! ! • Use DELETE statement to remove data 31 INSERT INTO comments_by_video ( ... )! VALUES ( ... )! USING TTL 86400; DELETE FROM comments_by_video! WHERE videoid = '0fe6a...' AND commentid = '82be1...';
  • 32. Querying • Use SELECT to get data from your tables ! ! ! ! • Always include Partition Key and optionally Clustering Columns! • Can use ORDER BY and LIMIT ! • Use range queries (for example, by date) to slice partitions 32 SELECT * FROM comments_by_video ! WHERE videoid = 'a67cd...'! LIMIT 10;
  • 33. Cassandra Data Modeling • Hmmm, looks like SQL, I know that… 33
  • 34. Cassandra Data Modeling • Hmmm, looks like SQL, I know that… 34
  • 35. Cassandra Data Modeling • Requires a different mindset than RDBMS modeling • Know your data and your queries up front • Queries drive a lot of the modeling decisions (i.e. “table per query” pattern) • Denormalize/Duplicate data at write time to do as few queries as possible come read time • Remember, disk is cheap and writes in Cassandra are FAST 35
  • 36. Other Data Modeling Concepts • Lightweight Transactions • JSON • User Defined Types • User Defined Functions 36
  • 38. Spark • Distributed Computing Framework • Similar to the Hadoop map reduce engine • Databricks • Company by the creators of Spark 38
  • 39. Spark Stack 39 Shark! or
 Spark SQL Streaming ML Spark (General execution engine) Graph Cassandra
  • 41. General Purpose API 41 map reduce sample filter count take groupby fold first sort reduceByKey partitionBy union cogroup mapWith join cross save … … …
  • 44. RDD • Resilient Distributed Dataset • Basically, a collection of elements to work on • Building block for Spark 44
  • 45. Operation Types • Transformations • Lazily computed • Actions • Force computation 45
  • 48. Cassandra and Spark: The Why • Easy analytics on your data • Easy transformation of your data 48
  • 49. Cassandra and Spark: The Why • Simple Deployment 49 Worker Master Worker Worker Worker
  • 50. Cassandra and Spark: The Why • Workload Isolation 50 50 Cassandra Cassandra + Spark
  • 51. Cassandra and Spark: The How • Cassandra Spark Driver • https://github.com/datastax/spark-cassandra-connector • Compatible With • Spark 0.9+ • Cassandra 2.0+ • DataStax Enterprise 4.5+ 51
  • 52. Cassandra and Spark: The How • Cassandra tables exposed as RDDS • Read/write from/to Cassandra • Automatic type conversion 52
  • 54. 54 // Import Cassandra-specific functions on SparkContext and RDD objects! import com.datastax.driver.spark._! ! ! // Spark connection options! val conf = new SparkConf(true)! ! ! .setMaster("spark://192.168.123.10:7077")! ! ! .setAppName("cassandra-demo")! .set(“cassandra.connection.host", "192.168.123.10") // initial contact! .set("cassandra.username", "cassandra")! .set("cassandra.password", "cassandra") ! ! val sc = new SparkContext(conf) Connecting
  • 55. 55 Accessing CREATE TABLE test.words (word text PRIMARY KEY, count int);! ! INSERT INTO test.words (word, count) VALUES ('bar', 30);! INSERT INTO test.words (word, count) VALUES ('foo', 20); // Use table as RDD! val rdd = sc.cassandraTable("test", "words")! // rdd: CassandraRDD[CassandraRow] = CassandraRDD[0]! ! rdd.toArray.foreach(println)! // CassandraRow[word: bar, count: 30]! // CassandraRow[word: foo, count: 20]! ! rdd.columnNames // Stream(word, count) ! rdd.size // 2! ! val firstRow = rdd.first // firstRow: CassandraRow = CassandraRow[word: bar, count: 30]! firstRow.getInt("count") // Int = 30
  • 56. 56 Saving val newRdd = sc.parallelize(Seq(("cat", 40), ("fox", 50)))! // newRdd: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[2]! ! newRdd.saveToCassandra("test", "words", Seq("word", "count")) SELECT * FROM test.words;! ! word | count! ------+-------! bar | 30! foo | 20! cat | 40! fox | 50! ! (4 rows)