Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awesome by Rachel Pedreschi of Datastax

©2013 DataStax Conﬁdential. Do not distribute without consent.
@RachelPedreschi
Rachel Pedreschi
Lead Technical Evangelist- Datastax Enterprise
Spark up your Cassandra Cluster
1

Cassandra is…
• Shared nothing
• Masterless peer-to-peer
• Great scaling story
• Resilient to failure

Cassandra for Applications
APACHE
CASSANDRA

A Data Ocean or Pond., Lake
An In-Memory Database
A Key-Value Store
A magical unicorn database that farts rainbows

Spark + Cassandra = All The Good Things

Great combo
Store a ton of data Analyze a ton of data

Great combo
Spark Streaming
Near Real-time
SparkSQL
Structured Data
MLLib
Machine Learning
GraphX
Graph Analysis

Spark Streaming
Near Real-time
SparkSQL
Structured Data
MLLib
Machine Learning
GraphX
Graph Analysis
Great combo
CREATE TABLE raw_weather_data (
wsid text,
year int,
month int,
day int,
hour int,
temperature double,
dewpoint double,
pressure double,
wind_direction int,
wind_speed double,
sky_condition int,
sky_condition_text text,
one_hour_precip double,
six_hour_precip double,
PRIMARY KEY ((wsid), year, month, day, hour)
) WITH CLUSTERING ORDER BY (year DESC, month DESC, day DESC, hour DESC);
Spark Connector

How does it work? OSS Stack
Executer
Master
Worker
Executer
Executer
Server

How does it work? OSS Stack
Master
Worker
0-24
Token Ranges
0-100
25-49
50-74
75-99
I will only
analyze 25% of
the data.
Worker Worker
Worker

Master
0-24
25-49
50-74
75-99
AnalyticsTransactional
Worker
WorkerWorker
Worker
0-24
75-99 25-49
50-74

75-99
SELECT *
FROM keyspace.table
WHERE token(pk) > 75
AND token(pk) <= 99
Spark RDD
Spark Partition
Spark Partition
Spark Partition
Spark Connector
Executer
Executer
Executer
Worker
Master

Spark RDD
Spark Partition
Spark Partition
Spark Partition
Master
Worker
Executer
Executer
Executer
75-99

Spark Connector
Cassandra
Cassandra +
Spark
Joins and Unions No Yes
Transformations Limited Yes
Outside Data
Integration
No Yes
Aggregations Limited Yes

Attaching to Spark and Cassandra
// Import Cassandra-specific functions on SparkContext and RDD objects
import org.apache.spark.{SparkContext, SparkConf} 
import com.datastax.spark.connector._
/** The setMaster("local") lets us run & test the job right in our IDE */ 
val conf = new SparkConf(true)
.set("spark.cassandra.connection.host", "127.0.0.1")
.setMaster(“local[*]")
.setAppName(getClass.getName)
// Optionally 
.set("cassandra.username", "cassandra") 
.set("cassandra.password", “cassandra")
 
val sc = new SparkContext(conf)

Comment table example
CREATE TABLE comments_by_video ( 
videoid uuid, 
commentid timeuuid, 
userid uuid, 
comment text, 
PRIMARY KEY (videoid, commentid) 
) WITH CLUSTERING ORDER BY (commentid DESC);

Simple example
/** keyspace & table */ 
val tableRDD = sc.cassandraTable("killrvideo", “comments_by_video”) 
 
 
/** get a simple count of all the rows in the raw_weather_data table */ 
val rowCount = tableRDD.count() 
 
 
println(s"Total Rows in Comments Table: $rowCount") 
sc.stop()

Simple example
/** keyspace & table */ 
val tableRDD = sc.cassandraTable("killrvideo", “comments_by_video”) 
 
 
/** get a simple count of all the rows in the comments_by_video table */ 
val rowCount = tableRDD.count() 
 
 
println(s"Total Rows in Comments Table: $rowCount") 
sc.stop()
Executer
SELECT *
FROM killrvideo.comments_by_video
Spark RDD
Spark Partition
Spark Connector

Using CQL
SELECT userid 
FROM comments_by_video 
WHERE videoid = '01860584-de45-018f-12be-5f81704e8033' 
val cqlRRD = sc.cassandraTable("killrvideo", “comments_by_video”) 
.select("userid") 
.where("videoid = ?”, 
“01860584-de45-018f-12be-5f81704e8033")

spark-sql> SELECT cast(videoid as String) videoid, count(*) c
FROM comments_by_video 
GROUP BY cast(videoid as String) 
ORDER BY c DESC limit 10;

Saving back to Cassandra
// Create insert data 
val collection = sc.parallelize(Seq(("01860584-de45-018f-12be-5f81704e8033", "Great video", "cdaf6bd5-8914-29e0-
f0b6-8b0bc6156777"), 
("01860584-de45-018f-12be-5f81704e8033", "Hated it", "cdaf6bd5-8914-29e0-f0b6-8b0bc6156777"))) 
// Insert data into table 
collection.saveToCassandra("killrvideo", "comments_by_video", SomeColumns("videoid", "comment", "userid"))

val solrQueryRDD = sc.cassandraTable("killrvideo", “videos")
.select("name").where("solr_query='tags:crime*'") 
 
solrQueryRDD.collect().map(row => println(row.getString("name")))

Thank you!
Bring the questions
Follow me on twitter
@RachelPedreschi

Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awesome by Rachel Pedreschi of Datastax

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (6)

Similar a Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awesome by Rachel Pedreschi of Datastax

Similar a Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awesome by Rachel Pedreschi of Datastax (20)

Más de Data Con LA

Más de Data Con LA (20)

Último

Último (20)

Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awesome by Rachel Pedreschi of Datastax