SlideShare una empresa de Scribd logo
1 de 57
Descargar para leer sin conexión
Spark Cassandra Connector: Past, Present and Future
Spark Cassandra
Connector
Past, Present and Future
Russell Spitzer
@RussSpitzer
Software Engineer - Datastax
The Past:
Hadoop and C*
3
You
Hadoop integration with C* required a bit of knowledge and was generally not very easy.
Map Reduce Code
 	
  	
  	
  public	
  static	
  class	
  ReducerToCassandra	
  extends	
  Reducer<Text,	
  IntWritable,	
  Map<String,	
  ByteBuffer>,	
  List<ByteBuffer>>	
  
	
  	
  	
  	
  {	
  
	
  	
  	
  	
  	
  	
  	
  	
  private	
  Map<String,	
  ByteBuffer>	
  keys;	
  
	
  	
  	
  	
  	
  	
  	
  	
  private	
  ByteBuffer	
  key;	
  
	
  	
  	
  	
  	
  	
  	
  	
  protected	
  void	
  setup(org.apache.hadoop.mapreduce.Reducer.Context	
  context)	
  
	
  	
  	
  	
  	
  	
  	
  	
  throws	
  IOException,	
  InterruptedException	
  
	
  	
  	
  	
  	
  	
  	
  	
  {	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  keys	
  =	
  new	
  LinkedHashMap<String,	
  ByteBuffer>();	
  
	
  	
  	
  	
  	
  	
  	
  	
  }	
  
	
  	
  	
  	
  	
  	
  	
  	
  public	
  void	
  reduce(Text	
  word,	
  Iterable<IntWritable>	
  values,	
  Context	
  context)	
  throws	
  IOException,	
  InterruptedException	
  
	
  	
  	
  	
  	
  	
  	
  	
  {	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  int	
  sum	
  =	
  0;	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  for	
  (IntWritable	
  val	
  :	
  values)	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  sum	
  +=	
  val.get();	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  keys.put("word",	
  ByteBufferUtil.bytes(word.toString()));	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  context.write(keys,	
  getBindVariables(word,	
  sum));	
  
	
  	
  	
  	
  	
  	
  	
  	
  }	
  
	
  	
  	
  	
  	
  	
  	
  	
  private	
  List<ByteBuffer>	
  getBindVariables(Text	
  word,	
  int	
  sum)	
  
	
  	
  	
  	
  	
  	
  	
  	
  {	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  List<ByteBuffer>	
  variables	
  =	
  new	
  ArrayList<ByteBuffer>();	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  variables.add(ByteBufferUtil.bytes(String.valueOf(sum)));	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  return	
  variables;	
  
	
  	
  	
  	
  	
  	
  	
  	
  }	
  
	
  	
  	
  	
  }
Hadoop Interfaces are … difficult
4© 2015. All Rights Reserved.
https://github.com/apache/cassandra/blob/trunk/examples/hadoop_cql3_word_count/src/WordCount.java
Even simple integration with a Hadoop cluster took a lot of
experience to get right.
 	
  	
  	
  public	
  static	
  class	
  ReducerToCassandra	
  extends	
  Reducer<Text,	
  IntWritable,	
  Map<String,	
  ByteBuffer>,	
  List<ByteBuffer>>	
  
	
  	
  	
  	
  {	
  
	
  	
  	
  	
  	
  	
  	
  	
  private	
  Map<String,	
  ByteBuffer>	
  keys;	
  
	
  	
  	
  	
  	
  	
  	
  	
  private	
  ByteBuffer	
  key;	
  
	
  	
  	
  	
  	
  	
  	
  	
  protected	
  void	
  setup(org.apache.hadoop.mapreduce.Reducer.Context	
  context)	
  
	
  	
  	
  	
  	
  	
  	
  	
  throws	
  IOException,	
  InterruptedException	
  
	
  	
  	
  	
  	
  	
  	
  	
  {	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  keys	
  =	
  new	
  LinkedHashMap<String,	
  ByteBuffer>();	
  
	
  	
  	
  	
  	
  	
  	
  	
  }	
  
	
  	
  	
  	
  	
  	
  	
  	
  public	
  void	
  reduce(Text	
  word,	
  Iterable<IntWritable>	
  values,	
  Context	
  context)	
  throws	
  IOException,	
  InterruptedException	
  
	
  	
  	
  	
  	
  	
  	
  	
  {	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  int	
  sum	
  =	
  0;	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  for	
  (IntWritable	
  val	
  :	
  values)	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  sum	
  +=	
  val.get();	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  keys.put("word",	
  ByteBufferUtil.bytes(word.toString()));	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  context.write(keys,	
  getBindVariables(word,	
  sum));	
  
	
  	
  	
  	
  	
  	
  	
  	
  }	
  
	
  	
  	
  	
  	
  	
  	
  	
  private	
  List<ByteBuffer>	
  getBindVariables(Text	
  word,	
  int	
  sum)	
  
	
  	
  	
  	
  	
  	
  	
  	
  {	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  List<ByteBuffer>	
  variables	
  =	
  new	
  ArrayList<ByteBuffer>();	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  variables.add(ByteBufferUtil.bytes(String.valueOf(sum)));	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  return	
  variables;	
  
	
  	
  	
  	
  	
  	
  	
  	
  }	
  
	
  	
  	
  	
  }
Hadoop Interfaces are … difficult
5© 2015. All Rights Reserved.
https://github.com/apache/cassandra/blob/trunk/examples/hadoop_cql3_word_count/src/WordCount.java
Well at least you have Pig built in right?
moredata	
  =	
  load	
  'cql://cql3ks/compmore'	
  USING	
  CqlNativeStorage;	
  
insertformat	
  =	
  FOREACH	
  moredata	
  GENERATE	
  TOTUPLE	
  (TOTUPLE('a',x),TOTUPLE('b',y),	
  
TOTUPLE('c',z)),TOTUPLE(data);	
  
STORE	
  insertformat	
  INTO	
  'cql://cql3ks/compotable?output_query=UPDATE
%20cql3ks.compotable%20SET%20d%20%3D%20%3F'	
  USING	
  CqlNativeStorage;	
  
Even simple integration with a Hadoop cluster took a lot of
experience to get right.
Spark Offers a New Path
6© 2015. All Rights Reserved.
Core Libraries for ML/Streaming
No need for HDFS/Hadoop
Easy integration with other Data Sources
val	
  lines	
  =	
  sc.textFile("data.txt")	
  
val	
  pairs	
  =	
  lines.map(s	
  =>	
  (s,	
  1))	
  
val	
  counts	
  =	
  pairs.reduceByKey((a,	
  b)	
  =>	
  a	
  +	
  b)
RDD Api
df.groupBy("age").count().show()
Dataframes Api
head(filter(df,	
  df$waiting	
  <	
  50))
R Api
SELECT	
  name	
  FROM	
  people
SQL API
Driver
Executor
Enter The Spark Cassandra Connector
7© 2015. All Rights Reserved.
First Public Release at the Spark Summit in June 2014
If you write a Spark
application that
needs access to Cassandra,
this library is for you
-Piotr Kołaczkowski
https://github.com/datastax/spark-cassandra-connector
Open Source Software
1394 Commits
28 Contributors
Why do we even want a Distributed Analytics tool?
8© 2015. All Rights Reserved.
Why do we even want a Distributed Analytics tool?
9© 2015. All Rights Reserved.
•Generating Reports
•Direct Analytics on our data
•Cassandra Maintenance
•Making new views
•Changing partition keys
•Streaming
•Machine Learning
•ETL Data between different sources
We have small questions and big questions and
they need to work in different ways
10© 2015. All Rights Reserved.
How many shoes
did Marty buy?
How many shoes were
sold last year
compared to this year
grouped by demographic?
BIG DATA
We have small questions and big questions and
they need to work in different ways
11© 2015. All Rights Reserved.
How many shoes
did Marty buy?
How many shoes were
sold last year
compared to this year
grouped by demographic?
BIG DATA
Marty Purchase History
BIG DATA
We have small questions and big questions and
they need to work in different ways
12© 2015. All Rights Reserved.
How many shoes
did Marty buy?
All Shoe Data
How many shoes were
sold last year
compared to this year
grouped by demographic?
Part of Shoe Data
When we actually want to work with large amounts
of data we break it into parts
13© 2015. All Rights Reserved.
Distributed FS/databases
already do this for us
Node1 Node2 Node3 Node4
Part of Shoe Data Part of Shoe Data Part of Shoe Data
Spark describes underlying large multi-machine sets of
data using
The RDD (Resilient Distributed Dataset)
14© 2015. All Rights Reserved.
RDD
Part of Shoe Data
Node1 Node2 Node3 Node4
Part of Shoe Data Part of Shoe Data Part of Shoe Data
Spark Partitions
In Cassandra this distribution is mapped out by
token ranges
15© 2015. All Rights Reserved.
1 - 10000 10001-20000 20001-30000 30001 - 40000
Tokens
Part of Shoe Data
Node1 Node2 Node3 Node4
Part of Shoe Data Part of Shoe Data Part of Shoe Data
This distribution is key to how Cassandra handles
OLTP Requests
16© 2015. All Rights Reserved.
SELECT	
  amount	
  from	
  orders	
  where	
  customer	
  =	
  martyID
1 - 10000 10001-20000 20001-30000 30001 - 40000
Tokens
Part of Shoe Data
Node1 Node2 Node3 Node4
Part of Shoe Data Part of Shoe Data Part of Shoe Data
How many shoes
did Marty buy?
martyId	
  -­‐>	
  Token	
  -­‐>	
  3470
Lookup	
  Data	
  for	
  marty
The Connector Maps Cassandra Tokens
to Spark Partitions
17© 2015. All Rights Reserved.
sc.cassandraTable("keyspace","tablename")
1 - 10000 10001-20000 30001 - 40000
Tokens
Part of Shoe Data
Node1 Node2 Node3 Node4
Part of Shoe Data Part of Shoe Data Part of Shoe Data
20001-30000
00001
-
02500
02501
-
05000
05001
-
07500
07501
-
10000
CassandraRDD
10001
-
12500
12501
-
15000
15001
-
17500
17501
-
20000
20001
-
22500
22501
-
25000
25001
-
27500
27501
-
30000
30001
-
32500
32501
-
35000
35001
-
37500
37501
-
40000
This allows for Node Local operations!
18© 2015. All Rights Reserved.
sc.cassandraTable("keyspace","tablename")
1 - 10000 10001-20000 30001 - 40000
Tokens
Part of Shoe Data
Node1 Node2 Node3 Node4
Part of Shoe Data Part of Shoe Data Part of Shoe Data
20001-30000
00001
-
02500
02501
-
05000
05001
-
07500
07501
-
10000
CassandraRDD
10001
-
12500
12501
-
15000
15001
-
17500
17501
-
20000
20001
-
22500
22501
-
25000
25001
-
27500
27501
-
30000
30001
-
32500
32501
-
35000
35001
-
37500
37501
-
40000
Under the Hood the Spark Cassandra Connector
Uses the Java Driver to pull Information from C*
19© 2015. All Rights Reserved.
Check out my videos on
Datastax Academy
For a Deep Dive!
Check out
Robert's Talk!
5:10 PM - 5:50 PM
B1 - B3
https://academy.datastax.com/tutorials	
  
https://academy.datastax.com/demos/how-­‐spark-­‐cassandra-­‐connector-­‐reads-­‐data	
  
https://academy.datastax.com/demos/how-­‐spark-­‐cassandra-­‐connector-­‐writes-­‐data	
  
https://academy.datastax.com/demos/how-­‐spark-­‐works-­‐dsestandalone-­‐mode
The Present:

Capabilities and Features
20© 2015. All Rights Reserved.
Official Releases for Spark 1.0 - 1.4

Milestone Release for 1.5
Read Cassandra Data into RDDs
Write RDDs into Cassandra
21© 2015. All Rights Reserved.
RDD[Letter]
case	
  class	
  Letter(mailbox:	
  Int,	
  body:	
  String,	
  fromuser:	
  String,	
  :	
  touser:	
  String)
CREATE	
  TABLE	
  important.letters	
  

	
  	
  (	
  mailbox	
  int,	
  	
  
	
  	
  	
  	
  touser	
  text,	
  	
  
	
  	
  	
  	
  fromuser	
  text,	
  	
  
	
  	
  	
  	
  body	
  text,	
  	
  
	
  	
  	
  	
  PRIMARY	
  KEY	
  ((mailbox),	
  touser,	
  fromuser));
https://github.com/datastax/spark-­‐cassandra-­‐connector/blob/master/doc/2_loading.md
https://github.com/datastax/spark-­‐cassandra-­‐connector/blob/master/doc/5_saving.md
Read Cassandra Data into RDDs
Write RDDs into Cassandra
22© 2015. All Rights Reserved.
RDD[Letter]
sc.cassandraTable[Letter]("important","letters")
case	
  class	
  Letter(mailbox:	
  Int,	
  body:	
  String,	
  fromuser:	
  String,	
  :	
  touser:	
  String)
CREATE	
  TABLE	
  important.letters	
  

	
  	
  (	
  mailbox	
  int,	
  	
  
	
  	
  	
  	
  touser	
  text,	
  	
  
	
  	
  	
  	
  fromuser	
  text,	
  	
  
	
  	
  	
  	
  body	
  text,	
  	
  
	
  	
  	
  	
  PRIMARY	
  KEY	
  ((mailbox),	
  touser,	
  fromuser));
https://github.com/datastax/spark-­‐cassandra-­‐connector/blob/master/doc/2_loading.md
https://github.com/datastax/spark-­‐cassandra-­‐connector/blob/master/doc/5_saving.md
Read Cassandra Data into RDDs
Write RDDs into Cassandra
23© 2015. All Rights Reserved.
RDD[Letter]
sc.cassandraTable[Letter]("important","letters")
rdd.saveToCassandra("important","letters")
case	
  class	
  Letter(mailbox:	
  Int,	
  body:	
  String,	
  fromuser:	
  String,	
  :	
  touser:	
  String)
CREATE	
  TABLE	
  important.letters	
  

	
  	
  (	
  mailbox	
  int,	
  	
  
	
  	
  	
  	
  touser	
  text,	
  	
  
	
  	
  	
  	
  fromuser	
  text,	
  	
  
	
  	
  	
  	
  body	
  text,	
  	
  
	
  	
  	
  	
  PRIMARY	
  KEY	
  ((mailbox),	
  touser,	
  fromuser));
https://github.com/datastax/spark-­‐cassandra-­‐connector/blob/master/doc/2_loading.md
https://github.com/datastax/spark-­‐cassandra-­‐connector/blob/master/doc/5_saving.md
Ability to push down relevant filters to the C*
Server
24© 2015. All Rights Reserved.
CREATE	
  TABLE	
  important.letters	
  

	
  	
  (	
  mailbox	
  int,	
  	
  
	
  	
  	
  	
  touser	
  text,	
  	
  
	
  	
  	
  	
  fromuser	
  text,	
  	
  
	
  	
  	
  	
  body	
  text,	
  	
  
	
  	
  	
  	
  PRIMARY	
  KEY	
  ((mailbox),	
  touser,	
  fromuser));
https://github.com/datastax/spark-­‐cassandra-­‐connector/blob/master/doc/3_selection.md
Ability to push down relevant filters to the C*
Server
25© 2015. All Rights Reserved.
CREATE	
  TABLE	
  important.letters	
  

	
  	
  (	
  mailbox	
  int,	
  	
  
	
  	
  	
  	
  touser	
  text,	
  	
  
	
  	
  	
  	
  fromuser	
  text,	
  	
  
	
  	
  	
  	
  body	
  text,	
  	
  
	
  	
  	
  	
  PRIMARY	
  KEY	
  ((mailbox),	
  touser,	
  fromuser));
Partition for Mailbox 1 Partition for Mailbox 2
Orderedbytouser
https://github.com/datastax/spark-­‐cassandra-­‐connector/blob/master/doc/3_selection.md
Ability to push down relevant filters to the C*
Server
26© 2015. All Rights Reserved.
mailbox:	
  2	
  
touser:	
  marty	
  
fromuser:	
  doc	
  
body:	
  It's	
  your	
  kids,	
  Marty.	
  	
  
	
  	
  	
  	
  	
  	
  Something	
  gotta	
  be	
  done	
  about	
  
	
  	
  	
  	
  	
  	
  your	
  kids!
mailbox:	
  1	
  
touser:	
  doc	
  
fromuser:	
  marty	
  
body:	
  What	
  happens	
  to	
  us	
  in	
  the	
  
future?	
  	
  
mailbox:	
  1	
  
touser:	
  lorraine	
  
fromuser:	
  marty	
  
body:	
  Calvin?	
  Wh…	
  Why	
  do	
  you	
  keep	
  	
  
	
  	
  	
  	
  	
  	
  calling	
  me	
  calvin	
  
https://github.com/datastax/spark-­‐cassandra-­‐connector/blob/master/doc/3_selection.md
Partition for Mailbox 1 Partition for Mailbox 2
Orderedbytouser
CREATE	
  TABLE	
  important.letters	
  

	
  	
  (	
  mailbox	
  int,	
  	
  
	
  	
  	
  	
  touser	
  text,	
  	
  
	
  	
  	
  	
  fromuser	
  text,	
  	
  
	
  	
  	
  	
  body	
  text,	
  	
  
	
  	
  	
  	
  PRIMARY	
  KEY	
  ((mailbox),	
  touser,	
  fromuser));
Ability to push down relevant filters to the C*
Server
27© 2015. All Rights Reserved.
sc.cassandraTable("important",	
  "letters")	
  
	
  	
  .select("body")	
  
	
  	
  .where("touser	
  =	
  >",	
  "einstein")	
  
	
  	
  .collect
https://github.com/datastax/spark-­‐cassandra-­‐connector/blob/master/doc/3_selection.md
mailbox:	
  1	
  
touser:	
  doc	
  
fromuser:	
  marty	
  
body:	
  What	
  happens	
  to	
  us	
  in	
  the	
  
future?	
  	
  
mailbox:	
  1	
  
touser:	
  lorraine	
  
fromuser:	
  marty	
  
body:	
  Calvin?	
  Wh…	
  Why	
  do	
  you	
  keep	
  	
  
	
  	
  	
  	
  	
  	
  calling	
  me	
  calvin	
  
Partition for Mailbox 1 Partition for Mailbox 2
Orderedbytouser
CREATE	
  TABLE	
  important.letters	
  

	
  	
  (	
  mailbox	
  int,	
  	
  
	
  	
  	
  	
  touser	
  text,	
  	
  
	
  	
  	
  	
  fromuser	
  text,	
  	
  
	
  	
  	
  	
  body	
  text,	
  	
  
	
  	
  	
  	
  PRIMARY	
  KEY	
  ((mailbox),	
  touser,	
  fromuser));
mailbox:	
  2	
  
touser:	
  marty	
  
fromuser:	
  doc	
  
body:	
  It's	
  your	
  kids,	
  Marty.	
  	
  
	
  	
  	
  	
  	
  	
  Something	
  gotta	
  be	
  done	
  about	
  
	
  	
  	
  	
  	
  	
  your	
  kids!
Ability to push down relevant filters to the C*
Server
28© 2015. All Rights Reserved.
Select lets us only request certain columns from C*
https://github.com/datastax/spark-­‐cassandra-­‐connector/blob/master/doc/3_selection.md
sc.cassandraTable("important",	
  "letters")	
  
	
  	
  .select("body")	
  
	
  	
  .where("touser	
  =	
  >",	
  "einstein")	
  
	
  	
  .collect
mailbox:	
  1	
  
touser:	
  doc	
  
fromuser:	
  marty	
  
body:	
  What	
  happens	
  to	
  us	
  in	
  the	
  
future?	
  	
  
mailbox:	
  1	
  
touser:	
  lorraine	
  
fromuser:	
  marty	
  
body:	
  Calvin?	
  Wh…	
  Why	
  do	
  you	
  keep	
  	
  
	
  	
  	
  	
  	
  	
  calling	
  me	
  calvin	
  
Partition for Mailbox 1 Partition for Mailbox 2
Orderedbytouser
CREATE	
  TABLE	
  important.letters	
  

	
  	
  (	
  mailbox	
  int,	
  	
  
	
  	
  	
  	
  touser	
  text,	
  	
  
	
  	
  	
  	
  fromuser	
  text,	
  	
  
	
  	
  	
  	
  body	
  text,	
  	
  
	
  	
  	
  	
  PRIMARY	
  KEY	
  ((mailbox),	
  touser,	
  fromuser));
mailbox:	
  2	
  
touser:	
  marty	
  
fromuser:	
  doc	
  
body:	
  It's	
  your	
  kids,	
  Marty.	
  	
  
	
  	
  	
  	
  	
  	
  Something	
  gotta	
  be	
  done	
  about	
  
	
  	
  	
  	
  	
  	
  your	
  kids!
Ability to push down relevant filters to the C*
Server
29© 2015. All Rights Reserved.
Where lets us put in CQL Predicates that are allowed
https://github.com/datastax/spark-­‐cassandra-­‐connector/blob/master/doc/3_selection.md
sc.cassandraTable("important",	
  "letters")	
  
	
  	
  .select("body")	
  
	
  	
  .where("touser	
  =	
  >",	
  "einstein")	
  
	
  	
  .collect
mailbox:	
  1	
  
touser:	
  doc	
  
fromuser:	
  marty	
  
body:	
  What	
  happens	
  to	
  us	
  in	
  the	
  
future?	
  	
  
mailbox:	
  1	
  
touser:	
  lorraine	
  
fromuser:	
  marty	
  
body:	
  Calvin?	
  Wh…	
  Why	
  do	
  you	
  keep	
  	
  
	
  	
  	
  	
  	
  	
  calling	
  me	
  calvin	
  
Partition for Mailbox 1 Partition for Mailbox 2
Orderedbytouser
CREATE	
  TABLE	
  important.letters	
  

	
  	
  (	
  mailbox	
  int,	
  	
  
	
  	
  	
  	
  touser	
  text,	
  	
  
	
  	
  	
  	
  fromuser	
  text,	
  	
  
	
  	
  	
  	
  body	
  text,	
  	
  
	
  	
  	
  	
  PRIMARY	
  KEY	
  ((mailbox),	
  touser,	
  fromuser));
mailbox:	
  2	
  
touser:	
  marty	
  
fromuser:	
  doc	
  
body:	
  It's	
  your	
  kids,	
  Marty.	
  	
  
	
  	
  	
  	
  	
  	
  Something	
  gotta	
  be	
  done	
  about	
  
	
  	
  	
  	
  	
  	
  your	
  kids!
Ability to push down relevant filters to the C*
Server
30© 2015. All Rights Reserved.
https://github.com/datastax/spark-­‐cassandra-­‐connector/blob/master/doc/3_selection.md
Only the data we specifically request is pulled form C*
sc.cassandraTable("important",	
  "letters")	
  
	
  	
  .select("body")	
  
	
  	
  .where("touser	
  =	
  >",	
  "einstein")	
  
	
  	
  .collect
mailbox:	
  1	
  
touser:	
  doc	
  
fromuser:	
  marty	
  
body:	
  What	
  happens	
  to	
  us	
  in	
  the	
  
future?	
  	
  
mailbox:	
  1	
  
touser:	
  lorraine	
  
fromuser:	
  marty	
  
body:	
  Calvin?	
  Wh…	
  Why	
  do	
  you	
  keep	
  	
  
	
  	
  	
  	
  	
  	
  calling	
  me	
  calvin	
  
Partition for Mailbox 1 Partition for Mailbox 2
Orderedbytouser
CREATE	
  TABLE	
  important.letters	
  

	
  	
  (	
  mailbox	
  int,	
  	
  
	
  	
  	
  	
  touser	
  text,	
  	
  
	
  	
  	
  	
  fromuser	
  text,	
  	
  
	
  	
  	
  	
  body	
  text,	
  	
  
	
  	
  	
  	
  PRIMARY	
  KEY	
  ((mailbox),	
  touser,	
  fromuser));
mailbox:	
  2	
  
touser:	
  marty	
  
fromuser:	
  doc	
  
body:	
  It's	
  your	
  kids,	
  Marty.	
  	
  
	
  	
  	
  	
  	
  	
  Something	
  gotta	
  be	
  done	
  about	
  
	
  	
  	
  	
  	
  	
  your	
  kids!
Java API Support
31© 2015. All Rights Reserved.
JavaRDD<Double>	
  pricesRDD	
  =	
  javaFunctions(sc)	
  
	
  	
  .cassandraTable("important",	
  "letters",	
  	
  	
  	
  	
  	
  	
  	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  mapColumnTo(Letter.class))	
  
	
  	
  .select("body");
All functionality introduced in the Scala API
is also available in the Java API
javaFunctions(rdd).writerBuilder(	
  
	
  	
  "important",	
  	
  
	
  	
  "letters",	
  
	
  	
  	
  mapToRow(Letters.class)	
  
).saveToCassandra();
Reading
Writing
https://github.com/datastax/spark-­‐cassandra-­‐connector/blob/master/doc/7_java_api.md
32© 2015. All Rights Reserved.
But what if you want to work with brand new
Dataframes?
Full Dataframes Support :
org.apache.spark.sql.cassandra
33© 2015. All Rights Reserved.
Dataframes (aka SchemaRDDs) provide a new and more
generic api for working with RDD's
val	
  df	
  =	
  sqlContext	
  
	
  	
  .read	
  
	
  	
  .format("org.apache.spark.sql.cassandra")	
  
	
  	
  .options(	
  
Map(	
  	
  
	
  "keyspace"	
  -­‐>	
  "important",	
  
	
  "table"	
  -­‐>	
  "letters"	
  
))	
  
	
  	
  .load()
Reading
https://github.com/datastax/spark-­‐cassandra-­‐connector/blob/master/doc/14_data_frames.md
Full Dataframes Support :
org.apache.spark.sql.cassandra
34© 2015. All Rights Reserved.
Dataframes (aka SchemaRDDs) provide a new and more
generic api for working with RDD's
val	
  df	
  =	
  sqlContext	
  
	
  	
  .read	
  
	
  	
  .format("org.apache.spark.sql.cassandra")	
  
	
  	
  .options(	
  
Map(	
  	
  
	
  "keyspace"	
  -­‐>	
  "important",	
  
	
  "table"	
  -­‐>	
  "letters"	
  
))	
  
	
  	
  .load()
CREATE	
  TABLE	
  letters	
  
	
  	
  	
  	
  	
  USING	
  org.apache.spark.sql.cassandra	
  
	
  	
  	
  	
  	
  OPTIONS	
  (	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  keyspace	
  "important",	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  table	
  "letters"	
  	
  
	
  	
  	
  	
  	
  )
Reading
https://github.com/datastax/spark-­‐cassandra-­‐connector/blob/master/doc/14_data_frames.md
Full Dataframes Support :
org.apache.spark.sql.cassandra
35© 2015. All Rights Reserved.
Dataframes (aka SchemaRDDs) provide a new and more
generic api for working with RDD's
val	
  df	
  =	
  sqlContext	
  
	
  	
  .read	
  
	
  	
  .format("org.apache.spark.sql.cassandra")	
  
	
  	
  .options(	
  
Map(	
  	
  
	
  "keyspace"	
  -­‐>	
  "important",	
  
	
  "table"	
  -­‐>	
  "letters"	
  
))	
  
	
  	
  .load()
CREATE	
  TABLE	
  letters	
  
	
  	
  	
  	
  	
  USING	
  org.apache.spark.sql.cassandra	
  
	
  	
  	
  	
  	
  OPTIONS	
  (	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  keyspace	
  "important",	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  table	
  "letters"	
  	
  
	
  	
  	
  	
  	
  )
Reading
Writing
df.write	
  
	
  	
  .format("org.apache.spark.sql.cassandra")	
  
	
  	
  .options(	
  
	
  	
  	
  	
  Map(	
  	
  
	
  	
  	
  	
  "keyspace"	
  -­‐>	
  "important",	
  
	
  	
  	
  	
  	
  "table"	
  -­‐>	
  "letters"	
  	
  	
  	
  	
  
	
  	
  	
  	
  ))	
  
	
  	
  .save()
https://github.com/datastax/spark-­‐cassandra-­‐connector/blob/master/doc/14_data_frames.md
Full Dataframes Support :
org.apache.spark.sql.cassandra
36© 2015. All Rights Reserved.
Dataframes (aka SchemaRDDs) provide a new and more
generic api for working with RDD's
val	
  df	
  =	
  sqlContext	
  
	
  	
  .read	
  
	
  	
  .format("org.apache.spark.sql.cassandra")	
  
	
  	
  .options(	
  
Map(	
  	
  
	
  "keyspace"	
  -­‐>	
  "important",	
  
	
  "table"	
  -­‐>	
  "letters"	
  
))	
  
	
  	
  .load()
CREATE	
  TABLE	
  letters	
  
	
  	
  	
  	
  	
  USING	
  org.apache.spark.sql.cassandra	
  
	
  	
  	
  	
  	
  OPTIONS	
  (	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  keyspace	
  "important",	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  table	
  "letters"	
  	
  
	
  	
  	
  	
  	
  )
Reading
Writing
df.write	
  
	
  	
  .format("org.apache.spark.sql.cassandra")	
  
	
  	
  .options(	
  
	
  	
  	
  	
  Map(	
  	
  
	
  	
  	
  	
  "keyspace"	
  -­‐>	
  "important",	
  
	
  	
  	
  	
  	
  "table"	
  -­‐>	
  "letters"	
  	
  	
  	
  	
  
	
  	
  	
  	
  ))	
  
	
  	
  .save()
CREATE	
  TABLE	
  letters_copy	
  
	
  	
  	
  	
  	
  USING	
  org.apache.spark.sql.cassandra	
  
	
  	
  	
  	
  	
  OPTIONS	
  (	
  
	
  	
  	
  	
  	
  	
  keyspace	
  "important",	
  
	
  	
  	
  	
  	
  	
  table	
  "letters_copy"	
  
	
  	
  	
  	
  	
  	
  )	
  
INSERT	
  INTO	
  TABLE	
  letters_copy	
  SELECT	
  *	
  FROM	
  letters;
https://github.com/datastax/spark-­‐cassandra-­‐connector/blob/master/doc/14_data_frames.md
val	
  df	
  =	
  sqlContext	
  
	
  	
  .read	
  
	
  	
  .format("org.apache.spark.sql.cassandra")	
  
	
  	
  .options(	
  
Map(	
  	
  
	
  "keyspace"	
  -­‐>	
  "important",	
  
	
  "table"	
  -­‐>	
  "letters"	
  
))	
  
	
  	
  .load()
CREATE	
  TABLE	
  letters	
  
	
  	
  	
  	
  	
  USING	
  org.apache.spark.sql.cassandra	
  
	
  	
  	
  	
  	
  OPTIONS	
  (	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  keyspace	
  "important",	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  table	
  "letters"	
  	
  
	
  	
  	
  	
  	
  )
Reading
Writing
df.write	
  
	
  	
  .format("org.apache.spark.sql.cassandra")	
  
	
  	
  .options(	
  
	
  	
  	
  	
  Map(	
  	
  
	
  	
  	
  	
  "keyspace"	
  -­‐>	
  "important",	
  
	
  	
  	
  	
  	
  "table"	
  -­‐>	
  "letters"	
  	
  	
  	
  	
  
	
  	
  	
  	
  ))	
  
	
  	
  .save()
CREATE	
  TABLE	
  letters_copy	
  
	
  	
  	
  	
  	
  USING	
  org.apache.spark.sql.cassandra	
  
	
  	
  	
  	
  	
  OPTIONS	
  (	
  
	
  	
  	
  	
  	
  	
  keyspace	
  "important",	
  
	
  	
  	
  	
  	
  	
  table	
  "letters_copy"	
  
	
  	
  	
  	
  	
  	
  )	
  
INSERT	
  INTO	
  TABLE	
  letters_copy	
  SELECT	
  *	
  FROM	
  letters;
Full Dataframes Support
37© 2015. All Rights Reserved.
https://github.com/datastax/spark-­‐cassandra-­‐connector/blob/master/doc/14_data_frames.md
Backed By CassandraRDD
So we can prune
and pushdown predicates!
Integrated Pushdown of Predicates to C* in
Dataframes
38© 2015. All Rights Reserved.
There is no need for special functions when using Dataframes
since the pushdown is done by the Catalyst optimizer
CREATE	
  TABLE	
  important.letters	
  

	
  	
  (	
  mailbox	
  int,	
  	
  
	
  	
  	
  	
  touser	
  text,	
  	
  
	
  	
  	
  	
  fromuser	
  text,	
  	
  
	
  	
  	
  	
  body	
  text,	
  	
  
	
  	
  	
  	
  PRIMARY	
  KEY	
  ((mailbox),	
  touser,	
  fromuser));
https://github.com/datastax/spark-­‐cassandra-­‐connector/blob/master/doc/14_data_frames.md
scala>	
  df.filter(	
  "touser	
  >	
  'einstein'").explain	
  
==	
  Physical	
  Plan	
  ==	
  
Filter	
  (touser#1	
  >	
  einstein)	
  
	
  PhysicalRDD	
  [mailbox#0,touser#1,fromuser#2,body#3],	
  
MapPartitionsRDD[6]	
  at	
  explain	
  at	
  <console>:59
Automatically Checked Against C* rules for pushing down
predicates. Valid predicates will be applied as if you did a
.where on CassandraRDD.
Pyspark and Dataframes Also Supported
39© 2015. All Rights Reserved.
Dataframes in PySpark run Native Code, no need for 

Python <-> Java Serialization
	
  sqlContext.read	
  
	
  	
  	
  	
  .format("org.apache.spark.sql.cassandra")	
  
	
  	
  	
  	
  .options(table="kv",	
  keyspace="test")	
  
	
  	
  	
  	
  .load().show()
You can tell it's python
because of
my need to escape line ends
Pure Python in Pyspark PySpark Dataframes!
https://github.com/datastax/spark-­‐cassandra-­‐connector/blob/master/doc/15_python.md
Pyspark and Dataframes Also Supported
40© 2015. All Rights Reserved. https://github.com/datastax/spark-­‐cassandra-­‐connector/blob/master/doc/15_python.md
	
  sqlContext.read	
  
	
  	
  	
  	
  .format("org.apache.spark.sql.cassandra")	
  
	
  	
  	
  	
  .options(table="kv",	
  keyspace="test")	
  
	
  	
  	
  	
  .load().show()
You can tell it's python
because of
my need to escape line ends
Pure Python in Pyspark PySpark Dataframes!
SparkR Also Works with
Cassandra Dataframes!
Repartition by Cassandra Replica
41© 2015. All Rights Reserved.
Repartition any RDD to get Data Locality to C*!
https://github.com/datastax/spark-­‐cassandra-­‐connector/blob/master/doc/2_loading.md
1955 1985 2015
RDD
Spark Partitions Located
on Different Nodes than
Their Respective C* Data
Repartition by Cassandra Replica
42© 2015. All Rights Reserved.
Repartition any RDD to get Data Locality to C*!
https://github.com/datastax/spark-­‐cassandra-­‐connector/blob/master/doc/2_loading.md
1955 1985 2015
Repartition by Cassandra Replica
43© 2015. All Rights Reserved.
Repartition any RDD to get Data Locality to C*!
https://github.com/datastax/spark-­‐cassandra-­‐connector/blob/master/doc/2_loading.md
1955 1985 2015
mailboxesToCheck	
  
	
  	
  .repartitionByCassandraReplica("important",	
  "letters",	
  10)
JoinWithCassandraTable pulls specific
Partition Keys From Cassandra
44© 2015. All Rights Reserved.
mailboxesToCheck	
  
	
  	
  .repartitionByCassandraReplica("important",	
  "letters",	
  10)	
  
	
  	
  .joinWithCassandraTable("important","letters")
https://github.com/datastax/spark-­‐cassandra-­‐connector/blob/master/doc/2_loading.md
Mailbox13234
Mailbox13234
Mailbox13234
Mailbox13234
Mailbox13234
Mailbox13234
Mailbox13234
Mailbox13234
Mailbox13234
Mailbox13234
Mailbox13234
Mailbox13234
Mailbox13234
Mailbox13234
Mailbox13234
Mailbox13234
Mailbox13234
Mailbox13234
Mailbox13234
Mailbox13234
Mailbox13234
Mailbox13234
Mailbox13234
Mailbox13234
Mailbox13234
Mailbox13234
Mailbox13234
Mailbox13234
Mailbox13234
Mailbox13234
Mailbox13234
Mailbox13234
Mailbox13234
Mailbox13234
Mailbox13234
Mailbox13234
Mailbox13234
Mailbox13234
Mailbox13234
Mailbox13234
Mailbox13234
Mailbox13234
Mailbox13234
Mailbox13234
Mailbox13234
Mailbox13234
Mailbox13234
Mailbox13234
Mailbox13234
Mailbox13234
Mailbox13234
Mailbox13234
Node1 Node2 Node3 Node4
Several thousand mailboxes
CREATE	
  TABLE	
  important.letters	
  

	
  	
  (	
  mailbox	
  int,	
  	
  
	
  	
  	
  	
  touser	
  text,	
  	
  
	
  	
  	
  	
  fromuser	
  text,	
  	
  
	
  	
  	
  	
  body	
  text,	
  	
  
	
  	
  	
  	
  PRIMARY	
  KEY	
  ((mailbox),	
  touser,	
  fromuser));
JoinWithCassandraTable pulls specific
Partition Keys From Cassandra
45© 2015. All Rights Reserved.
mailboxesToCheck	
  
	
  	
  .repartitionByCassandraReplica("important",	
  "letters",	
  10)	
  
	
  	
  .joinWithCassandraTable("important","letters")
https://github.com/datastax/spark-­‐cassandra-­‐connector/blob/master/doc/2_loading.md
Mailbox13234
Mailbox13234
Mailbox13234
Mailbox13234
Mailbox13234
Mailbox13234
Mailbox13234
Mailbox13234
Mailbox13234
Mailbox13234
Mailbox13234
Mailbox13234
Mailbox13234
Mailbox13234
Mailbox13234
Mailbox13234
Mailbox13234
Mailbox13234
Mailbox13234
Mailbox13234
Mailbox13234
Mailbox13234
Mailbox13234
Mailbox13234
Mailbox8765
Mailbox13234
Mailbox13234
Mailbox13234
Mailbox13234
Mailbox3
Mailbox13234
Mailbox2341
Mailbox13234
Mailbox43211
Mailbox13234
Mailbox13234
Mailbox13234
Mailbox13234
Mailbox13234
Mailbox13234
Mailbox13234
Mailbox13234
Mailbox13234
Mailbox13234
Mailbox13234
Mailbox754567
Mailbox13452
Mailbox13234
Mailbox13234
Mailbox13234
Mailbox13234
Mailbox52352
Node1 Node2 Node3 Node4
Repartition places our keys
local to the data they will
retrieve
CREATE	
  TABLE	
  important.letters	
  

	
  	
  (	
  mailbox	
  int,	
  	
  
	
  	
  	
  	
  touser	
  text,	
  	
  
	
  	
  	
  	
  fromuser	
  text,	
  	
  
	
  	
  	
  	
  body	
  text,	
  	
  
	
  	
  	
  	
  PRIMARY	
  KEY	
  ((mailbox),	
  touser,	
  fromuser));
JoinWithCassandraTable pulls specific
Partition Keys From Cassandra
46© 2015. All Rights Reserved.
mailboxesToCheck	
  
	
  	
  .repartitionByCassandraReplica("important",	
  "letters",	
  10)	
  
	
  	
  .joinWithCassandraTable("important","letters")
https://github.com/datastax/spark-­‐cassandra-­‐connector/blob/master/doc/2_loading.md
Mailbox13234
Mailbox13234
Mailbox13234
Mailbox13234
Mailbox13234
Mailbox13234
Mailbox13234
Mailbox13234
Mailbox13234
Mailbox13234
Mailbox13234
Mailbox13234
Mailbox13234
Mailbox13234
Mailbox13234
Mailbox13234
Mailbox13234
Mailbox13234
Mailbox13234
Mailbox13234
Mailbox13234
Mailbox13234
Mailbox13234
Mailbox13234
Mailbox8765
Mailbox13234
Mailbox13234
Mailbox13234
Mailbox13234
Mailbox3
Mailbox13234
Mailbox2341
Mailbox13234
Mailbox43211
Mailbox13234
Mailbox13234
Mailbox13234
Mailbox13234
Mailbox13234
Mailbox13234
Mailbox13234
Mailbox13234
Mailbox13234
Mailbox13234
Mailbox13234
Mailbox754567
Mailbox13452
Mailbox13234
Mailbox13234
Mailbox13234
Mailbox13234
Mailbox52352
Node1 Node2 Node3 Node4
The Join then retrieves the rows in parallel
CREATE	
  TABLE	
  important.letters	
  

	
  	
  (	
  mailbox	
  int,	
  	
  
	
  	
  	
  	
  touser	
  text,	
  	
  
	
  	
  	
  	
  fromuser	
  text,	
  	
  
	
  	
  	
  	
  body	
  text,	
  	
  
	
  	
  	
  	
  PRIMARY	
  KEY	
  ((mailbox),	
  touser,	
  fromuser));
Manual Driver Sessions are available!
47© 2015. All Rights Reserved. https://github.com/datastax/spark-­‐cassandra-­‐connector/blob/master/doc/1_connecting.md
import	
  com.datastax.spark.connector.cql.CassandraConnector	
  
CassandraConnector(conf).withSessionDo	
  {	
  session	
  =>	
  
	
  	
  session.execute("CREATE	
  KEYSPACE	
  test2	
  WITH	
  REPLICATION	
  =	
  {'class':	
  'SimpleStrategy',	
  'replication_factor':	
  1	
  }")	
  
	
  	
  session.execute("CREATE	
  TABLE	
  test2.words	
  (word	
  text	
  PRIMARY	
  KEY,	
  count	
  int)")	
  
}
Any Connections Made through CassandraConnector
will use a Connection pool (even remotely!)
48© 2015. All Rights Reserved. https://github.com/datastax/spark-­‐cassandra-­‐connector/blob/master/doc/1_connecting.md
CassandraConnector(conf).withSessionDo	
  {}
Gains a handle on a running
Cluster object made with
Configuration conf
Executor Thread 2
Executor Thread 3
Executor Thread1
Executor JVM
Cassandra
Connection
Pool
Cassandra
Connection
Pool
Any Connections Made through CassandraConnector
will use a Connection pool (even remotely!)
49© 2015. All Rights Reserved. https://github.com/datastax/spark-­‐cassandra-­‐connector/blob/master/doc/1_connecting.md
Multiple threads/executor cores
will end up using the same
Connection
Executor Thread 2
Executor Thread 3
Executor JVM
Cluster
CassandraConnector(conf).withSessionDo	
  {}
Executor Thread1
Cassandra Connector can be used in Closures
and Prepared Statements will be Cached as well
50© 2015. All Rights Reserved. https://github.com/datastax/spark-­‐cassandra-­‐connector/blob/master/doc/1_connecting.md
rdd.mapPartitions{	
  it	
  =>	
  CassandraConnector.withSessionDo(	
  session	
  =>	
  ps	
  =	
  session.prepare(query)	
  )	
  }
Reference to already created prepared
statement will be used if available
Cassandra
Connection
Pool
Executor Thread 2
Executor Thread 3
Executor JVM
Cluster
Prepared Statement CacheExecutor Thread1
What is the Future of the Spark Cassandra Connector?
51© 2015. All Rights Reserved.
You!
52© 2015. All Rights Reserved.
The more people that contribute to the project the better it will become!
We welcome any contributions or just send us a letter on the mailing list!
https://github.com/datastax/spark-­‐cassandra-­‐connector/blob/master/doc/FAQ.md#can-­‐i-­‐contribute-­‐to-­‐the-­‐spark-­‐cassandra-­‐connector
Spark Packages!
53© 2015. All Rights Reserved.
http://spark-packages.org/package/datastax/spark-cassandra-connector
Update Even Faster to New Spark Versions
54© 2015. All Rights Reserved.
We'll be testing against Spark Release Candidates in the future so that we can have a compatible
Spark Cassandra Connectors out the moment an official Spark Release is ready!
Even better Dataframes
55© 2015. All Rights Reserved.
Automatic integration of repartitionByCassandra and

joinWithCassandraTable
Make it that any joins against Cassandra Tables
are automatically detected, and if possible converted to JoinWithCassandraTable calls. No need
to manually determine
when you should or shouldn't use the method.
Create Cassandra Tables from Dataframes Automatically
Currently all tables need to have been created in C* prior to saving, we'd like it if
users could specify what kind of key they would like on their C* table and have it
automatically generated on data frame writes.
Improve
Spark-Cassandra-Stress
56© 2015. All Rights Reserved.
https://github.com/datastax/spark-­‐cassandra-­‐stress
Open source tool which lets you test maximum throughput
of your cluster with Spark and C*
• Write Tests
• Read Tests
• Streaming Tests
Includes!
Thank you

Más contenido relacionado

La actualidad más candente

Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax EnablementVincent Poncet
 
Apache cassandra and spark. you got the the lighter, let's start the fire
Apache cassandra and spark. you got the the lighter, let's start the fireApache cassandra and spark. you got the the lighter, let's start the fire
Apache cassandra and spark. you got the the lighter, let's start the firePatrick McFadin
 
Analytics with Cassandra, Spark & MLLib - Cassandra Essentials Day
Analytics with Cassandra, Spark & MLLib - Cassandra Essentials DayAnalytics with Cassandra, Spark & MLLib - Cassandra Essentials Day
Analytics with Cassandra, Spark & MLLib - Cassandra Essentials DayMatthias Niehoff
 
Spark Cassandra Connector Dataframes
Spark Cassandra Connector DataframesSpark Cassandra Connector Dataframes
Spark Cassandra Connector DataframesRussell Spitzer
 
Lightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and CassandraLightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and Cassandranickmbailey
 
Real time data pipeline with spark streaming and cassandra with mesos
Real time data pipeline with spark streaming and cassandra with mesosReal time data pipeline with spark streaming and cassandra with mesos
Real time data pipeline with spark streaming and cassandra with mesosRahul Kumar
 
Spark cassandra connector.API, Best Practices and Use-Cases
Spark cassandra connector.API, Best Practices and Use-CasesSpark cassandra connector.API, Best Practices and Use-Cases
Spark cassandra connector.API, Best Practices and Use-CasesDuyhai Doan
 
Owning time series with team apache Strata San Jose 2015
Owning time series with team apache   Strata San Jose 2015Owning time series with team apache   Strata San Jose 2015
Owning time series with team apache Strata San Jose 2015Patrick McFadin
 
An Introduction to time series with Team Apache
An Introduction to time series with Team ApacheAn Introduction to time series with Team Apache
An Introduction to time series with Team ApachePatrick McFadin
 
Analytics with Cassandra & Spark
Analytics with Cassandra & SparkAnalytics with Cassandra & Spark
Analytics with Cassandra & SparkMatthias Niehoff
 
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...DataStax
 
Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...
Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...
Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...DataStax
 
Time series with Apache Cassandra - Long version
Time series with Apache Cassandra - Long versionTime series with Apache Cassandra - Long version
Time series with Apache Cassandra - Long versionPatrick McFadin
 
Nike Tech Talk: Double Down on Apache Cassandra and Spark
Nike Tech Talk:  Double Down on Apache Cassandra and SparkNike Tech Talk:  Double Down on Apache Cassandra and Spark
Nike Tech Talk: Double Down on Apache Cassandra and SparkPatrick McFadin
 
Laying down the smack on your data pipelines
Laying down the smack on your data pipelinesLaying down the smack on your data pipelines
Laying down the smack on your data pipelinesPatrick McFadin
 
Spark Streaming with Cassandra
Spark Streaming with CassandraSpark Streaming with Cassandra
Spark Streaming with CassandraJacek Lewandowski
 
Successful Architectures for Fast Data
Successful Architectures for Fast DataSuccessful Architectures for Fast Data
Successful Architectures for Fast DataPatrick McFadin
 
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016StampedeCon
 
Time Series Processing with Apache Spark
Time Series Processing with Apache SparkTime Series Processing with Apache Spark
Time Series Processing with Apache SparkJosef Adersberger
 
Time series with apache cassandra strata
Time series with apache cassandra   strataTime series with apache cassandra   strata
Time series with apache cassandra strataPatrick McFadin
 

La actualidad más candente (20)

Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax Enablement
 
Apache cassandra and spark. you got the the lighter, let's start the fire
Apache cassandra and spark. you got the the lighter, let's start the fireApache cassandra and spark. you got the the lighter, let's start the fire
Apache cassandra and spark. you got the the lighter, let's start the fire
 
Analytics with Cassandra, Spark & MLLib - Cassandra Essentials Day
Analytics with Cassandra, Spark & MLLib - Cassandra Essentials DayAnalytics with Cassandra, Spark & MLLib - Cassandra Essentials Day
Analytics with Cassandra, Spark & MLLib - Cassandra Essentials Day
 
Spark Cassandra Connector Dataframes
Spark Cassandra Connector DataframesSpark Cassandra Connector Dataframes
Spark Cassandra Connector Dataframes
 
Lightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and CassandraLightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and Cassandra
 
Real time data pipeline with spark streaming and cassandra with mesos
Real time data pipeline with spark streaming and cassandra with mesosReal time data pipeline with spark streaming and cassandra with mesos
Real time data pipeline with spark streaming and cassandra with mesos
 
Spark cassandra connector.API, Best Practices and Use-Cases
Spark cassandra connector.API, Best Practices and Use-CasesSpark cassandra connector.API, Best Practices and Use-Cases
Spark cassandra connector.API, Best Practices and Use-Cases
 
Owning time series with team apache Strata San Jose 2015
Owning time series with team apache   Strata San Jose 2015Owning time series with team apache   Strata San Jose 2015
Owning time series with team apache Strata San Jose 2015
 
An Introduction to time series with Team Apache
An Introduction to time series with Team ApacheAn Introduction to time series with Team Apache
An Introduction to time series with Team Apache
 
Analytics with Cassandra & Spark
Analytics with Cassandra & SparkAnalytics with Cassandra & Spark
Analytics with Cassandra & Spark
 
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...
 
Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...
Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...
Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...
 
Time series with Apache Cassandra - Long version
Time series with Apache Cassandra - Long versionTime series with Apache Cassandra - Long version
Time series with Apache Cassandra - Long version
 
Nike Tech Talk: Double Down on Apache Cassandra and Spark
Nike Tech Talk:  Double Down on Apache Cassandra and SparkNike Tech Talk:  Double Down on Apache Cassandra and Spark
Nike Tech Talk: Double Down on Apache Cassandra and Spark
 
Laying down the smack on your data pipelines
Laying down the smack on your data pipelinesLaying down the smack on your data pipelines
Laying down the smack on your data pipelines
 
Spark Streaming with Cassandra
Spark Streaming with CassandraSpark Streaming with Cassandra
Spark Streaming with Cassandra
 
Successful Architectures for Fast Data
Successful Architectures for Fast DataSuccessful Architectures for Fast Data
Successful Architectures for Fast Data
 
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
 
Time Series Processing with Apache Spark
Time Series Processing with Apache SparkTime Series Processing with Apache Spark
Time Series Processing with Apache Spark
 
Time series with apache cassandra strata
Time series with apache cassandra   strataTime series with apache cassandra   strata
Time series with apache cassandra strata
 

Destacado

Products of alkali activated calcareous fly ash and glass cullet
Products of alkali activated calcareous fly ash and glass culletProducts of alkali activated calcareous fly ash and glass cullet
Products of alkali activated calcareous fly ash and glass culleteSAT Publishing House
 
HRD INTER FORUM ALMATY_29 сентября 2016
HRD INTER FORUM ALMATY_29 сентября 2016HRD INTER FORUM ALMATY_29 сентября 2016
HRD INTER FORUM ALMATY_29 сентября 2016International HRD-Forums
 
Effect of cold rolling on low cycle fatigue behavior
Effect of cold rolling on low cycle fatigue behaviorEffect of cold rolling on low cycle fatigue behavior
Effect of cold rolling on low cycle fatigue behavioreSAT Publishing House
 
Research regarding calculation of the tensile forces
Research regarding calculation of the tensile forcesResearch regarding calculation of the tensile forces
Research regarding calculation of the tensile forceseSAT Publishing House
 
Triumph tridentt150partsmanual
Triumph tridentt150partsmanualTriumph tridentt150partsmanual
Triumph tridentt150partsmanualJuan Perez Lobera
 
Bao cao ket qua sxkd nam 2014 va ke hoach sxkd nam 2015 cua cty ppc
Bao cao ket qua sxkd nam 2014 va ke hoach sxkd nam 2015 cua cty ppcBao cao ket qua sxkd nam 2014 va ke hoach sxkd nam 2015 cua cty ppc
Bao cao ket qua sxkd nam 2014 va ke hoach sxkd nam 2015 cua cty ppcCao Nguyên Nguyễn
 
Helios: Cross-Platform Framework
Helios: Cross-Platform FrameworkHelios: Cross-Platform Framework
Helios: Cross-Platform FrameworkMX
 
برای تدریس زور الکی نزنید
برای تدریس زور الکی نزنیدبرای تدریس زور الکی نزنید
برای تدریس زور الکی نزنیدSelf-employed
 
Network topology
Network topologyNetwork topology
Network topologyJency Pj
 
Factores de riesgo, efectos de salud y como controlarlos
Factores de riesgo, efectos de salud y como controlarlosFactores de riesgo, efectos de salud y como controlarlos
Factores de riesgo, efectos de salud y como controlarlosJohandres_c
 
La recopilación de obras y autores como estrategia
La recopilación de obras y autores como estrategiaLa recopilación de obras y autores como estrategia
La recopilación de obras y autores como estrategiaAlejandra Cornelio
 

Destacado (17)

La constituccion informatica 2
La constituccion informatica 2La constituccion informatica 2
La constituccion informatica 2
 
What is Quality in Blended Learning?
What is Quality in Blended Learning?What is Quality in Blended Learning?
What is Quality in Blended Learning?
 
(485226650) OLED 3
(485226650) OLED 3(485226650) OLED 3
(485226650) OLED 3
 
Products of alkali activated calcareous fly ash and glass cullet
Products of alkali activated calcareous fly ash and glass culletProducts of alkali activated calcareous fly ash and glass cullet
Products of alkali activated calcareous fly ash and glass cullet
 
HRD INTER FORUM ALMATY_29 сентября 2016
HRD INTER FORUM ALMATY_29 сентября 2016HRD INTER FORUM ALMATY_29 сентября 2016
HRD INTER FORUM ALMATY_29 сентября 2016
 
Effect of cold rolling on low cycle fatigue behavior
Effect of cold rolling on low cycle fatigue behaviorEffect of cold rolling on low cycle fatigue behavior
Effect of cold rolling on low cycle fatigue behavior
 
Research regarding calculation of the tensile forces
Research regarding calculation of the tensile forcesResearch regarding calculation of the tensile forces
Research regarding calculation of the tensile forces
 
Triumph tridentt150partsmanual
Triumph tridentt150partsmanualTriumph tridentt150partsmanual
Triumph tridentt150partsmanual
 
Bao cao ket qua sxkd nam 2014 va ke hoach sxkd nam 2015 cua cty ppc
Bao cao ket qua sxkd nam 2014 va ke hoach sxkd nam 2015 cua cty ppcBao cao ket qua sxkd nam 2014 va ke hoach sxkd nam 2015 cua cty ppc
Bao cao ket qua sxkd nam 2014 va ke hoach sxkd nam 2015 cua cty ppc
 
Helios: Cross-Platform Framework
Helios: Cross-Platform FrameworkHelios: Cross-Platform Framework
Helios: Cross-Platform Framework
 
EJCC Presentation
EJCC PresentationEJCC Presentation
EJCC Presentation
 
برای تدریس زور الکی نزنید
برای تدریس زور الکی نزنیدبرای تدریس زور الکی نزنید
برای تدریس زور الکی نزنید
 
Network topology
Network topologyNetwork topology
Network topology
 
Factores de riesgo, efectos de salud y como controlarlos
Factores de riesgo, efectos de salud y como controlarlosFactores de riesgo, efectos de salud y como controlarlos
Factores de riesgo, efectos de salud y como controlarlos
 
Cuny talk
Cuny talkCuny talk
Cuny talk
 
La recopilación de obras y autores como estrategia
La recopilación de obras y autores como estrategiaLa recopilación de obras y autores como estrategia
La recopilación de obras y autores como estrategia
 
Case reports
Case reportsCase reports
Case reports
 

Similar a Spark Cassandra Connector: Past, Present, and Future

Spark Cassandra Connector: Past, Present and Furure
Spark Cassandra Connector: Past, Present and FurureSpark Cassandra Connector: Past, Present and Furure
Spark Cassandra Connector: Past, Present and FurureDataStax Academy
 
Hadoop Integration in Cassandra
Hadoop Integration in CassandraHadoop Integration in Cassandra
Hadoop Integration in CassandraJairam Chandar
 
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaAdvance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaDesing Pathshala
 
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in productionScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in productionChetan Khatri
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupDatabricks
 
Scalable and Flexible Machine Learning With Scala @ LinkedIn
Scalable and Flexible Machine Learning With Scala @ LinkedInScalable and Flexible Machine Learning With Scala @ LinkedIn
Scalable and Flexible Machine Learning With Scala @ LinkedInVitaly Gordon
 
Lightning Fast Analytics with Cassandra and Spark
Lightning Fast Analytics with Cassandra and SparkLightning Fast Analytics with Cassandra and Spark
Lightning Fast Analytics with Cassandra and SparkTim Vincent
 
Lightning fast analytics with Cassandra and Spark
Lightning fast analytics with Cassandra and SparkLightning fast analytics with Cassandra and Spark
Lightning fast analytics with Cassandra and SparkVictor Coustenoble
 
Introduction to Spark with Scala
Introduction to Spark with ScalaIntroduction to Spark with Scala
Introduction to Spark with ScalaHimanshu Gupta
 
Spring data ii
Spring data iiSpring data ii
Spring data ii명철 강
 
Apache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster ComputingApache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster ComputingGerger
 
Hw09 Sqoop Database Import For Hadoop
Hw09   Sqoop Database Import For HadoopHw09   Sqoop Database Import For Hadoop
Hw09 Sqoop Database Import For HadoopCloudera, Inc.
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Michael Rys
 
Apache Spark on Apache HBase: Current and Future
Apache Spark on Apache HBase: Current and Future Apache Spark on Apache HBase: Current and Future
Apache Spark on Apache HBase: Current and Future HBaseCon
 
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...Julian Hyde
 
Cascading Through Hadoop for the Boulder JUG
Cascading Through Hadoop for the Boulder JUGCascading Through Hadoop for the Boulder JUG
Cascading Through Hadoop for the Boulder JUGMatthew McCullough
 

Similar a Spark Cassandra Connector: Past, Present, and Future (20)

Spark Cassandra Connector: Past, Present and Furure
Spark Cassandra Connector: Past, Present and FurureSpark Cassandra Connector: Past, Present and Furure
Spark Cassandra Connector: Past, Present and Furure
 
Hadoop Integration in Cassandra
Hadoop Integration in CassandraHadoop Integration in Cassandra
Hadoop Integration in Cassandra
 
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaAdvance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in productionScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Scalable and Flexible Machine Learning With Scala @ LinkedIn
Scalable and Flexible Machine Learning With Scala @ LinkedInScalable and Flexible Machine Learning With Scala @ LinkedIn
Scalable and Flexible Machine Learning With Scala @ LinkedIn
 
Lightning Fast Analytics with Cassandra and Spark
Lightning Fast Analytics with Cassandra and SparkLightning Fast Analytics with Cassandra and Spark
Lightning Fast Analytics with Cassandra and Spark
 
Lightning fast analytics with Cassandra and Spark
Lightning fast analytics with Cassandra and SparkLightning fast analytics with Cassandra and Spark
Lightning fast analytics with Cassandra and Spark
 
Introduction to Spark with Scala
Introduction to Spark with ScalaIntroduction to Spark with Scala
Introduction to Spark with Scala
 
Spring data ii
Spring data iiSpring data ii
Spring data ii
 
Apache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster ComputingApache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster Computing
 
Hw09 Sqoop Database Import For Hadoop
Hw09   Sqoop Database Import For HadoopHw09   Sqoop Database Import For Hadoop
Hw09 Sqoop Database Import For Hadoop
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)
 
Apache Spark & Hadoop
Apache Spark & HadoopApache Spark & Hadoop
Apache Spark & Hadoop
 
Apache Spark on Apache HBase: Current and Future
Apache Spark on Apache HBase: Current and Future Apache Spark on Apache HBase: Current and Future
Apache Spark on Apache HBase: Current and Future
 
Polyalgebra
PolyalgebraPolyalgebra
Polyalgebra
 
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
 
Cascading Through Hadoop for the Boulder JUG
Cascading Through Hadoop for the Boulder JUGCascading Through Hadoop for the Boulder JUG
Cascading Through Hadoop for the Boulder JUG
 

Último

TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension AidPhilip Schwarz
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrandmasabamasaba
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab
 
%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Hararemasabamasaba
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisamasabamasaba
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesVictorSzoltysek
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfonteinmasabamasaba
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastPapp Krisztián
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...Shane Coughlan
 
%in Durban+277-882-255-28 abortion pills for sale in Durban
%in Durban+277-882-255-28 abortion pills for sale in Durban%in Durban+277-882-255-28 abortion pills for sale in Durban
%in Durban+277-882-255-28 abortion pills for sale in Durbanmasabamasaba
 
Exploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfExploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfproinshot.com
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnAmarnathKambale
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...SelfMade bd
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park masabamasaba
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...panagenda
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisamasabamasaba
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is insideshinachiaurasa2
 

Último (20)

TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the past
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
 
%in Durban+277-882-255-28 abortion pills for sale in Durban
%in Durban+277-882-255-28 abortion pills for sale in Durban%in Durban+277-882-255-28 abortion pills for sale in Durban
%in Durban+277-882-255-28 abortion pills for sale in Durban
 
Exploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfExploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdf
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
 

Spark Cassandra Connector: Past, Present, and Future

  • 1. Spark Cassandra Connector: Past, Present and Future
  • 2. Spark Cassandra Connector Past, Present and Future Russell Spitzer @RussSpitzer Software Engineer - Datastax
  • 3. The Past: Hadoop and C* 3 You Hadoop integration with C* required a bit of knowledge and was generally not very easy. Map Reduce Code
  • 4.        public  static  class  ReducerToCassandra  extends  Reducer<Text,  IntWritable,  Map<String,  ByteBuffer>,  List<ByteBuffer>>          {                  private  Map<String,  ByteBuffer>  keys;                  private  ByteBuffer  key;                  protected  void  setup(org.apache.hadoop.mapreduce.Reducer.Context  context)                  throws  IOException,  InterruptedException                  {                          keys  =  new  LinkedHashMap<String,  ByteBuffer>();                  }                  public  void  reduce(Text  word,  Iterable<IntWritable>  values,  Context  context)  throws  IOException,  InterruptedException                  {                          int  sum  =  0;                          for  (IntWritable  val  :  values)                                  sum  +=  val.get();                          keys.put("word",  ByteBufferUtil.bytes(word.toString()));                          context.write(keys,  getBindVariables(word,  sum));                  }                  private  List<ByteBuffer>  getBindVariables(Text  word,  int  sum)                  {                          List<ByteBuffer>  variables  =  new  ArrayList<ByteBuffer>();                          variables.add(ByteBufferUtil.bytes(String.valueOf(sum)));                                            return  variables;                  }          } Hadoop Interfaces are … difficult 4© 2015. All Rights Reserved. https://github.com/apache/cassandra/blob/trunk/examples/hadoop_cql3_word_count/src/WordCount.java Even simple integration with a Hadoop cluster took a lot of experience to get right.
  • 5.        public  static  class  ReducerToCassandra  extends  Reducer<Text,  IntWritable,  Map<String,  ByteBuffer>,  List<ByteBuffer>>          {                  private  Map<String,  ByteBuffer>  keys;                  private  ByteBuffer  key;                  protected  void  setup(org.apache.hadoop.mapreduce.Reducer.Context  context)                  throws  IOException,  InterruptedException                  {                          keys  =  new  LinkedHashMap<String,  ByteBuffer>();                  }                  public  void  reduce(Text  word,  Iterable<IntWritable>  values,  Context  context)  throws  IOException,  InterruptedException                  {                          int  sum  =  0;                          for  (IntWritable  val  :  values)                                  sum  +=  val.get();                          keys.put("word",  ByteBufferUtil.bytes(word.toString()));                          context.write(keys,  getBindVariables(word,  sum));                  }                  private  List<ByteBuffer>  getBindVariables(Text  word,  int  sum)                  {                          List<ByteBuffer>  variables  =  new  ArrayList<ByteBuffer>();                          variables.add(ByteBufferUtil.bytes(String.valueOf(sum)));                                            return  variables;                  }          } Hadoop Interfaces are … difficult 5© 2015. All Rights Reserved. https://github.com/apache/cassandra/blob/trunk/examples/hadoop_cql3_word_count/src/WordCount.java Well at least you have Pig built in right? moredata  =  load  'cql://cql3ks/compmore'  USING  CqlNativeStorage;   insertformat  =  FOREACH  moredata  GENERATE  TOTUPLE  (TOTUPLE('a',x),TOTUPLE('b',y),   TOTUPLE('c',z)),TOTUPLE(data);   STORE  insertformat  INTO  'cql://cql3ks/compotable?output_query=UPDATE %20cql3ks.compotable%20SET%20d%20%3D%20%3F'  USING  CqlNativeStorage;   Even simple integration with a Hadoop cluster took a lot of experience to get right.
  • 6. Spark Offers a New Path 6© 2015. All Rights Reserved. Core Libraries for ML/Streaming No need for HDFS/Hadoop Easy integration with other Data Sources val  lines  =  sc.textFile("data.txt")   val  pairs  =  lines.map(s  =>  (s,  1))   val  counts  =  pairs.reduceByKey((a,  b)  =>  a  +  b) RDD Api df.groupBy("age").count().show() Dataframes Api head(filter(df,  df$waiting  <  50)) R Api SELECT  name  FROM  people SQL API Driver Executor
  • 7. Enter The Spark Cassandra Connector 7© 2015. All Rights Reserved. First Public Release at the Spark Summit in June 2014 If you write a Spark application that needs access to Cassandra, this library is for you -Piotr Kołaczkowski https://github.com/datastax/spark-cassandra-connector Open Source Software 1394 Commits 28 Contributors
  • 8. Why do we even want a Distributed Analytics tool? 8© 2015. All Rights Reserved.
  • 9. Why do we even want a Distributed Analytics tool? 9© 2015. All Rights Reserved. •Generating Reports •Direct Analytics on our data •Cassandra Maintenance •Making new views •Changing partition keys •Streaming •Machine Learning •ETL Data between different sources
  • 10. We have small questions and big questions and they need to work in different ways 10© 2015. All Rights Reserved. How many shoes did Marty buy? How many shoes were sold last year compared to this year grouped by demographic? BIG DATA
  • 11. We have small questions and big questions and they need to work in different ways 11© 2015. All Rights Reserved. How many shoes did Marty buy? How many shoes were sold last year compared to this year grouped by demographic? BIG DATA Marty Purchase History
  • 12. BIG DATA We have small questions and big questions and they need to work in different ways 12© 2015. All Rights Reserved. How many shoes did Marty buy? All Shoe Data How many shoes were sold last year compared to this year grouped by demographic?
  • 13. Part of Shoe Data When we actually want to work with large amounts of data we break it into parts 13© 2015. All Rights Reserved. Distributed FS/databases already do this for us Node1 Node2 Node3 Node4 Part of Shoe Data Part of Shoe Data Part of Shoe Data
  • 14. Spark describes underlying large multi-machine sets of data using The RDD (Resilient Distributed Dataset) 14© 2015. All Rights Reserved. RDD Part of Shoe Data Node1 Node2 Node3 Node4 Part of Shoe Data Part of Shoe Data Part of Shoe Data Spark Partitions
  • 15. In Cassandra this distribution is mapped out by token ranges 15© 2015. All Rights Reserved. 1 - 10000 10001-20000 20001-30000 30001 - 40000 Tokens Part of Shoe Data Node1 Node2 Node3 Node4 Part of Shoe Data Part of Shoe Data Part of Shoe Data
  • 16. This distribution is key to how Cassandra handles OLTP Requests 16© 2015. All Rights Reserved. SELECT  amount  from  orders  where  customer  =  martyID 1 - 10000 10001-20000 20001-30000 30001 - 40000 Tokens Part of Shoe Data Node1 Node2 Node3 Node4 Part of Shoe Data Part of Shoe Data Part of Shoe Data How many shoes did Marty buy? martyId  -­‐>  Token  -­‐>  3470 Lookup  Data  for  marty
  • 17. The Connector Maps Cassandra Tokens to Spark Partitions 17© 2015. All Rights Reserved. sc.cassandraTable("keyspace","tablename") 1 - 10000 10001-20000 30001 - 40000 Tokens Part of Shoe Data Node1 Node2 Node3 Node4 Part of Shoe Data Part of Shoe Data Part of Shoe Data 20001-30000 00001 - 02500 02501 - 05000 05001 - 07500 07501 - 10000 CassandraRDD 10001 - 12500 12501 - 15000 15001 - 17500 17501 - 20000 20001 - 22500 22501 - 25000 25001 - 27500 27501 - 30000 30001 - 32500 32501 - 35000 35001 - 37500 37501 - 40000
  • 18. This allows for Node Local operations! 18© 2015. All Rights Reserved. sc.cassandraTable("keyspace","tablename") 1 - 10000 10001-20000 30001 - 40000 Tokens Part of Shoe Data Node1 Node2 Node3 Node4 Part of Shoe Data Part of Shoe Data Part of Shoe Data 20001-30000 00001 - 02500 02501 - 05000 05001 - 07500 07501 - 10000 CassandraRDD 10001 - 12500 12501 - 15000 15001 - 17500 17501 - 20000 20001 - 22500 22501 - 25000 25001 - 27500 27501 - 30000 30001 - 32500 32501 - 35000 35001 - 37500 37501 - 40000
  • 19. Under the Hood the Spark Cassandra Connector Uses the Java Driver to pull Information from C* 19© 2015. All Rights Reserved. Check out my videos on Datastax Academy For a Deep Dive! Check out Robert's Talk! 5:10 PM - 5:50 PM B1 - B3 https://academy.datastax.com/tutorials   https://academy.datastax.com/demos/how-­‐spark-­‐cassandra-­‐connector-­‐reads-­‐data   https://academy.datastax.com/demos/how-­‐spark-­‐cassandra-­‐connector-­‐writes-­‐data   https://academy.datastax.com/demos/how-­‐spark-­‐works-­‐dsestandalone-­‐mode
  • 20. The Present:
 Capabilities and Features 20© 2015. All Rights Reserved. Official Releases for Spark 1.0 - 1.4
 Milestone Release for 1.5
  • 21. Read Cassandra Data into RDDs Write RDDs into Cassandra 21© 2015. All Rights Reserved. RDD[Letter] case  class  Letter(mailbox:  Int,  body:  String,  fromuser:  String,  :  touser:  String) CREATE  TABLE  important.letters  
    (  mailbox  int,            touser  text,            fromuser  text,            body  text,            PRIMARY  KEY  ((mailbox),  touser,  fromuser)); https://github.com/datastax/spark-­‐cassandra-­‐connector/blob/master/doc/2_loading.md https://github.com/datastax/spark-­‐cassandra-­‐connector/blob/master/doc/5_saving.md
  • 22. Read Cassandra Data into RDDs Write RDDs into Cassandra 22© 2015. All Rights Reserved. RDD[Letter] sc.cassandraTable[Letter]("important","letters") case  class  Letter(mailbox:  Int,  body:  String,  fromuser:  String,  :  touser:  String) CREATE  TABLE  important.letters  
    (  mailbox  int,            touser  text,            fromuser  text,            body  text,            PRIMARY  KEY  ((mailbox),  touser,  fromuser)); https://github.com/datastax/spark-­‐cassandra-­‐connector/blob/master/doc/2_loading.md https://github.com/datastax/spark-­‐cassandra-­‐connector/blob/master/doc/5_saving.md
  • 23. Read Cassandra Data into RDDs Write RDDs into Cassandra 23© 2015. All Rights Reserved. RDD[Letter] sc.cassandraTable[Letter]("important","letters") rdd.saveToCassandra("important","letters") case  class  Letter(mailbox:  Int,  body:  String,  fromuser:  String,  :  touser:  String) CREATE  TABLE  important.letters  
    (  mailbox  int,            touser  text,            fromuser  text,            body  text,            PRIMARY  KEY  ((mailbox),  touser,  fromuser)); https://github.com/datastax/spark-­‐cassandra-­‐connector/blob/master/doc/2_loading.md https://github.com/datastax/spark-­‐cassandra-­‐connector/blob/master/doc/5_saving.md
  • 24. Ability to push down relevant filters to the C* Server 24© 2015. All Rights Reserved. CREATE  TABLE  important.letters  
    (  mailbox  int,            touser  text,            fromuser  text,            body  text,            PRIMARY  KEY  ((mailbox),  touser,  fromuser)); https://github.com/datastax/spark-­‐cassandra-­‐connector/blob/master/doc/3_selection.md
  • 25. Ability to push down relevant filters to the C* Server 25© 2015. All Rights Reserved. CREATE  TABLE  important.letters  
    (  mailbox  int,            touser  text,            fromuser  text,            body  text,            PRIMARY  KEY  ((mailbox),  touser,  fromuser)); Partition for Mailbox 1 Partition for Mailbox 2 Orderedbytouser https://github.com/datastax/spark-­‐cassandra-­‐connector/blob/master/doc/3_selection.md
  • 26. Ability to push down relevant filters to the C* Server 26© 2015. All Rights Reserved. mailbox:  2   touser:  marty   fromuser:  doc   body:  It's  your  kids,  Marty.                Something  gotta  be  done  about              your  kids! mailbox:  1   touser:  doc   fromuser:  marty   body:  What  happens  to  us  in  the   future?     mailbox:  1   touser:  lorraine   fromuser:  marty   body:  Calvin?  Wh…  Why  do  you  keep                calling  me  calvin   https://github.com/datastax/spark-­‐cassandra-­‐connector/blob/master/doc/3_selection.md Partition for Mailbox 1 Partition for Mailbox 2 Orderedbytouser CREATE  TABLE  important.letters  
    (  mailbox  int,            touser  text,            fromuser  text,            body  text,            PRIMARY  KEY  ((mailbox),  touser,  fromuser));
  • 27. Ability to push down relevant filters to the C* Server 27© 2015. All Rights Reserved. sc.cassandraTable("important",  "letters")      .select("body")      .where("touser  =  >",  "einstein")      .collect https://github.com/datastax/spark-­‐cassandra-­‐connector/blob/master/doc/3_selection.md mailbox:  1   touser:  doc   fromuser:  marty   body:  What  happens  to  us  in  the   future?     mailbox:  1   touser:  lorraine   fromuser:  marty   body:  Calvin?  Wh…  Why  do  you  keep                calling  me  calvin   Partition for Mailbox 1 Partition for Mailbox 2 Orderedbytouser CREATE  TABLE  important.letters  
    (  mailbox  int,            touser  text,            fromuser  text,            body  text,            PRIMARY  KEY  ((mailbox),  touser,  fromuser)); mailbox:  2   touser:  marty   fromuser:  doc   body:  It's  your  kids,  Marty.                Something  gotta  be  done  about              your  kids!
  • 28. Ability to push down relevant filters to the C* Server 28© 2015. All Rights Reserved. Select lets us only request certain columns from C* https://github.com/datastax/spark-­‐cassandra-­‐connector/blob/master/doc/3_selection.md sc.cassandraTable("important",  "letters")      .select("body")      .where("touser  =  >",  "einstein")      .collect mailbox:  1   touser:  doc   fromuser:  marty   body:  What  happens  to  us  in  the   future?     mailbox:  1   touser:  lorraine   fromuser:  marty   body:  Calvin?  Wh…  Why  do  you  keep                calling  me  calvin   Partition for Mailbox 1 Partition for Mailbox 2 Orderedbytouser CREATE  TABLE  important.letters  
    (  mailbox  int,            touser  text,            fromuser  text,            body  text,            PRIMARY  KEY  ((mailbox),  touser,  fromuser)); mailbox:  2   touser:  marty   fromuser:  doc   body:  It's  your  kids,  Marty.                Something  gotta  be  done  about              your  kids!
  • 29. Ability to push down relevant filters to the C* Server 29© 2015. All Rights Reserved. Where lets us put in CQL Predicates that are allowed https://github.com/datastax/spark-­‐cassandra-­‐connector/blob/master/doc/3_selection.md sc.cassandraTable("important",  "letters")      .select("body")      .where("touser  =  >",  "einstein")      .collect mailbox:  1   touser:  doc   fromuser:  marty   body:  What  happens  to  us  in  the   future?     mailbox:  1   touser:  lorraine   fromuser:  marty   body:  Calvin?  Wh…  Why  do  you  keep                calling  me  calvin   Partition for Mailbox 1 Partition for Mailbox 2 Orderedbytouser CREATE  TABLE  important.letters  
    (  mailbox  int,            touser  text,            fromuser  text,            body  text,            PRIMARY  KEY  ((mailbox),  touser,  fromuser)); mailbox:  2   touser:  marty   fromuser:  doc   body:  It's  your  kids,  Marty.                Something  gotta  be  done  about              your  kids!
  • 30. Ability to push down relevant filters to the C* Server 30© 2015. All Rights Reserved. https://github.com/datastax/spark-­‐cassandra-­‐connector/blob/master/doc/3_selection.md Only the data we specifically request is pulled form C* sc.cassandraTable("important",  "letters")      .select("body")      .where("touser  =  >",  "einstein")      .collect mailbox:  1   touser:  doc   fromuser:  marty   body:  What  happens  to  us  in  the   future?     mailbox:  1   touser:  lorraine   fromuser:  marty   body:  Calvin?  Wh…  Why  do  you  keep                calling  me  calvin   Partition for Mailbox 1 Partition for Mailbox 2 Orderedbytouser CREATE  TABLE  important.letters  
    (  mailbox  int,            touser  text,            fromuser  text,            body  text,            PRIMARY  KEY  ((mailbox),  touser,  fromuser)); mailbox:  2   touser:  marty   fromuser:  doc   body:  It's  your  kids,  Marty.                Something  gotta  be  done  about              your  kids!
  • 31. Java API Support 31© 2015. All Rights Reserved. JavaRDD<Double>  pricesRDD  =  javaFunctions(sc)      .cassandraTable("important",  "letters",                                                      mapColumnTo(Letter.class))      .select("body"); All functionality introduced in the Scala API is also available in the Java API javaFunctions(rdd).writerBuilder(      "important",        "letters",        mapToRow(Letters.class)   ).saveToCassandra(); Reading Writing https://github.com/datastax/spark-­‐cassandra-­‐connector/blob/master/doc/7_java_api.md
  • 32. 32© 2015. All Rights Reserved. But what if you want to work with brand new Dataframes?
  • 33. Full Dataframes Support : org.apache.spark.sql.cassandra 33© 2015. All Rights Reserved. Dataframes (aka SchemaRDDs) provide a new and more generic api for working with RDD's val  df  =  sqlContext      .read      .format("org.apache.spark.sql.cassandra")      .options(   Map(      "keyspace"  -­‐>  "important",    "table"  -­‐>  "letters"   ))      .load() Reading https://github.com/datastax/spark-­‐cassandra-­‐connector/blob/master/doc/14_data_frames.md
  • 34. Full Dataframes Support : org.apache.spark.sql.cassandra 34© 2015. All Rights Reserved. Dataframes (aka SchemaRDDs) provide a new and more generic api for working with RDD's val  df  =  sqlContext      .read      .format("org.apache.spark.sql.cassandra")      .options(   Map(      "keyspace"  -­‐>  "important",    "table"  -­‐>  "letters"   ))      .load() CREATE  TABLE  letters            USING  org.apache.spark.sql.cassandra            OPTIONS  (                      keyspace  "important",                      table  "letters"              ) Reading https://github.com/datastax/spark-­‐cassandra-­‐connector/blob/master/doc/14_data_frames.md
  • 35. Full Dataframes Support : org.apache.spark.sql.cassandra 35© 2015. All Rights Reserved. Dataframes (aka SchemaRDDs) provide a new and more generic api for working with RDD's val  df  =  sqlContext      .read      .format("org.apache.spark.sql.cassandra")      .options(   Map(      "keyspace"  -­‐>  "important",    "table"  -­‐>  "letters"   ))      .load() CREATE  TABLE  letters            USING  org.apache.spark.sql.cassandra            OPTIONS  (                      keyspace  "important",                      table  "letters"              ) Reading Writing df.write      .format("org.apache.spark.sql.cassandra")      .options(          Map(            "keyspace"  -­‐>  "important",            "table"  -­‐>  "letters"                  ))      .save() https://github.com/datastax/spark-­‐cassandra-­‐connector/blob/master/doc/14_data_frames.md
  • 36. Full Dataframes Support : org.apache.spark.sql.cassandra 36© 2015. All Rights Reserved. Dataframes (aka SchemaRDDs) provide a new and more generic api for working with RDD's val  df  =  sqlContext      .read      .format("org.apache.spark.sql.cassandra")      .options(   Map(      "keyspace"  -­‐>  "important",    "table"  -­‐>  "letters"   ))      .load() CREATE  TABLE  letters            USING  org.apache.spark.sql.cassandra            OPTIONS  (                      keyspace  "important",                      table  "letters"              ) Reading Writing df.write      .format("org.apache.spark.sql.cassandra")      .options(          Map(            "keyspace"  -­‐>  "important",            "table"  -­‐>  "letters"                  ))      .save() CREATE  TABLE  letters_copy            USING  org.apache.spark.sql.cassandra            OPTIONS  (              keyspace  "important",              table  "letters_copy"              )   INSERT  INTO  TABLE  letters_copy  SELECT  *  FROM  letters; https://github.com/datastax/spark-­‐cassandra-­‐connector/blob/master/doc/14_data_frames.md
  • 37. val  df  =  sqlContext      .read      .format("org.apache.spark.sql.cassandra")      .options(   Map(      "keyspace"  -­‐>  "important",    "table"  -­‐>  "letters"   ))      .load() CREATE  TABLE  letters            USING  org.apache.spark.sql.cassandra            OPTIONS  (                      keyspace  "important",                      table  "letters"              ) Reading Writing df.write      .format("org.apache.spark.sql.cassandra")      .options(          Map(            "keyspace"  -­‐>  "important",            "table"  -­‐>  "letters"                  ))      .save() CREATE  TABLE  letters_copy            USING  org.apache.spark.sql.cassandra            OPTIONS  (              keyspace  "important",              table  "letters_copy"              )   INSERT  INTO  TABLE  letters_copy  SELECT  *  FROM  letters; Full Dataframes Support 37© 2015. All Rights Reserved. https://github.com/datastax/spark-­‐cassandra-­‐connector/blob/master/doc/14_data_frames.md Backed By CassandraRDD So we can prune and pushdown predicates!
  • 38. Integrated Pushdown of Predicates to C* in Dataframes 38© 2015. All Rights Reserved. There is no need for special functions when using Dataframes since the pushdown is done by the Catalyst optimizer CREATE  TABLE  important.letters  
    (  mailbox  int,            touser  text,            fromuser  text,            body  text,            PRIMARY  KEY  ((mailbox),  touser,  fromuser)); https://github.com/datastax/spark-­‐cassandra-­‐connector/blob/master/doc/14_data_frames.md scala>  df.filter(  "touser  >  'einstein'").explain   ==  Physical  Plan  ==   Filter  (touser#1  >  einstein)    PhysicalRDD  [mailbox#0,touser#1,fromuser#2,body#3],   MapPartitionsRDD[6]  at  explain  at  <console>:59 Automatically Checked Against C* rules for pushing down predicates. Valid predicates will be applied as if you did a .where on CassandraRDD.
  • 39. Pyspark and Dataframes Also Supported 39© 2015. All Rights Reserved. Dataframes in PySpark run Native Code, no need for 
 Python <-> Java Serialization  sqlContext.read          .format("org.apache.spark.sql.cassandra")          .options(table="kv",  keyspace="test")          .load().show() You can tell it's python because of my need to escape line ends Pure Python in Pyspark PySpark Dataframes! https://github.com/datastax/spark-­‐cassandra-­‐connector/blob/master/doc/15_python.md
  • 40. Pyspark and Dataframes Also Supported 40© 2015. All Rights Reserved. https://github.com/datastax/spark-­‐cassandra-­‐connector/blob/master/doc/15_python.md  sqlContext.read          .format("org.apache.spark.sql.cassandra")          .options(table="kv",  keyspace="test")          .load().show() You can tell it's python because of my need to escape line ends Pure Python in Pyspark PySpark Dataframes! SparkR Also Works with Cassandra Dataframes!
  • 41. Repartition by Cassandra Replica 41© 2015. All Rights Reserved. Repartition any RDD to get Data Locality to C*! https://github.com/datastax/spark-­‐cassandra-­‐connector/blob/master/doc/2_loading.md 1955 1985 2015 RDD Spark Partitions Located on Different Nodes than Their Respective C* Data
  • 42. Repartition by Cassandra Replica 42© 2015. All Rights Reserved. Repartition any RDD to get Data Locality to C*! https://github.com/datastax/spark-­‐cassandra-­‐connector/blob/master/doc/2_loading.md 1955 1985 2015
  • 43. Repartition by Cassandra Replica 43© 2015. All Rights Reserved. Repartition any RDD to get Data Locality to C*! https://github.com/datastax/spark-­‐cassandra-­‐connector/blob/master/doc/2_loading.md 1955 1985 2015 mailboxesToCheck      .repartitionByCassandraReplica("important",  "letters",  10)
  • 44. JoinWithCassandraTable pulls specific Partition Keys From Cassandra 44© 2015. All Rights Reserved. mailboxesToCheck      .repartitionByCassandraReplica("important",  "letters",  10)      .joinWithCassandraTable("important","letters") https://github.com/datastax/spark-­‐cassandra-­‐connector/blob/master/doc/2_loading.md Mailbox13234 Mailbox13234 Mailbox13234 Mailbox13234 Mailbox13234 Mailbox13234 Mailbox13234 Mailbox13234 Mailbox13234 Mailbox13234 Mailbox13234 Mailbox13234 Mailbox13234 Mailbox13234 Mailbox13234 Mailbox13234 Mailbox13234 Mailbox13234 Mailbox13234 Mailbox13234 Mailbox13234 Mailbox13234 Mailbox13234 Mailbox13234 Mailbox13234 Mailbox13234 Mailbox13234 Mailbox13234 Mailbox13234 Mailbox13234 Mailbox13234 Mailbox13234 Mailbox13234 Mailbox13234 Mailbox13234 Mailbox13234 Mailbox13234 Mailbox13234 Mailbox13234 Mailbox13234 Mailbox13234 Mailbox13234 Mailbox13234 Mailbox13234 Mailbox13234 Mailbox13234 Mailbox13234 Mailbox13234 Mailbox13234 Mailbox13234 Mailbox13234 Mailbox13234 Node1 Node2 Node3 Node4 Several thousand mailboxes CREATE  TABLE  important.letters  
    (  mailbox  int,            touser  text,            fromuser  text,            body  text,            PRIMARY  KEY  ((mailbox),  touser,  fromuser));
  • 45. JoinWithCassandraTable pulls specific Partition Keys From Cassandra 45© 2015. All Rights Reserved. mailboxesToCheck      .repartitionByCassandraReplica("important",  "letters",  10)      .joinWithCassandraTable("important","letters") https://github.com/datastax/spark-­‐cassandra-­‐connector/blob/master/doc/2_loading.md Mailbox13234 Mailbox13234 Mailbox13234 Mailbox13234 Mailbox13234 Mailbox13234 Mailbox13234 Mailbox13234 Mailbox13234 Mailbox13234 Mailbox13234 Mailbox13234 Mailbox13234 Mailbox13234 Mailbox13234 Mailbox13234 Mailbox13234 Mailbox13234 Mailbox13234 Mailbox13234 Mailbox13234 Mailbox13234 Mailbox13234 Mailbox13234 Mailbox8765 Mailbox13234 Mailbox13234 Mailbox13234 Mailbox13234 Mailbox3 Mailbox13234 Mailbox2341 Mailbox13234 Mailbox43211 Mailbox13234 Mailbox13234 Mailbox13234 Mailbox13234 Mailbox13234 Mailbox13234 Mailbox13234 Mailbox13234 Mailbox13234 Mailbox13234 Mailbox13234 Mailbox754567 Mailbox13452 Mailbox13234 Mailbox13234 Mailbox13234 Mailbox13234 Mailbox52352 Node1 Node2 Node3 Node4 Repartition places our keys local to the data they will retrieve CREATE  TABLE  important.letters  
    (  mailbox  int,            touser  text,            fromuser  text,            body  text,            PRIMARY  KEY  ((mailbox),  touser,  fromuser));
  • 46. JoinWithCassandraTable pulls specific Partition Keys From Cassandra 46© 2015. All Rights Reserved. mailboxesToCheck      .repartitionByCassandraReplica("important",  "letters",  10)      .joinWithCassandraTable("important","letters") https://github.com/datastax/spark-­‐cassandra-­‐connector/blob/master/doc/2_loading.md Mailbox13234 Mailbox13234 Mailbox13234 Mailbox13234 Mailbox13234 Mailbox13234 Mailbox13234 Mailbox13234 Mailbox13234 Mailbox13234 Mailbox13234 Mailbox13234 Mailbox13234 Mailbox13234 Mailbox13234 Mailbox13234 Mailbox13234 Mailbox13234 Mailbox13234 Mailbox13234 Mailbox13234 Mailbox13234 Mailbox13234 Mailbox13234 Mailbox8765 Mailbox13234 Mailbox13234 Mailbox13234 Mailbox13234 Mailbox3 Mailbox13234 Mailbox2341 Mailbox13234 Mailbox43211 Mailbox13234 Mailbox13234 Mailbox13234 Mailbox13234 Mailbox13234 Mailbox13234 Mailbox13234 Mailbox13234 Mailbox13234 Mailbox13234 Mailbox13234 Mailbox754567 Mailbox13452 Mailbox13234 Mailbox13234 Mailbox13234 Mailbox13234 Mailbox52352 Node1 Node2 Node3 Node4 The Join then retrieves the rows in parallel CREATE  TABLE  important.letters  
    (  mailbox  int,            touser  text,            fromuser  text,            body  text,            PRIMARY  KEY  ((mailbox),  touser,  fromuser));
  • 47. Manual Driver Sessions are available! 47© 2015. All Rights Reserved. https://github.com/datastax/spark-­‐cassandra-­‐connector/blob/master/doc/1_connecting.md import  com.datastax.spark.connector.cql.CassandraConnector   CassandraConnector(conf).withSessionDo  {  session  =>      session.execute("CREATE  KEYSPACE  test2  WITH  REPLICATION  =  {'class':  'SimpleStrategy',  'replication_factor':  1  }")      session.execute("CREATE  TABLE  test2.words  (word  text  PRIMARY  KEY,  count  int)")   }
  • 48. Any Connections Made through CassandraConnector will use a Connection pool (even remotely!) 48© 2015. All Rights Reserved. https://github.com/datastax/spark-­‐cassandra-­‐connector/blob/master/doc/1_connecting.md CassandraConnector(conf).withSessionDo  {} Gains a handle on a running Cluster object made with Configuration conf Executor Thread 2 Executor Thread 3 Executor Thread1 Executor JVM Cassandra Connection Pool
  • 49. Cassandra Connection Pool Any Connections Made through CassandraConnector will use a Connection pool (even remotely!) 49© 2015. All Rights Reserved. https://github.com/datastax/spark-­‐cassandra-­‐connector/blob/master/doc/1_connecting.md Multiple threads/executor cores will end up using the same Connection Executor Thread 2 Executor Thread 3 Executor JVM Cluster CassandraConnector(conf).withSessionDo  {} Executor Thread1
  • 50. Cassandra Connector can be used in Closures and Prepared Statements will be Cached as well 50© 2015. All Rights Reserved. https://github.com/datastax/spark-­‐cassandra-­‐connector/blob/master/doc/1_connecting.md rdd.mapPartitions{  it  =>  CassandraConnector.withSessionDo(  session  =>  ps  =  session.prepare(query)  )  } Reference to already created prepared statement will be used if available Cassandra Connection Pool Executor Thread 2 Executor Thread 3 Executor JVM Cluster Prepared Statement CacheExecutor Thread1
  • 51. What is the Future of the Spark Cassandra Connector? 51© 2015. All Rights Reserved.
  • 52. You! 52© 2015. All Rights Reserved. The more people that contribute to the project the better it will become! We welcome any contributions or just send us a letter on the mailing list! https://github.com/datastax/spark-­‐cassandra-­‐connector/blob/master/doc/FAQ.md#can-­‐i-­‐contribute-­‐to-­‐the-­‐spark-­‐cassandra-­‐connector
  • 53. Spark Packages! 53© 2015. All Rights Reserved. http://spark-packages.org/package/datastax/spark-cassandra-connector
  • 54. Update Even Faster to New Spark Versions 54© 2015. All Rights Reserved. We'll be testing against Spark Release Candidates in the future so that we can have a compatible Spark Cassandra Connectors out the moment an official Spark Release is ready!
  • 55. Even better Dataframes 55© 2015. All Rights Reserved. Automatic integration of repartitionByCassandra and
 joinWithCassandraTable Make it that any joins against Cassandra Tables are automatically detected, and if possible converted to JoinWithCassandraTable calls. No need to manually determine when you should or shouldn't use the method. Create Cassandra Tables from Dataframes Automatically Currently all tables need to have been created in C* prior to saving, we'd like it if users could specify what kind of key they would like on their C* table and have it automatically generated on data frame writes.
  • 56. Improve Spark-Cassandra-Stress 56© 2015. All Rights Reserved. https://github.com/datastax/spark-­‐cassandra-­‐stress Open source tool which lets you test maximum throughput of your cluster with Spark and C* • Write Tests • Read Tests • Streaming Tests Includes!