After a brief technical introduction to Apache Cassandra we'll then go into the exciting world of Apache Spark integration, and learn how you can turn your transactional datastore into an analytics platform. Apache Spark has taken the Hadoop world by storm (no pun intended!), and is widely seen as the replacement to Hadoop Map Reduce. Apache Spark coupled with Cassandra are perfect allies, Cassandra does the distributed data storage, Spark does the distributed computation.
17. Attaching to Spark and Cassandra
// Import Cassandra-specific functions on SparkContext and RDD objects
import org.apache.spark.{SparkContext, SparkConf}
import com.datastax.spark.connector._
/** The setMaster("local") lets us run & test the job right in our IDE */
val conf = new SparkConf(true)
.set("spark.cassandra.connection.host", "127.0.0.1")
.setMaster(“local[*]")
.setAppName(getClass.getName)
// Optionally
.set("cassandra.username", "cassandra")
.set("cassandra.password", “cassandra")
val sc = new SparkContext(conf)
18. Comment table example
CREATE TABLE comments_by_video (
videoid uuid,
commentid timeuuid,
userid uuid,
comment text,
PRIMARY KEY (videoid, commentid)
) WITH CLUSTERING ORDER BY (commentid DESC);
19. Simple example
/** keyspace & table */
val tableRDD = sc.cassandraTable("killrvideo", “comments_by_video”)
/** get a simple count of all the rows in the raw_weather_data table */
val rowCount = tableRDD.count()
println(s"Total Rows in Comments Table: $rowCount")
sc.stop()
20. Simple example
/** keyspace & table */
val tableRDD = sc.cassandraTable("killrvideo", “comments_by_video”)
/** get a simple count of all the rows in the comments_by_video table */
val rowCount = tableRDD.count()
println(s"Total Rows in Comments Table: $rowCount")
sc.stop()
Executer
SELECT *
FROM killrvideo.comments_by_video
Spark RDD
Spark Partition
Spark Connector
21. Using CQL
SELECT userid
FROM comments_by_video
WHERE videoid = '01860584-de45-018f-12be-5f81704e8033'
val cqlRRD = sc.cassandraTable("killrvideo", “comments_by_video”)
.select("userid")
.where("videoid = ?”,
“01860584-de45-018f-12be-5f81704e8033")
22. spark-sql> SELECT cast(videoid as String) videoid, count(*) c
FROM comments_by_video
GROUP BY cast(videoid as String)
ORDER BY c DESC limit 10;
23. Saving back to Cassandra
// Create insert data
val collection = sc.parallelize(Seq(("01860584-de45-018f-12be-5f81704e8033", "Great video", "cdaf6bd5-8914-29e0-
f0b6-8b0bc6156777"),
("01860584-de45-018f-12be-5f81704e8033", "Hated it", "cdaf6bd5-8914-29e0-f0b6-8b0bc6156777")))
// Insert data into table
collection.saveToCassandra("killrvideo", "comments_by_video", SomeColumns("videoid", "comment", "userid"))
24.
val solrQueryRDD = sc.cassandraTable("killrvideo", “videos")
.select("name").where("solr_query='tags:crime*'")
solrQueryRDD.collect().map(row => println(row.getString("name")))