Más contenido relacionado La actualidad más candente (20) Similar a Cassandra + Spark (You’ve got the lighter, let’s start a fire) (20) Cassandra + Spark (You’ve got the lighter, let’s start a fire)1. You’ve got the lighter, let’s start a fire:
Spark and Cassandra
Robert Stupp
Solutions Architect @ DataStax, Committer to Apache Cassandra
@snazy
3. KillrPowr Application
© 2016 DataStax,All Rights Reserved. 3
Show current power
consumption in Germany
in real time
• Power consumption of all
cities on a map
• Graph of data for one city
8. Apache Spark
Distributed & resilient analytics
“Spark lets you run programs up to 100x faster in memory,
or 10x faster on disk, than Hadoop.”
9. Spark : Simplify the…
© 2016 DataStax,All Rights Reserved. 9
challenging and compute-intensive task of processing
high volumes of real-time or archived data,
both structured and unstructured,
seamlessly integrating relevant
complex capabilities
such as machine learning and
graph algorithms.
Source:http://www.toptal.com/spark/introduction-to-apache-spark
10. Spark Use Cases
© 2016 DataStax,All Rights Reserved. 10
• Event detection
• Behavior / anomalies detection
• Fraud & intrusion detection
• Targeted advertising
• E-commerce transaction processing
• Recommendations
11. Spark Use Cases
© 2016 DataStax,All Rights Reserved. 11
• Data migration
• Ad hoc queries
• Business Intelligence
14. Spark RDDs
© 2016 DataStax,All Rights Reserved. 14
Resilient Distributed Datasets
• Fault-tolerant collection of elements
• Can be operated on in parallel
• Immutable
• In-Memory (as much as possible)
18. Spark Actions
© 2016 DataStax,All Rights Reserved. 18
Return a value after running a computation
20. Spark Example (word count)
© 2016 DataStax,All Rights Reserved. 20
Given is a text file with movies and their categories.
We want to know the occurences of each genre.
Pirates of the Carribbean,Adventure,Action,Fantasy
Star Trek I,SciFi,Adventure,Action
Die Hard,Action,Thriller
21. Why Scala
© 2016 DataStax,All Rights Reserved. 21
Scala is a functional language
Analytics is a functional “problem“
So – use a functional language
22. Spark Example (word count)
© 2016 DataStax,All Rights Reserved. 22
sc.textFile("file:///home/videos.csv")
.flatMap(line => line.split(",").drop(1))
.map(genre => (genre,1))
.reduceByKey{case (x,y) => x + y}
.collect()
.foreach(println)
23. Spark Example (word count)
© 2016 DataStax,All Rights Reserved. 23
sc.textFile("file:///home/videos.csv")
.flatMap(line => line.split(",").drop(1))
.map(genre => (genre,1))
.reduceByKey{case (x,y) => x + y}
.collect()
.foreach(println)
Anonymous Scala Function
24. Spark Example (word count)
© 2016 DataStax,All Rights Reserved. 24
sc.textFile("file:///home/videos.csv")
.flatMap(line => line.split(",").drop(1))
.map(genre => (genre,1))
.reduceByKey{case (x,y) => x + y}
.collect()
.foreach(println)
Flat the collection
25. Spark Example (word count)
© 2016 DataStax,All Rights Reserved. 25
sc.textFile("file:///home/videos.csv")
.flatMap(line => line.split(",").drop(1))
.map(genre => (genre,1))
.reduceByKey{case (x,y) => x + y}
.collect()
.foreach(println)Pair RDD
26. Spark Example (word count)
© 2016 DataStax,All Rights Reserved. 26
sc.textFile("file:///home/videos.csv")
.flatMap(line => line.split(",").drop(1))
.map(genre => (genre,1))
.reduceByKey{case (x,y) => x + y}
.collect()
.foreach(println)
Values ofTWO Pair RDDs
27. Spark Example (word count)
© 2016 DataStax,All Rights Reserved. 27
sc.textFile("file:///home/videos.csv")
.flatMap(line => line.split(",").drop(1))
.map(genre => (genre,1))
.reduceByKey{case (x,y) => x + y}
.collect()
.foreach(println)
28. Spark Example (word count)
© 2016 DataStax,All Rights Reserved. 28
sc.textFile("file:///home/videos.csv")
.flatMap(line => line.split(",").drop(1))
.map(genre => (genre,1))
.reduceByKey{case (x,y) => x + y}
.collect()
.foreach(println)
31. Spark Batch Processing
© 2016 DataStax,All Rights Reserved. 31
Other REALLY useful stuff like
• Re-partition data & processing
• Allows efficient use of Cassandra
33. Spark SQL
© 2016 DataStax,All Rights Reserved. 33
csc.sql(" SELECT COUNT(*) AS total " +
" FROM killr_video.movies_by_actor " +
" WHERE actor = 'Johnny Depp' ")
.show
+-----+
|total|
+-----+
| 54|
+-----+
34. Spark Data Frames
© 2016 DataStax,All Rights Reserved. 34
val movies = csc
.read
.format("org.apache.spark.sql.cassandra")
.options(Map( "keyspace" -> "killr_video",
"table" -> "movies_by_actor" ))
.load
movies.filter("actor = 'Johnny Depp'")
.agg(Map("*" -> "count"))
.withColumnRenamed("COUNT(1)", "total")
.show
Same as:
SELECT COUNT(*) AS total
FROM killr_video.movies_by_actor
WHERE actor = 'Johnny Depp'
35. Spark Data Frames
© 2016 DataStax,All Rights Reserved. 35
Data frame is RDD that has named columns
• Has schema
• Has rows
• Has columns
• Has data types
37. Spark Streaming
© 2016 DataStax,All Rights Reserved. 37
Continuous Micro Batching
• Collect data of time slides
• Compute on that data
41. Spark Streaming Windows
© 2016 DataStax,All Rights Reserved. 41
time
Slide Slide Slide Slide Slide Slide Slide Slide Slide Slide Slide
00:00:00 00:00:01 00:00:02 00:00:03 00:00:04 00:00:05 00:00:06 00:00:07 00:00:08 00:00:09 00:00:10
Window
Window
Window
Window
Window
Window
Window
Window
Window
Process data for this slide –
i.e. 1 second in this example
Process data for this slide –
i.e. 1 second in this example
Process data for this slide –
i.e. 1 second in this example
Time window ~=
Grouping data by time
44. Kafka
© 2016 DataStax,All Rights Reserved. 44
Kafka Cluster
Producer
Producer
Producer
Consumer
Consumer
Kafka Node
Kafka Node
Kafka Node
45. Kafka Use-Cases
© 2016 DataStax,All Rights Reserved. 45
• Messaging
• Website activity tracking
• Metrics
• Log aggregation
• Event sourcing
• Stream processing
46. Common Characteristics
© 2016 DataStax,All Rights Reserved. 46
Kafka, Spark & Cassandra are
• Scalable
• Distributed
• Partitioned
• Replicated
47. KillrPowr
© 2016 DataStax,All Rights Reserved. 47
Show current power consumption in Germany
in real time
• Power consumption of all cities on a map
• Graph of data for a selected city
48. KillrPowr – Data modeling 101
© 2016 DataStax,All Rights Reserved. 48
1. Collect UI Requirements
2. Model data (flow) per web UI request
3. Build Spark Streaming code to generate that data
4. Connect sensors and Spark Streaming using Kafka
49. KillrPowr – Tech Stack
© 2016 DataStax,All Rights Reserved. 49
Power Sensor
Power Sensor
Kafka
Spark
Streaming
Cassandra
Web Site
Queue data
Aggregate data
and save
for web UI
50. KillrPowr – Tech Stack
© 2016 DataStax,All Rights Reserved. 50
Techs used in the demo
• Spray (HTTP for Akka Actors)
• AngularJS
• Kafka
• Spark Streaming
• Cassandra
51. Tips
© 2016 DataStax,All Rights Reserved. 51
• Push down predicates to
Cassandra
• Partition your data “right”
• Think distributed
52. Cassandra + Spark in DSE
© 2016 DataStax,All Rights Reserved. 52
DSE integrates
Spark + Cassandra
https://academy.datastax.com/
courses/getting-started-
apache-spark
54. You’ve got the lighter, let’s start a fire:
Spark and Cassandra
Robert Stupp
Solutions Architect @ DataStax, Committer to Apache Cassandra
@snazy
57. BACKUP - DDL
© 2014 DataStax, All Rights Reserved. Company Confidential 57
58. BACKUP - DDL
© 2014 DataStax, All Rights Reserved. Company Confidential 58
59. BACKUP - DDL
© 2014 DataStax, All Rights Reserved. Company Confidential 59
60. BACKUP - DDL
© 2014 DataStax, All Rights Reserved. Company Confidential 60
61. BACKUP – Spark Streaming
© 2014 DataStax, All Rights Reserved. Company Confidential 61
62. BACKUP – Spark Streaming
© 2014 DataStax, All Rights Reserved. Company Confidential 62
63. BACKUP – Spark Streaming
© 2014 DataStax, All Rights Reserved. Company Confidential 63
64. BACKUP – Spark Streaming
© 2014 DataStax, All Rights Reserved. Company Confidential 64
65. BACKUP – Spark Streaming
© 2014 DataStax, All Rights Reserved. Company Confidential 65
66. BACKUP – Spark Streaming
© 2014 DataStax, All Rights Reserved. Company Confidential 66