SlideShare una empresa de Scribd logo
1 de 67
Descargar para leer sin conexión
You’ve got the lighter, let’s start a fire:
Spark and Cassandra
Robert Stupp
Solutions Architect @ DataStax, Committer to Apache Cassandra
@snazy
A journey through
Distributed Computing
© 2016 DataStax,All Rights Reserved. 2
KillrPowr Application
© 2016 DataStax,All Rights Reserved. 3
Show current power
consumption in Germany
in real time
• Power consumption of all
cities on a map
• Graph of data for one city
One word…
© 2016 DataStax,All Rights Reserved. 4
Apache Cassandra
Distributed & resilient database
Cassandra Use-Cases
© 2016 DataStax,All Rights Reserved. 6
• Time Series
• Personalization
• (or a mix of both)
Cassandra?
© 2016 DataStax,All Rights Reserved. 7
• Joining data … hmm
• Analyzing the whole data set … hmm
• Ad-Hoc Queries (SQL) … hmm
Apache Spark
Distributed & resilient analytics
“Spark lets you run programs up to 100x faster in memory,
or 10x faster on disk, than Hadoop.”
Spark : Simplify the…
© 2016 DataStax,All Rights Reserved. 9
challenging and compute-intensive task of processing
high volumes of real-time or archived data,
both structured and unstructured,
seamlessly integrating relevant
complex capabilities
such as machine learning and
graph algorithms.
Source:http://www.toptal.com/spark/introduction-to-apache-spark
Spark Use Cases
© 2016 DataStax,All Rights Reserved. 10
• Event detection
• Behavior / anomalies detection
• Fraud & intrusion detection
• Targeted advertising
• E-commerce transaction processing
• Recommendations
Spark Use Cases
© 2016 DataStax,All Rights Reserved. 11
• Data migration
• Ad hoc queries
• Business Intelligence
Spark
© 2016 DataStax,All Rights Reserved. 12
Spark Architecture
© 2016 DataStax,All Rights Reserved. 13
Spark RDDs
© 2016 DataStax,All Rights Reserved. 14
Resilient Distributed Datasets
• Fault-tolerant collection of elements
• Can be operated on in parallel
• Immutable
• In-Memory (as much as possible)
Spark RDD Partitioning
© 2016 DataStax,All Rights Reserved. 15
Spark RDD Resiliency
© 2016 DataStax,All Rights Reserved. 16
Spark Transformations
© 2016 DataStax,All Rights Reserved. 17
Create a new data set from an existing one
Spark Actions
© 2016 DataStax,All Rights Reserved. 18
Return a value after running a computation
Spark DAG
© 2016 DataStax,All Rights Reserved. 19
Spark Example (word count)
© 2016 DataStax,All Rights Reserved. 20
Given is a text file with movies and their categories.
We want to know the occurences of each genre.
Pirates of the Carribbean,Adventure,Action,Fantasy
Star Trek I,SciFi,Adventure,Action
Die Hard,Action,Thriller
Why Scala
© 2016 DataStax,All Rights Reserved. 21
Scala is a functional language
Analytics is a functional “problem“
So – use a functional language
Spark Example (word count)
© 2016 DataStax,All Rights Reserved. 22
sc.textFile("file:///home/videos.csv")
.flatMap(line => line.split(",").drop(1))
.map(genre => (genre,1))
.reduceByKey{case (x,y) => x + y}
.collect()
.foreach(println)
Spark Example (word count)
© 2016 DataStax,All Rights Reserved. 23
sc.textFile("file:///home/videos.csv")
.flatMap(line => line.split(",").drop(1))
.map(genre => (genre,1))
.reduceByKey{case (x,y) => x + y}
.collect()
.foreach(println)
Anonymous Scala Function
Spark Example (word count)
© 2016 DataStax,All Rights Reserved. 24
sc.textFile("file:///home/videos.csv")
.flatMap(line => line.split(",").drop(1))
.map(genre => (genre,1))
.reduceByKey{case (x,y) => x + y}
.collect()
.foreach(println)
Flat the collection
Spark Example (word count)
© 2016 DataStax,All Rights Reserved. 25
sc.textFile("file:///home/videos.csv")
.flatMap(line => line.split(",").drop(1))
.map(genre => (genre,1))
.reduceByKey{case (x,y) => x + y}
.collect()
.foreach(println)Pair RDD
Spark Example (word count)
© 2016 DataStax,All Rights Reserved. 26
sc.textFile("file:///home/videos.csv")
.flatMap(line => line.split(",").drop(1))
.map(genre => (genre,1))
.reduceByKey{case (x,y) => x + y}
.collect()
.foreach(println)
Values ofTWO Pair RDDs
Spark Example (word count)
© 2016 DataStax,All Rights Reserved. 27
sc.textFile("file:///home/videos.csv")
.flatMap(line => line.split(",").drop(1))
.map(genre => (genre,1))
.reduceByKey{case (x,y) => x + y}
.collect()
.foreach(println)
Spark Example (word count)
© 2016 DataStax,All Rights Reserved. 28
sc.textFile("file:///home/videos.csv")
.flatMap(line => line.split(",").drop(1))
.map(genre => (genre,1))
.reduceByKey{case (x,y) => x + y}
.collect()
.foreach(println)
Cassandra-Spark-Driver
Spark-Cassandra-Connector
© 2016 DataStax,All Rights Reserved. 30
• Scala, Java
• Open Source (ASF2)
• Developed by DataStax and community guys
https://github.com/datastax/
spark-cassandra-connector
Spark Batch Processing
© 2016 DataStax,All Rights Reserved. 31
Other REALLY useful stuff like
• Re-partition data & processing
• Allows efficient use of Cassandra
SQL and Data Frames
Spark SQL
© 2016 DataStax,All Rights Reserved. 33
csc.sql(" SELECT COUNT(*) AS total " +
" FROM killr_video.movies_by_actor " +
" WHERE actor = 'Johnny Depp' ")
.show
+-----+
|total|
+-----+
| 54|
+-----+
Spark Data Frames
© 2016 DataStax,All Rights Reserved. 34
val movies = csc
.read
.format("org.apache.spark.sql.cassandra")
.options(Map( "keyspace" -> "killr_video",
"table" -> "movies_by_actor" ))
.load
movies.filter("actor = 'Johnny Depp'")
.agg(Map("*" -> "count"))
.withColumnRenamed("COUNT(1)", "total")
.show
Same as:
SELECT COUNT(*) AS total
FROM killr_video.movies_by_actor
WHERE actor = 'Johnny Depp'
Spark Data Frames
© 2016 DataStax,All Rights Reserved. 35
Data frame is RDD that has named columns
• Has schema
• Has rows
• Has columns
• Has data types
Spark Straming
Spark Streaming
© 2016 DataStax,All Rights Reserved. 37
Continuous Micro Batching
• Collect data of time slides
• Compute on that data
Spark Streaming
© 2016 DataStax,All Rights Reserved. 38
Spark Streaming Architecture
Confid© 2016 DataStax,All Rights Reserved. 39
Spark Streaming
© 2016 DataStax,All Rights Reserved. 40
Spark Streaming Windows
© 2016 DataStax,All Rights Reserved. 41
time
Slide Slide Slide Slide Slide Slide Slide Slide Slide Slide Slide
00:00:00 00:00:01 00:00:02 00:00:03 00:00:04 00:00:05 00:00:06 00:00:07 00:00:08 00:00:09 00:00:10
Window
Window
Window
Window
Window
Window
Window
Window
Window
Process data for this slide –
i.e. 1 second in this example
Process data for this slide –
i.e. 1 second in this example
Process data for this slide –
i.e. 1 second in this example
Time window ~=
Grouping data by time
Apache Kafka
Distributed & resilient messaging
Kafka
© 2016 DataStax,All Rights Reserved. 43
Publish-subscribe messaging
rethought as a distributed commit log
Kafka
© 2016 DataStax,All Rights Reserved. 44
Kafka Cluster
Producer
Producer
Producer
Consumer
Consumer
Kafka Node
Kafka Node
Kafka Node
Kafka Use-Cases
© 2016 DataStax,All Rights Reserved. 45
• Messaging
• Website activity tracking
• Metrics
• Log aggregation
• Event sourcing
• Stream processing
Common Characteristics
© 2016 DataStax,All Rights Reserved. 46
Kafka, Spark & Cassandra are
• Scalable
• Distributed
• Partitioned
• Replicated
KillrPowr
© 2016 DataStax,All Rights Reserved. 47
Show current power consumption in Germany
in real time
• Power consumption of all cities on a map
• Graph of data for a selected city
KillrPowr – Data modeling 101
© 2016 DataStax,All Rights Reserved. 48
1. Collect UI Requirements
2. Model data (flow) per web UI request
3. Build Spark Streaming code to generate that data
4. Connect sensors and Spark Streaming using Kafka
KillrPowr – Tech Stack
© 2016 DataStax,All Rights Reserved. 49
Power Sensor
Power Sensor
Kafka
Spark
Streaming
Cassandra
Web Site
Queue data
Aggregate data
and save
for web UI
KillrPowr – Tech Stack
© 2016 DataStax,All Rights Reserved. 50
Techs used in the demo
• Spray (HTTP for Akka Actors)
• AngularJS
• Kafka
• Spark Streaming
• Cassandra
Tips
© 2016 DataStax,All Rights Reserved. 51
• Push down predicates to
Cassandra
• Partition your data “right”
• Think distributed
Cassandra + Spark in DSE
© 2016 DataStax,All Rights Reserved. 52
DSE integrates
Spark + Cassandra
https://academy.datastax.com/
courses/getting-started-
apache-spark
KillrPowr Live Demo
© 2016 DataStax,All Rights Reserved. 53
You’ve got the lighter, let’s start a fire:
Spark and Cassandra
Robert Stupp
Solutions Architect @ DataStax, Committer to Apache Cassandra
@snazy
Thank You
© 2016 DataStax,All Rights Reserved. 55
BACKUP SLIDES
Confidential 56
BACKUP - DDL
© 2014 DataStax, All Rights Reserved. Company Confidential 57
BACKUP - DDL
© 2014 DataStax, All Rights Reserved. Company Confidential 58
BACKUP - DDL
© 2014 DataStax, All Rights Reserved. Company Confidential 59
BACKUP - DDL
© 2014 DataStax, All Rights Reserved. Company Confidential 60
BACKUP – Spark Streaming
© 2014 DataStax, All Rights Reserved. Company Confidential 61
BACKUP – Spark Streaming
© 2014 DataStax, All Rights Reserved. Company Confidential 62
BACKUP – Spark Streaming
© 2014 DataStax, All Rights Reserved. Company Confidential 63
BACKUP – Spark Streaming
© 2014 DataStax, All Rights Reserved. Company Confidential 64
BACKUP – Spark Streaming
© 2014 DataStax, All Rights Reserved. Company Confidential 65
BACKUP – Spark Streaming
© 2014 DataStax, All Rights Reserved. Company Confidential 66
© 2014 DataStax, All Rights Reserved. Company Confidential 67

Más contenido relacionado

La actualidad más candente

Anomaly Detection with Apache Spark
Anomaly Detection with Apache SparkAnomaly Detection with Apache Spark
Anomaly Detection with Apache Spark
Cloudera, Inc.
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache Spark
Databricks
 
Spark Summit EU talk by Miklos Christine paddling up the stream
Spark Summit EU talk by Miklos Christine paddling up the streamSpark Summit EU talk by Miklos Christine paddling up the stream
Spark Summit EU talk by Miklos Christine paddling up the stream
Spark Summit
 

La actualidad más candente (20)

Spark + Cassandra = Real Time Analytics on Operational Data
Spark + Cassandra = Real Time Analytics on Operational DataSpark + Cassandra = Real Time Analytics on Operational Data
Spark + Cassandra = Real Time Analytics on Operational Data
 
Engineering fast indexes
Engineering fast indexesEngineering fast indexes
Engineering fast indexes
 
Anomaly Detection with Apache Spark
Anomaly Detection with Apache SparkAnomaly Detection with Apache Spark
Anomaly Detection with Apache Spark
 
Spark Cassandra Connector Dataframes
Spark Cassandra Connector DataframesSpark Cassandra Connector Dataframes
Spark Cassandra Connector Dataframes
 
Lightning fast analytics with Cassandra and Spark
Lightning fast analytics with Cassandra and SparkLightning fast analytics with Cassandra and Spark
Lightning fast analytics with Cassandra and Spark
 
Combining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache SparkCombining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache Spark
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache Spark
 
Making Nested Columns as First Citizen in Apache Spark SQL
Making Nested Columns as First Citizen in Apache Spark SQLMaking Nested Columns as First Citizen in Apache Spark SQL
Making Nested Columns as First Citizen in Apache Spark SQL
 
Spark Summit EU talk by Miklos Christine paddling up the stream
Spark Summit EU talk by Miklos Christine paddling up the streamSpark Summit EU talk by Miklos Christine paddling up the stream
Spark Summit EU talk by Miklos Christine paddling up the stream
 
Escape from Hadoop: Ultra Fast Data Analysis with Spark & Cassandra
Escape from Hadoop: Ultra Fast Data Analysis with Spark & CassandraEscape from Hadoop: Ultra Fast Data Analysis with Spark & Cassandra
Escape from Hadoop: Ultra Fast Data Analysis with Spark & Cassandra
 
Beyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesBeyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFrames
 
Spark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard MaasSpark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard Maas
 
Analytics with Cassandra & Spark
Analytics with Cassandra & SparkAnalytics with Cassandra & Spark
Analytics with Cassandra & Spark
 
Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris
Real time data processing with spark & cassandra @ NoSQLMatters 2015 ParisReal time data processing with spark & cassandra @ NoSQLMatters 2015 Paris
Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris
 
Analytics with Cassandra, Spark & MLLib - Cassandra Essentials Day
Analytics with Cassandra, Spark & MLLib - Cassandra Essentials DayAnalytics with Cassandra, Spark & MLLib - Cassandra Essentials Day
Analytics with Cassandra, Spark & MLLib - Cassandra Essentials Day
 
Hadoop + Cassandra: Fast queries on data lakes, and wikipedia search tutorial.
Hadoop + Cassandra: Fast queries on data lakes, and  wikipedia search tutorial.Hadoop + Cassandra: Fast queries on data lakes, and  wikipedia search tutorial.
Hadoop + Cassandra: Fast queries on data lakes, and wikipedia search tutorial.
 
Lightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and CassandraLightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and Cassandra
 
Scaling up data science applications
Scaling up data science applicationsScaling up data science applications
Scaling up data science applications
 
Write Graph Algorithms Like a Boss Andrew Ray
Write Graph Algorithms Like a Boss Andrew RayWrite Graph Algorithms Like a Boss Andrew Ray
Write Graph Algorithms Like a Boss Andrew Ray
 
Intro to py spark (and cassandra)
Intro to py spark (and cassandra)Intro to py spark (and cassandra)
Intro to py spark (and cassandra)
 

Similar a Cassandra + Spark (You’ve got the lighter, let’s start a fire)

Similar a Cassandra + Spark (You’ve got the lighter, let’s start a fire) (20)

Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016
What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016
What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016
 
Realtime Data Pipeline with Spark Streaming and Cassandra with Mesos (Rahul K...
Realtime Data Pipeline with Spark Streaming and Cassandra with Mesos (Rahul K...Realtime Data Pipeline with Spark Streaming and Cassandra with Mesos (Rahul K...
Realtime Data Pipeline with Spark Streaming and Cassandra with Mesos (Rahul K...
 
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
 
Datastax day 2016 introduction to apache cassandra
Datastax day 2016   introduction to apache cassandraDatastax day 2016   introduction to apache cassandra
Datastax day 2016 introduction to apache cassandra
 
A Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In ProductionA Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In Production
 
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
 
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by ScyllaScylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)
 
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, ScalaLambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
 
Big data analytics with Spark & Cassandra
Big data analytics with Spark & Cassandra Big data analytics with Spark & Cassandra
Big data analytics with Spark & Cassandra
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study Notes
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
What's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You CareWhat's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You Care
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
 
Streaming ETL for All
Streaming ETL for AllStreaming ETL for All
Streaming ETL for All
 
GraphFrames Access Methods in DSE Graph
GraphFrames Access Methods in DSE GraphGraphFrames Access Methods in DSE Graph
GraphFrames Access Methods in DSE Graph
 

Último

+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
Health
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
VishalKumarJha10
 

Último (20)

W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
BUS PASS MANGEMENT SYSTEM USING PHP.pptx
BUS PASS MANGEMENT SYSTEM USING PHP.pptxBUS PASS MANGEMENT SYSTEM USING PHP.pptx
BUS PASS MANGEMENT SYSTEM USING PHP.pptx
 
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdfThe Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
 
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdfAzure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
 
Pharm-D Biostatistics and Research methodology
Pharm-D Biostatistics and Research methodologyPharm-D Biostatistics and Research methodology
Pharm-D Biostatistics and Research methodology
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
 
10 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 202410 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 2024
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 

Cassandra + Spark (You’ve got the lighter, let’s start a fire)

  • 1. You’ve got the lighter, let’s start a fire: Spark and Cassandra Robert Stupp Solutions Architect @ DataStax, Committer to Apache Cassandra @snazy
  • 2. A journey through Distributed Computing © 2016 DataStax,All Rights Reserved. 2
  • 3. KillrPowr Application © 2016 DataStax,All Rights Reserved. 3 Show current power consumption in Germany in real time • Power consumption of all cities on a map • Graph of data for one city
  • 4. One word… © 2016 DataStax,All Rights Reserved. 4
  • 5. Apache Cassandra Distributed & resilient database
  • 6. Cassandra Use-Cases © 2016 DataStax,All Rights Reserved. 6 • Time Series • Personalization • (or a mix of both)
  • 7. Cassandra? © 2016 DataStax,All Rights Reserved. 7 • Joining data … hmm • Analyzing the whole data set … hmm • Ad-Hoc Queries (SQL) … hmm
  • 8. Apache Spark Distributed & resilient analytics “Spark lets you run programs up to 100x faster in memory, or 10x faster on disk, than Hadoop.”
  • 9. Spark : Simplify the… © 2016 DataStax,All Rights Reserved. 9 challenging and compute-intensive task of processing high volumes of real-time or archived data, both structured and unstructured, seamlessly integrating relevant complex capabilities such as machine learning and graph algorithms. Source:http://www.toptal.com/spark/introduction-to-apache-spark
  • 10. Spark Use Cases © 2016 DataStax,All Rights Reserved. 10 • Event detection • Behavior / anomalies detection • Fraud & intrusion detection • Targeted advertising • E-commerce transaction processing • Recommendations
  • 11. Spark Use Cases © 2016 DataStax,All Rights Reserved. 11 • Data migration • Ad hoc queries • Business Intelligence
  • 12. Spark © 2016 DataStax,All Rights Reserved. 12
  • 13. Spark Architecture © 2016 DataStax,All Rights Reserved. 13
  • 14. Spark RDDs © 2016 DataStax,All Rights Reserved. 14 Resilient Distributed Datasets • Fault-tolerant collection of elements • Can be operated on in parallel • Immutable • In-Memory (as much as possible)
  • 15. Spark RDD Partitioning © 2016 DataStax,All Rights Reserved. 15
  • 16. Spark RDD Resiliency © 2016 DataStax,All Rights Reserved. 16
  • 17. Spark Transformations © 2016 DataStax,All Rights Reserved. 17 Create a new data set from an existing one
  • 18. Spark Actions © 2016 DataStax,All Rights Reserved. 18 Return a value after running a computation
  • 19. Spark DAG © 2016 DataStax,All Rights Reserved. 19
  • 20. Spark Example (word count) © 2016 DataStax,All Rights Reserved. 20 Given is a text file with movies and their categories. We want to know the occurences of each genre. Pirates of the Carribbean,Adventure,Action,Fantasy Star Trek I,SciFi,Adventure,Action Die Hard,Action,Thriller
  • 21. Why Scala © 2016 DataStax,All Rights Reserved. 21 Scala is a functional language Analytics is a functional “problem“ So – use a functional language
  • 22. Spark Example (word count) © 2016 DataStax,All Rights Reserved. 22 sc.textFile("file:///home/videos.csv") .flatMap(line => line.split(",").drop(1)) .map(genre => (genre,1)) .reduceByKey{case (x,y) => x + y} .collect() .foreach(println)
  • 23. Spark Example (word count) © 2016 DataStax,All Rights Reserved. 23 sc.textFile("file:///home/videos.csv") .flatMap(line => line.split(",").drop(1)) .map(genre => (genre,1)) .reduceByKey{case (x,y) => x + y} .collect() .foreach(println) Anonymous Scala Function
  • 24. Spark Example (word count) © 2016 DataStax,All Rights Reserved. 24 sc.textFile("file:///home/videos.csv") .flatMap(line => line.split(",").drop(1)) .map(genre => (genre,1)) .reduceByKey{case (x,y) => x + y} .collect() .foreach(println) Flat the collection
  • 25. Spark Example (word count) © 2016 DataStax,All Rights Reserved. 25 sc.textFile("file:///home/videos.csv") .flatMap(line => line.split(",").drop(1)) .map(genre => (genre,1)) .reduceByKey{case (x,y) => x + y} .collect() .foreach(println)Pair RDD
  • 26. Spark Example (word count) © 2016 DataStax,All Rights Reserved. 26 sc.textFile("file:///home/videos.csv") .flatMap(line => line.split(",").drop(1)) .map(genre => (genre,1)) .reduceByKey{case (x,y) => x + y} .collect() .foreach(println) Values ofTWO Pair RDDs
  • 27. Spark Example (word count) © 2016 DataStax,All Rights Reserved. 27 sc.textFile("file:///home/videos.csv") .flatMap(line => line.split(",").drop(1)) .map(genre => (genre,1)) .reduceByKey{case (x,y) => x + y} .collect() .foreach(println)
  • 28. Spark Example (word count) © 2016 DataStax,All Rights Reserved. 28 sc.textFile("file:///home/videos.csv") .flatMap(line => line.split(",").drop(1)) .map(genre => (genre,1)) .reduceByKey{case (x,y) => x + y} .collect() .foreach(println)
  • 30. Spark-Cassandra-Connector © 2016 DataStax,All Rights Reserved. 30 • Scala, Java • Open Source (ASF2) • Developed by DataStax and community guys https://github.com/datastax/ spark-cassandra-connector
  • 31. Spark Batch Processing © 2016 DataStax,All Rights Reserved. 31 Other REALLY useful stuff like • Re-partition data & processing • Allows efficient use of Cassandra
  • 32. SQL and Data Frames
  • 33. Spark SQL © 2016 DataStax,All Rights Reserved. 33 csc.sql(" SELECT COUNT(*) AS total " + " FROM killr_video.movies_by_actor " + " WHERE actor = 'Johnny Depp' ") .show +-----+ |total| +-----+ | 54| +-----+
  • 34. Spark Data Frames © 2016 DataStax,All Rights Reserved. 34 val movies = csc .read .format("org.apache.spark.sql.cassandra") .options(Map( "keyspace" -> "killr_video", "table" -> "movies_by_actor" )) .load movies.filter("actor = 'Johnny Depp'") .agg(Map("*" -> "count")) .withColumnRenamed("COUNT(1)", "total") .show Same as: SELECT COUNT(*) AS total FROM killr_video.movies_by_actor WHERE actor = 'Johnny Depp'
  • 35. Spark Data Frames © 2016 DataStax,All Rights Reserved. 35 Data frame is RDD that has named columns • Has schema • Has rows • Has columns • Has data types
  • 37. Spark Streaming © 2016 DataStax,All Rights Reserved. 37 Continuous Micro Batching • Collect data of time slides • Compute on that data
  • 38. Spark Streaming © 2016 DataStax,All Rights Reserved. 38
  • 39. Spark Streaming Architecture Confid© 2016 DataStax,All Rights Reserved. 39
  • 40. Spark Streaming © 2016 DataStax,All Rights Reserved. 40
  • 41. Spark Streaming Windows © 2016 DataStax,All Rights Reserved. 41 time Slide Slide Slide Slide Slide Slide Slide Slide Slide Slide Slide 00:00:00 00:00:01 00:00:02 00:00:03 00:00:04 00:00:05 00:00:06 00:00:07 00:00:08 00:00:09 00:00:10 Window Window Window Window Window Window Window Window Window Process data for this slide – i.e. 1 second in this example Process data for this slide – i.e. 1 second in this example Process data for this slide – i.e. 1 second in this example Time window ~= Grouping data by time
  • 42. Apache Kafka Distributed & resilient messaging
  • 43. Kafka © 2016 DataStax,All Rights Reserved. 43 Publish-subscribe messaging rethought as a distributed commit log
  • 44. Kafka © 2016 DataStax,All Rights Reserved. 44 Kafka Cluster Producer Producer Producer Consumer Consumer Kafka Node Kafka Node Kafka Node
  • 45. Kafka Use-Cases © 2016 DataStax,All Rights Reserved. 45 • Messaging • Website activity tracking • Metrics • Log aggregation • Event sourcing • Stream processing
  • 46. Common Characteristics © 2016 DataStax,All Rights Reserved. 46 Kafka, Spark & Cassandra are • Scalable • Distributed • Partitioned • Replicated
  • 47. KillrPowr © 2016 DataStax,All Rights Reserved. 47 Show current power consumption in Germany in real time • Power consumption of all cities on a map • Graph of data for a selected city
  • 48. KillrPowr – Data modeling 101 © 2016 DataStax,All Rights Reserved. 48 1. Collect UI Requirements 2. Model data (flow) per web UI request 3. Build Spark Streaming code to generate that data 4. Connect sensors and Spark Streaming using Kafka
  • 49. KillrPowr – Tech Stack © 2016 DataStax,All Rights Reserved. 49 Power Sensor Power Sensor Kafka Spark Streaming Cassandra Web Site Queue data Aggregate data and save for web UI
  • 50. KillrPowr – Tech Stack © 2016 DataStax,All Rights Reserved. 50 Techs used in the demo • Spray (HTTP for Akka Actors) • AngularJS • Kafka • Spark Streaming • Cassandra
  • 51. Tips © 2016 DataStax,All Rights Reserved. 51 • Push down predicates to Cassandra • Partition your data “right” • Think distributed
  • 52. Cassandra + Spark in DSE © 2016 DataStax,All Rights Reserved. 52 DSE integrates Spark + Cassandra https://academy.datastax.com/ courses/getting-started- apache-spark
  • 53. KillrPowr Live Demo © 2016 DataStax,All Rights Reserved. 53
  • 54. You’ve got the lighter, let’s start a fire: Spark and Cassandra Robert Stupp Solutions Architect @ DataStax, Committer to Apache Cassandra @snazy
  • 55. Thank You © 2016 DataStax,All Rights Reserved. 55
  • 57. BACKUP - DDL © 2014 DataStax, All Rights Reserved. Company Confidential 57
  • 58. BACKUP - DDL © 2014 DataStax, All Rights Reserved. Company Confidential 58
  • 59. BACKUP - DDL © 2014 DataStax, All Rights Reserved. Company Confidential 59
  • 60. BACKUP - DDL © 2014 DataStax, All Rights Reserved. Company Confidential 60
  • 61. BACKUP – Spark Streaming © 2014 DataStax, All Rights Reserved. Company Confidential 61
  • 62. BACKUP – Spark Streaming © 2014 DataStax, All Rights Reserved. Company Confidential 62
  • 63. BACKUP – Spark Streaming © 2014 DataStax, All Rights Reserved. Company Confidential 63
  • 64. BACKUP – Spark Streaming © 2014 DataStax, All Rights Reserved. Company Confidential 64
  • 65. BACKUP – Spark Streaming © 2014 DataStax, All Rights Reserved. Company Confidential 65
  • 66. BACKUP – Spark Streaming © 2014 DataStax, All Rights Reserved. Company Confidential 66
  • 67. © 2014 DataStax, All Rights Reserved. Company Confidential 67