SlideShare una empresa de Scribd logo
1 de 33
Descargar para leer sin conexión
Cassandra and Spark:
Optimizing for Data
Locality
Russell Spitzer

Software Engineer @ DataStax
Lex Luther Was Right:
Location is Important
The value of many things is based upon it's
location. Developed land near the beach is
valuable but desert and farmland is generally
much cheaper. Unfortunately moving land is
generally impossible.
Spark Summit
or lake, or swamp or whatever body of water
is "Data" at the time this slide is viewed.
Lex Luther Was Wrong:
Don't ETL the Data Ocean
Spark Summit
My House
Spark is Our Hero Giving us the
Ability to Do Our Analytics Without the ETL
PARK
Moving Data Between Machines Is Expensive
Do Work Where the Data Lives!
Our Cassandra Nodes are like Cities and our Spark Executors
are like Super Heroes. We'd rather they spend their time locally
rather than flying back and forth all the time.
Moving Data Between Machines Is Expensive
Do Work Where the Data Lives!
Our Cassandra Nodes are like Cities and our Spark Executors
are like Super Heroes. We'd rather they spend their time locally
rather than flying back and forth all the time.
Metropolis
Superman
Gotham
Batman
Spark Executors
DataStax Open Source
Spark Cassandra Connector is Available on Github
https://github.com/datastax/spark-cassandra-connector
• Compatible with Spark 1.3
• Read and Map C* Data Types
• Saves To Cassandra
• Intelligent write batching
• Supports Collections
• Secondary index pushdown
• Arbitrary CQL Execution!
How the Spark Cassandra Connector
Reads Data Node Local
Cassandra Locates a Row Based on
Partition Key and Token Range
All of the rows in a Cassandra Cluster
are stored based based on their
location in the Token Range.
Metropolis
Gotham
Coast City
Each of the Nodes in a 

Cassandra Cluster is primarily
responsible for one set of
Tokens.
0999
500
Cassandra Locates a Row Based on
Partition Key and Token Range
Metropolis
Gotham
Coast City
Each of the Nodes in a 

Cassandra Cluster is primarily
responsible for one set of
Tokens.
0999
500
750 - 99
350 - 749
100 - 349
Cassandra Locates a Row Based on
Partition Key and Token Range
Jacek 514 Red
The CQL Schema designates
at least one column to be the
Partition Key.
Metropolis
Gotham
Coast City
Cassandra Locates a Row Based on
Partition Key and Token Range
Jacek 514 Red
The hash of the Partition Key
tells us where a row
should be stored.
Metropolis
Gotham
Coast City
Cassandra Locates a Row Based on
Partition Key and Token Range
830
With VNodes the ranges are
not contiguous but the same
mechanism controls row
location.
Jacek 514 Red
Metropolis
Gotham
Coast City
Cassandra Locates a Row Based on
Partition Key and Token Range
830
Loading Huge Amounts of Data
Table Scans involve loading most of the data in Cassandra
Cassandra RDD Use the Token Range to Create Node
Local Spark Partitions
sc.cassandraTable or sqlContext.load(org.apache.spark.sql.cassandra)
Token Ranges
spark.cassandra.input.split.size
The (estimated) number of C* Partitions to be placed in a Spark Partition
Cassandra RDD Use the Token Range to Create Node
Local Spark Partitions
sc.cassandraTable or sqlContext.load(org.apache.spark.sql.cassandra)
Token Ranges
spark.cassandra.input.split.size
The (estimated) number of C* Partitions to be placed in a Spark Partition
CassandraRDD
Spark
Partition
Cassandra RDD Use the Token Range to Create Node
Local Spark Partitions
sc.cassandraTable or sqlContext.load(org.apache.spark.sql.cassandra)
Token Ranges
spark.cassandra.input.split.size
The (estimated) number of C* Partitions to be placed in a Spark Partition
CassandraRDD
Cassandra RDD Use the Token Range to Create Node
Local Spark Partitions
sc.cassandraTable or sqlContext.load(org.apache.spark.sql.cassandra)
Token Ranges
spark.cassandra.input.split.size
The (estimated) number of C* Partitions to be placed in a Spark Partition
CassandraRDD
Cassandra RDD Use the Token Range to Create Node
Local Spark Partitions
sc.cassandraTable or sqlContext.load(org.apache.spark.sql.cassandra)
Token Ranges
spark.cassandra.input.split.size
The (estimated) number of C* Partitions to be placed in a Spark Partition
CassandraRDD
Spark Driver
Spark Partitions Are Annotated With the Location For
TokenRanges they Span
Metropolis
Metropolis
Superman
Gotham
Batman
Coast City
Green L.
The Driver waits spark.locality.wait for the
preferred location to have an open executor
Assigns Task
Metropolis
Spark Executor (Superman)
The Spark Executor uses the Java Driver to
Pull Rows from the Local Cassandra Instance
On the Executor the task is transformed into CQL queries which are
executed via the Java Driver.
SELECT * FROM keyspace.table WHERE
token(pk) > 780 and token(pk) <= 830
Tokens 780 - 830
Metropolis
Spark Executor (Superman)
The Spark Executor uses the Java Driver to
Pull Rows from the Local Cassandra Instance
SELECT * FROM keyspace.table WHERE
token(pk) > 780 and token(pk) <= 830
Tokens 780 - 830
The C* Java Driver pages spark.cassandra.input.page.row.size
CQL rows at a time
Metropolis
Spark Executor (Superman)
The Spark Executor uses the Java Driver to
Pull Rows from the Local Cassandra Instance
SELECT * FROM keyspace.table WHERE 

(Pushed Down Clauses) AND
token(pk) > 780 and token(pk) <= 830
Tokens 780 - 830
Because we are utilizing CQL we can also pushdown predicates which
can be handled by C*.
Loading Sizable But Defined Amounts of Data
Retrieving sets of Partition Keys can be done in parallel
joinWithCassandraTable Provides an Interface for
Obtaining a Set of C* Partitions
Generic RDD
Metropolis
Superman
Gotham
Batman
Coast City
Green L.
Generic RDDs can Be Joined But the Spark Tasks will Not be Node Local
Generic RDD
Metropolis
Superman
Gotham
Batman
Coast City
Green L.
Generic RDDs can Be Joined But the Spark Tasks will Not be Node Local
joinWithCassandraTable Provides an Interface for
Obtaining a Set of C* Partitions
repartitionByCassandraReplica Repartitions RDD's to be
C* Local
Generic RDD CassandraPartitionedRDD
This operation requires a shuffle
joinWithCassandraTable on CassandraPartitionedRDDs
(or CassandraTableScanRDDs) will be Node local
CassandraPartitionedRDDs are partitioned to be executed node local
CassandraPartitionedRDD
Metropolis
Superman
Gotham
Batman
Coast City
Green L.
Metropolis
Spark Executor (Superman)
The Spark Executor uses the Java Driver to
Pull Rows from the Local Cassandra Instance
The C* Java Driver pages spark.cassandra.input.page.row.size
CQL rows at a time
SELECT * FROM keyspace.table WHERE
pk =
DataStax Enterprise Comes Bundled
with Spark and the Connector
Apache Spark Apache Solr
DataStax Delivers
Apache Cassandra
In A Database Platform
DataStax Enterprise Enables This Same Machinery 

with Solr Pushdown
Metropolis
Spark Executor (Superman)
DataStax
Enterprise
SELECT * FROM keyspace.table
WHERE solr_query = 'title:b'
AND
token(pk) > 780 and token(pk) <= 830
Tokens 780 - 830
Learn More Online and at Cassandra Summit
https://academy.datastax.com/

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Spark and Cassandra 2 Fast 2 Furious
Spark and Cassandra 2 Fast 2 FuriousSpark and Cassandra 2 Fast 2 Furious
Spark and Cassandra 2 Fast 2 Furious
 
Spark cassandra connector.API, Best Practices and Use-Cases
Spark cassandra connector.API, Best Practices and Use-CasesSpark cassandra connector.API, Best Practices and Use-Cases
Spark cassandra connector.API, Best Practices and Use-Cases
 
Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...
Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...
Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...
 
Spark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and FutureSpark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and Future
 
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...
 
Real time data pipeline with spark streaming and cassandra with mesos
Real time data pipeline with spark streaming and cassandra with mesosReal time data pipeline with spark streaming and cassandra with mesos
Real time data pipeline with spark streaming and cassandra with mesos
 
Cassandra Fundamentals - C* 2.0
Cassandra Fundamentals - C* 2.0Cassandra Fundamentals - C* 2.0
Cassandra Fundamentals - C* 2.0
 
Escape From Hadoop: Spark One Liners for C* Ops
Escape From Hadoop: Spark One Liners for C* OpsEscape From Hadoop: Spark One Liners for C* Ops
Escape From Hadoop: Spark One Liners for C* Ops
 
Maximum Overdrive: Tuning the Spark Cassandra Connector
Maximum Overdrive: Tuning the Spark Cassandra ConnectorMaximum Overdrive: Tuning the Spark Cassandra Connector
Maximum Overdrive: Tuning the Spark Cassandra Connector
 
Apache cassandra and spark. you got the the lighter, let's start the fire
Apache cassandra and spark. you got the the lighter, let's start the fireApache cassandra and spark. you got the the lighter, let's start the fire
Apache cassandra and spark. you got the the lighter, let's start the fire
 
Spark Streaming with Cassandra
Spark Streaming with CassandraSpark Streaming with Cassandra
Spark Streaming with Cassandra
 
Owning time series with team apache Strata San Jose 2015
Owning time series with team apache   Strata San Jose 2015Owning time series with team apache   Strata San Jose 2015
Owning time series with team apache Strata San Jose 2015
 
Spark with Cassandra by Christopher Batey
Spark with Cassandra by Christopher BateySpark with Cassandra by Christopher Batey
Spark with Cassandra by Christopher Batey
 
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
 
Using Spark to Load Oracle Data into Cassandra
Using Spark to Load Oracle Data into CassandraUsing Spark to Load Oracle Data into Cassandra
Using Spark to Load Oracle Data into Cassandra
 
How We Used Cassandra/Solr to Build Real-Time Analytics Platform
How We Used Cassandra/Solr to Build Real-Time Analytics PlatformHow We Used Cassandra/Solr to Build Real-Time Analytics Platform
How We Used Cassandra/Solr to Build Real-Time Analytics Platform
 
OLAP with Cassandra and Spark
OLAP with Cassandra and SparkOLAP with Cassandra and Spark
OLAP with Cassandra and Spark
 
An Introduction to time series with Team Apache
An Introduction to time series with Team ApacheAn Introduction to time series with Team Apache
An Introduction to time series with Team Apache
 
Escape from Hadoop: Ultra Fast Data Analysis with Spark & Cassandra
Escape from Hadoop: Ultra Fast Data Analysis with Spark & CassandraEscape from Hadoop: Ultra Fast Data Analysis with Spark & Cassandra
Escape from Hadoop: Ultra Fast Data Analysis with Spark & Cassandra
 
Time series with Apache Cassandra - Long version
Time series with Apache Cassandra - Long versionTime series with Apache Cassandra - Long version
Time series with Apache Cassandra - Long version
 

Destacado

Hadoop scalability
Hadoop scalabilityHadoop scalability
Hadoop scalability
WANdisco Plc
 

Destacado (6)

Gluster.community.day.2013
Gluster.community.day.2013Gluster.community.day.2013
Gluster.community.day.2013
 
Supporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big DataSupporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big Data
 
Performance comparison of Distributed File Systems on 1Gbit networks
Performance comparison of Distributed File Systems on 1Gbit networksPerformance comparison of Distributed File Systems on 1Gbit networks
Performance comparison of Distributed File Systems on 1Gbit networks
 
Hadoop scalability
Hadoop scalabilityHadoop scalability
Hadoop scalability
 
Hadoop 3 in a Nutshell
Hadoop 3 in a NutshellHadoop 3 in a Nutshell
Hadoop 3 in a Nutshell
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
 

Similar a Cassandra and Spark: Optimizing for Data Locality

Cassandra and Spark, closing the gap between no sql and analytics codemotio...
Cassandra and Spark, closing the gap between no sql and analytics   codemotio...Cassandra and Spark, closing the gap between no sql and analytics   codemotio...
Cassandra and Spark, closing the gap between no sql and analytics codemotio...
Duyhai Doan
 
Spark_tutorial (1).pptx
Spark_tutorial (1).pptxSpark_tutorial (1).pptx
Spark_tutorial (1).pptx
0111002
 
Apache Spark for Library Developers with Erik Erlandson and William Benton
Apache Spark for Library Developers with Erik Erlandson and William BentonApache Spark for Library Developers with Erik Erlandson and William Benton
Apache Spark for Library Developers with Erik Erlandson and William Benton
Databricks
 

Similar a Cassandra and Spark: Optimizing for Data Locality (20)

Spark And Cassandra: 2 Fast, 2 Furious
Spark And Cassandra: 2 Fast, 2 FuriousSpark And Cassandra: 2 Fast, 2 Furious
Spark And Cassandra: 2 Fast, 2 Furious
 
Escape from Hadoop
Escape from HadoopEscape from Hadoop
Escape from Hadoop
 
Dive into Spark Streaming
Dive into Spark StreamingDive into Spark Streaming
Dive into Spark Streaming
 
Cassandra and Spark, closing the gap between no sql and analytics codemotio...
Cassandra and Spark, closing the gap between no sql and analytics   codemotio...Cassandra and Spark, closing the gap between no sql and analytics   codemotio...
Cassandra and Spark, closing the gap between no sql and analytics codemotio...
 
Akka Cluster in Production
Akka Cluster in ProductionAkka Cluster in Production
Akka Cluster in Production
 
Spark_tutorial (1).pptx
Spark_tutorial (1).pptxSpark_tutorial (1).pptx
Spark_tutorial (1).pptx
 
Apache Spark for Library Developers with Erik Erlandson and William Benton
Apache Spark for Library Developers with Erik Erlandson and William BentonApache Spark for Library Developers with Erik Erlandson and William Benton
Apache Spark for Library Developers with Erik Erlandson and William Benton
 
Cassandra spark connector
Cassandra spark connectorCassandra spark connector
Cassandra spark connector
 
Manchester Hadoop Meetup: Cassandra Spark internals
Manchester Hadoop Meetup: Cassandra Spark internalsManchester Hadoop Meetup: Cassandra Spark internals
Manchester Hadoop Meetup: Cassandra Spark internals
 
A Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In ProductionA Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In Production
 
Cassandra and Spark
Cassandra and Spark Cassandra and Spark
Cassandra and Spark
 
Cassandra for Sysadmins
Cassandra for SysadminsCassandra for Sysadmins
Cassandra for Sysadmins
 
Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...
Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...
Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...
 
Spark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard MaasSpark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard Maas
 
Cassandra Day Denver 2014: Feelin' the Flow: Analyzing Data with Spark and Ca...
Cassandra Day Denver 2014: Feelin' the Flow: Analyzing Data with Spark and Ca...Cassandra Day Denver 2014: Feelin' the Flow: Analyzing Data with Spark and Ca...
Cassandra Day Denver 2014: Feelin' the Flow: Analyzing Data with Spark and Ca...
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
Cassandra London - C* Spark Connector
Cassandra London - C* Spark ConnectorCassandra London - C* Spark Connector
Cassandra London - C* Spark Connector
 
Virtual Nodes: Rethinking Topology in Cassandra
Virtual Nodes: Rethinking Topology in CassandraVirtual Nodes: Rethinking Topology in Cassandra
Virtual Nodes: Rethinking Topology in Cassandra
 
C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans
C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric EvansC* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans
C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans
 
Cassandra + Spark + Elk
Cassandra + Spark + ElkCassandra + Spark + Elk
Cassandra + Spark + Elk
 

Último

Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
amitlee9823
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
amitlee9823
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
amitlee9823
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
amitlee9823
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
amitlee9823
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
amitlee9823
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
MarinCaroMartnezBerg
 

Último (20)

BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
ELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptxELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptx
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 

Cassandra and Spark: Optimizing for Data Locality

  • 1. Cassandra and Spark: Optimizing for Data Locality Russell Spitzer
 Software Engineer @ DataStax
  • 2. Lex Luther Was Right: Location is Important The value of many things is based upon it's location. Developed land near the beach is valuable but desert and farmland is generally much cheaper. Unfortunately moving land is generally impossible. Spark Summit
  • 3. or lake, or swamp or whatever body of water is "Data" at the time this slide is viewed. Lex Luther Was Wrong: Don't ETL the Data Ocean Spark Summit My House
  • 4. Spark is Our Hero Giving us the Ability to Do Our Analytics Without the ETL PARK
  • 5. Moving Data Between Machines Is Expensive Do Work Where the Data Lives! Our Cassandra Nodes are like Cities and our Spark Executors are like Super Heroes. We'd rather they spend their time locally rather than flying back and forth all the time.
  • 6. Moving Data Between Machines Is Expensive Do Work Where the Data Lives! Our Cassandra Nodes are like Cities and our Spark Executors are like Super Heroes. We'd rather they spend their time locally rather than flying back and forth all the time. Metropolis Superman Gotham Batman Spark Executors
  • 7. DataStax Open Source Spark Cassandra Connector is Available on Github https://github.com/datastax/spark-cassandra-connector • Compatible with Spark 1.3 • Read and Map C* Data Types • Saves To Cassandra • Intelligent write batching • Supports Collections • Secondary index pushdown • Arbitrary CQL Execution!
  • 8. How the Spark Cassandra Connector Reads Data Node Local
  • 9. Cassandra Locates a Row Based on Partition Key and Token Range All of the rows in a Cassandra Cluster are stored based based on their location in the Token Range.
  • 10. Metropolis Gotham Coast City Each of the Nodes in a 
 Cassandra Cluster is primarily responsible for one set of Tokens. 0999 500 Cassandra Locates a Row Based on Partition Key and Token Range
  • 11. Metropolis Gotham Coast City Each of the Nodes in a 
 Cassandra Cluster is primarily responsible for one set of Tokens. 0999 500 750 - 99 350 - 749 100 - 349 Cassandra Locates a Row Based on Partition Key and Token Range
  • 12. Jacek 514 Red The CQL Schema designates at least one column to be the Partition Key. Metropolis Gotham Coast City Cassandra Locates a Row Based on Partition Key and Token Range
  • 13. Jacek 514 Red The hash of the Partition Key tells us where a row should be stored. Metropolis Gotham Coast City Cassandra Locates a Row Based on Partition Key and Token Range 830
  • 14. With VNodes the ranges are not contiguous but the same mechanism controls row location. Jacek 514 Red Metropolis Gotham Coast City Cassandra Locates a Row Based on Partition Key and Token Range 830
  • 15. Loading Huge Amounts of Data Table Scans involve loading most of the data in Cassandra
  • 16. Cassandra RDD Use the Token Range to Create Node Local Spark Partitions sc.cassandraTable or sqlContext.load(org.apache.spark.sql.cassandra) Token Ranges spark.cassandra.input.split.size The (estimated) number of C* Partitions to be placed in a Spark Partition
  • 17. Cassandra RDD Use the Token Range to Create Node Local Spark Partitions sc.cassandraTable or sqlContext.load(org.apache.spark.sql.cassandra) Token Ranges spark.cassandra.input.split.size The (estimated) number of C* Partitions to be placed in a Spark Partition CassandraRDD Spark Partition
  • 18. Cassandra RDD Use the Token Range to Create Node Local Spark Partitions sc.cassandraTable or sqlContext.load(org.apache.spark.sql.cassandra) Token Ranges spark.cassandra.input.split.size The (estimated) number of C* Partitions to be placed in a Spark Partition CassandraRDD
  • 19. Cassandra RDD Use the Token Range to Create Node Local Spark Partitions sc.cassandraTable or sqlContext.load(org.apache.spark.sql.cassandra) Token Ranges spark.cassandra.input.split.size The (estimated) number of C* Partitions to be placed in a Spark Partition CassandraRDD
  • 20. Cassandra RDD Use the Token Range to Create Node Local Spark Partitions sc.cassandraTable or sqlContext.load(org.apache.spark.sql.cassandra) Token Ranges spark.cassandra.input.split.size The (estimated) number of C* Partitions to be placed in a Spark Partition CassandraRDD
  • 21. Spark Driver Spark Partitions Are Annotated With the Location For TokenRanges they Span Metropolis Metropolis Superman Gotham Batman Coast City Green L. The Driver waits spark.locality.wait for the preferred location to have an open executor Assigns Task
  • 22. Metropolis Spark Executor (Superman) The Spark Executor uses the Java Driver to Pull Rows from the Local Cassandra Instance On the Executor the task is transformed into CQL queries which are executed via the Java Driver. SELECT * FROM keyspace.table WHERE token(pk) > 780 and token(pk) <= 830 Tokens 780 - 830
  • 23. Metropolis Spark Executor (Superman) The Spark Executor uses the Java Driver to Pull Rows from the Local Cassandra Instance SELECT * FROM keyspace.table WHERE token(pk) > 780 and token(pk) <= 830 Tokens 780 - 830 The C* Java Driver pages spark.cassandra.input.page.row.size CQL rows at a time
  • 24. Metropolis Spark Executor (Superman) The Spark Executor uses the Java Driver to Pull Rows from the Local Cassandra Instance SELECT * FROM keyspace.table WHERE 
 (Pushed Down Clauses) AND token(pk) > 780 and token(pk) <= 830 Tokens 780 - 830 Because we are utilizing CQL we can also pushdown predicates which can be handled by C*.
  • 25. Loading Sizable But Defined Amounts of Data Retrieving sets of Partition Keys can be done in parallel
  • 26. joinWithCassandraTable Provides an Interface for Obtaining a Set of C* Partitions Generic RDD Metropolis Superman Gotham Batman Coast City Green L. Generic RDDs can Be Joined But the Spark Tasks will Not be Node Local
  • 27. Generic RDD Metropolis Superman Gotham Batman Coast City Green L. Generic RDDs can Be Joined But the Spark Tasks will Not be Node Local joinWithCassandraTable Provides an Interface for Obtaining a Set of C* Partitions
  • 28. repartitionByCassandraReplica Repartitions RDD's to be C* Local Generic RDD CassandraPartitionedRDD This operation requires a shuffle
  • 29. joinWithCassandraTable on CassandraPartitionedRDDs (or CassandraTableScanRDDs) will be Node local CassandraPartitionedRDDs are partitioned to be executed node local CassandraPartitionedRDD Metropolis Superman Gotham Batman Coast City Green L.
  • 30. Metropolis Spark Executor (Superman) The Spark Executor uses the Java Driver to Pull Rows from the Local Cassandra Instance The C* Java Driver pages spark.cassandra.input.page.row.size CQL rows at a time SELECT * FROM keyspace.table WHERE pk =
  • 31. DataStax Enterprise Comes Bundled with Spark and the Connector Apache Spark Apache Solr DataStax Delivers Apache Cassandra In A Database Platform
  • 32. DataStax Enterprise Enables This Same Machinery 
 with Solr Pushdown Metropolis Spark Executor (Superman) DataStax Enterprise SELECT * FROM keyspace.table WHERE solr_query = 'title:b' AND token(pk) > 780 and token(pk) <= 830 Tokens 780 - 830
  • 33. Learn More Online and at Cassandra Summit https://academy.datastax.com/