SlideShare a Scribd company logo
1 of 31
Download to read offline
INTRO TO PYSPARK
Jon Haddad, Technical Evangelist, DataStax
@rustyrazorblade
WHAT TOOLS ARE YOU ALREADY
USING FOR DATA ANALYSIS?
NumPy / SciPy
Pandas
iPython Notebooks
scikit-learn
hdf5
pybrain
WHAT'S THE PROBLEM?
GREAT TOOLS
BUT NOT BUILT FOR BIG DATA SETS
And not real time...
LIMITED TO 1 MACHINE
What if we have a lot of data?
What if we use Cassandra?
We need distributed computing
Use when we have more data what fits on a single machine
WHAT IS SPARK?
Fast and general purpose cluster computing system
LANGUAGES
Scala
Java
R (version >= 1.4)
Python
WHAT CAN I DO WITH IT?
Read and write data in bulk to and from Cassandra
Batch processing
Stream processing
Machine Learning
Distributed SQL
Operate on entire dataset (or at least a big chunk of it)
BATCH PROCESSING
RDD
Resilliant Distributed Dataset (it's a big list)
Use functional concepts like map, filter, reduce
Caveat: Will always pay penalty going from JVM <> Python
DATA MIGRATIONS
USERS
name favorite_food
jon bacon
luke pie
patrick pizza
rachel pizza
SET UP OUR KEYSPACE
create KEYSPACE demo WITH replication =
{'class': 'SimpleStrategy', 'replication_factor': 1};
use demo ;
CREATE OUR DEMO USER TABLE
create TABLE user ( name text PRIMARY KEY,
favorite_food text );
insert into user (name, favorite_food) values ('jon', 'bacon');
insert into user (name, favorite_food) values ('luke', 'pie');
insert into user (name, favorite_food) values ('patrick', 'pizza');
insert into user (name, favorite_food) values ('rachel', 'pizza');
create table favorite_foods ( food text, name text,
primary key (food, name));
MAPPING FOODS TO USERS
from pyspark_cassandra import CassandraSparkContext, Row
from pyspark import SparkContext, SparkConf
conf = SparkConf() 
.setAppName("User Food Migration") 
.setMaster("spark://127.0.0.1:7077") 
.set("spark.cassandra.connection.host", "127.0.0.1")
sc = CassandraSparkContext(conf=conf)
users = sc.cassandraTable("demo", "user")
favorite_foods = users.map(lambda x:
{"food":x['favorite_food'],
"name":x['name']} )
favorite_foods.saveToCassandra("demo", "favorite_foods")
MIGRATION RESULTS
cqlsh:demo> select * from favorite_foods ;
food | name
-------+---------
pizza | patrick
pizza | rachel
pie | luke
bacon | jon
(4 rows)
cqlsh:demo> select * from favorite_foods where food = 'pizza';
food | name
-------+---------
pizza | patrick
pizza | rachel
AGGREGATIONS
u = sc.cassandraTable("demo", "user")
u.map(lambda x: (x['favorite_food'], 1)).
reduceByKey(lambda x, y: x + y).collect()
[(u'bacon', 1), (u'pie', 1), (u'pizza', 2)]
RDDS ARE COOL
And very powerful
But kind of annoying
DATAFRAMES
From R language
Available in Python via Pandas
DataFrames allow for optimized filters, sorting, grouping
With Spark, all the data stays in the JVM
With Cassandra it's still expensive due to JVM <> Python
But it can be fixed
DATAFRAMES EXAMPLE
from pyspark_cassandra import CassandraSparkContext, Row
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext # needed for toDF()
users = sc.cassandraTable("demo", "user").toDF()
food_count = users.select("favorite_food").
groupBy("favorite_food").count()
food_count.collect()
[Row(favorite_food=u'bacon', count=1),
Row(favorite_food=u'pizza', count=2),
Row(favorite_food=u'pie', count=1)]
SPARKSQL
Register dataframes as tables
JOIN, GROUP BY
SPARKSQL IN ACTION
sql = SQLContext(sc)
users = sc.cassandraTable("demo", "user").toDF()
users.registerTempTable("users")
sql.sql("""select favorite_food, count(favorite_food)
from users group by favorite_food """).collect()
[Row(favorite_food=u'bacon', c1=1),
Row(favorite_food=u'pizza', c1=2),
Row(favorite_food=u'pie', c1=1)]
STREAMING
Operate on batch windows
Each batch is a small RDD
PRETTY PICTURE
STREAMING
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
stream = StreamingContext(sc, 1) # 1 second window
kafka_stream = KafkaUtils.createStream(stream, 
"localhost:2181", 
"raw-event-streaming-consumer",
{"pageviews":1})
# manipulate kafka_stream as an RDD
stream.start()
stream.awaitTermination()
MACHINE LEARNING
Supervised learning
Unsupervised learning
SUPERVISED LEARNING
When we know the inputs and outputs
Example: Real estate prices
Take existing knowledge about houses and prices
Build a model to predict the future
UNSUPERVISED LEARNING
When we don't know the output
Popular usage: discover groups
INTERACTIVE IPYTHON NOTEBOOKS
Iterate quickly
Visualize your data
GET STARTED!
Open Source:
Download Cassandra
Download Spark
Cassandra PySpark Repo:
https://github.com/TargetHolding/pyspark-cassandra
Integrated solution
Download DataStax Enterprise

More Related Content

Similar to Intro to py spark (and cassandra)

Drilling into Data with Apache Drill
Drilling into Data with Apache DrillDrilling into Data with Apache Drill
Drilling into Data with Apache Drill
MapR Technologies
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25th
Sneha Challa
 
Using Cassandra with your Web Application
Using Cassandra with your Web ApplicationUsing Cassandra with your Web Application
Using Cassandra with your Web Application
supertom
 

Similar to Intro to py spark (and cassandra) (20)

Data Analysis in Python
Data Analysis in PythonData Analysis in Python
Data Analysis in Python
 
Fast and Scalable Python
Fast and Scalable PythonFast and Scalable Python
Fast and Scalable Python
 
R for hadoopers
R for hadoopersR for hadoopers
R for hadoopers
 
Drilling into Data with Apache Drill
Drilling into Data with Apache DrillDrilling into Data with Apache Drill
Drilling into Data with Apache Drill
 
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaAdvance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with Python
 
Drilling into Data with Apache Drill
Drilling into Data with Apache DrillDrilling into Data with Apache Drill
Drilling into Data with Apache Drill
 
Apache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLabApache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLab
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25th
 
Data Science
Data ScienceData Science
Data Science
 
A Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.pptA Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.ppt
 
Pig
PigPig
Pig
 
PYTHON PANDAS.pptx
PYTHON PANDAS.pptxPYTHON PANDAS.pptx
PYTHON PANDAS.pptx
 
Koalas: Pandas on Apache Spark
Koalas: Pandas on Apache SparkKoalas: Pandas on Apache Spark
Koalas: Pandas on Apache Spark
 
Big data week presentation
Big data week presentationBig data week presentation
Big data week presentation
 
Leveraging Hadoop in your PostgreSQL Environment
Leveraging Hadoop in your PostgreSQL EnvironmentLeveraging Hadoop in your PostgreSQL Environment
Leveraging Hadoop in your PostgreSQL Environment
 
Using Cassandra with your Web Application
Using Cassandra with your Web ApplicationUsing Cassandra with your Web Application
Using Cassandra with your Web Application
 
Spark - Philly JUG
Spark  - Philly JUGSpark  - Philly JUG
Spark - Philly JUG
 
PySpark in practice slides
PySpark in practice slidesPySpark in practice slides
PySpark in practice slides
 
S51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdf
S51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdfS51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdf
S51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdf
 

More from Jon Haddad

Crash course intro to cassandra
Crash course   intro to cassandraCrash course   intro to cassandra
Crash course intro to cassandra
Jon Haddad
 

More from Jon Haddad (17)

Cassandra Performance Tuning Like You've Been Doing It for Ten Years
Cassandra Performance Tuning Like You've Been Doing It for Ten YearsCassandra Performance Tuning Like You've Been Doing It for Ten Years
Cassandra Performance Tuning Like You've Been Doing It for Ten Years
 
Performance tuning
Performance tuningPerformance tuning
Performance tuning
 
Cassandra Core Concepts - Cassandra Day Toronto
Cassandra Core Concepts - Cassandra Day TorontoCassandra Core Concepts - Cassandra Day Toronto
Cassandra Core Concepts - Cassandra Day Toronto
 
Diagnosing Problems in Production (Nov 2015)
Diagnosing Problems in Production (Nov 2015)Diagnosing Problems in Production (Nov 2015)
Diagnosing Problems in Production (Nov 2015)
 
Cassandra Core Concepts
Cassandra Core ConceptsCassandra Core Concepts
Cassandra Core Concepts
 
Enter the Snake Pit for Fast and Easy Spark
Enter the Snake Pit for Fast and Easy SparkEnter the Snake Pit for Fast and Easy Spark
Enter the Snake Pit for Fast and Easy Spark
 
Cassandra 3.0 Awesomeness
Cassandra 3.0 AwesomenessCassandra 3.0 Awesomeness
Cassandra 3.0 Awesomeness
 
Spark and cassandra (Hulu Talk)
Spark and cassandra (Hulu Talk)Spark and cassandra (Hulu Talk)
Spark and cassandra (Hulu Talk)
 
Intro to Cassandra
Intro to CassandraIntro to Cassandra
Intro to Cassandra
 
Python and cassandra
Python and cassandraPython and cassandra
Python and cassandra
 
Python performance profiling
Python performance profilingPython performance profiling
Python performance profiling
 
Diagnosing Problems in Production - Cassandra
Diagnosing Problems in Production - CassandraDiagnosing Problems in Production - Cassandra
Diagnosing Problems in Production - Cassandra
 
Python & Cassandra - Best Friends
Python & Cassandra - Best FriendsPython & Cassandra - Best Friends
Python & Cassandra - Best Friends
 
Introduction to Cassandra - Denver
Introduction to Cassandra - DenverIntroduction to Cassandra - Denver
Introduction to Cassandra - Denver
 
Diagnosing Problems in Production: Cassandra Summit 2014
Diagnosing Problems in Production: Cassandra Summit 2014Diagnosing Problems in Production: Cassandra Summit 2014
Diagnosing Problems in Production: Cassandra Summit 2014
 
Crash course intro to cassandra
Crash course   intro to cassandraCrash course   intro to cassandra
Crash course intro to cassandra
 
Cassandra meetup slides - Oct 15 Santa Monica Coloft
Cassandra meetup slides - Oct 15 Santa Monica ColoftCassandra meetup slides - Oct 15 Santa Monica Coloft
Cassandra meetup slides - Oct 15 Santa Monica Coloft
 

Recently uploaded

Recently uploaded (20)

The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 

Intro to py spark (and cassandra)

  • 1. INTRO TO PYSPARK Jon Haddad, Technical Evangelist, DataStax @rustyrazorblade
  • 2. WHAT TOOLS ARE YOU ALREADY USING FOR DATA ANALYSIS? NumPy / SciPy Pandas iPython Notebooks scikit-learn hdf5 pybrain
  • 3. WHAT'S THE PROBLEM? GREAT TOOLS BUT NOT BUILT FOR BIG DATA SETS And not real time...
  • 4. LIMITED TO 1 MACHINE What if we have a lot of data? What if we use Cassandra? We need distributed computing
  • 5. Use when we have more data what fits on a single machine WHAT IS SPARK? Fast and general purpose cluster computing system
  • 7. WHAT CAN I DO WITH IT? Read and write data in bulk to and from Cassandra Batch processing Stream processing Machine Learning Distributed SQL
  • 8. Operate on entire dataset (or at least a big chunk of it) BATCH PROCESSING
  • 9. RDD Resilliant Distributed Dataset (it's a big list) Use functional concepts like map, filter, reduce Caveat: Will always pay penalty going from JVM <> Python
  • 11. USERS name favorite_food jon bacon luke pie patrick pizza rachel pizza
  • 12. SET UP OUR KEYSPACE create KEYSPACE demo WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1}; use demo ;
  • 13. CREATE OUR DEMO USER TABLE create TABLE user ( name text PRIMARY KEY, favorite_food text ); insert into user (name, favorite_food) values ('jon', 'bacon'); insert into user (name, favorite_food) values ('luke', 'pie'); insert into user (name, favorite_food) values ('patrick', 'pizza'); insert into user (name, favorite_food) values ('rachel', 'pizza'); create table favorite_foods ( food text, name text, primary key (food, name));
  • 14. MAPPING FOODS TO USERS from pyspark_cassandra import CassandraSparkContext, Row from pyspark import SparkContext, SparkConf conf = SparkConf() .setAppName("User Food Migration") .setMaster("spark://127.0.0.1:7077") .set("spark.cassandra.connection.host", "127.0.0.1") sc = CassandraSparkContext(conf=conf) users = sc.cassandraTable("demo", "user") favorite_foods = users.map(lambda x: {"food":x['favorite_food'], "name":x['name']} ) favorite_foods.saveToCassandra("demo", "favorite_foods")
  • 15. MIGRATION RESULTS cqlsh:demo> select * from favorite_foods ; food | name -------+--------- pizza | patrick pizza | rachel pie | luke bacon | jon (4 rows) cqlsh:demo> select * from favorite_foods where food = 'pizza'; food | name -------+--------- pizza | patrick pizza | rachel
  • 16. AGGREGATIONS u = sc.cassandraTable("demo", "user") u.map(lambda x: (x['favorite_food'], 1)). reduceByKey(lambda x, y: x + y).collect() [(u'bacon', 1), (u'pie', 1), (u'pizza', 2)]
  • 17. RDDS ARE COOL And very powerful But kind of annoying
  • 18. DATAFRAMES From R language Available in Python via Pandas DataFrames allow for optimized filters, sorting, grouping With Spark, all the data stays in the JVM With Cassandra it's still expensive due to JVM <> Python But it can be fixed
  • 19. DATAFRAMES EXAMPLE from pyspark_cassandra import CassandraSparkContext, Row from pyspark import SparkContext, SparkConf from pyspark.sql import SQLContext # needed for toDF() users = sc.cassandraTable("demo", "user").toDF() food_count = users.select("favorite_food"). groupBy("favorite_food").count() food_count.collect() [Row(favorite_food=u'bacon', count=1), Row(favorite_food=u'pizza', count=2), Row(favorite_food=u'pie', count=1)]
  • 20. SPARKSQL Register dataframes as tables JOIN, GROUP BY
  • 21. SPARKSQL IN ACTION sql = SQLContext(sc) users = sc.cassandraTable("demo", "user").toDF() users.registerTempTable("users") sql.sql("""select favorite_food, count(favorite_food) from users group by favorite_food """).collect() [Row(favorite_food=u'bacon', c1=1), Row(favorite_food=u'pizza', c1=2), Row(favorite_food=u'pie', c1=1)]
  • 22. STREAMING Operate on batch windows Each batch is a small RDD
  • 24. STREAMING from pyspark.streaming import StreamingContext from pyspark.streaming.kafka import KafkaUtils stream = StreamingContext(sc, 1) # 1 second window kafka_stream = KafkaUtils.createStream(stream, "localhost:2181", "raw-event-streaming-consumer", {"pageviews":1}) # manipulate kafka_stream as an RDD stream.start() stream.awaitTermination()
  • 26. SUPERVISED LEARNING When we know the inputs and outputs Example: Real estate prices Take existing knowledge about houses and prices Build a model to predict the future
  • 27. UNSUPERVISED LEARNING When we don't know the output Popular usage: discover groups
  • 28.
  • 29. INTERACTIVE IPYTHON NOTEBOOKS Iterate quickly Visualize your data
  • 30.
  • 31. GET STARTED! Open Source: Download Cassandra Download Spark Cassandra PySpark Repo: https://github.com/TargetHolding/pyspark-cassandra Integrated solution Download DataStax Enterprise