Intro to py spark (and cassandra)

INTRO TO PYSPARK
Jon Haddad, Technical Evangelist, DataStax
@rustyrazorblade

WHAT TOOLS ARE YOU ALREADY
USING FOR DATA ANALYSIS?
NumPy / SciPy
Pandas
iPython Notebooks
scikit-learn
hdf5
pybrain

WHAT'S THE PROBLEM?
GREAT TOOLS
BUT NOT BUILT FOR BIG DATA SETS
And not real time...

LIMITED TO 1 MACHINE
What if we have a lot of data?
What if we use Cassandra?
We need distributed computing

Use when we have more data what fits on a single machine
WHAT IS SPARK?
Fast and general purpose cluster computing system

LANGUAGES
Scala
Java
R (version >= 1.4)
Python

WHAT CAN I DO WITH IT?
Read and write data in bulk to and from Cassandra
Batch processing
Stream processing
Machine Learning
Distributed SQL

Operate on entire dataset (or at least a big chunk of it)
BATCH PROCESSING

RDD
Resilliant Distributed Dataset (it's a big list)
Use functional concepts like map, filter, reduce
Caveat: Will always pay penalty going from JVM <> Python

USERS
name favorite_food
jon bacon
luke pie
patrick pizza
rachel pizza

SET UP OUR KEYSPACE
create KEYSPACE demo WITH replication =
{'class': 'SimpleStrategy', 'replication_factor': 1};
use demo ;

CREATE OUR DEMO USER TABLE
create TABLE user ( name text PRIMARY KEY,
favorite_food text );
insert into user (name, favorite_food) values ('jon', 'bacon');
insert into user (name, favorite_food) values ('luke', 'pie');
insert into user (name, favorite_food) values ('patrick', 'pizza');
insert into user (name, favorite_food) values ('rachel', 'pizza');
create table favorite_foods ( food text, name text,
primary key (food, name));

MAPPING FOODS TO USERS
from pyspark_cassandra import CassandraSparkContext, Row
from pyspark import SparkContext, SparkConf
conf = SparkConf()
.setAppName("User Food Migration")
.setMaster("spark://127.0.0.1:7077")
.set("spark.cassandra.connection.host", "127.0.0.1")
sc = CassandraSparkContext(conf=conf)
users = sc.cassandraTable("demo", "user")
favorite_foods = users.map(lambda x:
{"food":x['favorite_food'],
"name":x['name']} )
favorite_foods.saveToCassandra("demo", "favorite_foods")

AGGREGATIONS
u = sc.cassandraTable("demo", "user")
u.map(lambda x: (x['favorite_food'], 1)).
reduceByKey(lambda x, y: x + y).collect()
[(u'bacon', 1), (u'pie', 1), (u'pizza', 2)]

RDDS ARE COOL
And very powerful
But kind of annoying

DATAFRAMES
From R language
Available in Python via Pandas
DataFrames allow for optimized filters, sorting, grouping
With Spark, all the data stays in the JVM
With Cassandra it's still expensive due to JVM <> Python
But it can be fixed

DATAFRAMES EXAMPLE
from pyspark_cassandra import CassandraSparkContext, Row
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext # needed for toDF()
users = sc.cassandraTable("demo", "user").toDF()
food_count = users.select("favorite_food").
groupBy("favorite_food").count()
food_count.collect()
[Row(favorite_food=u'bacon', count=1),
Row(favorite_food=u'pizza', count=2),
Row(favorite_food=u'pie', count=1)]

SPARKSQL
Register dataframes as tables
JOIN, GROUP BY

SPARKSQL IN ACTION
sql = SQLContext(sc)
users = sc.cassandraTable("demo", "user").toDF()
users.registerTempTable("users")
sql.sql("""select favorite_food, count(favorite_food)
from users group by favorite_food """).collect()
[Row(favorite_food=u'bacon', c1=1),
Row(favorite_food=u'pizza', c1=2),
Row(favorite_food=u'pie', c1=1)]

STREAMING
Operate on batch windows
Each batch is a small RDD

STREAMING
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
stream = StreamingContext(sc, 1) # 1 second window
kafka_stream = KafkaUtils.createStream(stream,
"localhost:2181",
"raw-event-streaming-consumer",
{"pageviews":1})
# manipulate kafka_stream as an RDD
stream.start()
stream.awaitTermination()

MACHINE LEARNING
Supervised learning
Unsupervised learning

SUPERVISED LEARNING
When we know the inputs and outputs
Example: Real estate prices
Take existing knowledge about houses and prices
Build a model to predict the future

UNSUPERVISED LEARNING
When we don't know the output
Popular usage: discover groups

INTERACTIVE IPYTHON NOTEBOOKS
Iterate quickly
Visualize your data

GET STARTED!
Open Source:
Download Cassandra
Download Spark
Cassandra PySpark Repo:
https://github.com/TargetHolding/pyspark-cassandra
Integrated solution
Download DataStax Enterprise

Intro to py spark (and cassandra)

Recommended

Recommended

More Related Content

Similar to Intro to py spark (and cassandra)

Similar to Intro to py spark (and cassandra) (20)

More from Jon Haddad

More from Jon Haddad (17)

Recently uploaded

Recently uploaded (20)

Intro to py spark (and cassandra)