From the original abstract:
If you're already using Cassandra you're already aware of it’s strengths of high availability and linear scalability. The downside to this power is less query flexibility. For an OLTP system with an SLA this is an acceptable tradeoff, but for a data scientist it’s extremely limiting.
Enter Apache Spark. Apache spark complements an existing Cassandra cluster by providing a means of executing arbitrary queries, filters, sorting and aggregation. It’s possible to use functional constructs like map, filter, and reduce, as well as SQL and DataFrames.
In this presentation I’ll show you how to process Cassandra data in bulk or through a Kafka stream using Python. Then we’ll visualize our data using iPython notebooks, leveraging Pandas and matplotlib.
This is an advanced talk. We will assume existing knowledge of Cassandra and CQL.
7. WHAT CAN I DO WITH IT?
Read and write data in bulk to and from Cassandra
Batch processing
Stream processing
Machine Learning
Distributed SQL
8. Operate on entire dataset (or at least a big chunk of it)
BATCH PROCESSING
9. RDD
Resilliant Distributed Dataset (it's a big list)
Use functional concepts like map, filter, reduce
Caveat: Will always pay penalty going from JVM <> Python
12. SET UP OUR KEYSPACE
create KEYSPACE demo WITH replication =
{'class': 'SimpleStrategy', 'replication_factor': 1};
use demo ;
13. CREATE OUR DEMO USER TABLE
create TABLE user ( name text PRIMARY KEY,
favorite_food text );
insert into user (name, favorite_food) values ('jon', 'bacon');
insert into user (name, favorite_food) values ('luke', 'pie');
insert into user (name, favorite_food) values ('patrick', 'pizza');
insert into user (name, favorite_food) values ('rachel', 'pizza');
create table favorite_foods ( food text, name text,
primary key (food, name));
15. MIGRATION RESULTS
cqlsh:demo> select * from favorite_foods ;
food | name
-------+---------
pizza | patrick
pizza | rachel
pie | luke
bacon | jon
(4 rows)
cqlsh:demo> select * from favorite_foods where food = 'pizza';
food | name
-------+---------
pizza | patrick
pizza | rachel
18. DATAFRAMES
From R language
Available in Python via Pandas
DataFrames allow for optimized filters, sorting, grouping
With Spark, all the data stays in the JVM
With Cassandra it's still expensive due to JVM <> Python
But it can be fixed
19. DATAFRAMES EXAMPLE
from pyspark_cassandra import CassandraSparkContext, Row
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext # needed for toDF()
users = sc.cassandraTable("demo", "user").toDF()
food_count = users.select("favorite_food").
groupBy("favorite_food").count()
food_count.collect()
[Row(favorite_food=u'bacon', count=1),
Row(favorite_food=u'pizza', count=2),
Row(favorite_food=u'pie', count=1)]
26. SUPERVISED LEARNING
When we know the inputs and outputs
Example: Real estate prices
Take existing knowledge about houses and prices
Build a model to predict the future