SlideShare a Scribd company logo
1 of 61
Download to read offline
Apache Spark 
Easy and Fast Big Data Analytics 
Pat McDonough
Founded by the creators of Apache Spark 
out of UC Berkeley’s AMPLab 
Fully committed to 100% open source 
Apache Spark 
Support and Grow the 
Spark Community and Ecosystem 
Building Databricks Cloud
Databricks & Datastax 
Apache Spark is packaged as part of Datastax 
Enterprise Analytics 4.5 
Databricks & Datstax Have Partnered for 
Apache Spark Engineering and Support
Big Data Analytics 
Where We’ve Been 
• 2003 & 2004 - Google 
GFS & MapReduce Papers 
are Precursors to Hadoop 
• 2006 & 2007 - Google 
BigTable and Amazon 
DynamoDB Paper 
Precursor to Cassandra, 
HBase, Others
Big Data Analytics 
A Zoo of Innovation
Big Data Analytics 
A Zoo of Innovation
Big Data Analytics 
A Zoo of Innovation
Big Data Analytics 
A Zoo of Innovation
What's Working? 
Many Excellent Innovations Have Come From Big Data Analytics: 
• Distributed & Data Parallel is disruptive ... because we needed it 
• We Now Have Massive throughput… Solved the ETL Problem 
• The Data Hub/Lake Is Possible
What Needs to Improve? 
Go Beyond MapReduce 
MapReduce is a Very Powerful 
and Flexible Engine 
Processing Throughput 
Previously Unobtainable on 
Commodity Equipment 
But MapReduce Isn’t Enough: 
• Essentially Batch-only 
• Inefficient with respect to 
memory use, latency 
• Too Hard to Program
What Needs to Improve? 
Go Beyond (S)QL 
SQL Support Has Been A 
Welcome Interface on Many 
Platforms 
And in many cases, a faster 
alternative 
But SQL Is Often Not Enough: 
• Sometimes you want to write real programs 
(Loops, variables, functions, existing 
libraries) but don’t want to build UDFs. 
• Machine Learning (see above, plus iterative) 
• Multi-step pipelines 
• Often an Additional System
What Needs to Improve? 
Ease of Use 
Big Data Distributions Provide a 
number of Useful Tools and 
Systems 
Choices are Good to Have 
But This Is Often Unsatisfactory: 
• Each new system has it’s own configs, 
APIs, and management, coordination of 
multiple systems is challenging 
• A typical solution requires stringing 
together disparate systems - we need 
unification 
• Developers want the full power of their 
programming language
What Needs to Improve? 
Latency 
Big Data systems are 
throughput-oriented 
Some new SQL Systems 
provide interactivity 
But We Need More: 
• Interactivity beyond SQL 
interfaces 
• Repeated access of the same 
datasets (i.e. caching)
Can Spark Solve These 
Problems?
Apache Spark 
Originally developed in 2009 in UC Berkeley’s 
AMPLab 
Fully open sourced in 2010 – now at Apache 
Software Foundation 
http://spark.apache.org
Project Activity 
June 2013 June 2014 
total 
contributors 68 255 
companies 
contributing 17 50 
total lines 
of code 63,000 175,000
Project Activity 
June 2013 June 2014 
total 
contributors 68 255 
companies 
contributing 17 50 
total lines 
of code 63,000 175,000
Compared to Other Projects 
1200 
900 
600 
300 
0 
300000 
225000 
150000 
75000 
0 
Commits Lines of Code Changed 
Activity 
in past 6 
months
Compared to Other Projects 
1200 
900 
600 
300 
0 
300000 
225000 
150000 
75000 
0 
Commits Lines of Code Changed 
Activity 
in past 6 
months 
Spark is now the most active project in the 
Hadoop ecosystem
Spark on Github 
So active on Github, sometimes we break it 
Over 1200 Forks (can’t display Network Graphs) 
~80 commits to master each week 
So many PRs We Built our own PR UI
Apache Spark - Easy to 
Use And Very Fast 
Fast and general cluster computing system interoperable with Big Data 
Systems Like Hadoop and Cassandra 
Improved Efficiency: 
• In-memory computing primitives 
• General computation graphs 
Improved Usability: 
• Rich APIs 
• Interactive shell
Apache Spark - Easy to 
Use And Very Fast 
Fast and general cluster computing system interoperable with Big Data 
Systems Like Hadoop and Cassandra 
Improved Efficiency: 
• Up to 100× faster 
In-memory computing primitives 
• (2-10× on disk) 
General computation graphs 
Improved Usability: 
• Rich APIs 
2-5× less code 
• Interactive shell
Apache Spark - A 
Robust SDK for Big 
Data Applications 
SQL 
Machine 
Learning 
Streaming Graph 
Core 
Unified System With Libraries to 
Build a Complete Solution 
! 
Full-featured Programming 
Environment in Scala, Java, Python… 
Very developer-friendly, Functional 
API for working with Data 
! 
Runtimes available on several 
platforms
Spark Is A Part Of Most 
Big Data Platforms 
• All Major Hadoop Distributions Include 
Spark 
• Spark Is Also Integrated With Non-Hadoop 
Big Data Platforms like DSE 
• Spark Applications Can Be Written Once 
and Deployed Anywhere 
SQL 
Machine 
Learning 
Streaming Graph 
Core 
Deploy Spark Apps Anywhere
Easy: Get Started 
Immediately 
Interactive Shell Multi-language support 
Python 
lines = sc.textFile(...) 
lines.filter(lambda s: “ERROR” in s).count() 
Scala 
val lines = sc.textFile(...) 
lines.filter(x => x.contains(“ERROR”)).count() 
Java 
JavaRDD<String> lines = sc.textFile(...); 
lines.filter(new Function<String, Boolean>() { 
Boolean call(String s) { 
return s.contains(“error”); 
} 
}).count();
Easy: Clean API 
Write programs in terms of transformations on 
distributed datasets 
Resilient Distributed Datasets 
• Collections of objects spread 
across a cluster, stored in RAM 
or on Disk 
• Built through parallel 
transformations 
• Automatically rebuilt on failure 
Operations 
• Transformations 
(e.g. map, filter, groupBy) 
• Actions 
(e.g. count, collect, save)
Easy: Expressive API 
map reduce
Easy: Expressive API 
map 
filter 
groupBy 
sort 
union 
join 
leftOuterJoin 
rightOuterJoin 
reduce 
count 
fold 
reduceByKey 
groupByKey 
cogroup 
cross 
zip 
sample 
take 
first 
partitionBy 
mapWith 
pipe 
save ...
Easy: Example – Word Count 
Hadoop MapReduce 
public static class WordCountMapClass extends MapReduceBase 
implements Mapper<LongWritable, Text, Text, IntWritable> { 
! 
private final static IntWritable one = new IntWritable(1); 
private Text word = new Text(); 
! 
public void map(LongWritable key, Text value, 
OutputCollector<Text, IntWritable> output, 
Reporter reporter) throws IOException { 
String line = value.toString(); 
StringTokenizer itr = new StringTokenizer(line); 
while (itr.hasMoreTokens()) { 
word.set(itr.nextToken()); 
output.collect(word, one); 
} 
} 
} 
! 
public static class WorkdCountReduce extends MapReduceBase 
implements Reducer<Text, IntWritable, Text, IntWritable> { 
! 
public void reduce(Text key, Iterator<IntWritable> values, 
OutputCollector<Text, IntWritable> output, 
Reporter reporter) throws IOException { 
int sum = 0; 
while (values.hasNext()) { 
sum += values.next().get(); 
} 
output.collect(key, new IntWritable(sum)); 
} 
} 
Spark 
val spark = new SparkContext(master, appName, [sparkHome], [jars]) 
val file = spark.textFile("hdfs://...") 
val counts = file.flatMap(line => line.split(" ")) 
.map(word => (word, 1)) 
.reduceByKey(_ + _) 
counts.saveAsTextFile("hdfs://...")
Easy: Example – Word Count 
Hadoop MapReduce 
public static class WordCountMapClass extends MapReduceBase 
implements Mapper<LongWritable, Text, Text, IntWritable> { 
! 
private final static IntWritable one = new IntWritable(1); 
private Text word = new Text(); 
! 
public void map(LongWritable key, Text value, 
OutputCollector<Text, IntWritable> output, 
Reporter reporter) throws IOException { 
String line = value.toString(); 
StringTokenizer itr = new StringTokenizer(line); 
while (itr.hasMoreTokens()) { 
word.set(itr.nextToken()); 
output.collect(word, one); 
} 
} 
} 
! 
public static class WorkdCountReduce extends MapReduceBase 
implements Reducer<Text, IntWritable, Text, IntWritable> { 
! 
public void reduce(Text key, Iterator<IntWritable> values, 
OutputCollector<Text, IntWritable> output, 
Reporter reporter) throws IOException { 
int sum = 0; 
while (values.hasNext()) { 
sum += values.next().get(); 
} 
output.collect(key, new IntWritable(sum)); 
} 
} 
Spark 
val spark = new SparkContext(master, appName, [sparkHome], [jars]) 
val file = spark.textFile("hdfs://...") 
val counts = file.flatMap(line => line.split(" ")) 
.map(word => (word, 1)) 
.reduceByKey(_ + _) 
counts.saveAsTextFile("hdfs://...")
Easy: Works Well With 
Hadoop 
Data Compatibility 
• Access your existing Hadoop 
Data 
• Use the same data formats 
• Adheres to data locality for 
efficient processing 
! 
Deployment Models 
• “Standalone” deployment 
• YARN-based deployment 
• Mesos-based deployment 
• Deploy on existing Hadoop 
cluster or side-by-side
Example: Logistic Regression 
data = spark.textFile(...).map(readPoint).cache() 
! 
w = numpy.random.rand(D) 
! 
for i in range(iterations): 
gradient = data 
.map(lambda p: (1 / (1 + exp(-p.y * w.dot(p.x)))) 
* p.y * p.x) 
.reduce(lambda x, y: x + y) 
w -= gradient 
! 
print “Final w: %s” % w
Fast: Using RAM, Operator 
Graphs 
In-memory Caching 
• Data Partitions read from RAM 
instead of disk 
Operator Graphs 
• Scheduling Optimizations 
• Fault Tolerance 
= 
RDD 
= 
cached 
partition 
join 
A: B: 
groupBy 
C: D: E: 
filter 
Stage 
3 
Stage 
1 
Stage 
2 
F: 
map
Fast: Logistic Regression 
Performance 
Running Time (s) 
4000 
3000 
2000 
1000 
0 
1 5 10 20 30 
Number of Iterations 
110 
s 
/ 
iteration 
Hadoop Spark 
first 
iteration 
80 
s 
further 
iterations 
1 
s
Fast: Scales Down Seamlessly 
Execution 
time 
(s) 
100 
75 
50 
25 
0 
Cache 
disabled 25% 50% 75% Fully 
cached 
% 
of 
working 
set 
in 
cache 
11.5304 
29.7471 
40.7407 
58.0614 
68.8414
Easy: Fault Recovery 
RDDs track lineage information that can be used to 
efficiently recompute lost data 
msgs = textFile.filter(lambda s: s.startsWith(“ERROR”)) 
.map(lambda s: s.split(“t”)[2]) 
HDFS File Filtered RDD 
Mapped 
filter RDD 
(func 
= 
startsWith(…)) 
map 
(func 
= 
split(...))
How Spark Works
Working With RDDs
Working With RDDs 
RDD 
textFile = sc.textFile(”SomeFile.txt”)
Working With RDDs 
RDRDDD RDRDDD 
Transformations 
textFile = sc.textFile(”SomeFile.txt”) 
linesWithSpark = textFile.filter(lambda line: "Spark” in line)
Working With RDDs 
RDRDDD RDRDDD 
Transformations 
textFile = sc.textFile(”SomeFile.txt”) 
Action Value 
linesWithSpark = textFile.filter(lambda line: "Spark” in line) 
linesWithSpark.count() 
74 
! 
linesWithSpark.first() 
# Apache Spark
Example: Log Mining 
Load error messages from a log into memory, then interactively search for 
various patterns
Load error messages from a log into memory, then interactively search for 
various patterns 
Worker 
Example: Log Mining 
Worker 
Worker 
Driver
Load error messages from a log into memory, then interactively search for 
various patterns 
Worker 
Example: Log Mining 
Worker 
Worker 
Driver 
lines = spark.textFile(“hdfs://...”)
Example: Log Mining 
Load error messages from a log into memory, then interactively search for 
various patterns 
lines = spark.textFile(“hdfs://...”) 
errors = lines.filter(lambda s: s.startswith(“ERROR”)) 
Worker 
Worker 
Worker 
Driver
Example: Log Mining 
Load error messages from a log into memory, then interactively search for 
various patterns 
lines = spark.textFile(“hdfs://...”) 
errors = lines.filter(lambda s: s.startswith(“ERROR”)) 
Worker 
Worker 
Worker 
Driver
Example: Log Mining 
Load error messages from a log into memory, then interactively search for 
various patterns 
lines = spark.textFile(“hdfs://...”) 
errors = lines.filter(lambda s: s.startswith(“ERROR”)) 
messages = errors.map(lambda s: s.split(“t”)[2]) 
messages.cache() 
Worker 
Worker 
Worker 
Driver 
messages.filter(lambda s: “mysql” in s).count()
Example: Log Mining 
Load error messages from a log into memory, then interactively search for 
various patterns 
lines = spark.textFile(“hdfs://...”) 
errors = lines.filter(lambda s: s.startswith(“ERROR”)) 
messages = errors.map(lambda s: s.split(“t”)[2]) 
messages.cache() 
Worker 
Worker 
Worker 
Driver 
messages.filter(lambda s: “mysql” in s).count() Action
Example: Log Mining 
Load error messages from a log into memory, then interactively search for 
various patterns 
lines = spark.textFile(“hdfs://...”) 
errors = lines.filter(lambda s: s.startswith(“ERROR”)) 
messages = errors.map(lambda s: s.split(“t”)[2]) 
messages.cache() 
Worker 
Worker 
Worker 
Driver 
messages.filter(lambda s: “mysql” in s).count() 
Block 1 
Block 2 
Block 3
Example: Log Mining 
Load error messages from a log into memory, then interactively search for 
various patterns 
lines = spark.textFile(“hdfs://...”) 
errors = lines.filter(lambda s: s.startswith(“ERROR”)) 
messages = errors.map(lambda s: s.split(“t”)[2]) 
messages.cache() 
Worker 
Driver 
Worker 
Worker 
messages.filter(lambda s: “mysql” in s).count() 
Block 1 
Block 2 
Block 3 
tasks 
tasks 
tasks
lines = spark.textFile(“hdfs://...”) 
errors = lines.filter(lambda s: s.startswith(“ERROR”)) 
messages = errors.map(lambda s: s.split(“t”)[2]) 
messages.cache() 
Worker 
Driver 
Worker 
Read 
HDFS 
Block 
Worker 
messages.filter(lambda s: “mysql” in s).count() 
Block 1 
Block 2 
Block 3 
Read 
HDFS 
Block 
Read 
HDFS 
Block 
Example: Log Mining 
Load error messages from a log into memory, then interactively search for 
various patterns
lines = spark.textFile(“hdfs://...”) 
errors = lines.filter(lambda s: s.startswith(“ERROR”)) 
messages = errors.map(lambda s: s.split(“t”)[2]) 
messages.cache() 
Worker 
Driver 
Worker 
Process 
& Cache 
Data 
Worker 
messages.filter(lambda s: “mysql” in s).count() 
Block 1 
Block 2 
Block 3 
Cache 1 
Cache 2 
Cache 3 
Process 
& Cache 
Data 
Process 
& Cache 
Data 
Example: Log Mining 
Load error messages from a log into memory, then interactively search for 
various patterns
Example: Log Mining 
Load error messages from a log into memory, then interactively search for 
various patterns 
lines = spark.textFile(“hdfs://...”) 
errors = lines.filter(lambda s: s.startswith(“ERROR”)) 
messages = errors.map(lambda s: s.split(“t”)[2]) 
messages.cache() 
Worker 
Driver 
Worker 
Worker 
messages.filter(lambda s: “mysql” in s).count() 
Block 1 
Block 2 
Block 3 
Cache 1 
Cache 2 
Cache 3 
results 
results 
results
Example: Log Mining 
Load error messages from a log into memory, then interactively search for 
various patterns 
lines = spark.textFile(“hdfs://...”) 
errors = lines.filter(lambda s: s.startswith(“ERROR”)) 
messages = errors.map(lambda s: s.split(“t”)[2]) 
messages.cache() 
Worker 
Driver 
Worker 
Worker 
messages.filter(lambda s: “mysql” in s).count() 
Block 1 
Block 2 
Block 3 
Cache 1 
Cache 2 
Cache 3 
messages.filter(lambda s: “php” in s).count()
Example: Log Mining 
Load error messages from a log into memory, then interactively search for 
various patterns 
lines = spark.textFile(“hdfs://...”) 
errors = lines.filter(lambda s: s.startswith(“ERROR”)) 
messages = errors.map(lambda s: s.split(“t”)[2]) 
messages.cache() 
Worker 
Driver 
Worker 
Worker 
messages.filter(lambda s: “mysql” in s).count() 
Block 1 
Block 2 
Block 3 
Cache 1 
Cache 2 
Cache 3 
messages.filter(lambda s: “php” in s).count() 
tasks 
tasks 
tasks
lines = spark.textFile(“hdfs://...”) 
errors = lines.filter(lambda s: s.startswith(“ERROR”)) 
messages = errors.map(lambda s: s.split(“t”)[2]) 
messages.cache() 
Worker 
Worker 
Worker 
messages.filter(lambda s: “mysql” in s).count() 
Block 1 
Block 2 
Block 3 
Cache 1 
Cache 2 
Cache 3 
messages.filter(lambda s: “php” in s).count() 
Driver 
Process 
from 
Cache 
Process 
from 
Cache 
Process 
from 
Cache 
Example: Log Mining 
Load error messages from a log into memory, then interactively search for 
various patterns
Example: Log Mining 
Load error messages from a log into memory, then interactively search for 
various patterns 
lines = spark.textFile(“hdfs://...”) 
errors = lines.filter(lambda s: s.startswith(“ERROR”)) 
messages = errors.map(lambda s: s.split(“t”)[2]) 
messages.cache() 
Worker 
Worker 
Worker 
messages.filter(lambda s: “mysql” in s).count() 
Block 1 
Block 2 
Block 3 
Cache 1 
Cache 2 
Cache 3 
messages.filter(lambda s: “php” in s).count() 
Driver 
results 
results 
results
Example: Log Mining 
Load error messages from a log into memory, then interactively search for 
various patterns 
lines = spark.textFile(“hdfs://...”) 
errors = lines.filter(lambda s: s.startswith(“ERROR”)) 
messages = errors.map(lambda s: s.split(“t”)[2]) 
messages.cache() 
Worker 
Worker 
Worker 
messages.filter(lambda s: “mysql” in s).count() 
Block 1 
Block 2 
Block 3 
Cache 1 
Cache 2 
Cache 3 
messages.filter(lambda s: “php” in s).count() 
Driver 
Cache your data ➔ Faster Results 
Full-text search of Wikipedia 
• 60GB on 20 EC2 machines 
• 0.5 sec from cache vs. 20s for on-disk
Cassandra + Spark: 
A Great Combination 
Both are Easy to Use 
Spark Can Help You Bridge Your Hadoop and 
Cassandra Systems 
Use Spark Libraries, Caching on-top of Cassandra-stored 
Data 
Combine Spark Streaming with Cassandra Storage Datastax 
spark-cassandra-connector: 
https://github.com/datastax/ 
spark-cassandra-connector
Schema RDDs (Spark SQL) 
• Built-in Mechanism for recognizing Structured data in Spark 
• Allow for systems to apply several data access and relational 
optimizations (e.g. predicate push-down, partition pruning, broadcast 
joins) 
• Columnar in-memory representation when cached 
• Native Support for structured formats like parquet, JSON 
• Great Compatibility with the Rest of the Stack (python, libraries, etc.)
Thank You! 
Visit http://databricks.com: 
Blogs, Tutorials and more 
! 
Questions?

More Related Content

What's hot

Hands on Mahout!
Hands on Mahout!Hands on Mahout!
Hands on Mahout!OSCON Byrum
 
Tutorial Mahout - Recommendation
Tutorial Mahout - RecommendationTutorial Mahout - Recommendation
Tutorial Mahout - RecommendationCataldo Musto
 
Apache Mahout 於電子商務的應用
Apache Mahout 於電子商務的應用Apache Mahout 於電子商務的應用
Apache Mahout 於電子商務的應用James Chen
 
Apache Mahout Tutorial - Recommendation - 2013/2014
Apache Mahout Tutorial - Recommendation - 2013/2014 Apache Mahout Tutorial - Recommendation - 2013/2014
Apache Mahout Tutorial - Recommendation - 2013/2014 Cataldo Musto
 
Hadoop and Machine Learning
Hadoop and Machine LearningHadoop and Machine Learning
Hadoop and Machine Learningjoshwills
 
Mahout classification presentation
Mahout classification presentationMahout classification presentation
Mahout classification presentationNaoki Nakatani
 
Introduction to Collaborative Filtering with Apache Mahout
Introduction to Collaborative Filtering with Apache MahoutIntroduction to Collaborative Filtering with Apache Mahout
Introduction to Collaborative Filtering with Apache Mahoutsscdotopen
 
Introduction to Apache Mahout
Introduction to Apache MahoutIntroduction to Apache Mahout
Introduction to Apache MahoutAman Adhikari
 
Next directions in Mahout's recommenders
Next directions in Mahout's recommendersNext directions in Mahout's recommenders
Next directions in Mahout's recommenderssscdotopen
 
Orchestrating the Intelligent Web with Apache Mahout
Orchestrating the Intelligent Web with Apache MahoutOrchestrating the Intelligent Web with Apache Mahout
Orchestrating the Intelligent Web with Apache Mahoutaneeshabakharia
 
Logistic Regression using Mahout
Logistic Regression using MahoutLogistic Regression using Mahout
Logistic Regression using Mahouttanuvir
 
Using the search engine as recommendation engine
Using the search engine as recommendation engineUsing the search engine as recommendation engine
Using the search engine as recommendation engineLars Marius Garshol
 
Top 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner PitfallsTop 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner PitfallsSri Ambati
 
OSCON: Apache Mahout - Mammoth Scale Machine Learning
OSCON: Apache Mahout - Mammoth Scale Machine LearningOSCON: Apache Mahout - Mammoth Scale Machine Learning
OSCON: Apache Mahout - Mammoth Scale Machine LearningRobin Anil
 

What's hot (20)

Hands on Mahout!
Hands on Mahout!Hands on Mahout!
Hands on Mahout!
 
Apache Mahout
Apache MahoutApache Mahout
Apache Mahout
 
Tutorial Mahout - Recommendation
Tutorial Mahout - RecommendationTutorial Mahout - Recommendation
Tutorial Mahout - Recommendation
 
Apache Mahout 於電子商務的應用
Apache Mahout 於電子商務的應用Apache Mahout 於電子商務的應用
Apache Mahout 於電子商務的應用
 
Apache Mahout Tutorial - Recommendation - 2013/2014
Apache Mahout Tutorial - Recommendation - 2013/2014 Apache Mahout Tutorial - Recommendation - 2013/2014
Apache Mahout Tutorial - Recommendation - 2013/2014
 
Mahout
MahoutMahout
Mahout
 
Intro to Apache Mahout
Intro to Apache MahoutIntro to Apache Mahout
Intro to Apache Mahout
 
Hadoop and Machine Learning
Hadoop and Machine LearningHadoop and Machine Learning
Hadoop and Machine Learning
 
Mahout classification presentation
Mahout classification presentationMahout classification presentation
Mahout classification presentation
 
Introduction to Collaborative Filtering with Apache Mahout
Introduction to Collaborative Filtering with Apache MahoutIntroduction to Collaborative Filtering with Apache Mahout
Introduction to Collaborative Filtering with Apache Mahout
 
mahout introduction
mahout  introductionmahout  introduction
mahout introduction
 
Apache mahout
Apache mahoutApache mahout
Apache mahout
 
Introduction to Apache Mahout
Introduction to Apache MahoutIntroduction to Apache Mahout
Introduction to Apache Mahout
 
Next directions in Mahout's recommenders
Next directions in Mahout's recommendersNext directions in Mahout's recommenders
Next directions in Mahout's recommenders
 
Mahout part2
Mahout part2Mahout part2
Mahout part2
 
Orchestrating the Intelligent Web with Apache Mahout
Orchestrating the Intelligent Web with Apache MahoutOrchestrating the Intelligent Web with Apache Mahout
Orchestrating the Intelligent Web with Apache Mahout
 
Logistic Regression using Mahout
Logistic Regression using MahoutLogistic Regression using Mahout
Logistic Regression using Mahout
 
Using the search engine as recommendation engine
Using the search engine as recommendation engineUsing the search engine as recommendation engine
Using the search engine as recommendation engine
 
Top 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner PitfallsTop 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner Pitfalls
 
OSCON: Apache Mahout - Mammoth Scale Machine Learning
OSCON: Apache Mahout - Mammoth Scale Machine LearningOSCON: Apache Mahout - Mammoth Scale Machine Learning
OSCON: Apache Mahout - Mammoth Scale Machine Learning
 

Viewers also liked

C* Summit 2013: Big Data Analytics – Realize the Investment from Your Big Dat...
C* Summit 2013: Big Data Analytics – Realize the Investment from Your Big Dat...C* Summit 2013: Big Data Analytics – Realize the Investment from Your Big Dat...
C* Summit 2013: Big Data Analytics – Realize the Investment from Your Big Dat...DataStax Academy
 
Bring the Spark To Your Eyes
Bring the Spark To Your EyesBring the Spark To Your Eyes
Bring the Spark To Your EyesDemi Ben-Ari
 
Introduction to big data and apache spark
Introduction to big data and apache sparkIntroduction to big data and apache spark
Introduction to big data and apache sparkMohammed Guller
 
Prediction as a service with ensemble model in SparkML and Python ScikitLearn
Prediction as a service with ensemble model in SparkML and Python ScikitLearnPrediction as a service with ensemble model in SparkML and Python ScikitLearn
Prediction as a service with ensemble model in SparkML and Python ScikitLearnJosef A. Habdank
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Databricks
 
Thing you didn't know you could do in Spark
Thing you didn't know you could do in SparkThing you didn't know you could do in Spark
Thing you didn't know you could do in SparkSnappyData
 
Using Spark with Tachyon by Gene Pang
Using Spark with Tachyon by Gene PangUsing Spark with Tachyon by Gene Pang
Using Spark with Tachyon by Gene PangSpark Summit
 
Databricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User GroupDatabricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User GroupPaco Nathan
 
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...Data Con LA
 
Spark 101 - First steps to distributed computing
Spark 101 - First steps to distributed computingSpark 101 - First steps to distributed computing
Spark 101 - First steps to distributed computingDemi Ben-Ari
 
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MoreStrata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MorePaco Nathan
 
Advanced Analytics and Recommendations with Apache Spark - Spark Maryland/DC ...
Advanced Analytics and Recommendations with Apache Spark - Spark Maryland/DC ...Advanced Analytics and Recommendations with Apache Spark - Spark Maryland/DC ...
Advanced Analytics and Recommendations with Apache Spark - Spark Maryland/DC ...Chris Fregly
 
Introduction to real time big data with Apache Spark
Introduction to real time big data with Apache SparkIntroduction to real time big data with Apache Spark
Introduction to real time big data with Apache SparkTaras Matyashovsky
 
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...Chris Fregly
 
Spark + Cassandra = Real Time Analytics on Operational Data
Spark + Cassandra = Real Time Analytics on Operational DataSpark + Cassandra = Real Time Analytics on Operational Data
Spark + Cassandra = Real Time Analytics on Operational DataVictor Coustenoble
 
Getting Started Running Apache Spark on Apache Mesos
Getting Started Running Apache Spark on Apache MesosGetting Started Running Apache Spark on Apache Mesos
Getting Started Running Apache Spark on Apache MesosPaco Nathan
 
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Deep Dive : Spark Data Frames, SQL and Catalyst OptimizerDeep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Deep Dive : Spark Data Frames, SQL and Catalyst OptimizerSachin Aggarwal
 
QCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingQCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingPaco Nathan
 
What's new with Apache Spark?
What's new with Apache Spark?What's new with Apache Spark?
What's new with Apache Spark?Paco Nathan
 

Viewers also liked (20)

C* Summit 2013: Big Data Analytics – Realize the Investment from Your Big Dat...
C* Summit 2013: Big Data Analytics – Realize the Investment from Your Big Dat...C* Summit 2013: Big Data Analytics – Realize the Investment from Your Big Dat...
C* Summit 2013: Big Data Analytics – Realize the Investment from Your Big Dat...
 
kafka
kafkakafka
kafka
 
Bring the Spark To Your Eyes
Bring the Spark To Your EyesBring the Spark To Your Eyes
Bring the Spark To Your Eyes
 
Introduction to big data and apache spark
Introduction to big data and apache sparkIntroduction to big data and apache spark
Introduction to big data and apache spark
 
Prediction as a service with ensemble model in SparkML and Python ScikitLearn
Prediction as a service with ensemble model in SparkML and Python ScikitLearnPrediction as a service with ensemble model in SparkML and Python ScikitLearn
Prediction as a service with ensemble model in SparkML and Python ScikitLearn
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
 
Thing you didn't know you could do in Spark
Thing you didn't know you could do in SparkThing you didn't know you could do in Spark
Thing you didn't know you could do in Spark
 
Using Spark with Tachyon by Gene Pang
Using Spark with Tachyon by Gene PangUsing Spark with Tachyon by Gene Pang
Using Spark with Tachyon by Gene Pang
 
Databricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User GroupDatabricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User Group
 
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...
 
Spark 101 - First steps to distributed computing
Spark 101 - First steps to distributed computingSpark 101 - First steps to distributed computing
Spark 101 - First steps to distributed computing
 
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MoreStrata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
 
Advanced Analytics and Recommendations with Apache Spark - Spark Maryland/DC ...
Advanced Analytics and Recommendations with Apache Spark - Spark Maryland/DC ...Advanced Analytics and Recommendations with Apache Spark - Spark Maryland/DC ...
Advanced Analytics and Recommendations with Apache Spark - Spark Maryland/DC ...
 
Introduction to real time big data with Apache Spark
Introduction to real time big data with Apache SparkIntroduction to real time big data with Apache Spark
Introduction to real time big data with Apache Spark
 
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
 
Spark + Cassandra = Real Time Analytics on Operational Data
Spark + Cassandra = Real Time Analytics on Operational DataSpark + Cassandra = Real Time Analytics on Operational Data
Spark + Cassandra = Real Time Analytics on Operational Data
 
Getting Started Running Apache Spark on Apache Mesos
Getting Started Running Apache Spark on Apache MesosGetting Started Running Apache Spark on Apache Mesos
Getting Started Running Apache Spark on Apache Mesos
 
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Deep Dive : Spark Data Frames, SQL and Catalyst OptimizerDeep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
 
QCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingQCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark Streaming
 
What's new with Apache Spark?
What's new with Apache Spark?What's new with Apache Spark?
What's new with Apache Spark?
 

Similar to Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopAmanda Casari
 
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
 Let Spark Fly: Advantages and Use Cases for Spark on Hadoop Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
Let Spark Fly: Advantages and Use Cases for Spark on HadoopMapR Technologies
 
BDM25 - Spark runtime internal
BDM25 - Spark runtime internalBDM25 - Spark runtime internal
BDM25 - Spark runtime internalDavid Lauzon
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkVenkata Naga Ravi
 
Lightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkLightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkManish Gupta
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study NotesRichard Kuo
 
Big data week presentation
Big data week presentationBig data week presentation
Big data week presentationJoseph Adler
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoMapR Technologies
 
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced AnalyticsETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced AnalyticsMiklos Christine
 
Hadoop Big Data A big picture
Hadoop Big Data A big pictureHadoop Big Data A big picture
Hadoop Big Data A big pictureJ S Jodha
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache SparkAmir Sedighi
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkC4Media
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsDatabricks
 
Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Databricks
 
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014NoSQLmatters
 

Similar to Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms (20)

20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code Workshop
 
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
 Let Spark Fly: Advantages and Use Cases for Spark on Hadoop Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
 
Apache Spark & Hadoop
Apache Spark & HadoopApache Spark & Hadoop
Apache Spark & Hadoop
 
Spark meetup TCHUG
Spark meetup TCHUGSpark meetup TCHUG
Spark meetup TCHUG
 
BDM25 - Spark runtime internal
BDM25 - Spark runtime internalBDM25 - Spark runtime internal
BDM25 - Spark runtime internal
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache Spark
 
Meetup ml spark_ppt
Meetup ml spark_pptMeetup ml spark_ppt
Meetup ml spark_ppt
 
Lightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkLightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache Spark
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study Notes
 
Big data week presentation
Big data week presentationBig data week presentation
Big data week presentation
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
 
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced AnalyticsETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
 
Hadoop Big Data A big picture
Hadoop Big Data A big pictureHadoop Big Data A big picture
Hadoop Big Data A big picture
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
 
Hadoop basics
Hadoop basicsHadoop basics
Hadoop basics
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
 
Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)
 
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014
 

More from DataStax Academy

Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
Forrester CXNYC 2017 - Delivering great real-time cx is a true craftForrester CXNYC 2017 - Delivering great real-time cx is a true craft
Forrester CXNYC 2017 - Delivering great real-time cx is a true craftDataStax Academy
 
Introduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph DatabaseIntroduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph DatabaseDataStax Academy
 
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Introduction to DataStax Enterprise Advanced Replication with Apache CassandraIntroduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Introduction to DataStax Enterprise Advanced Replication with Apache CassandraDataStax Academy
 
Cassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart LabsCassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart LabsDataStax Academy
 
Cassandra 3.0 Data Modeling
Cassandra 3.0 Data ModelingCassandra 3.0 Data Modeling
Cassandra 3.0 Data ModelingDataStax Academy
 
Cassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stackCassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stackDataStax Academy
 
Data Modeling for Apache Cassandra
Data Modeling for Apache CassandraData Modeling for Apache Cassandra
Data Modeling for Apache CassandraDataStax Academy
 
Production Ready Cassandra
Production Ready CassandraProduction Ready Cassandra
Production Ready CassandraDataStax Academy
 
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & PythonCassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & PythonDataStax Academy
 
Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1DataStax Academy
 
Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2DataStax Academy
 
Standing Up Your First Cluster
Standing Up Your First ClusterStanding Up Your First Cluster
Standing Up Your First ClusterDataStax Academy
 
Real Time Analytics with Dse
Real Time Analytics with DseReal Time Analytics with Dse
Real Time Analytics with DseDataStax Academy
 
Introduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache CassandraIntroduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache CassandraDataStax Academy
 
Enabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax EnterpriseEnabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax EnterpriseDataStax Academy
 
Advanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache CassandraAdvanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache CassandraDataStax Academy
 

More from DataStax Academy (20)

Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
Forrester CXNYC 2017 - Delivering great real-time cx is a true craftForrester CXNYC 2017 - Delivering great real-time cx is a true craft
Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
 
Introduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph DatabaseIntroduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph Database
 
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Introduction to DataStax Enterprise Advanced Replication with Apache CassandraIntroduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
 
Cassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart LabsCassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart Labs
 
Cassandra 3.0 Data Modeling
Cassandra 3.0 Data ModelingCassandra 3.0 Data Modeling
Cassandra 3.0 Data Modeling
 
Cassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stackCassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stack
 
Data Modeling for Apache Cassandra
Data Modeling for Apache CassandraData Modeling for Apache Cassandra
Data Modeling for Apache Cassandra
 
Coursera Cassandra Driver
Coursera Cassandra DriverCoursera Cassandra Driver
Coursera Cassandra Driver
 
Production Ready Cassandra
Production Ready CassandraProduction Ready Cassandra
Production Ready Cassandra
 
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & PythonCassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
 
Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1
 
Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2
 
Standing Up Your First Cluster
Standing Up Your First ClusterStanding Up Your First Cluster
Standing Up Your First Cluster
 
Real Time Analytics with Dse
Real Time Analytics with DseReal Time Analytics with Dse
Real Time Analytics with Dse
 
Introduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache CassandraIntroduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache Cassandra
 
Cassandra Core Concepts
Cassandra Core ConceptsCassandra Core Concepts
Cassandra Core Concepts
 
Enabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax EnterpriseEnabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax Enterprise
 
Bad Habits Die Hard
Bad Habits Die Hard Bad Habits Die Hard
Bad Habits Die Hard
 
Advanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache CassandraAdvanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache Cassandra
 
Advanced Cassandra
Advanced CassandraAdvanced Cassandra
Advanced Cassandra
 

Recently uploaded

A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 

Recently uploaded (20)

A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 

Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

  • 1. Apache Spark Easy and Fast Big Data Analytics Pat McDonough
  • 2. Founded by the creators of Apache Spark out of UC Berkeley’s AMPLab Fully committed to 100% open source Apache Spark Support and Grow the Spark Community and Ecosystem Building Databricks Cloud
  • 3. Databricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics 4.5 Databricks & Datstax Have Partnered for Apache Spark Engineering and Support
  • 4. Big Data Analytics Where We’ve Been • 2003 & 2004 - Google GFS & MapReduce Papers are Precursors to Hadoop • 2006 & 2007 - Google BigTable and Amazon DynamoDB Paper Precursor to Cassandra, HBase, Others
  • 5. Big Data Analytics A Zoo of Innovation
  • 6. Big Data Analytics A Zoo of Innovation
  • 7. Big Data Analytics A Zoo of Innovation
  • 8. Big Data Analytics A Zoo of Innovation
  • 9. What's Working? Many Excellent Innovations Have Come From Big Data Analytics: • Distributed & Data Parallel is disruptive ... because we needed it • We Now Have Massive throughput… Solved the ETL Problem • The Data Hub/Lake Is Possible
  • 10. What Needs to Improve? Go Beyond MapReduce MapReduce is a Very Powerful and Flexible Engine Processing Throughput Previously Unobtainable on Commodity Equipment But MapReduce Isn’t Enough: • Essentially Batch-only • Inefficient with respect to memory use, latency • Too Hard to Program
  • 11. What Needs to Improve? Go Beyond (S)QL SQL Support Has Been A Welcome Interface on Many Platforms And in many cases, a faster alternative But SQL Is Often Not Enough: • Sometimes you want to write real programs (Loops, variables, functions, existing libraries) but don’t want to build UDFs. • Machine Learning (see above, plus iterative) • Multi-step pipelines • Often an Additional System
  • 12. What Needs to Improve? Ease of Use Big Data Distributions Provide a number of Useful Tools and Systems Choices are Good to Have But This Is Often Unsatisfactory: • Each new system has it’s own configs, APIs, and management, coordination of multiple systems is challenging • A typical solution requires stringing together disparate systems - we need unification • Developers want the full power of their programming language
  • 13. What Needs to Improve? Latency Big Data systems are throughput-oriented Some new SQL Systems provide interactivity But We Need More: • Interactivity beyond SQL interfaces • Repeated access of the same datasets (i.e. caching)
  • 14. Can Spark Solve These Problems?
  • 15. Apache Spark Originally developed in 2009 in UC Berkeley’s AMPLab Fully open sourced in 2010 – now at Apache Software Foundation http://spark.apache.org
  • 16. Project Activity June 2013 June 2014 total contributors 68 255 companies contributing 17 50 total lines of code 63,000 175,000
  • 17. Project Activity June 2013 June 2014 total contributors 68 255 companies contributing 17 50 total lines of code 63,000 175,000
  • 18. Compared to Other Projects 1200 900 600 300 0 300000 225000 150000 75000 0 Commits Lines of Code Changed Activity in past 6 months
  • 19. Compared to Other Projects 1200 900 600 300 0 300000 225000 150000 75000 0 Commits Lines of Code Changed Activity in past 6 months Spark is now the most active project in the Hadoop ecosystem
  • 20. Spark on Github So active on Github, sometimes we break it Over 1200 Forks (can’t display Network Graphs) ~80 commits to master each week So many PRs We Built our own PR UI
  • 21. Apache Spark - Easy to Use And Very Fast Fast and general cluster computing system interoperable with Big Data Systems Like Hadoop and Cassandra Improved Efficiency: • In-memory computing primitives • General computation graphs Improved Usability: • Rich APIs • Interactive shell
  • 22. Apache Spark - Easy to Use And Very Fast Fast and general cluster computing system interoperable with Big Data Systems Like Hadoop and Cassandra Improved Efficiency: • Up to 100× faster In-memory computing primitives • (2-10× on disk) General computation graphs Improved Usability: • Rich APIs 2-5× less code • Interactive shell
  • 23. Apache Spark - A Robust SDK for Big Data Applications SQL Machine Learning Streaming Graph Core Unified System With Libraries to Build a Complete Solution ! Full-featured Programming Environment in Scala, Java, Python… Very developer-friendly, Functional API for working with Data ! Runtimes available on several platforms
  • 24. Spark Is A Part Of Most Big Data Platforms • All Major Hadoop Distributions Include Spark • Spark Is Also Integrated With Non-Hadoop Big Data Platforms like DSE • Spark Applications Can Be Written Once and Deployed Anywhere SQL Machine Learning Streaming Graph Core Deploy Spark Apps Anywhere
  • 25. Easy: Get Started Immediately Interactive Shell Multi-language support Python lines = sc.textFile(...) lines.filter(lambda s: “ERROR” in s).count() Scala val lines = sc.textFile(...) lines.filter(x => x.contains(“ERROR”)).count() Java JavaRDD<String> lines = sc.textFile(...); lines.filter(new Function<String, Boolean>() { Boolean call(String s) { return s.contains(“error”); } }).count();
  • 26. Easy: Clean API Write programs in terms of transformations on distributed datasets Resilient Distributed Datasets • Collections of objects spread across a cluster, stored in RAM or on Disk • Built through parallel transformations • Automatically rebuilt on failure Operations • Transformations (e.g. map, filter, groupBy) • Actions (e.g. count, collect, save)
  • 27. Easy: Expressive API map reduce
  • 28. Easy: Expressive API map filter groupBy sort union join leftOuterJoin rightOuterJoin reduce count fold reduceByKey groupByKey cogroup cross zip sample take first partitionBy mapWith pipe save ...
  • 29. Easy: Example – Word Count Hadoop MapReduce public static class WordCountMapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { ! private final static IntWritable one = new IntWritable(1); private Text word = new Text(); ! public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } } } ! public static class WorkdCountReduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { ! public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } } Spark val spark = new SparkContext(master, appName, [sparkHome], [jars]) val file = spark.textFile("hdfs://...") val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")
  • 30. Easy: Example – Word Count Hadoop MapReduce public static class WordCountMapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { ! private final static IntWritable one = new IntWritable(1); private Text word = new Text(); ! public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } } } ! public static class WorkdCountReduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { ! public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } } Spark val spark = new SparkContext(master, appName, [sparkHome], [jars]) val file = spark.textFile("hdfs://...") val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")
  • 31. Easy: Works Well With Hadoop Data Compatibility • Access your existing Hadoop Data • Use the same data formats • Adheres to data locality for efficient processing ! Deployment Models • “Standalone” deployment • YARN-based deployment • Mesos-based deployment • Deploy on existing Hadoop cluster or side-by-side
  • 32. Example: Logistic Regression data = spark.textFile(...).map(readPoint).cache() ! w = numpy.random.rand(D) ! for i in range(iterations): gradient = data .map(lambda p: (1 / (1 + exp(-p.y * w.dot(p.x)))) * p.y * p.x) .reduce(lambda x, y: x + y) w -= gradient ! print “Final w: %s” % w
  • 33. Fast: Using RAM, Operator Graphs In-memory Caching • Data Partitions read from RAM instead of disk Operator Graphs • Scheduling Optimizations • Fault Tolerance = RDD = cached partition join A: B: groupBy C: D: E: filter Stage 3 Stage 1 Stage 2 F: map
  • 34. Fast: Logistic Regression Performance Running Time (s) 4000 3000 2000 1000 0 1 5 10 20 30 Number of Iterations 110 s / iteration Hadoop Spark first iteration 80 s further iterations 1 s
  • 35. Fast: Scales Down Seamlessly Execution time (s) 100 75 50 25 0 Cache disabled 25% 50% 75% Fully cached % of working set in cache 11.5304 29.7471 40.7407 58.0614 68.8414
  • 36. Easy: Fault Recovery RDDs track lineage information that can be used to efficiently recompute lost data msgs = textFile.filter(lambda s: s.startsWith(“ERROR”)) .map(lambda s: s.split(“t”)[2]) HDFS File Filtered RDD Mapped filter RDD (func = startsWith(…)) map (func = split(...))
  • 39. Working With RDDs RDD textFile = sc.textFile(”SomeFile.txt”)
  • 40. Working With RDDs RDRDDD RDRDDD Transformations textFile = sc.textFile(”SomeFile.txt”) linesWithSpark = textFile.filter(lambda line: "Spark” in line)
  • 41. Working With RDDs RDRDDD RDRDDD Transformations textFile = sc.textFile(”SomeFile.txt”) Action Value linesWithSpark = textFile.filter(lambda line: "Spark” in line) linesWithSpark.count() 74 ! linesWithSpark.first() # Apache Spark
  • 42. Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns
  • 43. Load error messages from a log into memory, then interactively search for various patterns Worker Example: Log Mining Worker Worker Driver
  • 44. Load error messages from a log into memory, then interactively search for various patterns Worker Example: Log Mining Worker Worker Driver lines = spark.textFile(“hdfs://...”)
  • 45. Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) Worker Worker Worker Driver
  • 46. Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) Worker Worker Worker Driver
  • 47. Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Worker Worker Driver messages.filter(lambda s: “mysql” in s).count()
  • 48. Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Worker Worker Driver messages.filter(lambda s: “mysql” in s).count() Action
  • 49. Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Worker Worker Driver messages.filter(lambda s: “mysql” in s).count() Block 1 Block 2 Block 3
  • 50. Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Driver Worker Worker messages.filter(lambda s: “mysql” in s).count() Block 1 Block 2 Block 3 tasks tasks tasks
  • 51. lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Driver Worker Read HDFS Block Worker messages.filter(lambda s: “mysql” in s).count() Block 1 Block 2 Block 3 Read HDFS Block Read HDFS Block Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns
  • 52. lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Driver Worker Process & Cache Data Worker messages.filter(lambda s: “mysql” in s).count() Block 1 Block 2 Block 3 Cache 1 Cache 2 Cache 3 Process & Cache Data Process & Cache Data Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns
  • 53. Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Driver Worker Worker messages.filter(lambda s: “mysql” in s).count() Block 1 Block 2 Block 3 Cache 1 Cache 2 Cache 3 results results results
  • 54. Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Driver Worker Worker messages.filter(lambda s: “mysql” in s).count() Block 1 Block 2 Block 3 Cache 1 Cache 2 Cache 3 messages.filter(lambda s: “php” in s).count()
  • 55. Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Driver Worker Worker messages.filter(lambda s: “mysql” in s).count() Block 1 Block 2 Block 3 Cache 1 Cache 2 Cache 3 messages.filter(lambda s: “php” in s).count() tasks tasks tasks
  • 56. lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Worker Worker messages.filter(lambda s: “mysql” in s).count() Block 1 Block 2 Block 3 Cache 1 Cache 2 Cache 3 messages.filter(lambda s: “php” in s).count() Driver Process from Cache Process from Cache Process from Cache Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns
  • 57. Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Worker Worker messages.filter(lambda s: “mysql” in s).count() Block 1 Block 2 Block 3 Cache 1 Cache 2 Cache 3 messages.filter(lambda s: “php” in s).count() Driver results results results
  • 58. Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Worker Worker messages.filter(lambda s: “mysql” in s).count() Block 1 Block 2 Block 3 Cache 1 Cache 2 Cache 3 messages.filter(lambda s: “php” in s).count() Driver Cache your data ➔ Faster Results Full-text search of Wikipedia • 60GB on 20 EC2 machines • 0.5 sec from cache vs. 20s for on-disk
  • 59. Cassandra + Spark: A Great Combination Both are Easy to Use Spark Can Help You Bridge Your Hadoop and Cassandra Systems Use Spark Libraries, Caching on-top of Cassandra-stored Data Combine Spark Streaming with Cassandra Storage Datastax spark-cassandra-connector: https://github.com/datastax/ spark-cassandra-connector
  • 60. Schema RDDs (Spark SQL) • Built-in Mechanism for recognizing Structured data in Spark • Allow for systems to apply several data access and relational optimizations (e.g. predicate push-down, partition pruning, broadcast joins) • Columnar in-memory representation when cached • Native Support for structured formats like parquet, JSON • Great Compatibility with the Rest of the Stack (python, libraries, etc.)
  • 61. Thank You! Visit http://databricks.com: Blogs, Tutorials and more ! Questions?