Apache Spark Hands-On: Rank Colorado Counties by Gender Ratio

APACHE SPARK: HANDS ON
Andy Grove, Chief Architect
Dan Lynn, CEO

FOLLOW ALONG!
• Download IntelliJ Community Edition
• http://tiny.cc/get-intellij
• Snag our example code
• http://tiny.cc/agildata-spark
• git clone git@github.com:codefutures/apache-spark-examples.git

Andy Grove
Co-Founder & Chief Architect
Co-Founder @ Orbware
Technologies (acquired 2000)
Inventor of Firestorm/DAO 
 
andy@agildata.com
• Providers of dbShards
• Relational Database Scaling 
• Big Data Consulting
• Data Strategy
• Data Architecture Reviews
• Big Data Training
• Solution Implementation
• Distributed over 6 states!
• Headquartered in Broomﬁeld, CO
www.agildata.com
Dan Lynn
CEO
Co-Founder @ FullContact
15 years building software
Techstars 2011 
 
dan@agildata.com

AGENDA
• Part I - Overview of Spark
• Motivation, APIs, Ecosystem, Simple Example
• Part 2 - Hands On
• Work through a real data problem

A BRIEF HISTORY LESSON
• First there was Hadoop
• Goal: Process petabytes of constantly-growing data
• “Move the processing to the data”
• But MapReduce was diﬃcult to program
• So they made Pig, Hive, Cascading, etc…

• MapReduce was also very reliable
• But it performed poorly on iterative tasks like machine
learning.
• So in 2009, UC Berkeley started on an new approach
• Keeping data in memory as much as possible.

• They called it “Spark”
• After lots of community acceptance it became an Apache Project
in 2013.
• Since then, it has gained mainstream acceptance.
• “Potentially the Most Signiﬁcant Open Source Project of the Next
Decade” - IBM, June 15, 2015

• Huge ecosystem
• Machine learning: MLlib, Mahout
• Graph processing: GraphX
• Read from / write to anything that Hadoop can
• Tons of community contributions: spark-packages.org
• Zeppelin: Python-style interactive notebooks

CONCEPTS - RDD
RDD aka “Resilient Distributed Dataset”
your_data
f(your_data)
g(f(your_data))
<— an RDD
<— also an RDD
<— so is this

RDD - SECRET INTERNALS!!!11
/**
* Tells the Spark framework *where* the data is.
*/
protected Partition[] getPartitions();
/**
* Iterates through the data for a given partition.
*/
Iterator<T> compute(Partition split, TaskContext context);

RDD - PUBLIC API
• Transformations
• Make new RDDs by applying transformation functions.
• Actions
• Write to HDFS, write to databases, yield an answer, etc…
Two Options

RDD - PUBLIC API
• Transformations
• .map(func) .filter(func) .reduce(func) .flatMap(func)
• Actions
• .collect() .saveAsTextFile(path) .sample(…) .take(n)

SPARK EXECUTION MODEL
https://cwiki.apache.org/conﬂuence/display/SPARK/Spark+Internals
What’s this?

SPARK EXECUTION MODEL
• Cluster Managers
• Apache Mesos
• YARN (aka Hadoop 2.0)
• Spark’s native cluster manager

SPARK SQL / DATAFRAME API
• New in Spark 1.3. The core engine behind Spark SQL
• If RDDs are transformations that apply to JVM objects…
• Schema (i.e. the class) is passed along with each datum
• Serialization pain. GC pain.
• …then DataFrames are transformations that apply to data
• Schema is deﬁned for the entire set
• Data is transmitted independent of schema. JVM data access incurs much less GC overhead
• DataFrames have more optimized execution logic. i.e. a query planner

DATASET API
• New in Spark 1.6
• Addressed speciﬁc deﬁciencies in DataFrames
• DataFrames lack compile-time type-checking.
• Datasets look like RDDs, but perform like DataFrames

SPARK API CHOICES
Java Scala
RDD
DataFrame sketchy…
Spark SQL
Dataset exciting, but very new exciting, but very new

QUICK EXAMPLE
• Let’s count Shakespeare’s favorite words!

PART 2: HANDS ON
• The problem: Rank Colorado counties by gender ratio.
• The data: US census data from 2010
• The approach:
• RDD API (in both Java 8 and Scala)
• DataFrame API / Spark SQL
• Dataset API

REFERENCES
• http://spark.apache.org/research.html
• http://tiny.cc/agildata-spark
• http://spark-packages.org

Andy Grove
Co-Founder & Chief Architect
andy@agildata.com
@andygrove73
www.agildata.com
Dan Lynn
CEO
dan@agildata.com
@danklynn
Thanks!

Apache Spark Hands-On: Rank Colorado Counties by Gender Ratio

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (20)

Similar a Apache Spark Hands-On: Rank Colorado Counties by Gender Ratio

Similar a Apache Spark Hands-On: Rank Colorado Counties by Gender Ratio (20)

Último

Último (20)

Apache Spark Hands-On: Rank Colorado Counties by Gender Ratio