This document provides an overview of Apache Spark and a hands-on workshop for using Spark. It begins with a brief history of Spark and how it evolved from Hadoop to address limitations in processing iterative tasks and keeping data in memory. Key Spark concepts are explained including RDDs, transformations, actions and Spark's execution model. New APIs in Spark SQL, DataFrames and Datasets are also introduced. The workshop agenda includes an overview of Spark followed by a hands-on example to rank Colorado counties by gender ratio using census data and both RDD and DataFrame APIs.
3. Andy Grove
Co-Founder & Chief Architect
Co-Founder @ Orbware
Technologies (acquired 2000)
Inventor of Firestorm/DAO
andy@agildata.com
• Providers of dbShards
• Relational Database Scaling
• Big Data Consulting
• Data Strategy
• Data Architecture Reviews
• Big Data Training
• Solution Implementation
• Distributed over 6 states!
• Headquartered in Broomfield, CO
www.agildata.com
Dan Lynn
CEO
Co-Founder @ FullContact
15 years building software
Techstars 2011
dan@agildata.com
4. AGENDA
• Part I - Overview of Spark
• Motivation, APIs, Ecosystem, Simple Example
• Part 2 - Hands On
• Work through a real data problem
6. A BRIEF HISTORY LESSON
• First there was Hadoop
• Goal: Process petabytes of constantly-growing data
• “Move the processing to the data”
• But MapReduce was difficult to program
• So they made Pig, Hive, Cascading, etc…
7. A BRIEF HISTORY LESSON
• MapReduce was also very reliable
• But it performed poorly on iterative tasks like machine
learning.
• So in 2009, UC Berkeley started on an new approach
• Keeping data in memory as much as possible.
8. A BRIEF HISTORY LESSON
• They called it “Spark”
• After lots of community acceptance it became an Apache Project
in 2013.
• Since then, it has gained mainstream acceptance.
• “Potentially the Most Significant Open Source Project of the Next
Decade” - IBM, June 15, 2015
9. A BRIEF HISTORY LESSON
• Huge ecosystem
• Machine learning: MLlib, Mahout
• Graph processing: GraphX
• Read from / write to anything that Hadoop can
• Tons of community contributions: spark-packages.org
• Zeppelin: Python-style interactive notebooks
11. CONCEPTS - RDD
RDD aka “Resilient Distributed Dataset”
your_data
f(your_data)
g(f(your_data))
<— an RDD
<— also an RDD
<— so is this
12. RDD - SECRET INTERNALS!!!11
/**
* Tells the Spark framework *where* the data is.
*/
protected Partition[] getPartitions();
/**
* Iterates through the data for a given partition.
*/
Iterator<T> compute(Partition split, TaskContext context);
13. RDD - PUBLIC API
• Transformations
• Make new RDDs by applying transformation functions.
• Actions
• Write to HDFS, write to databases, yield an answer, etc…
Two Options
14. RDD - PUBLIC API
• Transformations
• .map(func) .filter(func) .reduce(func) .flatMap(func)
• Actions
• .collect() .saveAsTextFile(path) .sample(…) .take(n)
19. SPARK SQL / DATAFRAME API
• New in Spark 1.3. The core engine behind Spark SQL
• If RDDs are transformations that apply to JVM objects…
• Schema (i.e. the class) is passed along with each datum
• Serialization pain. GC pain.
• …then DataFrames are transformations that apply to data
• Schema is defined for the entire set
• Data is transmitted independent of schema. JVM data access incurs much less GC overhead
• DataFrames have more optimized execution logic. i.e. a query planner
20. DATASET API
• New in Spark 1.6
• Addressed specific deficiencies in DataFrames
• DataFrames lack compile-time type-checking.
• Datasets look like RDDs, but perform like DataFrames
21. SPARK API CHOICES
Java Scala
RDD
DataFrame sketchy…
Spark SQL
Dataset exciting, but very new exciting, but very new
24. PART 2: HANDS ON
• The problem: Rank Colorado counties by gender ratio.
• The data: US census data from 2010
• The approach:
• RDD API (in both Java 8 and Scala)
• DataFrame API / Spark SQL
• Dataset API