Spark For Plain Old Java Geeks (June2014 Meetup)

1© Copyright 2013 Pivotal. All rights reserved. 1© Copyright 2013 Pivotal. All rights reserved.
Intro to Apache Spark
A primer for POJGs
(Plain Old Java Geeks)
Scott Deeg: Sr. Field Engineer
sdeeg@gopivotal.com

2© Copyright 2013 Pivotal. All rights reserved.
Agenda
Ÿ  Intro: Agenda, it’s all about ME!, 10 seconds on Pivotal
Ÿ  What is Spark, and what does it have to do with BigData/Hadoop?
–  Ecosystem (Shark, Streaming, MLlib, GraphX)
Ÿ  Spark Programming Model
–  Demo: interactive shell
Ÿ  Related Projects
Ÿ  Spark 1.0
Ÿ  More Tech: WordCount, TicTacToe – dev experience, Java8
Ÿ  Deployment Topologies
–  Simple Cluster Demo

Who Am I?
Just a Plain Old Java Guy
Ÿ  Java since 1996, Symantec Visual Café 1.0
Ÿ  Random consulting around Si Valley
Ÿ  Hacker on Java based BPM product for 10 years
Ÿ  Joined VMW 2009 when they acquired SpringSource
Ÿ  Rolled into Pivotal April 1 2013

What is Pivotal?
Ÿ  Cloud, Big Data, Fast Data, Modern Apps
Ÿ  Technology Bets
–  HDFS will be the way we talk to Enterprise data repositories
▪  Consolidate Silos in “Data Lake”
▪  Eco-system of services will arise to utilize HDFS data
–  PaaS will manage the Application Life Cycle
–  OSS will be the basis for solutions
–  Cloud Architecture
▪  Distributed / Parallel
▪  CPU, Memory, Network … storage is a distributed service

Data
Sources
Application Platform
Stream
Server
IMDG
ASF
Services
MPP
SQL
HDFS
Pivotal Platform
SQL
Objects
JSON GemFireXD
...ETC
End Users Developers
AppOps

What Is Spark?
Hint: It’s all about the RDD

?
Ÿ  Is it “Big Data”
Ÿ  Is it “Hadoop”
Ÿ  It’s one of those “in memory” things, right
Ÿ  JVM, Java, Scala
Ÿ  Is it Real or just another shiny technology with a long, but
ultimately small tail

Spark is …
Ÿ  Distributed/Cluster Compute Execution Engine
–  Came out of AMPLab project at UCB, now ASF top level project
Ÿ  Designed to work with data in memory
Ÿ  Similar scalability and fault tolerance as Hadoop Map/Reduce
–  Utilizes Lineage to reconstitute data instead of replication
Ÿ  Generalization of Map/Reduce
–  Implementation of Resilient Distributed Dataset (RDD)
Ÿ  Programmatic or Interactive
Ÿ  Written in Scala

Spark is also …
Ÿ  An ASF Top Level project
Ÿ  Has ~100 contributors across 25 companies
–  More active than Hadoop MapReduce
Ÿ  An eco-system of domain specific tools
–  Different models, but mostly interoperable
Ÿ  Hadoop Compatible

Berkley Data Analytics Stack (BDAS)
Support
Ÿ  Batch
Ÿ  Streaming
Ÿ  Interactive
Make it easy to
compose them

Short History
Ÿ  2009 Started as research project at UCB
Ÿ  2010 Open Sourced
Ÿ  January 2011 AMPLab Created
Ÿ  October 2012 0.6
–  Java, Stand alone cluster, maven
Ÿ  June 21 2013 Spark accepted into ASF Incubator
Ÿ  Feb 27 2014 Spark becomes top level ASF project
Ÿ  May 30 2014 Spark 1.0

Spark Philosophy
Ÿ  Make life easy and productive for Data Scientists
Ÿ  Provide well documented and expressive APIs
Ÿ  Powerful Domain Specific Libraries
Ÿ  Easy integration with storage systems
Ÿ  Caching to avoid data movement (performance)
Ÿ  Well defined releases, stable API

Spark is not Hadoop, but is compatible
Ÿ  Often better than Hadoop (Eric Baldeschwieler)
–  M/R fine for “Data Parallel”, but awkward for some workloads
–  Low latency dispatch, Iterative, Streaming
Ÿ  Natively accesses Hadoop data
Ÿ  Spark just another YARN job
–  Maintains huge investment in data collection
–  Brings Spark to the Data
Ÿ  It’s not OR … it’s AND!

Improvements over Map/Reduce
Ÿ  Efficiency
–  General Execution Graphs (not just map->reduce->store)
–  In memory
Ÿ  Usability
–  Rich APIs in Scala, Java, Python
–  Interactive
Ÿ  Can Spark be the R for Big Data?

Spark Programming
Model
RDDs in Detail

Core Concept
Think of a program as a set of transformations on a
Distributed Dataset
Model: Resilient Distributed Dataset (RDD)
–  Read Only Collection of Objects spread across a cluster
–  RDDs are built through parallel transformations (map, filter, etc.)
–  Automatically rebuilt on failure using lineage
–  Controllable persistence (RAM, HDFS, etc.)

Operations
Ÿ  Create
–  From stable storage (hdfs)
Ÿ  Transform
–  Generate RDD from other RDD (map, filter, groupBy)
–  Lazy Operations that build a DAG
–  Once Spark knows your transformations it can build an efficient plan
Ÿ  Action
–  Return a result or write to storage (count, collect, reduce, save)

Demo: Log Mining
Ÿ  Scala shell
Ÿ  Load file from HDFS
Ÿ  Search for patterns

Transformation and Actions
Ÿ  Transformations
–  Map
–  filter
–  flatMap
–  sample
–  groupByKey
–  reduceByKey
–  union
–  join
–  sort
Ÿ  Actions
–  count
–  collect
–  reduce
–  lookup
–  save

RDD Fault Tolerance
Ÿ  RDDs maintain lineage information that can be used to
reconstruct lost partitions
cachedMsgs = textFile(...).filter(_.contains(“error”))
.map(_.split(‘t’)(2))
.cache()
HdfsRDD
path: hdfs://…
FilteredRDD
func: contains(...)
MappedRDD
func: split(…)
CachedRDD

RDDs are Foundational
Ÿ  General purpose enough to use to implement other
programing models
–  SQL
–  Graph
–  ML
–  MR

Related Projects
Things that run on Spark

Related Projects
Ÿ  Shark
Ÿ  Spark SQL
Ÿ  Spark Streaming
Ÿ  GraphX
Ÿ  MLbase
Ÿ  Others

Shark
Ÿ  Hive on Spark
–  HiveQL, UDFs, etc.
Ÿ  Turn SQL into RDD
–  Part of the lineage
Ÿ  Based on Hive, but takes advantage of Spark for
–  Fast Scheduling
–  Queries are DAGs of jobs, not chained M/R
–  Fast broadcast variables
© Apache Software Foundation

Shark (cont)
Ÿ  Optimized Columnar Storage format
Ÿ  Fast/Efficient Compression
–  From Yahoo!
–  Able to hold 3-20x more data in same cluster
Ÿ  Various other optimizations using partitioning
Ÿ  Will ultimately run on Spark SQL
–  No Hive dependencies except to accessing Hive datastore
–  Long running process with management tools

Spark SQL
Ÿ  Lib in Spark Core to treat RDDs as relations
–  SchemaRDD
Ÿ  Lighter weight version of Shark
–  No code from Hive
Ÿ  Import/Export in different Storage formats
–  Parquet, learn schema from existing Hive warehouse
Ÿ  Takes columnar storage from Shark

Spark SQL Code
Ÿ  Go take a look

Spark Streaming
Ÿ  Extend Spark to do large scale stream processing
–  100s of nodes and second scale end to end latency
Ÿ  Stateful Processing
–  Hard to make FT
–  Storm: requires idempotent updates
Ÿ  Simple, batch like API with RDDs
Ÿ  Single semantics for both real time and high latency

Streaming (cont)
Ÿ  Input is broken up into Batches that become RDDs
Ÿ  RDD’s are composed into DAGs to generate output
Ÿ  Raw data is replicated in-memory for FT

Streaming (cont)
Ÿ  Other features
–  Window-based Transformations
–  Arbitrary join of streams

GraphX (Alpha)
Ÿ  Graph processing
–  Replaces Spark Bagel
Ÿ  Graph Parallel not Data Parallel
–  Reason in the context of neighbors
–  GraphLab API

GraphX (cont)
Ÿ  Predicting things about people (eg: political bias)
–  Look at posts, apply classifier, try to predict attribute
–  Local signal is difficult alone
–  Look at context of social network to improve prediction
Ÿ  Triangle processing
–  More triangles reveals greater community
Ÿ  Collaborative Filtering
–  Bi-partide graph processing
–  What I like, who rated those things, what they like => what I may like

GraphX (cont)
Ÿ  Graph Creation => Algorithm => Post Processing
–  Existing systems mainly deal with the Algorithm and not interactive
–  Unify collection and graph models
Ÿ  Graphs have
–  Vertices, edges
–  Transformation: reverse, filter, map
–  Joins: graphs and tables
–  Aggregate Neighbors

MLbase
Ÿ  Machine Learning toolset
–  Library and higher level abstractions
Ÿ  General tool is MatLab
–  Difficult for end users to learn, debug, scale solutions
Ÿ  Starting with MLlib
–  Low level Distributed Machine Learning Library
Ÿ  Many different Algorithms
–  Classification, Regression, Collaborative Filtering, etc.

Others
Ÿ  Mesos
–  Enable multiple frameworks to share same cluster resources
–  Twitter is largest user: Over 6,000 servers
Ÿ  Tachyon
–  In-memory, fault tolerant file system that exposes HDFS
Ÿ  Catalyst
–  SQL Query Optimizer

Spark 1.0

Release cycle
Ÿ  1.0 Came out at end of May
Ÿ  1.X expected to be current for several years
Ÿ  Quarterly release cycle
–  2 mo dev / 1 mo QA
–  Actual release is based on vote
Ÿ  1.1 due end of August

1.0
Ÿ  API Stability in 1.X for all non-Alpha projects
–  Can recompile jobs, but hoping for binary compatibility
–  Internal API are marked @DeveloperApi or @Experimental
Ÿ  Focus: Core Engine, Streaming, MLLib, SparkSQL
Ÿ  History Server for Spark UI
–  Driving development of instrumentation
Ÿ  Job Submission Tool
–  Don’t configure Context in code (eg: master)

1.0
Ÿ  Java8 Lamdas
–  No more writing closures as Classes
–  Functions are interfaces
–  Return type sensitive functions
▪  mapToPair
Ÿ  Python improvements

1.0
Ÿ  Hadoop security
–  Kerberos, ACL for UI
Ÿ  Job cancel from UI
Ÿ  Distributed GC as things go out of scope
–  Good for long lives service
Ÿ  Spark SQL

More Code and Demos
WordCount, TicTacToe, Java8

Code Review: WordCount
Ÿ  Java API
Ÿ  Java Code
Ÿ  More usage of RDDs

TicTacToe: a developers experience
Ÿ  IDE
Ÿ  Spring
Ÿ  Building/Logging
Ÿ  Debugging

Demo: Java 8
Lamda Lamda Lamda

Deployment Topologies

Topologies
Ÿ  Local
Ÿ  Spark Cluster (master/slaves)
Ÿ  Cluster Resource Managers
–  YARN
–  MESOS
Ÿ  (PaaS?)

Demo:
Ÿ  Start master and slaves
Ÿ  Show the UI
Ÿ  Run a Job
Ÿ  Talk about the History Server

This
And That

How Real is Spark?
Ÿ  There is some criticism
–  As expected
–  New project!
Ÿ  There are many indicators that Spark is heading to success
–  Solid technology
–  Good buzz
–  Significant community

Next Steps
Ÿ  Spark website: http://spark.apache.org
–  Lots’O’Goodstuff
Ÿ  Spark Summit June 30/July 01
–  http://spark-summit.org

A NEW PLATFORM FOR A NEW ERA

Spark For Plain Old Java Geeks (June2014 Meetup)

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Spark For Plain Old Java Geeks (June2014 Meetup)

Similar a Spark For Plain Old Java Geeks (June2014 Meetup) (20)

Último

Último (20)

Spark For Plain Old Java Geeks (June2014 Meetup)