SlideShare una empresa de Scribd logo
1 de 51
Descargar para leer sin conexión
1© Copyright 2013 Pivotal. All rights reserved. 1© Copyright 2013 Pivotal. All rights reserved.
Intro to Apache Spark
A primer for POJGs
(Plain Old Java Geeks)
Scott Deeg: Sr. Field Engineer
sdeeg@gopivotal.com
2© Copyright 2013 Pivotal. All rights reserved.
Agenda
Ÿ  Intro: Agenda, it’s all about ME!, 10 seconds on Pivotal
Ÿ  What is Spark, and what does it have to do with BigData/Hadoop?
–  Ecosystem (Shark, Streaming, MLlib, GraphX)
Ÿ  Spark Programming Model
–  Demo: interactive shell
Ÿ  Related Projects
Ÿ  Spark 1.0
Ÿ  More Tech: WordCount, TicTacToe – dev experience, Java8
Ÿ  Deployment Topologies
–  Simple Cluster Demo
3© Copyright 2013 Pivotal. All rights reserved.
Who Am I?
Just a Plain Old Java Guy
Ÿ  Java since 1996, Symantec Visual Café 1.0
Ÿ  Random consulting around Si Valley
Ÿ  Hacker on Java based BPM product for 10 years
Ÿ  Joined VMW 2009 when they acquired SpringSource
Ÿ  Rolled into Pivotal April 1 2013
4© Copyright 2013 Pivotal. All rights reserved.
What is Pivotal?
Ÿ  Cloud, Big Data, Fast Data, Modern Apps
Ÿ  Technology Bets
–  HDFS will be the way we talk to Enterprise data repositories
▪  Consolidate Silos in “Data Lake”
▪  Eco-system of services will arise to utilize HDFS data
–  PaaS will manage the Application Life Cycle
–  OSS will be the basis for solutions
–  Cloud Architecture
▪  Distributed / Parallel
▪  CPU, Memory, Network … storage is a distributed service
5© Copyright 2013 Pivotal. All rights reserved.
Data
Sources
Application Platform
Stream
Server
IMDG
ASF
Services
MPP
SQL
HDFS
Pivotal Platform
SQL
Objects
JSON GemFireXD
...ETC
End Users Developers
AppOps
6© Copyright 2013 Pivotal. All rights reserved. 6© Copyright 2013 Pivotal. All rights reserved.
What Is Spark?
Hint: It’s all about the RDD
7© Copyright 2013 Pivotal. All rights reserved.
?
Ÿ  Is it “Big Data”
Ÿ  Is it “Hadoop”
Ÿ  It’s one of those “in memory” things, right
Ÿ  JVM, Java, Scala
Ÿ  Is it Real or just another shiny technology with a long, but
ultimately small tail
8© Copyright 2013 Pivotal. All rights reserved.
Spark is …
Ÿ  Distributed/Cluster Compute Execution Engine
–  Came out of AMPLab project at UCB, now ASF top level project
Ÿ  Designed to work with data in memory
Ÿ  Similar scalability and fault tolerance as Hadoop Map/Reduce
–  Utilizes Lineage to reconstitute data instead of replication
Ÿ  Generalization of Map/Reduce
–  Implementation of Resilient Distributed Dataset (RDD)
Ÿ  Programmatic or Interactive
Ÿ  Written in Scala
9© Copyright 2013 Pivotal. All rights reserved.
Spark is also …
Ÿ  An ASF Top Level project
Ÿ  Has ~100 contributors across 25 companies
–  More active than Hadoop MapReduce
Ÿ  An eco-system of domain specific tools
–  Different models, but mostly interoperable
Ÿ  Hadoop Compatible
10© Copyright 2013 Pivotal. All rights reserved.
Berkley Data Analytics Stack (BDAS)
Support
Ÿ  Batch
Ÿ  Streaming
Ÿ  Interactive
Make it easy to
compose them
11© Copyright 2013 Pivotal. All rights reserved.
Short History
Ÿ  2009 Started as research project at UCB
Ÿ  2010 Open Sourced
Ÿ  January 2011 AMPLab Created
Ÿ  October 2012 0.6
–  Java, Stand alone cluster, maven
Ÿ  June 21 2013 Spark accepted into ASF Incubator
Ÿ  Feb 27 2014 Spark becomes top level ASF project
Ÿ  May 30 2014 Spark 1.0
12© Copyright 2013 Pivotal. All rights reserved.
Spark Philosophy
Ÿ  Make life easy and productive for Data Scientists
Ÿ  Provide well documented and expressive APIs
Ÿ  Powerful Domain Specific Libraries
Ÿ  Easy integration with storage systems
Ÿ  Caching to avoid data movement (performance)
Ÿ  Well defined releases, stable API
13© Copyright 2013 Pivotal. All rights reserved.
Spark is not Hadoop, but is compatible
Ÿ  Often better than Hadoop (Eric Baldeschwieler)
–  M/R fine for “Data Parallel”, but awkward for some workloads
–  Low latency dispatch, Iterative, Streaming
Ÿ  Natively accesses Hadoop data
Ÿ  Spark just another YARN job
–  Maintains huge investment in data collection
–  Brings Spark to the Data
Ÿ  It’s not OR … it’s AND!
14© Copyright 2013 Pivotal. All rights reserved.
Improvements over Map/Reduce
Ÿ  Efficiency
–  General Execution Graphs (not just map->reduce->store)
–  In memory
Ÿ  Usability
–  Rich APIs in Scala, Java, Python
–  Interactive
Ÿ  Can Spark be the R for Big Data?
15© Copyright 2013 Pivotal. All rights reserved. 15© Copyright 2013 Pivotal. All rights reserved.
Spark Programming
Model
RDDs in Detail
16© Copyright 2013 Pivotal. All rights reserved.
Core Concept
Think of a program as a set of transformations on a
Distributed Dataset
Model: Resilient Distributed Dataset (RDD)
–  Read Only Collection of Objects spread across a cluster
–  RDDs are built through parallel transformations (map, filter, etc.)
–  Automatically rebuilt on failure using lineage
–  Controllable persistence (RAM, HDFS, etc.)
17© Copyright 2013 Pivotal. All rights reserved.
Operations
Ÿ  Create
–  From stable storage (hdfs)
Ÿ  Transform
–  Generate RDD from other RDD (map, filter, groupBy)
–  Lazy Operations that build a DAG
–  Once Spark knows your transformations it can build an efficient plan
Ÿ  Action
–  Return a result or write to storage (count, collect, reduce, save)
18© Copyright 2013 Pivotal. All rights reserved.
Demo: Log Mining
Ÿ  Scala shell
Ÿ  Load file from HDFS
Ÿ  Search for patterns
19© Copyright 2013 Pivotal. All rights reserved.
Transformation and Actions
Ÿ  Transformations
–  Map
–  filter
–  flatMap
–  sample
–  groupByKey
–  reduceByKey
–  union
–  join
–  sort
Ÿ  Actions
–  count
–  collect
–  reduce
–  lookup
–  save
20© Copyright 2013 Pivotal. All rights reserved.
RDD Fault Tolerance
Ÿ  RDDs maintain lineage information that can be used to
reconstruct lost partitions
cachedMsgs = textFile(...).filter(_.contains(“error”))
.map(_.split(‘t’)(2))
.cache()
HdfsRDD
path: hdfs://…
FilteredRDD
func: contains(...)
MappedRDD
func: split(…)
CachedRDD
21© Copyright 2013 Pivotal. All rights reserved.
RDDs are Foundational
Ÿ  General purpose enough to use to implement other
programing models
–  SQL
–  Graph
–  ML
–  MR
22© Copyright 2013 Pivotal. All rights reserved. 22© Copyright 2013 Pivotal. All rights reserved.
Related Projects
Things that run on Spark
23© Copyright 2013 Pivotal. All rights reserved.
Related Projects
Ÿ  Shark
Ÿ  Spark SQL
Ÿ  Spark Streaming
Ÿ  GraphX
Ÿ  MLbase
Ÿ  Others
24© Copyright 2013 Pivotal. All rights reserved.
Shark
Ÿ  Hive on Spark
–  HiveQL, UDFs, etc.
Ÿ  Turn SQL into RDD
–  Part of the lineage
Ÿ  Based on Hive, but takes advantage of Spark for
–  Fast Scheduling
–  Queries are DAGs of jobs, not chained M/R
–  Fast broadcast variables
© Apache Software Foundation
25© Copyright 2013 Pivotal. All rights reserved.
Shark (cont)
Ÿ  Optimized Columnar Storage format
Ÿ  Fast/Efficient Compression
–  From Yahoo!
–  Able to hold 3-20x more data in same cluster
Ÿ  Various other optimizations using partitioning
Ÿ  Will ultimately run on Spark SQL
–  No Hive dependencies except to accessing Hive datastore
–  Long running process with management tools
26© Copyright 2013 Pivotal. All rights reserved.
Spark SQL
Ÿ  Lib in Spark Core to treat RDDs as relations
–  SchemaRDD
Ÿ  Lighter weight version of Shark
–  No code from Hive
Ÿ  Import/Export in different Storage formats
–  Parquet, learn schema from existing Hive warehouse
Ÿ  Takes columnar storage from Shark
27© Copyright 2013 Pivotal. All rights reserved.
Spark SQL Code
Ÿ  Go take a look
28© Copyright 2013 Pivotal. All rights reserved.
Spark Streaming
Ÿ  Extend Spark to do large scale stream processing
–  100s of nodes and second scale end to end latency
Ÿ  Stateful Processing
–  Hard to make FT
–  Storm: requires idempotent updates
Ÿ  Simple, batch like API with RDDs
Ÿ  Single semantics for both real time and high latency
29© Copyright 2013 Pivotal. All rights reserved.
Streaming (cont)
Ÿ  Input is broken up into Batches that become RDDs
Ÿ  RDD’s are composed into DAGs to generate output
Ÿ  Raw data is replicated in-memory for FT
30© Copyright 2013 Pivotal. All rights reserved.
Streaming (cont)
Ÿ  Other features
–  Window-based Transformations
–  Arbitrary join of streams
31© Copyright 2013 Pivotal. All rights reserved.
GraphX (Alpha)
Ÿ  Graph processing
–  Replaces Spark Bagel
Ÿ  Graph Parallel not Data Parallel
–  Reason in the context of neighbors
–  GraphLab API
32© Copyright 2013 Pivotal. All rights reserved.
GraphX (cont)
Ÿ  Predicting things about people (eg: political bias)
–  Look at posts, apply classifier, try to predict attribute
–  Local signal is difficult alone
–  Look at context of social network to improve prediction
Ÿ  Triangle processing
–  More triangles reveals greater community
Ÿ  Collaborative Filtering
–  Bi-partide graph processing
–  What I like, who rated those things, what they like => what I may like
33© Copyright 2013 Pivotal. All rights reserved.
GraphX (cont)
Ÿ  Graph Creation => Algorithm => Post Processing
–  Existing systems mainly deal with the Algorithm and not interactive
–  Unify collection and graph models
Ÿ  Graphs have
–  Vertices, edges
–  Transformation: reverse, filter, map
–  Joins: graphs and tables
–  Aggregate Neighbors
34© Copyright 2013 Pivotal. All rights reserved.
MLbase
Ÿ  Machine Learning toolset
–  Library and higher level abstractions
Ÿ  General tool is MatLab
–  Difficult for end users to learn, debug, scale solutions
Ÿ  Starting with MLlib
–  Low level Distributed Machine Learning Library
Ÿ  Many different Algorithms
–  Classification, Regression, Collaborative Filtering, etc.
35© Copyright 2013 Pivotal. All rights reserved.
Others
Ÿ  Mesos
–  Enable multiple frameworks to share same cluster resources
–  Twitter is largest user: Over 6,000 servers
Ÿ  Tachyon
–  In-memory, fault tolerant file system that exposes HDFS
Ÿ  Catalyst
–  SQL Query Optimizer
36© Copyright 2013 Pivotal. All rights reserved. 36© Copyright 2013 Pivotal. All rights reserved.
Spark 1.0
37© Copyright 2013 Pivotal. All rights reserved.
Release cycle
Ÿ  1.0 Came out at end of May
Ÿ  1.X expected to be current for several years
Ÿ  Quarterly release cycle
–  2 mo dev / 1 mo QA
–  Actual release is based on vote
Ÿ  1.1 due end of August
38© Copyright 2013 Pivotal. All rights reserved.
1.0
Ÿ  API Stability in 1.X for all non-Alpha projects
–  Can recompile jobs, but hoping for binary compatibility
–  Internal API are marked @DeveloperApi or @Experimental
Ÿ  Focus: Core Engine, Streaming, MLLib, SparkSQL
Ÿ  History Server for Spark UI
–  Driving development of instrumentation
Ÿ  Job Submission Tool
–  Don’t configure Context in code (eg: master)
39© Copyright 2013 Pivotal. All rights reserved.
1.0
Ÿ  Java8 Lamdas
–  No more writing closures as Classes
–  Functions are interfaces
–  Return type sensitive functions
▪  mapToPair
Ÿ  Python improvements
40© Copyright 2013 Pivotal. All rights reserved.
1.0
Ÿ  Hadoop security
–  Kerberos, ACL for UI
Ÿ  Job cancel from UI
Ÿ  Distributed GC as things go out of scope
–  Good for long lives service
Ÿ  Spark SQL
41© Copyright 2013 Pivotal. All rights reserved. 41© Copyright 2013 Pivotal. All rights reserved.
More Code and Demos
WordCount, TicTacToe, Java8
42© Copyright 2013 Pivotal. All rights reserved.
Code Review: WordCount
Ÿ  Java API
Ÿ  Java Code
Ÿ  More usage of RDDs
43© Copyright 2013 Pivotal. All rights reserved.
TicTacToe: a developers experience
Ÿ  IDE
Ÿ  Spring
Ÿ  Building/Logging
Ÿ  Debugging
44© Copyright 2013 Pivotal. All rights reserved.
Demo: Java 8
Lamda Lamda Lamda
45© Copyright 2013 Pivotal. All rights reserved. 45© Copyright 2013 Pivotal. All rights reserved.
Deployment Topologies
46© Copyright 2013 Pivotal. All rights reserved.
Topologies
Ÿ  Local
Ÿ  Spark Cluster (master/slaves)
Ÿ  Cluster Resource Managers
–  YARN
–  MESOS
Ÿ  (PaaS?)
47© Copyright 2013 Pivotal. All rights reserved.
Demo:
Ÿ  Start master and slaves
Ÿ  Show the UI
Ÿ  Run a Job
Ÿ  Talk about the History Server
48© Copyright 2013 Pivotal. All rights reserved. 48© Copyright 2013 Pivotal. All rights reserved.
This
And That
49© Copyright 2013 Pivotal. All rights reserved.
How Real is Spark?
Ÿ  There is some criticism
–  As expected
–  New project!
Ÿ  There are many indicators that Spark is heading to success
–  Solid technology
–  Good buzz
–  Significant community
50© Copyright 2013 Pivotal. All rights reserved.
Next Steps
Ÿ  Spark website: http://spark.apache.org
–  Lots’O’Goodstuff
Ÿ  Spark Summit June 30/July 01
–  http://spark-summit.org
51© Copyright 2013 Pivotal. All rights reserved. 51© Copyright 2013 Pivotal. All rights reserved.
A NEW PLATFORM FOR A NEW ERA

Más contenido relacionado

La actualidad más candente

Simple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform ConceptSimple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform ConceptSatish Mohan
 
Graph Analytics on Data from Meetup.com
Graph Analytics on Data from Meetup.comGraph Analytics on Data from Meetup.com
Graph Analytics on Data from Meetup.comKarin Patenge
 
PGQL: A Language for Graphs
PGQL: A Language for GraphsPGQL: A Language for Graphs
PGQL: A Language for GraphsJean Ihm
 
Pandas UDF: Scalable Analysis with Python and PySpark
Pandas UDF: Scalable Analysis with Python and PySparkPandas UDF: Scalable Analysis with Python and PySpark
Pandas UDF: Scalable Analysis with Python and PySparkLi Jin
 
How To Model and Construct Graphs with Oracle Database (AskTOM Office Hours p...
How To Model and Construct Graphs with Oracle Database (AskTOM Office Hours p...How To Model and Construct Graphs with Oracle Database (AskTOM Office Hours p...
How To Model and Construct Graphs with Oracle Database (AskTOM Office Hours p...Jean Ihm
 
MADlib Architecture and Functional Demo on How to Use MADlib/PivotalR
MADlib Architecture and Functional Demo on How to Use MADlib/PivotalRMADlib Architecture and Functional Demo on How to Use MADlib/PivotalR
MADlib Architecture and Functional Demo on How to Use MADlib/PivotalRPivotalOpenSourceHub
 
Spark summit 2019 infrastructure for deep learning in apache spark 0425
Spark summit 2019 infrastructure for deep learning in apache spark 0425Spark summit 2019 infrastructure for deep learning in apache spark 0425
Spark summit 2019 infrastructure for deep learning in apache spark 0425Wee Hyong Tok
 
GraphTech Ecosystem - part 2: Graph Analytics
 GraphTech Ecosystem - part 2: Graph Analytics GraphTech Ecosystem - part 2: Graph Analytics
GraphTech Ecosystem - part 2: Graph AnalyticsLinkurious
 
Practical Distributed Machine Learning Pipelines on Hadoop
Practical Distributed Machine Learning Pipelines on HadoopPractical Distributed Machine Learning Pipelines on Hadoop
Practical Distributed Machine Learning Pipelines on HadoopDataWorks Summit
 
Pivotal Greenplum 次世代マルチクラウド・データ分析プラットフォーム
Pivotal Greenplum 次世代マルチクラウド・データ分析プラットフォームPivotal Greenplum 次世代マルチクラウド・データ分析プラットフォーム
Pivotal Greenplum 次世代マルチクラウド・データ分析プラットフォームMasayuki Matsushita
 
Make your PySpark Data Fly with Arrow!
Make your PySpark Data Fly with Arrow!Make your PySpark Data Fly with Arrow!
Make your PySpark Data Fly with Arrow!Databricks
 
Oracle Spatial Studio: Fast and Easy Spatial Analytics and Maps
Oracle Spatial Studio:  Fast and Easy Spatial Analytics and MapsOracle Spatial Studio:  Fast and Easy Spatial Analytics and Maps
Oracle Spatial Studio: Fast and Easy Spatial Analytics and MapsJean Ihm
 
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019VMware Tanzu
 
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...Databricks
 
Machine learning at scale challenges and solutions
Machine learning at scale challenges and solutionsMachine learning at scale challenges and solutions
Machine learning at scale challenges and solutionsStavros Kontopoulos
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Gain Insights with Graph Analytics
Gain Insights with Graph Analytics Gain Insights with Graph Analytics
Gain Insights with Graph Analytics Jean Ihm
 
Unlocking Your Hadoop Data with Apache Spark and CDH5
Unlocking Your Hadoop Data with Apache Spark and CDH5Unlocking Your Hadoop Data with Apache Spark and CDH5
Unlocking Your Hadoop Data with Apache Spark and CDH5SAP Concur
 
The MADlib Analytics Library
The MADlib Analytics Library The MADlib Analytics Library
The MADlib Analytics Library EMC
 
Big dataarchitecturesandecosystem+nosql
Big dataarchitecturesandecosystem+nosqlBig dataarchitecturesandecosystem+nosql
Big dataarchitecturesandecosystem+nosqlKhanderao Kand
 

La actualidad más candente (20)

Simple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform ConceptSimple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform Concept
 
Graph Analytics on Data from Meetup.com
Graph Analytics on Data from Meetup.comGraph Analytics on Data from Meetup.com
Graph Analytics on Data from Meetup.com
 
PGQL: A Language for Graphs
PGQL: A Language for GraphsPGQL: A Language for Graphs
PGQL: A Language for Graphs
 
Pandas UDF: Scalable Analysis with Python and PySpark
Pandas UDF: Scalable Analysis with Python and PySparkPandas UDF: Scalable Analysis with Python and PySpark
Pandas UDF: Scalable Analysis with Python and PySpark
 
How To Model and Construct Graphs with Oracle Database (AskTOM Office Hours p...
How To Model and Construct Graphs with Oracle Database (AskTOM Office Hours p...How To Model and Construct Graphs with Oracle Database (AskTOM Office Hours p...
How To Model and Construct Graphs with Oracle Database (AskTOM Office Hours p...
 
MADlib Architecture and Functional Demo on How to Use MADlib/PivotalR
MADlib Architecture and Functional Demo on How to Use MADlib/PivotalRMADlib Architecture and Functional Demo on How to Use MADlib/PivotalR
MADlib Architecture and Functional Demo on How to Use MADlib/PivotalR
 
Spark summit 2019 infrastructure for deep learning in apache spark 0425
Spark summit 2019 infrastructure for deep learning in apache spark 0425Spark summit 2019 infrastructure for deep learning in apache spark 0425
Spark summit 2019 infrastructure for deep learning in apache spark 0425
 
GraphTech Ecosystem - part 2: Graph Analytics
 GraphTech Ecosystem - part 2: Graph Analytics GraphTech Ecosystem - part 2: Graph Analytics
GraphTech Ecosystem - part 2: Graph Analytics
 
Practical Distributed Machine Learning Pipelines on Hadoop
Practical Distributed Machine Learning Pipelines on HadoopPractical Distributed Machine Learning Pipelines on Hadoop
Practical Distributed Machine Learning Pipelines on Hadoop
 
Pivotal Greenplum 次世代マルチクラウド・データ分析プラットフォーム
Pivotal Greenplum 次世代マルチクラウド・データ分析プラットフォームPivotal Greenplum 次世代マルチクラウド・データ分析プラットフォーム
Pivotal Greenplum 次世代マルチクラウド・データ分析プラットフォーム
 
Make your PySpark Data Fly with Arrow!
Make your PySpark Data Fly with Arrow!Make your PySpark Data Fly with Arrow!
Make your PySpark Data Fly with Arrow!
 
Oracle Spatial Studio: Fast and Easy Spatial Analytics and Maps
Oracle Spatial Studio:  Fast and Easy Spatial Analytics and MapsOracle Spatial Studio:  Fast and Easy Spatial Analytics and Maps
Oracle Spatial Studio: Fast and Easy Spatial Analytics and Maps
 
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
 
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...
 
Machine learning at scale challenges and solutions
Machine learning at scale challenges and solutionsMachine learning at scale challenges and solutions
Machine learning at scale challenges and solutions
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Gain Insights with Graph Analytics
Gain Insights with Graph Analytics Gain Insights with Graph Analytics
Gain Insights with Graph Analytics
 
Unlocking Your Hadoop Data with Apache Spark and CDH5
Unlocking Your Hadoop Data with Apache Spark and CDH5Unlocking Your Hadoop Data with Apache Spark and CDH5
Unlocking Your Hadoop Data with Apache Spark and CDH5
 
The MADlib Analytics Library
The MADlib Analytics Library The MADlib Analytics Library
The MADlib Analytics Library
 
Big dataarchitecturesandecosystem+nosql
Big dataarchitecturesandecosystem+nosqlBig dataarchitecturesandecosystem+nosql
Big dataarchitecturesandecosystem+nosql
 

Similar a Spark For Plain Old Java Geeks (June2014 Meetup)

Spark forplainoldjavageeks svforum_20140724
Spark forplainoldjavageeks svforum_20140724Spark forplainoldjavageeks svforum_20140724
Spark forplainoldjavageeks svforum_20140724sdeeg
 
Apache Spark Introduction @ University College London
Apache Spark Introduction @ University College LondonApache Spark Introduction @ University College London
Apache Spark Introduction @ University College LondonVitthal Gogate
 
Why spark by Stratio - v.1.0
Why spark by Stratio - v.1.0Why spark by Stratio - v.1.0
Why spark by Stratio - v.1.0Stratio
 
Apache Spark Introduction.pdf
Apache Spark Introduction.pdfApache Spark Introduction.pdf
Apache Spark Introduction.pdfMaheshPandit16
 
Apache spark with java 8
Apache spark with java 8Apache spark with java 8
Apache spark with java 8Janu Jahnavi
 
Apache spark with java 8
Apache spark with java 8Apache spark with java 8
Apache spark with java 8Janu Jahnavi
 
Agile data lake? An oxymoron?
Agile data lake? An oxymoron?Agile data lake? An oxymoron?
Agile data lake? An oxymoron?samthemonad
 
A short introduction to Spark and its benefits
A short introduction to Spark and its benefitsA short introduction to Spark and its benefits
A short introduction to Spark and its benefitsJohan Picard
 
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemWhy Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemCloudera, Inc.
 
Spark For Faster Batch Processing
Spark For Faster Batch ProcessingSpark For Faster Batch Processing
Spark For Faster Batch ProcessingEdureka!
 
Lightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkLightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkManish Gupta
 
Apache Spark for Beginners
Apache Spark for BeginnersApache Spark for Beginners
Apache Spark for BeginnersAnirudh
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache SparkAmir Sedighi
 
Boston Spark Meetup event Slides Update
Boston Spark Meetup event Slides UpdateBoston Spark Meetup event Slides Update
Boston Spark Meetup event Slides Updatevithakur
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsDatabricks
 

Similar a Spark For Plain Old Java Geeks (June2014 Meetup) (20)

Spark forplainoldjavageeks svforum_20140724
Spark forplainoldjavageeks svforum_20140724Spark forplainoldjavageeks svforum_20140724
Spark forplainoldjavageeks svforum_20140724
 
Spark 101
Spark 101Spark 101
Spark 101
 
Apache Spark Introduction @ University College London
Apache Spark Introduction @ University College LondonApache Spark Introduction @ University College London
Apache Spark Introduction @ University College London
 
spark_v1_2
spark_v1_2spark_v1_2
spark_v1_2
 
Why spark by Stratio - v.1.0
Why spark by Stratio - v.1.0Why spark by Stratio - v.1.0
Why spark by Stratio - v.1.0
 
Why Spark over Hadoop?
Why Spark over Hadoop?Why Spark over Hadoop?
Why Spark over Hadoop?
 
Scala and spark
Scala and sparkScala and spark
Scala and spark
 
Apache Spark Introduction.pdf
Apache Spark Introduction.pdfApache Spark Introduction.pdf
Apache Spark Introduction.pdf
 
Apache spark with java 8
Apache spark with java 8Apache spark with java 8
Apache spark with java 8
 
Apache spark with java 8
Apache spark with java 8Apache spark with java 8
Apache spark with java 8
 
Agile data lake? An oxymoron?
Agile data lake? An oxymoron?Agile data lake? An oxymoron?
Agile data lake? An oxymoron?
 
A short introduction to Spark and its benefits
A short introduction to Spark and its benefitsA short introduction to Spark and its benefits
A short introduction to Spark and its benefits
 
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemWhy Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
 
Spark For Faster Batch Processing
Spark For Faster Batch ProcessingSpark For Faster Batch Processing
Spark For Faster Batch Processing
 
Lightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkLightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache Spark
 
Apache Spark for Beginners
Apache Spark for BeginnersApache Spark for Beginners
Apache Spark for Beginners
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
 
Boston Spark Meetup event Slides Update
Boston Spark Meetup event Slides UpdateBoston Spark Meetup event Slides Update
Boston Spark Meetup event Slides Update
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
 
Bds session 13 14
Bds session 13 14Bds session 13 14
Bds session 13 14
 

Último

Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Natan Silnitsky
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Velvetech LLC
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作qr0udbr0
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationBradBedford3
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...Technogeeks
 
How to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdfHow to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdfLivetecs LLC
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfAlina Yurenko
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprisepreethippts
 
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Mater
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWave PLM
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Cizo Technology Services
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...confluent
 
Buds n Tech IT Solutions: Top-Notch Web Services in Noida
Buds n Tech IT Solutions: Top-Notch Web Services in NoidaBuds n Tech IT Solutions: Top-Notch Web Services in Noida
Buds n Tech IT Solutions: Top-Notch Web Services in Noidabntitsolutionsrishis
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfFerryKemperman
 
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)jennyeacort
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Hr365.us smith
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Andreas Granig
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEEVICTOR MAESTRE RAMIREZ
 

Último (20)

Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a series
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion Application
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...
 
How to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdfHow to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdf
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprise
 
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need It
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
 
Buds n Tech IT Solutions: Top-Notch Web Services in Noida
Buds n Tech IT Solutions: Top-Notch Web Services in NoidaBuds n Tech IT Solutions: Top-Notch Web Services in Noida
Buds n Tech IT Solutions: Top-Notch Web Services in Noida
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdf
 
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEE
 

Spark For Plain Old Java Geeks (June2014 Meetup)

  • 1. 1© Copyright 2013 Pivotal. All rights reserved. 1© Copyright 2013 Pivotal. All rights reserved. Intro to Apache Spark A primer for POJGs (Plain Old Java Geeks) Scott Deeg: Sr. Field Engineer sdeeg@gopivotal.com
  • 2. 2© Copyright 2013 Pivotal. All rights reserved. Agenda Ÿ  Intro: Agenda, it’s all about ME!, 10 seconds on Pivotal Ÿ  What is Spark, and what does it have to do with BigData/Hadoop? –  Ecosystem (Shark, Streaming, MLlib, GraphX) Ÿ  Spark Programming Model –  Demo: interactive shell Ÿ  Related Projects Ÿ  Spark 1.0 Ÿ  More Tech: WordCount, TicTacToe – dev experience, Java8 Ÿ  Deployment Topologies –  Simple Cluster Demo
  • 3. 3© Copyright 2013 Pivotal. All rights reserved. Who Am I? Just a Plain Old Java Guy Ÿ  Java since 1996, Symantec Visual Café 1.0 Ÿ  Random consulting around Si Valley Ÿ  Hacker on Java based BPM product for 10 years Ÿ  Joined VMW 2009 when they acquired SpringSource Ÿ  Rolled into Pivotal April 1 2013
  • 4. 4© Copyright 2013 Pivotal. All rights reserved. What is Pivotal? Ÿ  Cloud, Big Data, Fast Data, Modern Apps Ÿ  Technology Bets –  HDFS will be the way we talk to Enterprise data repositories ▪  Consolidate Silos in “Data Lake” ▪  Eco-system of services will arise to utilize HDFS data –  PaaS will manage the Application Life Cycle –  OSS will be the basis for solutions –  Cloud Architecture ▪  Distributed / Parallel ▪  CPU, Memory, Network … storage is a distributed service
  • 5. 5© Copyright 2013 Pivotal. All rights reserved. Data Sources Application Platform Stream Server IMDG ASF Services MPP SQL HDFS Pivotal Platform SQL Objects JSON GemFireXD ...ETC End Users Developers AppOps
  • 6. 6© Copyright 2013 Pivotal. All rights reserved. 6© Copyright 2013 Pivotal. All rights reserved. What Is Spark? Hint: It’s all about the RDD
  • 7. 7© Copyright 2013 Pivotal. All rights reserved. ? Ÿ  Is it “Big Data” Ÿ  Is it “Hadoop” Ÿ  It’s one of those “in memory” things, right Ÿ  JVM, Java, Scala Ÿ  Is it Real or just another shiny technology with a long, but ultimately small tail
  • 8. 8© Copyright 2013 Pivotal. All rights reserved. Spark is … Ÿ  Distributed/Cluster Compute Execution Engine –  Came out of AMPLab project at UCB, now ASF top level project Ÿ  Designed to work with data in memory Ÿ  Similar scalability and fault tolerance as Hadoop Map/Reduce –  Utilizes Lineage to reconstitute data instead of replication Ÿ  Generalization of Map/Reduce –  Implementation of Resilient Distributed Dataset (RDD) Ÿ  Programmatic or Interactive Ÿ  Written in Scala
  • 9. 9© Copyright 2013 Pivotal. All rights reserved. Spark is also … Ÿ  An ASF Top Level project Ÿ  Has ~100 contributors across 25 companies –  More active than Hadoop MapReduce Ÿ  An eco-system of domain specific tools –  Different models, but mostly interoperable Ÿ  Hadoop Compatible
  • 10. 10© Copyright 2013 Pivotal. All rights reserved. Berkley Data Analytics Stack (BDAS) Support Ÿ  Batch Ÿ  Streaming Ÿ  Interactive Make it easy to compose them
  • 11. 11© Copyright 2013 Pivotal. All rights reserved. Short History Ÿ  2009 Started as research project at UCB Ÿ  2010 Open Sourced Ÿ  January 2011 AMPLab Created Ÿ  October 2012 0.6 –  Java, Stand alone cluster, maven Ÿ  June 21 2013 Spark accepted into ASF Incubator Ÿ  Feb 27 2014 Spark becomes top level ASF project Ÿ  May 30 2014 Spark 1.0
  • 12. 12© Copyright 2013 Pivotal. All rights reserved. Spark Philosophy Ÿ  Make life easy and productive for Data Scientists Ÿ  Provide well documented and expressive APIs Ÿ  Powerful Domain Specific Libraries Ÿ  Easy integration with storage systems Ÿ  Caching to avoid data movement (performance) Ÿ  Well defined releases, stable API
  • 13. 13© Copyright 2013 Pivotal. All rights reserved. Spark is not Hadoop, but is compatible Ÿ  Often better than Hadoop (Eric Baldeschwieler) –  M/R fine for “Data Parallel”, but awkward for some workloads –  Low latency dispatch, Iterative, Streaming Ÿ  Natively accesses Hadoop data Ÿ  Spark just another YARN job –  Maintains huge investment in data collection –  Brings Spark to the Data Ÿ  It’s not OR … it’s AND!
  • 14. 14© Copyright 2013 Pivotal. All rights reserved. Improvements over Map/Reduce Ÿ  Efficiency –  General Execution Graphs (not just map->reduce->store) –  In memory Ÿ  Usability –  Rich APIs in Scala, Java, Python –  Interactive Ÿ  Can Spark be the R for Big Data?
  • 15. 15© Copyright 2013 Pivotal. All rights reserved. 15© Copyright 2013 Pivotal. All rights reserved. Spark Programming Model RDDs in Detail
  • 16. 16© Copyright 2013 Pivotal. All rights reserved. Core Concept Think of a program as a set of transformations on a Distributed Dataset Model: Resilient Distributed Dataset (RDD) –  Read Only Collection of Objects spread across a cluster –  RDDs are built through parallel transformations (map, filter, etc.) –  Automatically rebuilt on failure using lineage –  Controllable persistence (RAM, HDFS, etc.)
  • 17. 17© Copyright 2013 Pivotal. All rights reserved. Operations Ÿ  Create –  From stable storage (hdfs) Ÿ  Transform –  Generate RDD from other RDD (map, filter, groupBy) –  Lazy Operations that build a DAG –  Once Spark knows your transformations it can build an efficient plan Ÿ  Action –  Return a result or write to storage (count, collect, reduce, save)
  • 18. 18© Copyright 2013 Pivotal. All rights reserved. Demo: Log Mining Ÿ  Scala shell Ÿ  Load file from HDFS Ÿ  Search for patterns
  • 19. 19© Copyright 2013 Pivotal. All rights reserved. Transformation and Actions Ÿ  Transformations –  Map –  filter –  flatMap –  sample –  groupByKey –  reduceByKey –  union –  join –  sort Ÿ  Actions –  count –  collect –  reduce –  lookup –  save
  • 20. 20© Copyright 2013 Pivotal. All rights reserved. RDD Fault Tolerance Ÿ  RDDs maintain lineage information that can be used to reconstruct lost partitions cachedMsgs = textFile(...).filter(_.contains(“error”)) .map(_.split(‘t’)(2)) .cache() HdfsRDD path: hdfs://… FilteredRDD func: contains(...) MappedRDD func: split(…) CachedRDD
  • 21. 21© Copyright 2013 Pivotal. All rights reserved. RDDs are Foundational Ÿ  General purpose enough to use to implement other programing models –  SQL –  Graph –  ML –  MR
  • 22. 22© Copyright 2013 Pivotal. All rights reserved. 22© Copyright 2013 Pivotal. All rights reserved. Related Projects Things that run on Spark
  • 23. 23© Copyright 2013 Pivotal. All rights reserved. Related Projects Ÿ  Shark Ÿ  Spark SQL Ÿ  Spark Streaming Ÿ  GraphX Ÿ  MLbase Ÿ  Others
  • 24. 24© Copyright 2013 Pivotal. All rights reserved. Shark Ÿ  Hive on Spark –  HiveQL, UDFs, etc. Ÿ  Turn SQL into RDD –  Part of the lineage Ÿ  Based on Hive, but takes advantage of Spark for –  Fast Scheduling –  Queries are DAGs of jobs, not chained M/R –  Fast broadcast variables © Apache Software Foundation
  • 25. 25© Copyright 2013 Pivotal. All rights reserved. Shark (cont) Ÿ  Optimized Columnar Storage format Ÿ  Fast/Efficient Compression –  From Yahoo! –  Able to hold 3-20x more data in same cluster Ÿ  Various other optimizations using partitioning Ÿ  Will ultimately run on Spark SQL –  No Hive dependencies except to accessing Hive datastore –  Long running process with management tools
  • 26. 26© Copyright 2013 Pivotal. All rights reserved. Spark SQL Ÿ  Lib in Spark Core to treat RDDs as relations –  SchemaRDD Ÿ  Lighter weight version of Shark –  No code from Hive Ÿ  Import/Export in different Storage formats –  Parquet, learn schema from existing Hive warehouse Ÿ  Takes columnar storage from Shark
  • 27. 27© Copyright 2013 Pivotal. All rights reserved. Spark SQL Code Ÿ  Go take a look
  • 28. 28© Copyright 2013 Pivotal. All rights reserved. Spark Streaming Ÿ  Extend Spark to do large scale stream processing –  100s of nodes and second scale end to end latency Ÿ  Stateful Processing –  Hard to make FT –  Storm: requires idempotent updates Ÿ  Simple, batch like API with RDDs Ÿ  Single semantics for both real time and high latency
  • 29. 29© Copyright 2013 Pivotal. All rights reserved. Streaming (cont) Ÿ  Input is broken up into Batches that become RDDs Ÿ  RDD’s are composed into DAGs to generate output Ÿ  Raw data is replicated in-memory for FT
  • 30. 30© Copyright 2013 Pivotal. All rights reserved. Streaming (cont) Ÿ  Other features –  Window-based Transformations –  Arbitrary join of streams
  • 31. 31© Copyright 2013 Pivotal. All rights reserved. GraphX (Alpha) Ÿ  Graph processing –  Replaces Spark Bagel Ÿ  Graph Parallel not Data Parallel –  Reason in the context of neighbors –  GraphLab API
  • 32. 32© Copyright 2013 Pivotal. All rights reserved. GraphX (cont) Ÿ  Predicting things about people (eg: political bias) –  Look at posts, apply classifier, try to predict attribute –  Local signal is difficult alone –  Look at context of social network to improve prediction Ÿ  Triangle processing –  More triangles reveals greater community Ÿ  Collaborative Filtering –  Bi-partide graph processing –  What I like, who rated those things, what they like => what I may like
  • 33. 33© Copyright 2013 Pivotal. All rights reserved. GraphX (cont) Ÿ  Graph Creation => Algorithm => Post Processing –  Existing systems mainly deal with the Algorithm and not interactive –  Unify collection and graph models Ÿ  Graphs have –  Vertices, edges –  Transformation: reverse, filter, map –  Joins: graphs and tables –  Aggregate Neighbors
  • 34. 34© Copyright 2013 Pivotal. All rights reserved. MLbase Ÿ  Machine Learning toolset –  Library and higher level abstractions Ÿ  General tool is MatLab –  Difficult for end users to learn, debug, scale solutions Ÿ  Starting with MLlib –  Low level Distributed Machine Learning Library Ÿ  Many different Algorithms –  Classification, Regression, Collaborative Filtering, etc.
  • 35. 35© Copyright 2013 Pivotal. All rights reserved. Others Ÿ  Mesos –  Enable multiple frameworks to share same cluster resources –  Twitter is largest user: Over 6,000 servers Ÿ  Tachyon –  In-memory, fault tolerant file system that exposes HDFS Ÿ  Catalyst –  SQL Query Optimizer
  • 36. 36© Copyright 2013 Pivotal. All rights reserved. 36© Copyright 2013 Pivotal. All rights reserved. Spark 1.0
  • 37. 37© Copyright 2013 Pivotal. All rights reserved. Release cycle Ÿ  1.0 Came out at end of May Ÿ  1.X expected to be current for several years Ÿ  Quarterly release cycle –  2 mo dev / 1 mo QA –  Actual release is based on vote Ÿ  1.1 due end of August
  • 38. 38© Copyright 2013 Pivotal. All rights reserved. 1.0 Ÿ  API Stability in 1.X for all non-Alpha projects –  Can recompile jobs, but hoping for binary compatibility –  Internal API are marked @DeveloperApi or @Experimental Ÿ  Focus: Core Engine, Streaming, MLLib, SparkSQL Ÿ  History Server for Spark UI –  Driving development of instrumentation Ÿ  Job Submission Tool –  Don’t configure Context in code (eg: master)
  • 39. 39© Copyright 2013 Pivotal. All rights reserved. 1.0 Ÿ  Java8 Lamdas –  No more writing closures as Classes –  Functions are interfaces –  Return type sensitive functions ▪  mapToPair Ÿ  Python improvements
  • 40. 40© Copyright 2013 Pivotal. All rights reserved. 1.0 Ÿ  Hadoop security –  Kerberos, ACL for UI Ÿ  Job cancel from UI Ÿ  Distributed GC as things go out of scope –  Good for long lives service Ÿ  Spark SQL
  • 41. 41© Copyright 2013 Pivotal. All rights reserved. 41© Copyright 2013 Pivotal. All rights reserved. More Code and Demos WordCount, TicTacToe, Java8
  • 42. 42© Copyright 2013 Pivotal. All rights reserved. Code Review: WordCount Ÿ  Java API Ÿ  Java Code Ÿ  More usage of RDDs
  • 43. 43© Copyright 2013 Pivotal. All rights reserved. TicTacToe: a developers experience Ÿ  IDE Ÿ  Spring Ÿ  Building/Logging Ÿ  Debugging
  • 44. 44© Copyright 2013 Pivotal. All rights reserved. Demo: Java 8 Lamda Lamda Lamda
  • 45. 45© Copyright 2013 Pivotal. All rights reserved. 45© Copyright 2013 Pivotal. All rights reserved. Deployment Topologies
  • 46. 46© Copyright 2013 Pivotal. All rights reserved. Topologies Ÿ  Local Ÿ  Spark Cluster (master/slaves) Ÿ  Cluster Resource Managers –  YARN –  MESOS Ÿ  (PaaS?)
  • 47. 47© Copyright 2013 Pivotal. All rights reserved. Demo: Ÿ  Start master and slaves Ÿ  Show the UI Ÿ  Run a Job Ÿ  Talk about the History Server
  • 48. 48© Copyright 2013 Pivotal. All rights reserved. 48© Copyright 2013 Pivotal. All rights reserved. This And That
  • 49. 49© Copyright 2013 Pivotal. All rights reserved. How Real is Spark? Ÿ  There is some criticism –  As expected –  New project! Ÿ  There are many indicators that Spark is heading to success –  Solid technology –  Good buzz –  Significant community
  • 50. 50© Copyright 2013 Pivotal. All rights reserved. Next Steps Ÿ  Spark website: http://spark.apache.org –  Lots’O’Goodstuff Ÿ  Spark Summit June 30/July 01 –  http://spark-summit.org
  • 51. 51© Copyright 2013 Pivotal. All rights reserved. 51© Copyright 2013 Pivotal. All rights reserved. A NEW PLATFORM FOR A NEW ERA