SlideShare una empresa de Scribd logo
1 de 51
Descargar para leer sin conexión
Introduction to Apache Spark 
Scott Deeg – Sr. Field Engineer, Pivotal 
Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a 
Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
Who Am I? 
A Plain Old Java Geek 
• Came to Si Valley seeking fame and fortune in 1995 (still looking) 
• Started working in Java Jan 1996, Symantec Visual Café 1.0 
• Hacker on J2EE based BPM product for 10 years 
• Joined VMware 2009 / Rolled into Pivotal April 1 2013 
• Primarily pre-sales consulting for large/medium enterprises 
sdeeg@pivotal.io 
Random Facts: CalPoly SLO, Physics, Guitar/Lutherie, Arduino, 3yr old boy, 100 yr old house 
(aka: Lots’O’work), spaces not tabs 
Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a 
Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ 
2
Agenda 
• What is Spark? 
• Programming Model 
• Produce ecosystem 
• Spark and Spring 
• A bit on Internals 
(with demo’s along the way) 
Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a 
Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ 
3
What people have been asking me about Spark 
• It’s one of those in memory things, right (yes) 
• Is it “Big Data” (yes) 
• Is it Hadoop (no) 
• JVM, Java, Scala (yes) 
• Is it “Real” or just another shiny technology with a long, but 
ultimately small tail (?) 
Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a 
Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ 
4
What is Spark? 
Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a 
Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ 
5
Official Definition 
Apache Spark is a fast and general 
engine for large scale data processing 
Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a 
Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ 
6
Spark is … 
• Distributed/Cluster Compute Engine 
• A toolset for Data Scientists / Analysts 
• Runs “batch” workloads in memory 
• Hadoop Compatible 
• Implementation of Resilient Distributed Dataset (RDD) in Scala 
• Programmatic interface via API or Interactive 
• Scala, Java7/8, Python 
Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a 
Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ 
7
Spark is also … 
• An ASF Top Level project http://spark.apache.org 
• Came out of AMPLab project at UCB 
• An active community 
• ~100-200 contributors across 25-35 companies 
• More active than Hadoop MapReduce 
• 1000 people (max) attended Spark Summit 2014 in SF 
• An eco-system of domain specific tools 
• Different models, but interoperable 
• Backed by a commercial entity: Databricks 
Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a 
Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ 
8
Spark is not … 
• An OLTP data store 
• A permanent or stable data store 
• An app cache 
It’s also not Mature 
• Lots of room to grow. 
Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a 
Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ 
9
Short History 
• 2009 Started as research project at UCB 
• 2010 Open Sourced 
• January 2011 AMPLab Created 
• October 2012 version 0.6 
• Java, Stand alone cluster, maven 
• June 21 2013 Spark accepted into ASF Incubator 
• Feb 27 2014 Spark becomes top level ASF project 
• May 30 2014 Spark 1.0 
• August 2014 1.0.2 
Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a 
Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ 
10
Spark Team Goals 
• Make life easy and productive for Data Scientists 
• Provide well documented and expressive APIs 
• Powerful Domain Specific Libraries 
• Easy integration with common Big Data storage systems 
• High Performance 
• Well defined releases, stable API 
Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a 
Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ 
11
Spark is not Hadoop, but is compatible 
• Often better than Hadoop 
• M/R fine for “Data Parallel”, but awkward for some workloads 
• Low latency, Iterative, Streaming 
• Natively accesses Hadoop data 
• Spark is YAYJ (Yet Another YARN Job) 
• Utilize current investments in Hadoop 
• Brings Spark (closer) to the Data 
• Similar scalability and fault tolerance characteristics as Hadoop 
It’s not OR … it’s AND 
Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a 
Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ 
12
Improvements over Map/Reduce 
• Efficiency 
• General Execution Graphs (not just map->reduce->store) 
• In memory 
• Useful for iterative processing 
• Usability 
• Rich APIs in Scala, Java, Python 
• Interactive REPL 
Can Spark be the R for Big Data? 
Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a 
Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ 
13
Topologies 
• Local in JVM or through REPL 
• Great for dev 
• Spark Cluster (master/slaves) 
• Improving rapidly 
• Cluster Resource Managers 
• YARN 
• MESOS 
• (PaaS?) 
Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a 
Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ 
14
Spark Programming Model 
Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a 
Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
Core Spark Concept 
In the Spark model a program is a set of transformations and 
actions on a dataset with the following properties: 
Resilient Distributed Dataset (RDD) 
• Read Only Collection of Objects spread across a cluster 
• RDDs are built through parallel transformations (map, filter, …) 
• Results are generated by actions (reduce, collect, …) 
• Automatically rebuilt on failure using lineage 
• Controllable persistence (RAM, HDFS, etc.) 
Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a 
Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ 
16
Two Categories of Operations 
• Transform 
• Create from stable storage (hdfs, tachyon, etc.) 
• Generate new RDDs from other RDD 
• Lazy Operations that build a DAG 
• Once Spark knows your transformations it can build a plan 
• Action 
• Return a result or write to storage 
• Actions cause the DAG to execute 
Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a 
Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ 
Ø map 
Ø filter 
Ø flatMap 
Ø sample 
Ø groupByKey 
Ø reduceByKey 
Ø union 
Ø join 
Ø sort 
Ø count 
Ø collect 
Ø reduce 
Ø lookup 
Ø save 
17
Demo 
WordCount (of course) 
val file = sc.textFile("hdfs://bfm1/…") 
val words = file.flatMap(line => line.split(" ")) 
val wordOneMap = words.map(word => (word, 1)) 
val counts = wordOneMap.reduceByKey(_ + _) 
counts.collect() 
Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a 
Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ 
18
RDD Fault Tolerance 
• RDDs maintain lineage information that can be used to 
reconstruct lost partitions 
cachedMsgs = textFile(...).filter(_.contains(“error”)) 
.map(_.split(‘t’)(2)) 
.cache() 
HdfsRDD 
path: hdfs://… 
FilteredRDD 
func: contains(...) 
MappedRDD 
func: split(…) CachedRDD 
Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a 
Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ 
19 
Source: http://spark.apache.org/
Optimizing Dataflow 
Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a 
Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ 
20 
Source: Aaron Davidson of Databricks
RDDs are Foundational 
• General purpose enough to use to implement other programing 
models 
• SQL 
• Streaming 
• Machine Learning 
• Graph 
Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a 
Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ 
21
Spark Ecosystem 
Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a 
Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
Spark SQL 
• Models RDDs as relations 
• SchemaRDD 
• Replaces Shark 
• Lighter weight version with no code from Hive 
• Import/Export in different Storage formats 
• Parquet, learn schema from existing Hive warehouse 
JavaRDD<Person> people = ctx.textFile(“people.txt").map(…) 
JavaSchemaRDD schemaPeople = sqlCtx.applySchema(people, Person.class); 
schemaPeople.registerAsTable("people"); 
JavaSchemaRDD teens = sqlCtx.sql("SELECT name FROM people WHERE age >= 13"); 
Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a 
Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ 
23
Streaming 
• Extend Spark to do large scale stream processing 
• 100s of nodes with second scale end to end latency 
• Simple, batch like API with RDDs 
• Input is broken up into micro-batches that become RDDs 
Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a 
Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ 
24 
Image from http://spark.apache.org/
Streaming 
• DStream is the primary construct 
• Sources: HDFS, Flume, Kafka, Twitter, ZeroMQ, Custom 
• Raw data needs to be replicated in-memory for FT 
• Other features 
• Window-based Transformations 
• Arbitrary join of streams 
JavaStreamingContext ssc = new JavaStreamingContext(sparkConf, …); 
JavaReceiverInputDStream<String> lines = ssc.socketTextStream(…) 
JavaDStream<String> words = lines.flatMap(…) 
JavaPairDStream<String, Integer> wordCounts = words.mapToPair(…) 
Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a 
Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ 
25
MLbase (“Young Project”) 
• Machine Learning toolset 
• Library and higher level abstractions 
• General tool in space is MatLab 
• Difficult for end users to learn, debug, scale solutions 
• Starting with MLlib 
• Low level Distributed Machine Learning Library 
• Many different Algorithms 
• Classification, Regression, Collaborative Filtering, etc. 
Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a 
Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ 
26
GraphX (alpha) 
• Graph processing library 
• Replaces Spark Bagel 
• Graph Parallel not Data Parallel 
• Reason in the context of neighbors 
• GraphLab API 
• Graph Creation => Algorithm => Post Processing 
• Existing systems mainly deal with the Algorithm and not interactive 
• Unify collection and graph models 
Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a 
Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ 
27 
Image from http://spark.apache.org/
Others 
• Mesos 
• Enable multiple frameworks to share same cluster resources 
• Twitter is largest user: Over 6,000 servers 
• Tachyon 
• In-memory, fault tolerant file system that exposes HDFS 
• Catalyst 
• SQL Query Optimizer 
Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a 
Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ 
28
Spark and Spring 
Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a 
Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
Sample App: Rocket Telemetry 
• Rockets generate data, and we want to understand it 
• Batch processing to look for patterns across flights 
• Streaming for watching it happen and alerting 
• Boot, Java Config, MVC, etc. 
WHY? 
• Similar to Telematics 
• Very important to Auto Insurance industry 
• It’s my friends project 
• It’s Real (model) rocket data! 
Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a 
Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ 
30
Basics 
• Spark’s a library, so just include it 
• Some lib conflicts, but not much 
• Logging loop 
• Packaging not fun 
• Have to exclude spark and hadoop clients IF they’re running on a cluster as 
as they’re provided by the runtime 
• mvn “shade” plugin, gradle being a pain 
• Executable Boot jars don’t just run on the Spark cluster 
Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a 
Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ 
31
Demo 
Show us some code already! 
Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a 
Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ 
32 
32
Spark and Spring XD 
• Two different problems in Enterprise data 
• Primary data pipeline(s) 
• 24/7/365 rock solid 
• Operations oriented 
• Well defined transformations and routing rules with long term deployment 
• Data analysis 
• Batch and realtime aspects 
• Transformation and processing exploration 
• Frequently short term deployment 
• Should not impact stability or operations of primary pipeline 
Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a 
Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ 
33
Pretty Picture 
Source Primary 
Stream 
Processing 
Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a 
Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ 
Application 
Stable Storage 
(HDFS) 
Batch 
Analysis 
Stream 
Analysis 
Operational 
Data 
(Redis, 
Gem) 
Sink 
Transform / Filter 
34 
Source 
Source
A bit on Internals 
Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a 
Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
About this Sample 
I can’t come up with a better example, so I use this one from Aaron 
Davidson of Databricks. This is a summary from his slides, and my 
notes from his talk at Spark Summit. All the images are from his 
deck. For more detail I highly recommend: 
http://spark-summit.org/2014/talk/a-deeper-understanding-of-spark-internals 
Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a 
Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ 
36
Sample 
Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a 
Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ 
37
What happens 
• Create RDDs 
• Pipeline operations as much of possible 
• When a results doesn’t depend on other results, we can pipeline 
• But, when data needs to be reorganized, no longer pipeline 
• Stage is a merged operation 
• Each stage gets a set of tasks 
• Task is data and computation 
Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a 
Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ 
38
RDDs and Stages 
Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a 
Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ 
39
Tasks 
Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a 
Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ 
40
Stages running 
• Number of 
partitions matter for 
concurrency 
• Rule of thumb is at 
least 2x number of 
cores 
Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a 
Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ 
41
The Shuffle 
• Redistributes data among partitions 
• Hash keys into buckets 
• Pull not push 
• Writes to intermediate files to disk 
• Becoming plugable 
Ÿ Optimizations: 
– Avoided when possible, if ”data is already properly" partitioned 
– Partial aggregation reduces data movement 
Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a 
Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ 
42
Other thought’s on Memory 
• By default Spark (assumes it) owns 90% of the memory 
• Partitions don’t have to fit in memory, but some things do 
• EG: values for large sets in groupBy’s must fit in memory 
• Shuffle memory is 20% 
• If it goes over that, it’ll spill the data to disk 
• Shuffle always writes to disk 
• Turn on compression to keep objects serialized 
• Saves space, but takes compute to serialize/de-serialize 
Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a 
Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ 
43
This and That 
Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a 
Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
Release cycle 
• 1.0 Came out at end of May 
• 1.X expected to be current for several years 
• API Stability in 1.X for all non-Alpha projects 
• Can recompile jobs, but hoping for binary compatibility 
• Internal API are marked @DeveloperApi or @Experimental 
• Plan (was?) for quarterly .X release cycle 
• 2 mo dev / 1 mo QA 
• 1.0.1 July, 1.0.2 August 
Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a 
Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ 
45
Resources 
Main spark page 
• http://spark.apache.org/ 
An initial paper on Spark 
• https://www.usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf 
Demo code for this session 
• https://github.com/SpringOne2GX-2014/SparkForSpring 
Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a 
Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ 
46
Upcoming 
• Blog post on executing Spring based Spark apps on clusters 
(Spark native, YARN, and Mesos) 
• Sample app with SpringXD as a source and Spark Streaming as 
a processor 
Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a 
Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ 
47
Thanks! J 
Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a 
Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
Misc 
Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a 
Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
Abstract 
Apache Spark is one of the most exciting, active, and talked about 
ASF projects today, but how should Spring developers and 
enterprise architects view it? Is it the second coming of the Bean 
spec, or just another shiny distraction? This talk will introduce Spark 
and its core concepts, the ecosystem of services on top of it, types 
of problems it can solve, similarities and differences from Hadoop, 
integration with Spring XD, deployment topologies, and an 
exploration of uses in enterprise. Concepts will be illustrated with 
several demos covering: the programming model with Spring/Java8, 
development experience, “realistic” infrastructure simulation with 
local virtual deployments, and Spark cluster monitoring tools. 
Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a 
Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ 
50
Bio 
A self described Plain Old Java Geek, Scott Deeg began his journey with 
Java in 1996 as a member of the Visual Café team at Symantec. From 
there he worked primarily as a consultant and solution architect dealing 
with enterprise Java applications. He joined Vmware in 2009 and is now a 
part of the EMC/VMware spin out Pivotal where he continues to work with 
large enterprises on their application platform and data needs. A big fan of 
open source software and technology, he tries to occasionally get out of 
the corporate world to talk about interesting things happening in the Java/ 
OSS community. 
Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a 
Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ 
51

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Oracle ADF Architecture TV - Design - Task Flow Navigation Options
Oracle ADF Architecture TV - Design - Task Flow Navigation OptionsOracle ADF Architecture TV - Design - Task Flow Navigation Options
Oracle ADF Architecture TV - Design - Task Flow Navigation Options
 
Oracle ADF Architecture TV - Development - Version Control
Oracle ADF Architecture TV - Development - Version ControlOracle ADF Architecture TV - Development - Version Control
Oracle ADF Architecture TV - Development - Version Control
 
Oracle ADF Architecture TV - Deployment - System Topologies
Oracle ADF Architecture TV - Deployment - System TopologiesOracle ADF Architecture TV - Deployment - System Topologies
Oracle ADF Architecture TV - Deployment - System Topologies
 
Oracle ADF Architecture TV - Deployment - Build Options
Oracle ADF Architecture TV - Deployment - Build OptionsOracle ADF Architecture TV - Deployment - Build Options
Oracle ADF Architecture TV - Deployment - Build Options
 
Oracle ADF Architecture TV - Design - Architecting for ADF Mobile Integration
Oracle ADF Architecture TV - Design - Architecting for ADF Mobile IntegrationOracle ADF Architecture TV - Design - Architecting for ADF Mobile Integration
Oracle ADF Architecture TV - Design - Architecting for ADF Mobile Integration
 
Implementing a highly scalable stock prediction system with R, Geode, SpringX...
Implementing a highly scalable stock prediction system with R, Geode, SpringX...Implementing a highly scalable stock prediction system with R, Geode, SpringX...
Implementing a highly scalable stock prediction system with R, Geode, SpringX...
 
Turning Relational Database Tables into Hadoop Datasources by Kuassi Mensah
Turning Relational Database Tables into Hadoop Datasources by Kuassi MensahTurning Relational Database Tables into Hadoop Datasources by Kuassi Mensah
Turning Relational Database Tables into Hadoop Datasources by Kuassi Mensah
 
Microservices + Oracle: A Bright Future
Microservices + Oracle: A Bright FutureMicroservices + Oracle: A Bright Future
Microservices + Oracle: A Bright Future
 
Mobile Mumbo Jumbo - Demystifying the World of Enterprise Mobility with Oracle
Mobile Mumbo Jumbo - Demystifying the World of Enterprise Mobility with OracleMobile Mumbo Jumbo - Demystifying the World of Enterprise Mobility with Oracle
Mobile Mumbo Jumbo - Demystifying the World of Enterprise Mobility with Oracle
 
spark_v1_2
spark_v1_2spark_v1_2
spark_v1_2
 
Oracle REST Data Services
Oracle REST Data ServicesOracle REST Data Services
Oracle REST Data Services
 
Building Highly Scalable Spring Applications using In-Memory Data Grids
Building Highly Scalable Spring Applications using In-Memory Data GridsBuilding Highly Scalable Spring Applications using In-Memory Data Grids
Building Highly Scalable Spring Applications using In-Memory Data Grids
 
Oracle ADF Architecture TV - Deployment - Deployment Options
Oracle ADF Architecture TV - Deployment - Deployment OptionsOracle ADF Architecture TV - Deployment - Deployment Options
Oracle ADF Architecture TV - Deployment - Deployment Options
 
Oracle ADF Architecture TV - Development - Programming Best Practices
Oracle ADF Architecture TV - Development - Programming Best PracticesOracle ADF Architecture TV - Development - Programming Best Practices
Oracle ADF Architecture TV - Development - Programming Best Practices
 
Oracle ADF Architecture TV - Design - ADF Architectural Patterns
Oracle ADF Architecture TV - Design - ADF Architectural PatternsOracle ADF Architecture TV - Design - ADF Architectural Patterns
Oracle ADF Architecture TV - Design - ADF Architectural Patterns
 
Java EE Arquillian Testing with Docker & The Cloud
Java EE Arquillian Testing with Docker & The CloudJava EE Arquillian Testing with Docker & The Cloud
Java EE Arquillian Testing with Docker & The Cloud
 
Oracle ADF Architecture TV - Development - Error Handling
Oracle ADF Architecture TV - Development - Error HandlingOracle ADF Architecture TV - Development - Error Handling
Oracle ADF Architecture TV - Development - Error Handling
 
Oracle ADF Architecture TV - Development - Performance & Tuning
Oracle ADF Architecture TV - Development - Performance & TuningOracle ADF Architecture TV - Development - Performance & Tuning
Oracle ADF Architecture TV - Development - Performance & Tuning
 
Boost Your Content Strategy for REST APIs
Boost Your Content Strategy for REST APIsBoost Your Content Strategy for REST APIs
Boost Your Content Strategy for REST APIs
 
Oracle ADF Architecture TV - Design - Designing for Security
Oracle ADF Architecture TV - Design - Designing for SecurityOracle ADF Architecture TV - Design - Designing for Security
Oracle ADF Architecture TV - Design - Designing for Security
 

Similar a Spark forspringdevs springone_final

Similar a Spark forspringdevs springone_final (20)

Ratpack - SpringOne2GX 2015
Ratpack - SpringOne2GX 2015Ratpack - SpringOne2GX 2015
Ratpack - SpringOne2GX 2015
 
Cloud and agile software projects: Overview and Benefits
Cloud and agile software projects: Overview and BenefitsCloud and agile software projects: Overview and Benefits
Cloud and agile software projects: Overview and Benefits
 
Developing Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache KafkaDeveloping Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache Kafka
 
Apache Spark Introduction @ University College London
Apache Spark Introduction @ University College LondonApache Spark Introduction @ University College London
Apache Spark Introduction @ University College London
 
Resource Handling in Spring MVC 4.1
Resource Handling in Spring MVC 4.1Resource Handling in Spring MVC 4.1
Resource Handling in Spring MVC 4.1
 
Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkTransitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to Spark
 
Spring Tools 4 - Eclipse and Beyond
Spring Tools 4 - Eclipse and BeyondSpring Tools 4 - Eclipse and Beyond
Spring Tools 4 - Eclipse and Beyond
 
Apache Spark Workshop at Hadoop Summit
Apache Spark Workshop at Hadoop SummitApache Spark Workshop at Hadoop Summit
Apache Spark Workshop at Hadoop Summit
 
What We're Learning Adopting Spring Boot and PCF for Dell.com's eCommerce
What We're Learning Adopting Spring Boot and PCF for Dell.com's eCommerceWhat We're Learning Adopting Spring Boot and PCF for Dell.com's eCommerce
What We're Learning Adopting Spring Boot and PCF for Dell.com's eCommerce
 
JDBC, What Is It Good For?
JDBC, What Is It Good For?JDBC, What Is It Good For?
JDBC, What Is It Good For?
 
Cloud for agile_sw_projects-final
Cloud for agile_sw_projects-finalCloud for agile_sw_projects-final
Cloud for agile_sw_projects-final
 
Session 203 iouc summit database
Session 203 iouc summit databaseSession 203 iouc summit database
Session 203 iouc summit database
 
Building a Secure App with Google Polymer and Java / Spring
Building a Secure App with Google Polymer and Java / SpringBuilding a Secure App with Google Polymer and Java / Spring
Building a Secure App with Google Polymer and Java / Spring
 
Sql on everything with drill
Sql on everything with drillSql on everything with drill
Sql on everything with drill
 
Docker based Hadoop Deployment
Docker based Hadoop DeploymentDocker based Hadoop Deployment
Docker based Hadoop Deployment
 
Hadoop on Docker
Hadoop on DockerHadoop on Docker
Hadoop on Docker
 
Machines Can Learn - a Practical Take on Machine Intelligence Using Spring Cl...
Machines Can Learn - a Practical Take on Machine Intelligence Using Spring Cl...Machines Can Learn - a Practical Take on Machine Intelligence Using Spring Cl...
Machines Can Learn - a Practical Take on Machine Intelligence Using Spring Cl...
 
Hitting the Enterprise Sweet Spot—A Real-World View of PKS Deployment and Suc...
Hitting the Enterprise Sweet Spot—A Real-World View of PKS Deployment and Suc...Hitting the Enterprise Sweet Spot—A Real-World View of PKS Deployment and Suc...
Hitting the Enterprise Sweet Spot—A Real-World View of PKS Deployment and Suc...
 
Oracle Unified Information Architeture + Analytics by Example
Oracle Unified Information Architeture + Analytics by ExampleOracle Unified Information Architeture + Analytics by Example
Oracle Unified Information Architeture + Analytics by Example
 
SDLC for Pivotal Platform powered by Spring Initializr and Concourse
SDLC for Pivotal Platform powered by Spring Initializr and ConcourseSDLC for Pivotal Platform powered by Spring Initializr and Concourse
SDLC for Pivotal Platform powered by Spring Initializr and Concourse
 

Último

Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
nirzagarg
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
gajnagarg
 
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
HyderabadDolls
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
gajnagarg
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
nirzagarg
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
chadhar227
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
wsppdmt
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
Health
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
ranjankumarbehera14
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
gajnagarg
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 

Último (20)

SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
 
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
 
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Kings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about themKings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about them
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 

Spark forspringdevs springone_final

  • 1. Introduction to Apache Spark Scott Deeg – Sr. Field Engineer, Pivotal Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
  • 2. Who Am I? A Plain Old Java Geek • Came to Si Valley seeking fame and fortune in 1995 (still looking) • Started working in Java Jan 1996, Symantec Visual Café 1.0 • Hacker on J2EE based BPM product for 10 years • Joined VMware 2009 / Rolled into Pivotal April 1 2013 • Primarily pre-sales consulting for large/medium enterprises sdeeg@pivotal.io Random Facts: CalPoly SLO, Physics, Guitar/Lutherie, Arduino, 3yr old boy, 100 yr old house (aka: Lots’O’work), spaces not tabs Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ 2
  • 3. Agenda • What is Spark? • Programming Model • Produce ecosystem • Spark and Spring • A bit on Internals (with demo’s along the way) Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ 3
  • 4. What people have been asking me about Spark • It’s one of those in memory things, right (yes) • Is it “Big Data” (yes) • Is it Hadoop (no) • JVM, Java, Scala (yes) • Is it “Real” or just another shiny technology with a long, but ultimately small tail (?) Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ 4
  • 5. What is Spark? Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ 5
  • 6. Official Definition Apache Spark is a fast and general engine for large scale data processing Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ 6
  • 7. Spark is … • Distributed/Cluster Compute Engine • A toolset for Data Scientists / Analysts • Runs “batch” workloads in memory • Hadoop Compatible • Implementation of Resilient Distributed Dataset (RDD) in Scala • Programmatic interface via API or Interactive • Scala, Java7/8, Python Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ 7
  • 8. Spark is also … • An ASF Top Level project http://spark.apache.org • Came out of AMPLab project at UCB • An active community • ~100-200 contributors across 25-35 companies • More active than Hadoop MapReduce • 1000 people (max) attended Spark Summit 2014 in SF • An eco-system of domain specific tools • Different models, but interoperable • Backed by a commercial entity: Databricks Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ 8
  • 9. Spark is not … • An OLTP data store • A permanent or stable data store • An app cache It’s also not Mature • Lots of room to grow. Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ 9
  • 10. Short History • 2009 Started as research project at UCB • 2010 Open Sourced • January 2011 AMPLab Created • October 2012 version 0.6 • Java, Stand alone cluster, maven • June 21 2013 Spark accepted into ASF Incubator • Feb 27 2014 Spark becomes top level ASF project • May 30 2014 Spark 1.0 • August 2014 1.0.2 Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ 10
  • 11. Spark Team Goals • Make life easy and productive for Data Scientists • Provide well documented and expressive APIs • Powerful Domain Specific Libraries • Easy integration with common Big Data storage systems • High Performance • Well defined releases, stable API Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ 11
  • 12. Spark is not Hadoop, but is compatible • Often better than Hadoop • M/R fine for “Data Parallel”, but awkward for some workloads • Low latency, Iterative, Streaming • Natively accesses Hadoop data • Spark is YAYJ (Yet Another YARN Job) • Utilize current investments in Hadoop • Brings Spark (closer) to the Data • Similar scalability and fault tolerance characteristics as Hadoop It’s not OR … it’s AND Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ 12
  • 13. Improvements over Map/Reduce • Efficiency • General Execution Graphs (not just map->reduce->store) • In memory • Useful for iterative processing • Usability • Rich APIs in Scala, Java, Python • Interactive REPL Can Spark be the R for Big Data? Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ 13
  • 14. Topologies • Local in JVM or through REPL • Great for dev • Spark Cluster (master/slaves) • Improving rapidly • Cluster Resource Managers • YARN • MESOS • (PaaS?) Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ 14
  • 15. Spark Programming Model Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
  • 16. Core Spark Concept In the Spark model a program is a set of transformations and actions on a dataset with the following properties: Resilient Distributed Dataset (RDD) • Read Only Collection of Objects spread across a cluster • RDDs are built through parallel transformations (map, filter, …) • Results are generated by actions (reduce, collect, …) • Automatically rebuilt on failure using lineage • Controllable persistence (RAM, HDFS, etc.) Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ 16
  • 17. Two Categories of Operations • Transform • Create from stable storage (hdfs, tachyon, etc.) • Generate new RDDs from other RDD • Lazy Operations that build a DAG • Once Spark knows your transformations it can build a plan • Action • Return a result or write to storage • Actions cause the DAG to execute Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Ø map Ø filter Ø flatMap Ø sample Ø groupByKey Ø reduceByKey Ø union Ø join Ø sort Ø count Ø collect Ø reduce Ø lookup Ø save 17
  • 18. Demo WordCount (of course) val file = sc.textFile("hdfs://bfm1/…") val words = file.flatMap(line => line.split(" ")) val wordOneMap = words.map(word => (word, 1)) val counts = wordOneMap.reduceByKey(_ + _) counts.collect() Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ 18
  • 19. RDD Fault Tolerance • RDDs maintain lineage information that can be used to reconstruct lost partitions cachedMsgs = textFile(...).filter(_.contains(“error”)) .map(_.split(‘t’)(2)) .cache() HdfsRDD path: hdfs://… FilteredRDD func: contains(...) MappedRDD func: split(…) CachedRDD Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ 19 Source: http://spark.apache.org/
  • 20. Optimizing Dataflow Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ 20 Source: Aaron Davidson of Databricks
  • 21. RDDs are Foundational • General purpose enough to use to implement other programing models • SQL • Streaming • Machine Learning • Graph Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ 21
  • 22. Spark Ecosystem Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
  • 23. Spark SQL • Models RDDs as relations • SchemaRDD • Replaces Shark • Lighter weight version with no code from Hive • Import/Export in different Storage formats • Parquet, learn schema from existing Hive warehouse JavaRDD<Person> people = ctx.textFile(“people.txt").map(…) JavaSchemaRDD schemaPeople = sqlCtx.applySchema(people, Person.class); schemaPeople.registerAsTable("people"); JavaSchemaRDD teens = sqlCtx.sql("SELECT name FROM people WHERE age >= 13"); Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ 23
  • 24. Streaming • Extend Spark to do large scale stream processing • 100s of nodes with second scale end to end latency • Simple, batch like API with RDDs • Input is broken up into micro-batches that become RDDs Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ 24 Image from http://spark.apache.org/
  • 25. Streaming • DStream is the primary construct • Sources: HDFS, Flume, Kafka, Twitter, ZeroMQ, Custom • Raw data needs to be replicated in-memory for FT • Other features • Window-based Transformations • Arbitrary join of streams JavaStreamingContext ssc = new JavaStreamingContext(sparkConf, …); JavaReceiverInputDStream<String> lines = ssc.socketTextStream(…) JavaDStream<String> words = lines.flatMap(…) JavaPairDStream<String, Integer> wordCounts = words.mapToPair(…) Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ 25
  • 26. MLbase (“Young Project”) • Machine Learning toolset • Library and higher level abstractions • General tool in space is MatLab • Difficult for end users to learn, debug, scale solutions • Starting with MLlib • Low level Distributed Machine Learning Library • Many different Algorithms • Classification, Regression, Collaborative Filtering, etc. Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ 26
  • 27. GraphX (alpha) • Graph processing library • Replaces Spark Bagel • Graph Parallel not Data Parallel • Reason in the context of neighbors • GraphLab API • Graph Creation => Algorithm => Post Processing • Existing systems mainly deal with the Algorithm and not interactive • Unify collection and graph models Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ 27 Image from http://spark.apache.org/
  • 28. Others • Mesos • Enable multiple frameworks to share same cluster resources • Twitter is largest user: Over 6,000 servers • Tachyon • In-memory, fault tolerant file system that exposes HDFS • Catalyst • SQL Query Optimizer Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ 28
  • 29. Spark and Spring Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
  • 30. Sample App: Rocket Telemetry • Rockets generate data, and we want to understand it • Batch processing to look for patterns across flights • Streaming for watching it happen and alerting • Boot, Java Config, MVC, etc. WHY? • Similar to Telematics • Very important to Auto Insurance industry • It’s my friends project • It’s Real (model) rocket data! Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ 30
  • 31. Basics • Spark’s a library, so just include it • Some lib conflicts, but not much • Logging loop • Packaging not fun • Have to exclude spark and hadoop clients IF they’re running on a cluster as as they’re provided by the runtime • mvn “shade” plugin, gradle being a pain • Executable Boot jars don’t just run on the Spark cluster Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ 31
  • 32. Demo Show us some code already! Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ 32 32
  • 33. Spark and Spring XD • Two different problems in Enterprise data • Primary data pipeline(s) • 24/7/365 rock solid • Operations oriented • Well defined transformations and routing rules with long term deployment • Data analysis • Batch and realtime aspects • Transformation and processing exploration • Frequently short term deployment • Should not impact stability or operations of primary pipeline Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ 33
  • 34. Pretty Picture Source Primary Stream Processing Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Application Stable Storage (HDFS) Batch Analysis Stream Analysis Operational Data (Redis, Gem) Sink Transform / Filter 34 Source Source
  • 35. A bit on Internals Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
  • 36. About this Sample I can’t come up with a better example, so I use this one from Aaron Davidson of Databricks. This is a summary from his slides, and my notes from his talk at Spark Summit. All the images are from his deck. For more detail I highly recommend: http://spark-summit.org/2014/talk/a-deeper-understanding-of-spark-internals Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ 36
  • 37. Sample Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ 37
  • 38. What happens • Create RDDs • Pipeline operations as much of possible • When a results doesn’t depend on other results, we can pipeline • But, when data needs to be reorganized, no longer pipeline • Stage is a merged operation • Each stage gets a set of tasks • Task is data and computation Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ 38
  • 39. RDDs and Stages Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ 39
  • 40. Tasks Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ 40
  • 41. Stages running • Number of partitions matter for concurrency • Rule of thumb is at least 2x number of cores Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ 41
  • 42. The Shuffle • Redistributes data among partitions • Hash keys into buckets • Pull not push • Writes to intermediate files to disk • Becoming plugable Ÿ Optimizations: – Avoided when possible, if ”data is already properly" partitioned – Partial aggregation reduces data movement Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ 42
  • 43. Other thought’s on Memory • By default Spark (assumes it) owns 90% of the memory • Partitions don’t have to fit in memory, but some things do • EG: values for large sets in groupBy’s must fit in memory • Shuffle memory is 20% • If it goes over that, it’ll spill the data to disk • Shuffle always writes to disk • Turn on compression to keep objects serialized • Saves space, but takes compute to serialize/de-serialize Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ 43
  • 44. This and That Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
  • 45. Release cycle • 1.0 Came out at end of May • 1.X expected to be current for several years • API Stability in 1.X for all non-Alpha projects • Can recompile jobs, but hoping for binary compatibility • Internal API are marked @DeveloperApi or @Experimental • Plan (was?) for quarterly .X release cycle • 2 mo dev / 1 mo QA • 1.0.1 July, 1.0.2 August Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ 45
  • 46. Resources Main spark page • http://spark.apache.org/ An initial paper on Spark • https://www.usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf Demo code for this session • https://github.com/SpringOne2GX-2014/SparkForSpring Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ 46
  • 47. Upcoming • Blog post on executing Spring based Spark apps on clusters (Spark native, YARN, and Mesos) • Sample app with SpringXD as a source and Spark Streaming as a processor Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ 47
  • 48. Thanks! J Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
  • 49. Misc Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
  • 50. Abstract Apache Spark is one of the most exciting, active, and talked about ASF projects today, but how should Spring developers and enterprise architects view it? Is it the second coming of the Bean spec, or just another shiny distraction? This talk will introduce Spark and its core concepts, the ecosystem of services on top of it, types of problems it can solve, similarities and differences from Hadoop, integration with Spring XD, deployment topologies, and an exploration of uses in enterprise. Concepts will be illustrated with several demos covering: the programming model with Spring/Java8, development experience, “realistic” infrastructure simulation with local virtual deployments, and Spark cluster monitoring tools. Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ 50
  • 51. Bio A self described Plain Old Java Geek, Scott Deeg began his journey with Java in 1996 as a member of the Visual Café team at Symantec. From there he worked primarily as a consultant and solution architect dealing with enterprise Java applications. He joined Vmware in 2009 and is now a part of the EMC/VMware spin out Pivotal where he continues to work with large enterprises on their application platform and data needs. A big fan of open source software and technology, he tries to occasionally get out of the corporate world to talk about interesting things happening in the Java/ OSS community. Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ 51