The Big Boss(tm) has just OKed the first Hadoop cluster in the company. You are the guy in charge of analyzing petabytes of your company's valuable data using a combination of custom MapReduce jobs and SQL-on-Hadoop solutions. All of a sudden the web is full of articles telling you that Hadoop is dead, Spark has won and you should quit while you're still ahead. But should you?
10 Trends Likely to Shape Enterprise Technology in 2024
Apache Spark: killer or savior of Apache Hadoop?
1. Apache Spark:
killer or savior of Apache Hadoop?
Roman Shaposhnik
Director of Open Source @Pivotal
(Twitter: @rhatr)
2. Who’s this guy?
• Director of Open Source (building a team of OS contributors)
• Apache Software Foundation guy (Member, VP of Apache
Incubator, committer on Hadoop, Giraph, Sqoop, etc)
• Used to be root@Cloudera
• Used to be PHB@Yahoo! (original Hadoop team)
• Used to be a hacker at Sun microsystems (Sun Studio compilers
and tools)
26. Spark philosophy
• Make life easy for Data Scientists
• Provide well documented and expressive APIs
• Powerful Domain Specific Libraries
• Easy integration with storage systems
• Caching to avoid data movement
• Well defined releases, stable API
27. Spark innovations
• Resilient Distribtued Datasets (RDDs)
• Distributed on a cluster
• Manipulated via parallel operators (map, etc.)
• Automatically rebuilt on failure
• A parallel ecosystem
• A solution to iterative and multi-stage apps
30. How do I use it?
val file = spark.textFile(hdfs://...)
val counts = file.flatMap(line = line.split( ))
.map(word = (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile(hdfs://...)
32. RDDs are the foundation
• SQL
• Graph
• ML
• Streaming
33. Spark SQL
• Lib in Spark Core that models RDDs as rels.
• SchemaRDD
• Replaces Shark
• Lightweight with no code from Hive
• Import/Export into different storage formats
• Columnar storage (as in Shark)
34. Spark Streaming
• Extend Spark to do large scale stream
processing
• Simple, batch like API with RDDs
• Single semantics for both real time and high
latency
44. What is *really* going on?
• 2009 Research at UCB, written in Scala
• 2010 Open Sourced
• 2013 Accepted into Apache Incubator
• 2013 Databricks formed ($14M funding)
• 2014 Becomes TLP with ASF
• 2014 Spark 1.0 is out
• 2014 Databricks gets an extra $33M
45. Bigdata: brought to U by ASF
• 50% ML traffic
• 100-200 contributors across 25-35 companies
• More active than Hadoop
• Cross-pollination with other TLPs
54. Hadoop Maturity
ETL Offload
Accommodate massive
data growth with existing
EDW investments
Data Lakes
Unify Unstructured and
Structured Data Access
Big Data
Apps
Build analytic-led
applications impacting
top line revenue
Data-Driven
Enterprise
App Dev and Operational
Management on HDFS
Data Architecture
55. Pivotal HD on Pivotal CF
Ÿ Enterprise PaaS Management System
Ÿ Flexible multi-language ‘buildpack’
architecture
Ÿ Deployed applications enjoy built-in
services
Ÿ On-Premise Hadoop as a Service
Ÿ Single cluster deployment of Pivotal HD
Ÿ Developers instantly bind to shared
Hadoop Clusters
Ÿ Speeds up time-to-value
56. Pivotal’s view
Data Science Platform
Tachyon/Gem
Cluster Manager
MR
Application
Stream
Server
MPP
SQL
Data Lake / HDFS / Virtual Storage
GemFireXD
...ETC
Hadoop HDFS
Isilon
App Dev / Ops
MLbase
Streaming
Legacy
Systems
Legacy
Data Scientists
Data Sources
End Users
SparkSQL
58. It will be called Hadoop
HDFS
Pig
Sqoop Flume
Coordination and
workflow
management
Zookeeper
Command
Center
ASF Projects
FLOSS Projects
Pivotal Products
GemFire with Tachyon
Oozie
MapReduce
Hive
Tez
Giraph
Hadoop UI
Hue
SolrCloud
Phoenix
HBase
Crunch Mahout
Spark
Shark
Streaming
MLib
GraphX
Impala
HAWQ
SpringXD
MADlib
Hamster
PivotalR
YARN
59. Spark recap
• Is it “Big Data” (Yes)
• Is it “Hadoop” (No)
• It’s one of those “in memory” things, right (Yes)
• JVM, Java, Scala (All)
• Is it Real or just another shiny technology with
a long, but ultimately small tail (Yes and ?)