Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem

1© Cloudera, Inc. All rights reserved.
Why Apache Spark is the Heir to
MapReduce in the Apache
Hadoop Ecosystem

Key Advances by MapReduce:
• Data Locality: Automatic split computation and launch of mappers appropriately
• Fault-Tolerance: Write out of intermediate results and restartable mappers meant ability to run on
commodity hardware
• Linear Scalability: Combination of locality + programming model that forces developers to write
generally scalable solutions to problems
MapReduce: Hadoop’s Original Data Processing Engine
Map Map Map Map Map Map Map Map Map Map Map Map
Reduce Reduce Reduce Reduce

MR was sufficient for many use cases, but a bit like Haiku in its expressiveness:
A very rigid framework;
Diverse, powerful.
MapReduce Did Its Original Job Well, But…
MapReduce
Hive Pig Mahout Crunch Solr

Better Developer Productivity
Rich APIs for Scala, Java, and Python
Interactive shell
We Can Do Better with Apache Spark
Better Performance
General execution graphs
In-memory storage

• Native support for multiple
languages with identical APIs
• Use of closures, iterations, and
other common language
constructs to minimize code
• Unified API for batch and
streaming
High-Productivity Language Support
Python
lines = sc.textFile(...)
lines.filter(lambda s: “ERROR” in s).count()
Scala
val lines = sc.textFile(...)
lines.filter(s => s.contains(“ERROR”)).count()
Java
JavaRDD<String> lines = sc.textFile(...);
lines.filter(new Function<String, Boolean>() {
Boolean call(String s) {
return s.contains(“error”);
}
}).count();

In Spark, individual execution tasks are expressed as a single, parallelized
program flow. Big time saver for developers!
Automatic Parallelization of Complex Flows
rdd1.map(splitlines).filter("ERROR")
rdd2.map(splitlines).groupBy(key)
rdd2.join(rdd1, key).take(10)

Run continuous processing of data using Spark’s core API.
Example use cases:
• “On-the-fly” ETL as data is ingested into Hadoop/HDFS
• Detecting anomalous behavior and triggering alerts
• Continuous reporting of summary metrics for incoming
data
Integrated Streaming

Spark and Hadoop Belong Together (via YARN)
YARN
Spark
Spark
Streaming
GraphX MLlib
HDFS, HBase
HivePig
Impala
Spark or MR
Spark SQL Search
Core Hadoop
Spark components

Cloudera Is a Leader in the Spark Movement
2013 2014 2015 2016
Identified Spark’s
early potential
Ships and
Supports
Spark with
CDH 4.4
Significant
contributions to
Spark-on-YARN
integration
Announces initiative to
make Spark the standard
execution engine
Launches first
Spark training
Added security
integration
Cloudera engineers
publish O’Reilly Spark
book
Leading effort to
further performance,
usability, and
enterprise-readiness

Spark is Replacing MapReduce as the Open Standard
With help from Cloudera’s Apache committers, ecosystem communities are
complementing MapReduce with Spark as their execution engine/making Spark
the default:
Hive Pig Mahout Crunch Solr

Cloudera & Intel: Joint Roadmap for Spark
Cloudera and Intel engineers are major contributors to Spark, working
alongside those of DataBricks and the rest of the global Apache community
to help build the platform.
• 23 total engineers working on Spark (including 5 committers)
• Cloudera: 8 (4 committers)
• Intel: 15 (1 committer)
• 900+ patches contributed to date

Developers are Sparking Up
Source: Typesafe Apache
Spark Adoption Survey, Jan.
2015
• 82% of users have Spark to
replace MapReduce
• 78% of users need faster
processing for large data sets
• 67% of users plan to
introduce event stream
processing
• 22% of users run Spark on
Cloudera, twice as many as
any other platform option

Focus Areas for Contributions
Enterprise Readiness Performance SQL
• Comprehensive Security
• Comprehensive Governance
• Improved Monitoring and
Dashboards
• Core shuffle and sort
improvements
• Improved leverage of HDFS data
locality
• Automatic performance tuning
• Leverage HDFS Caching
• Scale testing
• HDFS discard-able distributed
memory integration
• Spark-on-YARN improvements:
dynamic container resizing
• Spark SQL stability
• SQL on Spark Streaming
• Column-level security
Growing the Ecosystem
• Hive on Spark
• Remote Spark Context
• Sqoop on Spark
Data Science
• MLlib Pipelines
• Interactive iPython-style
notebooks
• Intel MKL integration for
performance improvements

Get Educated About Spark at cloudera.com/spark
Read the Spark book by
Cloudera’s committers
Get Spark trainingGet hands-on with
Spark and Hadoop on AWS

Thank You
cloudera.com/spark

Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem

Similar a Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem (20)

Más de Cloudera, Inc.

Más de Cloudera, Inc. (20)

Último

Último (20)

Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem

Notas del editor