SlideShare una empresa de Scribd logo
1 de 40
Martin Ahchiev
Content is available under a Creative Commons 3.0 License unless otherwise noted.
2
05.12.2015
Apache Spark
• What is Big Data?
• Internet of Things
• What is Apache Spark?
• History of Apache Spark
• Why Spark?
• Spark Execution Flow
• Spark Context
• Resilient Distributed Dataset (RDD)
• RDD Examples
• MapReduce Algorithm
• MapReduce Example: Word Count
• Let’s try some examples
3
05.12.2015
What is Big Data?
Apache Spark
• “Big data” is similar to “small
data”, but bigger in size – is not
incorrect
• But having data bigger it
requires different approaches
• Big Data is a set of technologies and methods for handling
large volumes of data at rapid speeds and of various
formats.
4
05.12.2015
Big Data and the Internet of Things
Apache Spark
• Connected Intelligence
• Every day, we create 2.5 quintillion bytes of data — so much that
90% of the data in the world today has been created in the last two
years alone. IBM, “Bringing big data to the enterprise”
5
05.12.2015
In Next 60 seconds…
Apache Spark
6
05.12.2015
Big Data – Trends and Opportunities
Apache Spark
”Welcome to the Internet of Customers. Behind every app, every device,
and every connection, is a customer. Billions of them. And each and every
one is speeding toward the future.” Salesforce.com
7
05.12.2015
What is Apache Spark?
Apache Spark
• Emerging big data framework
• Open source framework for fast distributed in-memory
data processing and data analytics
• Extension/alternative to the MapReduce model
• Currently an Apache high-priority “top-level” project
8
05.12.2015
What is Apache Spark?
Apache Spark
• Written in Scala
 Functional programming language that runs in a JVM
• Key Concepts
 Avoid the data bottleneck by distributing data when it is stored
 Bring the processing to the data
 Data is stored in memory
9
05.12.2015
History of Apache Spark
Apache Spark
• Started in UC Berkeley AMPLab as a research
project by Matei Zaharia, 2009
 AMP = Algorithms Machines People
 AMPLab is integrating Algorithms, Machines, and People to
make sense of Big Data
• Spark become open source, March 2010
• Spark donated to Apache Software Foundation,
June 2013
• Spark becomes a top-level Apache project,
February 2014
10
05.12.2015
Why Spark?
Apache Spark
Speed
• Run programs up to 100x faster than Hadoop MapReduce in
memory, or 10x faster on disk.
• Last year, Spark took over Hadoop by completing the 100 TB
Daytona GraySort contest 3x faster on one tenth the number of
machines and it also became the fastest open source engine for
sorting a petabyte.
11
05.12.2015
Why Spark?
Apache Spark
Ease of Use
• Write applications quickly in Java, Scala, Python, R.
• Spark offers over 80 high-level operators that make it
easy to build parallel apps. And you can use it
interactively from the Scala, Python and R shells.
12
05.12.2015
Why Spark?
Apache Spark
Generality
• Combine SQL, streaming, and complex analytics
• Spark powers a stack of libraries including SQL and
DataFrames, MLlib for machine learning, GraphX, and Spark
Streaming. You can combine these libraries seamlessly in the
same application
13
05.12.2015
Why Spark?
Apache Spark
Runs Everywhere
• Spark runs on Hadoop, Mesos, standalone, or in the cloud. It
can access diverse data sources including HDFS, Cassandra,
HBase, and S3.
• You can run Spark using its standalone cluster mode, on EC2,
on Hadoop YARN, or on Apache Mesos. Access data in
HDFS, Cassandra, HBase, Hive, Tachyon, and any Hadoop
data source.
14
05.12.2015
Execution Flow
Apache Spark
• Cluster Manager An external service to manage resources on the cluster
(standalone manager, YARN, Apache Mesos)
• Worker Node : Node that run the application program in cluster
• Executor
1. Process launched on a worker node, that runs the Tasks
2. Keep data in memory or disk storage
• Task : A unit of work that will be sent to executor
• Job
1. Consists multiple tasks
2. Created based on a Action
• Stage : Each Job is divided into smaller set of tasks called Stages that is sequential
and depend on each other
• SparkContext : represents the connection to a Spark cluster, and can be used to
create RDDs, accumulators and broadcast variables on that cluster.
• Driver Program
The process to start the execution (main() function)
15
05.12.2015
Spark Context
Apache Spark
• Every Spark application requires a Spark Context
 The main entry point to the Spark API
• Spark Shell Provides a preconfigured Spark Context call sc
16
05.12.2015
Resilient Distributed Dataset (RDD)
Apache Spark
• RDD is a basic Abstraction in Spark
• Distributed collection of objects
• RDD(Resilient Distributed Dataset)
1. Resilient – if data in memory is lost, it can be recreated
2. Distributed – stored in memory across the cluster
3. Dataset – initial data can come from a file or created
programmatically
17
05.12.2015
Example: A File-base RDD
Apache Spark
18
05.12.2015
Example: A File-base RDD
Apache Spark
19
05.12.2015
RDD Operations
Apache Spark
• Two types of RDD operations
 Actions – return values
 count
 take(n)
 Transformations – define new RDDs based
on the current one
 filter
 map
 reduce
20
05.12.2015
Example map and filter Transformations
Apache Spark
21
05.12.2015
RDDs
Apache Spark
• RDDs can hold any type of element
 Primitive types: integers, chars, Boolean, string, etc
 Sequence type: lists, arrays, tuples, dicts, etc.
 Scala/Java Object (if serializable)
• Some types of RDDs have additional
functionality
 Double RDDs – RDDs consisting of numeric data
 Pair RDDs – RDDs consisting of Key-Value pairs
22
05.12.2015
Pair RDDs
Apache Spark
• Pair RDDs are a special form of RDD
 Each element must be a key-value pair
 Keys and values can be any type
• Why?
 Use with Map-Reduce algorithms
 Many additional functions are available for common data
processing needs – e.g. sorting, joining, grouping,
counting, etc
23
05.12.2015
MapReduce
Apache Spark
• MapReduce is a common programming model
1. Two phases
 Map – process each element in a data set
 Reduce – aggregate or consolidate the data
2. Easily applicable to distributed processing of large data sets
• Hadoop MapReduce is the major implementation
1. Limited
 Each job has one Map phase, one Reduce phase in each
 Job output saved to files
• Spark implements MapReduce with much greater flexibility
1. Map and Reduce functions can be interspersed
2. Results stored in memory
 Operations can be
chained easily
Spark execution flow
Hadoop execution flow
24
05.12.2015
MapReduce
Apache Spark
25
05.12.2015
MapReduce (Contd.)
Apache Spark
26
05.12.2015
MapReduce (Contd.)
Apache Spark
27
05.12.2015
MapReduce (Contd.)
Apache Spark
28
05.12.2015
MapReduce (Contd.)
Apache Spark
29
05.12.2015
MapReduce (Contd.)
Apache Spark
30
05.12.2015
MapReduce Example: Word Count
Apache Spark
31
05.12.2015
MapReduce Example: Word Count
Apache Spark
32
05.12.2015
MapReduce Example: Word Count
Apache Spark
33
05.12.2015
MapReduce Example: Word Count
Apache Spark
34
05.12.2015
MapReduce Example: Word Count
Apache Spark
35
05.12.2015
MapReduce Example: Word Count
Apache Spark
36
05.12.2015
MapReduce Example: Word Count
Apache Spark
37
05.12.2015
ReduceByKey
Apache Spark
• ReduceByKey functions must be
 Binary: combines values from two keys
 Commutative: x+y = y+x
 Associative: (x+y)+z = x+(y+z)
38
05.12.2015
MapReduce Example: Word Count
Apache Spark
39
05.12.2015
MapReduce Example: Word Count
Apache Spark
Let’s try some examples
40
05.12.2015
Apache Spark
martin.ahchiev@musala.com

Más contenido relacionado

La actualidad más candente

Introduction to apache spark
Introduction to apache sparkIntroduction to apache spark
Introduction to apache sparkUserReport
 
Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkTransitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkSlim Baltagi
 
Geospatial Analytics at Scale with Deep Learning and Apache Spark with Tim hu...
Geospatial Analytics at Scale with Deep Learning and Apache Spark with Tim hu...Geospatial Analytics at Scale with Deep Learning and Apache Spark with Tim hu...
Geospatial Analytics at Scale with Deep Learning and Apache Spark with Tim hu...Databricks
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkRahul Jain
 
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov... Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...Databricks
 
Spark - The Ultimate Scala Collections by Martin Odersky
Spark - The Ultimate Scala Collections by Martin OderskySpark - The Ultimate Scala Collections by Martin Odersky
Spark - The Ultimate Scala Collections by Martin OderskySpark Summit
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkRahul Jain
 
Lambda architecture with Spark
Lambda architecture with SparkLambda architecture with Spark
Lambda architecture with SparkVincent GALOPIN
 
Flink in Zalando's world of Microservices
Flink in Zalando's world of Microservices   Flink in Zalando's world of Microservices
Flink in Zalando's world of Microservices ZalandoHayley
 
Spark Advanced Analytics NJ Data Science Meetup - Princeton University
Spark Advanced Analytics NJ Data Science Meetup - Princeton UniversitySpark Advanced Analytics NJ Data Science Meetup - Princeton University
Spark Advanced Analytics NJ Data Science Meetup - Princeton UniversityAlex Zeltov
 
Big Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and ZeppelinBig Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and Zeppelinprajods
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark FundamentalsZahra Eskandari
 
Data Science with Spark & Zeppelin
Data Science with Spark & ZeppelinData Science with Spark & Zeppelin
Data Science with Spark & ZeppelinVinay Shukla
 
Using Visualization to Succeed with Big Data
Using Visualization to Succeed with Big Data Using Visualization to Succeed with Big Data
Using Visualization to Succeed with Big Data Pactera_US
 
Future of data visualization
Future of data visualizationFuture of data visualization
Future of data visualizationhadoopsphere
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkSamy Dindane
 
Apache spark - History and market overview
Apache spark - History and market overviewApache spark - History and market overview
Apache spark - History and market overviewMartin Zapletal
 
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...Alex Zeltov
 
Spark and Spark Streaming
Spark and Spark StreamingSpark and Spark Streaming
Spark and Spark Streaming宇 傅
 

La actualidad más candente (20)

Introduction to apache spark
Introduction to apache sparkIntroduction to apache spark
Introduction to apache spark
 
Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkTransitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to Spark
 
Geospatial Analytics at Scale with Deep Learning and Apache Spark with Tim hu...
Geospatial Analytics at Scale with Deep Learning and Apache Spark with Tim hu...Geospatial Analytics at Scale with Deep Learning and Apache Spark with Tim hu...
Geospatial Analytics at Scale with Deep Learning and Apache Spark with Tim hu...
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov... Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 
Spark - The Ultimate Scala Collections by Martin Odersky
Spark - The Ultimate Scala Collections by Martin OderskySpark - The Ultimate Scala Collections by Martin Odersky
Spark - The Ultimate Scala Collections by Martin Odersky
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
Lambda architecture with Spark
Lambda architecture with SparkLambda architecture with Spark
Lambda architecture with Spark
 
Hadoop and Spark
Hadoop and SparkHadoop and Spark
Hadoop and Spark
 
Flink in Zalando's world of Microservices
Flink in Zalando's world of Microservices   Flink in Zalando's world of Microservices
Flink in Zalando's world of Microservices
 
Spark Advanced Analytics NJ Data Science Meetup - Princeton University
Spark Advanced Analytics NJ Data Science Meetup - Princeton UniversitySpark Advanced Analytics NJ Data Science Meetup - Princeton University
Spark Advanced Analytics NJ Data Science Meetup - Princeton University
 
Big Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and ZeppelinBig Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and Zeppelin
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
 
Data Science with Spark & Zeppelin
Data Science with Spark & ZeppelinData Science with Spark & Zeppelin
Data Science with Spark & Zeppelin
 
Using Visualization to Succeed with Big Data
Using Visualization to Succeed with Big Data Using Visualization to Succeed with Big Data
Using Visualization to Succeed with Big Data
 
Future of data visualization
Future of data visualizationFuture of data visualization
Future of data visualization
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Apache spark - History and market overview
Apache spark - History and market overviewApache spark - History and market overview
Apache spark - History and market overview
 
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
 
Spark and Spark Streaming
Spark and Spark StreamingSpark and Spark Streaming
Spark and Spark Streaming
 

Similar a Big Data Processing with Apache Spark 2014

Big_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionBig_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionRUHULAMINHAZARIKA
 
Unit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxUnit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxRahul Borate
 
Apache Spark for Beginners
Apache Spark for BeginnersApache Spark for Beginners
Apache Spark for BeginnersAnirudh
 
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability MeetupApache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability MeetupHyderabad Scalability Meetup
 
Dec6 meetup spark presentation
Dec6 meetup spark presentationDec6 meetup spark presentation
Dec6 meetup spark presentationRamesh Mudunuri
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the SurfaceJosi Aranda
 
Introduction to apache spark
Introduction to apache sparkIntroduction to apache spark
Introduction to apache sparkMuktadiur Rahman
 
Introduction to apache spark
Introduction to apache sparkIntroduction to apache spark
Introduction to apache sparkJUGBD
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to sparkHome
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupNed Shawa
 
Sa introduction to big data pipelining with cassandra & spark west mins...
Sa introduction to big data pipelining with cassandra & spark   west mins...Sa introduction to big data pipelining with cassandra & spark   west mins...
Sa introduction to big data pipelining with cassandra & spark west mins...Simon Ambridge
 
Jumpstart on Apache Spark 2.2 on Databricks
Jumpstart on Apache Spark 2.2 on DatabricksJumpstart on Apache Spark 2.2 on Databricks
Jumpstart on Apache Spark 2.2 on DatabricksDatabricks
 
Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Databricks
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoMapR Technologies
 

Similar a Big Data Processing with Apache Spark 2014 (20)

Big_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionBig_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_Session
 
Unit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxUnit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptx
 
Apache Spark on HDinsight Training
Apache Spark on HDinsight TrainingApache Spark on HDinsight Training
Apache Spark on HDinsight Training
 
Apache Spark for Beginners
Apache Spark for BeginnersApache Spark for Beginners
Apache Spark for Beginners
 
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability MeetupApache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
 
Dec6 meetup spark presentation
Dec6 meetup spark presentationDec6 meetup spark presentation
Dec6 meetup spark presentation
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the Surface
 
Introduction to apache spark
Introduction to apache sparkIntroduction to apache spark
Introduction to apache spark
 
Introduction to apache spark
Introduction to apache sparkIntroduction to apache spark
Introduction to apache spark
 
Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
 
Spark 101
Spark 101Spark 101
Spark 101
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
Sa introduction to big data pipelining with cassandra & spark west mins...
Sa introduction to big data pipelining with cassandra & spark   west mins...Sa introduction to big data pipelining with cassandra & spark   west mins...
Sa introduction to big data pipelining with cassandra & spark west mins...
 
APACHE SPARK.pptx
APACHE SPARK.pptxAPACHE SPARK.pptx
APACHE SPARK.pptx
 
Jumpstart on Apache Spark 2.2 on Databricks
Jumpstart on Apache Spark 2.2 on DatabricksJumpstart on Apache Spark 2.2 on Databricks
Jumpstart on Apache Spark 2.2 on Databricks
 
Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
 
Big Data training
Big Data trainingBig Data training
Big Data training
 

Último

Lake Town / Independent Kolkata Call Girls Phone No 8005736733 Elite Escort S...
Lake Town / Independent Kolkata Call Girls Phone No 8005736733 Elite Escort S...Lake Town / Independent Kolkata Call Girls Phone No 8005736733 Elite Escort S...
Lake Town / Independent Kolkata Call Girls Phone No 8005736733 Elite Escort S...HyderabadDolls
 
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...kumargunjan9515
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Klinik kandungan
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraGovindSinghDasila
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...nirzagarg
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteedamy56318795
 
Statistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbersStatistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numberssuginr1
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...nirzagarg
 
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...kumargunjan9515
 
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...HyderabadDolls
 
Giridih Escorts Service Girl ^ 9332606886, WhatsApp Anytime Giridih
Giridih Escorts Service Girl ^ 9332606886, WhatsApp Anytime GiridihGiridih Escorts Service Girl ^ 9332606886, WhatsApp Anytime Giridih
Giridih Escorts Service Girl ^ 9332606886, WhatsApp Anytime Giridihmeghakumariji156
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...nirzagarg
 
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...HyderabadDolls
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...nirzagarg
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareGraham Ware
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...gajnagarg
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...Elaine Werffeli
 
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...HyderabadDolls
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNKTimothy Spann
 

Último (20)

Lake Town / Independent Kolkata Call Girls Phone No 8005736733 Elite Escort S...
Lake Town / Independent Kolkata Call Girls Phone No 8005736733 Elite Escort S...Lake Town / Independent Kolkata Call Girls Phone No 8005736733 Elite Escort S...
Lake Town / Independent Kolkata Call Girls Phone No 8005736733 Elite Escort S...
 
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Statistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbersStatistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbers
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
 
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
 
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...
 
Giridih Escorts Service Girl ^ 9332606886, WhatsApp Anytime Giridih
Giridih Escorts Service Girl ^ 9332606886, WhatsApp Anytime GiridihGiridih Escorts Service Girl ^ 9332606886, WhatsApp Anytime Giridih
Giridih Escorts Service Girl ^ 9332606886, WhatsApp Anytime Giridih
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
 
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 

Big Data Processing with Apache Spark 2014

  • 1. Martin Ahchiev Content is available under a Creative Commons 3.0 License unless otherwise noted.
  • 2. 2 05.12.2015 Apache Spark • What is Big Data? • Internet of Things • What is Apache Spark? • History of Apache Spark • Why Spark? • Spark Execution Flow • Spark Context • Resilient Distributed Dataset (RDD) • RDD Examples • MapReduce Algorithm • MapReduce Example: Word Count • Let’s try some examples
  • 3. 3 05.12.2015 What is Big Data? Apache Spark • “Big data” is similar to “small data”, but bigger in size – is not incorrect • But having data bigger it requires different approaches • Big Data is a set of technologies and methods for handling large volumes of data at rapid speeds and of various formats.
  • 4. 4 05.12.2015 Big Data and the Internet of Things Apache Spark • Connected Intelligence • Every day, we create 2.5 quintillion bytes of data — so much that 90% of the data in the world today has been created in the last two years alone. IBM, “Bringing big data to the enterprise”
  • 5. 5 05.12.2015 In Next 60 seconds… Apache Spark
  • 6. 6 05.12.2015 Big Data – Trends and Opportunities Apache Spark ”Welcome to the Internet of Customers. Behind every app, every device, and every connection, is a customer. Billions of them. And each and every one is speeding toward the future.” Salesforce.com
  • 7. 7 05.12.2015 What is Apache Spark? Apache Spark • Emerging big data framework • Open source framework for fast distributed in-memory data processing and data analytics • Extension/alternative to the MapReduce model • Currently an Apache high-priority “top-level” project
  • 8. 8 05.12.2015 What is Apache Spark? Apache Spark • Written in Scala  Functional programming language that runs in a JVM • Key Concepts  Avoid the data bottleneck by distributing data when it is stored  Bring the processing to the data  Data is stored in memory
  • 9. 9 05.12.2015 History of Apache Spark Apache Spark • Started in UC Berkeley AMPLab as a research project by Matei Zaharia, 2009  AMP = Algorithms Machines People  AMPLab is integrating Algorithms, Machines, and People to make sense of Big Data • Spark become open source, March 2010 • Spark donated to Apache Software Foundation, June 2013 • Spark becomes a top-level Apache project, February 2014
  • 10. 10 05.12.2015 Why Spark? Apache Spark Speed • Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk. • Last year, Spark took over Hadoop by completing the 100 TB Daytona GraySort contest 3x faster on one tenth the number of machines and it also became the fastest open source engine for sorting a petabyte.
  • 11. 11 05.12.2015 Why Spark? Apache Spark Ease of Use • Write applications quickly in Java, Scala, Python, R. • Spark offers over 80 high-level operators that make it easy to build parallel apps. And you can use it interactively from the Scala, Python and R shells.
  • 12. 12 05.12.2015 Why Spark? Apache Spark Generality • Combine SQL, streaming, and complex analytics • Spark powers a stack of libraries including SQL and DataFrames, MLlib for machine learning, GraphX, and Spark Streaming. You can combine these libraries seamlessly in the same application
  • 13. 13 05.12.2015 Why Spark? Apache Spark Runs Everywhere • Spark runs on Hadoop, Mesos, standalone, or in the cloud. It can access diverse data sources including HDFS, Cassandra, HBase, and S3. • You can run Spark using its standalone cluster mode, on EC2, on Hadoop YARN, or on Apache Mesos. Access data in HDFS, Cassandra, HBase, Hive, Tachyon, and any Hadoop data source.
  • 14. 14 05.12.2015 Execution Flow Apache Spark • Cluster Manager An external service to manage resources on the cluster (standalone manager, YARN, Apache Mesos) • Worker Node : Node that run the application program in cluster • Executor 1. Process launched on a worker node, that runs the Tasks 2. Keep data in memory or disk storage • Task : A unit of work that will be sent to executor • Job 1. Consists multiple tasks 2. Created based on a Action • Stage : Each Job is divided into smaller set of tasks called Stages that is sequential and depend on each other • SparkContext : represents the connection to a Spark cluster, and can be used to create RDDs, accumulators and broadcast variables on that cluster. • Driver Program The process to start the execution (main() function)
  • 15. 15 05.12.2015 Spark Context Apache Spark • Every Spark application requires a Spark Context  The main entry point to the Spark API • Spark Shell Provides a preconfigured Spark Context call sc
  • 16. 16 05.12.2015 Resilient Distributed Dataset (RDD) Apache Spark • RDD is a basic Abstraction in Spark • Distributed collection of objects • RDD(Resilient Distributed Dataset) 1. Resilient – if data in memory is lost, it can be recreated 2. Distributed – stored in memory across the cluster 3. Dataset – initial data can come from a file or created programmatically
  • 19. 19 05.12.2015 RDD Operations Apache Spark • Two types of RDD operations  Actions – return values  count  take(n)  Transformations – define new RDDs based on the current one  filter  map  reduce
  • 20. 20 05.12.2015 Example map and filter Transformations Apache Spark
  • 21. 21 05.12.2015 RDDs Apache Spark • RDDs can hold any type of element  Primitive types: integers, chars, Boolean, string, etc  Sequence type: lists, arrays, tuples, dicts, etc.  Scala/Java Object (if serializable) • Some types of RDDs have additional functionality  Double RDDs – RDDs consisting of numeric data  Pair RDDs – RDDs consisting of Key-Value pairs
  • 22. 22 05.12.2015 Pair RDDs Apache Spark • Pair RDDs are a special form of RDD  Each element must be a key-value pair  Keys and values can be any type • Why?  Use with Map-Reduce algorithms  Many additional functions are available for common data processing needs – e.g. sorting, joining, grouping, counting, etc
  • 23. 23 05.12.2015 MapReduce Apache Spark • MapReduce is a common programming model 1. Two phases  Map – process each element in a data set  Reduce – aggregate or consolidate the data 2. Easily applicable to distributed processing of large data sets • Hadoop MapReduce is the major implementation 1. Limited  Each job has one Map phase, one Reduce phase in each  Job output saved to files • Spark implements MapReduce with much greater flexibility 1. Map and Reduce functions can be interspersed 2. Results stored in memory  Operations can be chained easily Spark execution flow Hadoop execution flow
  • 37. 37 05.12.2015 ReduceByKey Apache Spark • ReduceByKey functions must be  Binary: combines values from two keys  Commutative: x+y = y+x  Associative: (x+y)+z = x+(y+z)
  • 39. 39 05.12.2015 MapReduce Example: Word Count Apache Spark Let’s try some examples