SlideShare a Scribd company logo
1 of 41
Download to read offline
Alexis Seigneurin
@aseigneurin @ippontech
Spark
● Processing of large volumes of data
● Distributed processing on commodity
hardware
● Written in Scala, Java and Python bindings
History
● 2009: AMPLab, Berkeley University
● June 2013 : "Top-level project" of the
Apache foundation
● May 2014: version 1.0.0
● Currently: version 1.2.0
Use cases
● Logs analysis
● Processing of text files
● Analytics
● Distributed search (Google, before)
● Fraud detection
● Product recommendation
● Same use cases
● Same development
model: MapReduce
● Integration with the
ecosystem
Proximity with Hadoop
Simpler than Hadoop
● API simpler to learn
● “Relaxed” MapReduce
● Spark Shell: interactive processing
Faster than Hadoop
Spark officially sets a new record in large-scale
sorting (5th November 2014)
● Sorting 100 To of data
● Hadoop MR: 72 minutes
○ With 2100 noeuds (50400 cores)
● Spark: 23 minutes
○ With 206 noeuds (6592 cores)
Spark ecosystem
● Spark
● Spark Shell
● Spark Streaming
● Spark SQL
● Spark ML
● GraphX
Integration
● Yarn, Zookeeper, Mesos
● HDFS
● Cassandra
● Elasticsearch
● MongoDB
Spark
Operating principle
● Resilient Distributed Dataset
● Abstraction of a collection processed in
parallel
● Fault tolerant
● Can work with tuples:
○ Key - Value
○ Tuples must be independent from each other
RDD
Sources
● Files on HDFS
● Local files
● Collection in memory
● Amazon S3
● NoSQL database
● ...
● Or a custom implementation of
InputFormat
Transformations
● Processes an RDD, returns another RDD
● Lazy!
● Examples :
○ map(): one value → another value
○ mapToPair(): one value → a tuple
○ filter(): filters values/tuples given a condition
○ groupByKey(): groups values by key
○ reduceByKey(): aggregates values by key
○ join(), cogroup()...: joins two RDDs
Actions
● Does not return an RDD
● Examples:
○ count(): counts values/tuples
○ saveAsHadoopFile(): saves results in Hadoop’s
format
○ foreach(): applies a function on each item
○ collect(): retrieves values in a list (List<T>)
Example
● Trees of Paris: CSV file, Open Data
● Count of trees by specie
Spark - Example
geom_x_y;circonfere;adresse;hauteurenm;espece;varieteouc;dateplanta
48.8648454814, 2.3094155344;140.0;COURS ALBERT 1ER;10.0;Aesculus hippocastanum;;
48.8782668139, 2.29806967519;100.0;PLACE DES TERNES;15.0;Tilia platyphyllos;;
48.889306184, 2.30400164126;38.0;BOULEVARD MALESHERBES;0.0;Platanus x hispanica;;
48.8599934405, 2.29504883623;65.0;QUAI BRANLY;10.0;Paulownia tomentosa;;1996-02-29
...
Spark - Example
JavaSparkContext sc = new JavaSparkContext("local", "arbres");
sc.textFile("data/arbresalignementparis2010.csv")
.filter(line -> !line.startsWith("geom"))
.map(line -> line.split(";"))
.mapToPair(fields -> new Tuple2<String, Integer>(fields[4], 1))
.reduceByKey((x, y) -> x + y)
.sortByKey()
.foreach(t -> System.out.println(t._1 + " : " + t._2));
[... ; … ; …]
[... ; … ; …]
[... ; … ; …]
[... ; … ; …]
[... ; … ; …]
[... ; … ; …]
u
m
k
m
a
a
textFile mapToPairmap
reduceByKey
foreach
1
1
1
1
1
u
m
k
1
2
1
2a
...
...
...
...
filter
...
...
sortByKey
a
m
2
1
2
1u
...
...
...
...
...
...
geom;...
1 k
Spark - Example
Acacia dealbata : 2
Acer acerifolius : 39
Acer buergerianum : 14
Acer campestre : 452
...
Spark clusters
Topology & Terminology
● One master / several workers
○ (+ one standby master)
● Submit an application to the cluster
● Execution managed by a driver
Spark in a cluster
Several options
● YARN
● Mesos
● Standalone
○ Workers started manually
○ Workers started by the master
MapReduce
● Spark (API)
● Distributed processing
● Fault tolerant
Storage
● HDFS, base NoSQL...
● Distributed storage
● Fault tolerant
Storage & Processing
Data locality
● Process the data where it is stored
● Avoid network I/Os
Data locality
Spark
Worker
HDFS
Datanode
Spark
Worker
HDFS
Datanode
Spark
Worker
HDFS
Datanode
Spark Master
HDFS
Namenode
HDFS
Namenode
(Standby)
Spark
Master
(Standby)
Demo
Spark in a cluster
Demo
$ $SPARK_HOME/sbin/start-master.sh
$ $SPARK_HOME/bin/spark-class
org.apache.spark.deploy.worker.Worker
spark://MBP-de-Alexis:7077
--cores 2 --memory 2G
$ mvn clean package
$ $SPARK_HOME/bin/spark-submit
--master spark://MBP-de-Alexis:7077
--class com.seigneurin.spark.WikipediaMapReduceByKey
--deploy-mode cluster
target/pres-spark-0.0.1-SNAPSHOT.jar
Spark SQL
● Usage of an RDD in SQL
● SQL engine: converts SQL instructions to
low-level instructions
Spark SQL
Spark SQL
Prerequisites:
● Use tabular data
● Describe the schema → SchemaRDD
Describing the schema :
● Programmatic description of the data
● Schema inference through reflection (POJO)
JavaRDD<Row> rdd = trees.map(fields -> Row.create(
Float.parseFloat(fields[3]), fields[4]));
● Creating tabular data (type Row)
Spark SQL - Example
---------------------------------------
| 10.0 | Aesculus hippocastanum |
| 15.0 | Tilia platyphyllos |
| 0.0 | Platanus x hispanica |
| 10.0 | Paulownia tomentosa |
| ... | ... |
Spark SQL - Example
List<StructField> fields = new ArrayList<StructField>();
fields.add(DataType.createStructField("hauteurenm", DataType.FloatType, false));
fields.add(DataType.createStructField("espece", DataType.StringType, false));
StructType schema = DataType.createStructType(fields);
JavaSchemaRDD schemaRDD = sqlContext.applySchema(rdd, schema);
schemaRDD.registerTempTable("tree");
---------------------------------------
| hauteurenm | espece |
---------------------------------------
| 10.0 | Aesculus hippocastanum |
| 15.0 | Tilia platyphyllos |
| 0.0 | Platanus x hispanica |
| 10.0 | Paulownia tomentosa |
| ... | ... |
● Describing the schema
● Counting trees by specie
Spark SQL - Example
sqlContext.sql("SELECT espece, COUNT(*)
FROM tree
WHERE espece <> ''
GROUP BY espece
ORDER BY espece")
.foreach(row -> System.out.println(row.getString(0)+" : "+row.getLong(1)));
Acacia dealbata : 2
Acer acerifolius : 39
Acer buergerianum : 14
Acer campestre : 452
...
Spark Streaming
Micro-batches
● Slices a continuous flow of data into batches
● Same API
● ≠ Apache Storm
DStream
● Discretized Streams
● Sequence of RDDs
● Initialized with a Duration
Window operations
● Sliding window
● Reuses data from other windows
● Initialized with a window length and a slide
interval
Sources
● Socket
● Kafka
● Flume
● HDFS
● MQ (ZeroMQ...)
● Twitter
● ...
● Or a custom implementation of Receiver
Demo
Spark Streaming
Spark Streaming Demo
● Receive Tweets with hashtag #Android
○ Twitter4J
● Detection of the language of the Tweet
○ Language Detection
● Indexing with Elasticsearch
● Reporting with Kibana 4
$ curl -X DELETE localhost:9200
$ curl -X PUT localhost:9200/spark/_mapping/tweets '{
"tweets": {
"properties": {
"user": {"type": "string","index": "not_analyzed"},
"text": {"type": "string"},
"createdAt": {"type": "date","format": "date_time"},
"language": {"type": "string","index": "not_analyzed"}
}
}
}'
● Launch ElasticSearch
Demo
● Launch Kibana -> http://localhost:5601
● Launch the Spark Streaming process
@aseigneurin
aseigneurin.github.io
@ippontech
blog.ippon.fr

More Related Content

What's hot

Apache cassandra in 2016
Apache cassandra in 2016Apache cassandra in 2016
Apache cassandra in 2016Duyhai Doan
 
Intro To Cascading
Intro To CascadingIntro To Cascading
Intro To CascadingNate Murray
 
Practical Hadoop using Pig
Practical Hadoop using PigPractical Hadoop using Pig
Practical Hadoop using PigDavid Wellman
 
Gruter_TECHDAY_2014_04_TajoCloudHandsOn (in Korean)
Gruter_TECHDAY_2014_04_TajoCloudHandsOn (in Korean)Gruter_TECHDAY_2014_04_TajoCloudHandsOn (in Korean)
Gruter_TECHDAY_2014_04_TajoCloudHandsOn (in Korean)Gruter
 
Apache avro and overview hadoop tools
Apache avro and overview hadoop toolsApache avro and overview hadoop tools
Apache avro and overview hadoop toolsalireza alikhani
 
How to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analyticsHow to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analyticsJulien Le Dem
 
Xadoop - new approaches to data analytics
Xadoop - new approaches to data analyticsXadoop - new approaches to data analytics
Xadoop - new approaches to data analyticsMaxim Grinev
 
DMM.com ラボはなぜSparkを採用したのか? レコメンドエンジン開発の裏側をお話します
DMM.com ラボはなぜSparkを採用したのか? レコメンドエンジン開発の裏側をお話しますDMM.com ラボはなぜSparkを採用したのか? レコメンドエンジン開発の裏側をお話します
DMM.com ラボはなぜSparkを採用したのか? レコメンドエンジン開発の裏側をお話しますWataru Shinohara
 
Hadoop Pig: MapReduce the easy way!
Hadoop Pig: MapReduce the easy way!Hadoop Pig: MapReduce the easy way!
Hadoop Pig: MapReduce the easy way!Nathan Bijnens
 
Anomaly Detection with Apache Spark
Anomaly Detection with Apache SparkAnomaly Detection with Apache Spark
Anomaly Detection with Apache SparkCloudera, Inc.
 
Bucket your partitions wisely - Cassandra summit 2016
Bucket your partitions wisely - Cassandra summit 2016Bucket your partitions wisely - Cassandra summit 2016
Bucket your partitions wisely - Cassandra summit 2016Markus Höfer
 
Introduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
HadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software FrameworkHadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software FrameworkThoughtWorks
 
Hive - SerDe and LazySerde
Hive - SerDe and LazySerdeHive - SerDe and LazySerde
Hive - SerDe and LazySerdeZheng Shao
 
Hive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReadingHive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReadingMitsuharu Hamba
 

What's hot (20)

Apache cassandra in 2016
Apache cassandra in 2016Apache cassandra in 2016
Apache cassandra in 2016
 
Intro To Cascading
Intro To CascadingIntro To Cascading
Intro To Cascading
 
Practical Hadoop using Pig
Practical Hadoop using PigPractical Hadoop using Pig
Practical Hadoop using Pig
 
Gruter_TECHDAY_2014_04_TajoCloudHandsOn (in Korean)
Gruter_TECHDAY_2014_04_TajoCloudHandsOn (in Korean)Gruter_TECHDAY_2014_04_TajoCloudHandsOn (in Korean)
Gruter_TECHDAY_2014_04_TajoCloudHandsOn (in Korean)
 
Apache avro and overview hadoop tools
Apache avro and overview hadoop toolsApache avro and overview hadoop tools
Apache avro and overview hadoop tools
 
How to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analyticsHow to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analytics
 
Xadoop - new approaches to data analytics
Xadoop - new approaches to data analyticsXadoop - new approaches to data analytics
Xadoop - new approaches to data analytics
 
Operations on rdd
Operations on rddOperations on rdd
Operations on rdd
 
Sql user group
Sql user groupSql user group
Sql user group
 
DMM.com ラボはなぜSparkを採用したのか? レコメンドエンジン開発の裏側をお話します
DMM.com ラボはなぜSparkを採用したのか? レコメンドエンジン開発の裏側をお話しますDMM.com ラボはなぜSparkを採用したのか? レコメンドエンジン開発の裏側をお話します
DMM.com ラボはなぜSparkを採用したのか? レコメンドエンジン開発の裏側をお話します
 
Hadoop Pig: MapReduce the easy way!
Hadoop Pig: MapReduce the easy way!Hadoop Pig: MapReduce the easy way!
Hadoop Pig: MapReduce the easy way!
 
Avro introduction
Avro introductionAvro introduction
Avro introduction
 
Anomaly Detection with Apache Spark
Anomaly Detection with Apache SparkAnomaly Detection with Apache Spark
Anomaly Detection with Apache Spark
 
Ruby on hadoop
Ruby on hadoopRuby on hadoop
Ruby on hadoop
 
Hadoop
HadoopHadoop
Hadoop
 
Bucket your partitions wisely - Cassandra summit 2016
Bucket your partitions wisely - Cassandra summit 2016Bucket your partitions wisely - Cassandra summit 2016
Bucket your partitions wisely - Cassandra summit 2016
 
Introduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLab
 
HadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software FrameworkHadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software Framework
 
Hive - SerDe and LazySerde
Hive - SerDe and LazySerdeHive - SerDe and LazySerde
Hive - SerDe and LazySerde
 
Hive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReadingHive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReading
 

Viewers also liked

Spark - Alexis Seigneurin (Français)
Spark - Alexis Seigneurin (Français)Spark - Alexis Seigneurin (Français)
Spark - Alexis Seigneurin (Français)Alexis Seigneurin
 
Lessons Learned: Using Spark and Microservices
Lessons Learned: Using Spark and MicroservicesLessons Learned: Using Spark and Microservices
Lessons Learned: Using Spark and MicroservicesAlexis Seigneurin
 
Building Realtim Data Pipelines with Kafka Connect and Spark Streaming
Building Realtim Data Pipelines with Kafka Connect and Spark StreamingBuilding Realtim Data Pipelines with Kafka Connect and Spark Streaming
Building Realtim Data Pipelines with Kafka Connect and Spark StreamingGuozhang Wang
 
Spark (v1.3) - Présentation (Français)
Spark (v1.3) - Présentation (Français)Spark (v1.3) - Présentation (Français)
Spark (v1.3) - Présentation (Français)Alexis Seigneurin
 
I Heart Log: Real-time Data and Apache Kafka
I Heart Log: Real-time Data and Apache KafkaI Heart Log: Real-time Data and Apache Kafka
I Heart Log: Real-time Data and Apache KafkaJay Kreps
 
Spark, ou comment traiter des données à la vitesse de l'éclair
Spark, ou comment traiter des données à la vitesse de l'éclairSpark, ou comment traiter des données à la vitesse de l'éclair
Spark, ou comment traiter des données à la vitesse de l'éclairAlexis Seigneurin
 
Reactive Databases for Big Data applications
Reactive Databases for Big Data applicationsReactive Databases for Big Data applications
Reactive Databases for Big Data applicationsGraph-TA
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkRahul Jain
 

Viewers also liked (10)

0712_Seigneurin
0712_Seigneurin0712_Seigneurin
0712_Seigneurin
 
Spark - Alexis Seigneurin (Français)
Spark - Alexis Seigneurin (Français)Spark - Alexis Seigneurin (Français)
Spark - Alexis Seigneurin (Français)
 
Lessons Learned: Using Spark and Microservices
Lessons Learned: Using Spark and MicroservicesLessons Learned: Using Spark and Microservices
Lessons Learned: Using Spark and Microservices
 
Spark - Ippevent 19-02-2015
Spark - Ippevent 19-02-2015Spark - Ippevent 19-02-2015
Spark - Ippevent 19-02-2015
 
Building Realtim Data Pipelines with Kafka Connect and Spark Streaming
Building Realtim Data Pipelines with Kafka Connect and Spark StreamingBuilding Realtim Data Pipelines with Kafka Connect and Spark Streaming
Building Realtim Data Pipelines with Kafka Connect and Spark Streaming
 
Spark (v1.3) - Présentation (Français)
Spark (v1.3) - Présentation (Français)Spark (v1.3) - Présentation (Français)
Spark (v1.3) - Présentation (Français)
 
I Heart Log: Real-time Data and Apache Kafka
I Heart Log: Real-time Data and Apache KafkaI Heart Log: Real-time Data and Apache Kafka
I Heart Log: Real-time Data and Apache Kafka
 
Spark, ou comment traiter des données à la vitesse de l'éclair
Spark, ou comment traiter des données à la vitesse de l'éclairSpark, ou comment traiter des données à la vitesse de l'éclair
Spark, ou comment traiter des données à la vitesse de l'éclair
 
Reactive Databases for Big Data applications
Reactive Databases for Big Data applicationsReactive Databases for Big Data applications
Reactive Databases for Big Data applications
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 

Similar to Spark Data Processing Framework Overview

Enabling exploratory data science with Spark and R
Enabling exploratory data science with Spark and REnabling exploratory data science with Spark and R
Enabling exploratory data science with Spark and RDatabricks
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQLjeykottalam
 
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by ScyllaScylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by ScyllaScyllaDB
 
Big data week presentation
Big data week presentationBig data week presentation
Big data week presentationJoseph Adler
 
Leveraging Hadoop in your PostgreSQL Environment
Leveraging Hadoop in your PostgreSQL EnvironmentLeveraging Hadoop in your PostgreSQL Environment
Leveraging Hadoop in your PostgreSQL EnvironmentJim Mlodgenski
 
Artigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdfArtigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdfWalmirCouto3
 
Data processing platforms with SMACK: Spark and Mesos internals
Data processing platforms with SMACK:  Spark and Mesos internalsData processing platforms with SMACK:  Spark and Mesos internals
Data processing platforms with SMACK: Spark and Mesos internalsAnton Kirillov
 
TensorFrames: Google Tensorflow on Apache Spark
TensorFrames: Google Tensorflow on Apache SparkTensorFrames: Google Tensorflow on Apache Spark
TensorFrames: Google Tensorflow on Apache SparkDatabricks
 
scalable machine learning
scalable machine learningscalable machine learning
scalable machine learningSamir Bessalah
 
Writing Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark APIWriting Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark APIDatabricks
 
Spark overview
Spark overviewSpark overview
Spark overviewLisa Hua
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupRafal Kwasny
 
Spark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with SparkSpark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with SparkDatabricks
 
Keeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETLKeeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETLDatabricks
 
Real-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingReal-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingDatabricks
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupNed Shawa
 
Spark Meetup TensorFrames
Spark Meetup TensorFramesSpark Meetup TensorFrames
Spark Meetup TensorFramesJen Aman
 

Similar to Spark Data Processing Framework Overview (20)

Enabling exploratory data science with Spark and R
Enabling exploratory data science with Spark and REnabling exploratory data science with Spark and R
Enabling exploratory data science with Spark and R
 
Meetup ml spark_ppt
Meetup ml spark_pptMeetup ml spark_ppt
Meetup ml spark_ppt
 
Slides
SlidesSlides
Slides
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQL
 
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by ScyllaScylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
 
Big data week presentation
Big data week presentationBig data week presentation
Big data week presentation
 
Leveraging Hadoop in your PostgreSQL Environment
Leveraging Hadoop in your PostgreSQL EnvironmentLeveraging Hadoop in your PostgreSQL Environment
Leveraging Hadoop in your PostgreSQL Environment
 
Apache Spark & Hadoop
Apache Spark & HadoopApache Spark & Hadoop
Apache Spark & Hadoop
 
Artigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdfArtigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdf
 
Data processing platforms with SMACK: Spark and Mesos internals
Data processing platforms with SMACK:  Spark and Mesos internalsData processing platforms with SMACK:  Spark and Mesos internals
Data processing platforms with SMACK: Spark and Mesos internals
 
TensorFrames: Google Tensorflow on Apache Spark
TensorFrames: Google Tensorflow on Apache SparkTensorFrames: Google Tensorflow on Apache Spark
TensorFrames: Google Tensorflow on Apache Spark
 
scalable machine learning
scalable machine learningscalable machine learning
scalable machine learning
 
Writing Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark APIWriting Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark API
 
Spark overview
Spark overviewSpark overview
Spark overview
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetup
 
Spark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with SparkSpark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with Spark
 
Keeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETLKeeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETL
 
Real-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingReal-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to Streaming
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
 
Spark Meetup TensorFrames
Spark Meetup TensorFramesSpark Meetup TensorFrames
Spark Meetup TensorFrames
 

Recently uploaded

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 

Recently uploaded (20)

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 

Spark Data Processing Framework Overview

  • 2. Spark ● Processing of large volumes of data ● Distributed processing on commodity hardware ● Written in Scala, Java and Python bindings
  • 3. History ● 2009: AMPLab, Berkeley University ● June 2013 : "Top-level project" of the Apache foundation ● May 2014: version 1.0.0 ● Currently: version 1.2.0
  • 4. Use cases ● Logs analysis ● Processing of text files ● Analytics ● Distributed search (Google, before) ● Fraud detection ● Product recommendation
  • 5. ● Same use cases ● Same development model: MapReduce ● Integration with the ecosystem Proximity with Hadoop
  • 6. Simpler than Hadoop ● API simpler to learn ● “Relaxed” MapReduce ● Spark Shell: interactive processing
  • 7. Faster than Hadoop Spark officially sets a new record in large-scale sorting (5th November 2014) ● Sorting 100 To of data ● Hadoop MR: 72 minutes ○ With 2100 noeuds (50400 cores) ● Spark: 23 minutes ○ With 206 noeuds (6592 cores)
  • 8. Spark ecosystem ● Spark ● Spark Shell ● Spark Streaming ● Spark SQL ● Spark ML ● GraphX
  • 9. Integration ● Yarn, Zookeeper, Mesos ● HDFS ● Cassandra ● Elasticsearch ● MongoDB
  • 11. ● Resilient Distributed Dataset ● Abstraction of a collection processed in parallel ● Fault tolerant ● Can work with tuples: ○ Key - Value ○ Tuples must be independent from each other RDD
  • 12. Sources ● Files on HDFS ● Local files ● Collection in memory ● Amazon S3 ● NoSQL database ● ... ● Or a custom implementation of InputFormat
  • 13. Transformations ● Processes an RDD, returns another RDD ● Lazy! ● Examples : ○ map(): one value → another value ○ mapToPair(): one value → a tuple ○ filter(): filters values/tuples given a condition ○ groupByKey(): groups values by key ○ reduceByKey(): aggregates values by key ○ join(), cogroup()...: joins two RDDs
  • 14. Actions ● Does not return an RDD ● Examples: ○ count(): counts values/tuples ○ saveAsHadoopFile(): saves results in Hadoop’s format ○ foreach(): applies a function on each item ○ collect(): retrieves values in a list (List<T>)
  • 16. ● Trees of Paris: CSV file, Open Data ● Count of trees by specie Spark - Example geom_x_y;circonfere;adresse;hauteurenm;espece;varieteouc;dateplanta 48.8648454814, 2.3094155344;140.0;COURS ALBERT 1ER;10.0;Aesculus hippocastanum;; 48.8782668139, 2.29806967519;100.0;PLACE DES TERNES;15.0;Tilia platyphyllos;; 48.889306184, 2.30400164126;38.0;BOULEVARD MALESHERBES;0.0;Platanus x hispanica;; 48.8599934405, 2.29504883623;65.0;QUAI BRANLY;10.0;Paulownia tomentosa;;1996-02-29 ...
  • 17. Spark - Example JavaSparkContext sc = new JavaSparkContext("local", "arbres"); sc.textFile("data/arbresalignementparis2010.csv") .filter(line -> !line.startsWith("geom")) .map(line -> line.split(";")) .mapToPair(fields -> new Tuple2<String, Integer>(fields[4], 1)) .reduceByKey((x, y) -> x + y) .sortByKey() .foreach(t -> System.out.println(t._1 + " : " + t._2)); [... ; … ; …] [... ; … ; …] [... ; … ; …] [... ; … ; …] [... ; … ; …] [... ; … ; …] u m k m a a textFile mapToPairmap reduceByKey foreach 1 1 1 1 1 u m k 1 2 1 2a ... ... ... ... filter ... ... sortByKey a m 2 1 2 1u ... ... ... ... ... ... geom;... 1 k
  • 18. Spark - Example Acacia dealbata : 2 Acer acerifolius : 39 Acer buergerianum : 14 Acer campestre : 452 ...
  • 20. Topology & Terminology ● One master / several workers ○ (+ one standby master) ● Submit an application to the cluster ● Execution managed by a driver
  • 21. Spark in a cluster Several options ● YARN ● Mesos ● Standalone ○ Workers started manually ○ Workers started by the master
  • 22. MapReduce ● Spark (API) ● Distributed processing ● Fault tolerant Storage ● HDFS, base NoSQL... ● Distributed storage ● Fault tolerant Storage & Processing
  • 23. Data locality ● Process the data where it is stored ● Avoid network I/Os
  • 25. Demo Spark in a cluster
  • 26. Demo $ $SPARK_HOME/sbin/start-master.sh $ $SPARK_HOME/bin/spark-class org.apache.spark.deploy.worker.Worker spark://MBP-de-Alexis:7077 --cores 2 --memory 2G $ mvn clean package $ $SPARK_HOME/bin/spark-submit --master spark://MBP-de-Alexis:7077 --class com.seigneurin.spark.WikipediaMapReduceByKey --deploy-mode cluster target/pres-spark-0.0.1-SNAPSHOT.jar
  • 28. ● Usage of an RDD in SQL ● SQL engine: converts SQL instructions to low-level instructions Spark SQL
  • 29. Spark SQL Prerequisites: ● Use tabular data ● Describe the schema → SchemaRDD Describing the schema : ● Programmatic description of the data ● Schema inference through reflection (POJO)
  • 30. JavaRDD<Row> rdd = trees.map(fields -> Row.create( Float.parseFloat(fields[3]), fields[4])); ● Creating tabular data (type Row) Spark SQL - Example --------------------------------------- | 10.0 | Aesculus hippocastanum | | 15.0 | Tilia platyphyllos | | 0.0 | Platanus x hispanica | | 10.0 | Paulownia tomentosa | | ... | ... |
  • 31. Spark SQL - Example List<StructField> fields = new ArrayList<StructField>(); fields.add(DataType.createStructField("hauteurenm", DataType.FloatType, false)); fields.add(DataType.createStructField("espece", DataType.StringType, false)); StructType schema = DataType.createStructType(fields); JavaSchemaRDD schemaRDD = sqlContext.applySchema(rdd, schema); schemaRDD.registerTempTable("tree"); --------------------------------------- | hauteurenm | espece | --------------------------------------- | 10.0 | Aesculus hippocastanum | | 15.0 | Tilia platyphyllos | | 0.0 | Platanus x hispanica | | 10.0 | Paulownia tomentosa | | ... | ... | ● Describing the schema
  • 32. ● Counting trees by specie Spark SQL - Example sqlContext.sql("SELECT espece, COUNT(*) FROM tree WHERE espece <> '' GROUP BY espece ORDER BY espece") .foreach(row -> System.out.println(row.getString(0)+" : "+row.getLong(1))); Acacia dealbata : 2 Acer acerifolius : 39 Acer buergerianum : 14 Acer campestre : 452 ...
  • 34. Micro-batches ● Slices a continuous flow of data into batches ● Same API ● ≠ Apache Storm
  • 35. DStream ● Discretized Streams ● Sequence of RDDs ● Initialized with a Duration
  • 36. Window operations ● Sliding window ● Reuses data from other windows ● Initialized with a window length and a slide interval
  • 37. Sources ● Socket ● Kafka ● Flume ● HDFS ● MQ (ZeroMQ...) ● Twitter ● ... ● Or a custom implementation of Receiver
  • 39. Spark Streaming Demo ● Receive Tweets with hashtag #Android ○ Twitter4J ● Detection of the language of the Tweet ○ Language Detection ● Indexing with Elasticsearch ● Reporting with Kibana 4
  • 40. $ curl -X DELETE localhost:9200 $ curl -X PUT localhost:9200/spark/_mapping/tweets '{ "tweets": { "properties": { "user": {"type": "string","index": "not_analyzed"}, "text": {"type": "string"}, "createdAt": {"type": "date","format": "date_time"}, "language": {"type": "string","index": "not_analyzed"} } } }' ● Launch ElasticSearch Demo ● Launch Kibana -> http://localhost:5601 ● Launch the Spark Streaming process