SlideShare una empresa de Scribd logo
1 de 38
Descargar para leer sin conexión
Ericsson Internal | 2015-08-11 | Page 2
• Wh a t ?
• Wh y ?
• Ho w ?
• De m o
• EDR A n a l y t i c s
AGENDA
Ericsson Internal | 2015-08-11 | Page 3
Spark eco-system
Technology landscape
Spark eco-system
Ericsson Internal | 2015-08-11 | Page 4
“Fast and general engine for big
data processing with libraries for
SQL, streaming, advanced
analytics(machine learning)
Ericsson Internal | 2015-08-11 | Page 5
WHAT?
Originally developed in 2009 in
UC Berkeley’sAMPLab
Fully open sourced in 2010 –
now at Apache Software
Foundation
http://spark.apache.org
Ericsson Internal | 2015-08-11 | Page 6
Spark is the Most Active
Open Source Project in
Big Data
Projectcontributorsinpastyear
Giraph
Storm
Tez
0
20
40
60
80
100
120
140
Ericsson Internal | 2015-08-11 | Page 7
Distributors Applications
7
The Spark Community
Ericsson Internal | 2015-08-11 | Page 8
2015 SNAPSHOT
Ericsson Internal | 2015-08-11 | Page 9
WHY SPARK?
Speed
Run programs up to
100x faster than
Hadoop Map
Reduce in memory,
or 10x faster on
disk.
Ease of Use
Supports different
languages for
developing
applications using
Spark
Generality
Combine SQL,
streaming, and
complex analytics
into one platform
Runs
Everywhere
Spark runs on
Hadoop, Mesos,
standalone, or in
the cloud.
Ericsson Internal | 2015-08-11 | Page 10
Easy: Get Started
Immediately
Interactive Shell
Ericsson Internal | 2015-08-11 | Page 11
Monitoring
Ericsson Internal | 2015-08-11 | Page 12
FEATURE COMPARISON
12
Source: Daytona GraySort benchmark, sortbenchmark.org
Ericsson Internal | 2015-08-11 | Page 13
WORD COUNT
Ericsson Internal | 2015-08-11 | Page 14
Spark eco-system
Local YARN Mesos
Spark Streaming Spark SQL GraphX MLLib
Spark Core Engine (Scala/Java/Python)
Standalone cluster
Persistence
Cluster Manager
…
1
4
Ericsson Internal | 2015-08-11 | Page 15
SPARK ON HDFS
Ericsson Internal | 2015-08-11 | Page 16
HADOOP SPARK
SQL Query interface HIVE SPARKSQL
Machine Learning APACHE MAHOUT MLIB
Graph processing APACHE GIRAPH GRAPHX
Streaming APACHE STORM SPARK STREAMING
ECOSYSTEM
Ericsson Internal | 2015-08-11 | Page 17
HOW?
Ericsson Internal | 2015-08-11 | Page 18
So, HOW is It BETTER
Ericsson Internal | 2015-08-11 | Page 19
THE BIG QUESTION?
Is Spark going to replace Hadoop?
Answer – Yes, Spark will be used on top of Hadoop and replace
MapReduce Reasons:
1. Hadoop MapReduce cannot handle real-time
processing
2. Hadoop MapReduce is slower than Hadoop Spark
3. With rise of IOT, Spark is a must
Ericsson Internal | 2015-08-11 | Page 20
RDD & SPARK
COMPONENTS
Technology landscape
Spark eco-system
Ericsson Internal | 2015-08-11 | Page 21
RESILIENT Distributed
Dataset
RDDs track lineage information that can be used to efficiently
re-compute lost data
Ericsson Internal | 2015-08-11 | Page 22
Partitions in the
cluster
SparkM
SparkW
SparkWSparkW
SparkW
partition
RDD
@doanduy 2
2
Ericsson Internal | 2015-08-11 | Page 23
RDD TRANSFORMATIONS
& ACTIONS
Ericsson Internal | 2015-08-11 | Page 24
PARTITION
TRANSFORMATION
map(tuple => (tuple._3, tuple))
groupByKey()
countByKey()
partition
RDD
direct transformation
shuffle
Ericsson Internal | 2015-08-11 | Page 25
Stage 1
Stages
Shuffle operation
Stage 2
Delimits "shuffle"
frontiers
@doanduy 2
5
Ericsson Internal | 2015-08-11 | Page 26
SPARK COMPONENTS
Ericsson Internal | 2015-08-11 | Page 27
SPARK STREAMING
Ericsson Internal | 2015-08-11 | Page 28
SPARK SQL
Ericsson Internal | 2015-08-11 | Page 29
Let’s try some
examples…
Ericsson Internal | 2015-08-11 | Page 30
Spark Shell
./bin/spark-shell --master local[2]
The --master option specifies the master URL for a distributed cluster, or local to run
locally with one thread, or local[N] to run locally with N threads. You should start by
using local for testing.
Ericsson Internal | 2015-08-11 | Page 31
scala> textFile.count() // Number of items in this RDD
ees0: Long = 126
scala> textFile.first() // First item in this RDD
res1: String = # Apache Spark
scala> val linesWithSpark = textFile.filter(line =>
line.contains("Spark"))
Simplier scala> textFile.filter(line =>
line.contains("Spark")).count() // How many lines contain
"Spark"?
res3: Long = 15
scala> val textFile = sc.textFile(“../README.md")
textFile: spark.RDD[String] = spark.MappedRDD@2ee9b6e3
Basic operations…
Ericsson Internal | 2015-08-11 | Page 32
Map - Reduce
scala> textFile.map(line => line.split("
").size).reduce((a, b) => if (a > b) a else b)
res4: Long = 15
scala> import java.lang.Math
scala> textFile.map(line => line.split("
").size).reduce((a, b) => Math.max(a, b))
res5: Int = 15
scala> val wordCounts = textFile.flatMap(line =>
line.split(" ")).map(word => (word, 1)).reduceByKey((a,
b) => a + b)
wordCounts: spark.RDD[(String, Int)] =
spark.ShuffledAggregatedRDD@71f027b8
wordCounts.collect()
Ericsson Internal | 2015-08-11 | Page 33
With Caching…
scala> linesWithSpark.cache()
res7: spark.RDD[String] =
spark.FilteredRDD@17e51082
scala> linesWithSpark.count()
res8: Long = 15
scala> linesWithSpark.count()
res9: Long = 15
Ericsson Internal | 2015-08-11 | Page 34
With HDFS…
val lines = spark.textFile(“hdfs://...”)
val errors = lines.filter(line =>
line.startsWith(“ERROR”))
println(Total errors: + errors.count())
Ericsson Internal | 2015-08-11 | Page 35
Job Submission
$SPARK_HOME/bin/spark-submit 
--class "SimpleApp" 
--master local[4] 
target/scala-2.10/simple-project_2.10-1.0.jar
Ericsson Internal | 2015-08-11 | Page 36
Configuration
val conf = new SparkConf()
.setMaster("local")
.setAppName("CountingSheep")
.set("spark.executor.memory", "1g")
val sc = new SparkContext(conf)
Ericsson Internal | 2015-08-11 | Page 37
SQL to RDD Translation
Projection & selection
SELECT name, age
FROM people
WHERE age ≥ 13 AND age ≤ 19
SELECT name, age
WHERE age ≥ 13 AND age ≤ 19
val people:RDD[Person]
val teenagers:RDD[(String,Int)]
= people
.filter(p => p.age ≥ 13 && p.age ≤ 19)
.map(p => (p.name, p.age))
.map(p => (p.name, p.age))
.filter(p => p.age ≥ 13 && p.age ≤ 19)
THANK
YOU

Más contenido relacionado

La actualidad más candente

Introduction to apache spark
Introduction to apache sparkIntroduction to apache spark
Introduction to apache spark
UserReport
 

La actualidad más candente (20)

Operational Tips for Deploying Spark
Operational Tips for Deploying SparkOperational Tips for Deploying Spark
Operational Tips for Deploying Spark
 
Introduction to apache spark
Introduction to apache spark Introduction to apache spark
Introduction to apache spark
 
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
 
Apache Spark 101
Apache Spark 101Apache Spark 101
Apache Spark 101
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
Introduction to apache spark
Introduction to apache sparkIntroduction to apache spark
Introduction to apache spark
 
Spark Under the Hood - Meetup @ Data Science London
Spark Under the Hood - Meetup @ Data Science LondonSpark Under the Hood - Meetup @ Data Science London
Spark Under the Hood - Meetup @ Data Science London
 
Keeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETLKeeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETL
 
Parallelizing Existing R Packages with SparkR
Parallelizing Existing R Packages with SparkRParallelizing Existing R Packages with SparkR
Parallelizing Existing R Packages with SparkR
 
Hadoop and Spark
Hadoop and SparkHadoop and Spark
Hadoop and Spark
 
NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin
NigthClazz Spark - Machine Learning / Introduction à Spark et ZeppelinNigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin
NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 
Python and Bigdata - An Introduction to Spark (PySpark)
Python and Bigdata -  An Introduction to Spark (PySpark)Python and Bigdata -  An Introduction to Spark (PySpark)
Python and Bigdata - An Introduction to Spark (PySpark)
 
Spark architecture
Spark architectureSpark architecture
Spark architecture
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma
Sparking up Data Engineering: Spark Summit East talk by Rohan SharmaSparking up Data Engineering: Spark Summit East talk by Rohan Sharma
Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma
 
PySaprk
PySaprkPySaprk
PySaprk
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
 
Performant data processing with PySpark, SparkR and DataFrame API
Performant data processing with PySpark, SparkR and DataFrame APIPerformant data processing with PySpark, SparkR and DataFrame API
Performant data processing with PySpark, SparkR and DataFrame API
 

Destacado

Node Security: The Good, Bad & Ugly
Node Security: The Good, Bad & UglyNode Security: The Good, Bad & Ugly
Node Security: The Good, Bad & Ugly
Bishan Singh
 
Node.js Enterprise Middleware
Node.js Enterprise MiddlewareNode.js Enterprise Middleware
Node.js Enterprise Middleware
Behrad Zari
 
End-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkEnd-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache Spark
Databricks
 

Destacado (20)

Node Security: The Good, Bad & Ugly
Node Security: The Good, Bad & UglyNode Security: The Good, Bad & Ugly
Node Security: The Good, Bad & Ugly
 
Hadoop Operations at LinkedIn
Hadoop Operations at LinkedInHadoop Operations at LinkedIn
Hadoop Operations at LinkedIn
 
Hadoop Performance at LinkedIn
Hadoop Performance at LinkedInHadoop Performance at LinkedIn
Hadoop Performance at LinkedIn
 
NodeJS ecosystem
NodeJS ecosystemNodeJS ecosystem
NodeJS ecosystem
 
Spark - The beginnings
Spark -  The beginningsSpark -  The beginnings
Spark - The beginnings
 
Big Data Day LA 2015 - Spark after Dark by Chris Fregly of Databricks
Big Data Day LA 2015 - Spark after Dark by Chris Fregly of DatabricksBig Data Day LA 2015 - Spark after Dark by Chris Fregly of Databricks
Big Data Day LA 2015 - Spark after Dark by Chris Fregly of Databricks
 
Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySor...Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
Node.js Enterprise Middleware
Node.js Enterprise MiddlewareNode.js Enterprise Middleware
Node.js Enterprise Middleware
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
New directions for Apache Spark in 2015
New directions for Apache Spark in 2015New directions for Apache Spark in 2015
New directions for Apache Spark in 2015
 
Baptist Health: Solving Healthcare Problems with Big Data
Baptist Health: Solving Healthcare Problems with Big DataBaptist Health: Solving Healthcare Problems with Big Data
Baptist Health: Solving Healthcare Problems with Big Data
 
Madrid Spark Big Data Bluemix Meetup - Spark Versus Hadoop @ 100 TB Daytona G...
Madrid Spark Big Data Bluemix Meetup - Spark Versus Hadoop @ 100 TB Daytona G...Madrid Spark Big Data Bluemix Meetup - Spark Versus Hadoop @ 100 TB Daytona G...
Madrid Spark Big Data Bluemix Meetup - Spark Versus Hadoop @ 100 TB Daytona G...
 
Extreme-scale Ad-Tech using Spark and Databricks at MediaMath
Extreme-scale Ad-Tech using Spark and Databricks at MediaMathExtreme-scale Ad-Tech using Spark and Databricks at MediaMath
Extreme-scale Ad-Tech using Spark and Databricks at MediaMath
 
Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)
 
Apache spark - Spark's distributed programming model
Apache spark - Spark's distributed programming modelApache spark - Spark's distributed programming model
Apache spark - Spark's distributed programming model
 
Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark
Spark After Dark:  Real time Advanced Analytics and Machine Learning with SparkSpark After Dark:  Real time Advanced Analytics and Machine Learning with Spark
Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
Apache Spark & Streaming
Apache Spark & StreamingApache Spark & Streaming
Apache Spark & Streaming
 
End-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkEnd-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache Spark
 

Similar a Apache spark linkedin

Similar a Apache spark linkedin (20)

Spark For Faster Batch Processing
Spark For Faster Batch ProcessingSpark For Faster Batch Processing
Spark For Faster Batch Processing
 
5 things one must know about spark!
5 things one must know about spark!5 things one must know about spark!
5 things one must know about spark!
 
Spark: The State of the Art Engine for Big Data Processing
Spark: The State of the Art Engine for Big Data ProcessingSpark: The State of the Art Engine for Big Data Processing
Spark: The State of the Art Engine for Big Data Processing
 
5 reasons why spark is in demand!
5 reasons why spark is in demand!5 reasons why spark is in demand!
5 reasons why spark is in demand!
 
A short introduction to Spark and its benefits
A short introduction to Spark and its benefitsA short introduction to Spark and its benefits
A short introduction to Spark and its benefits
 
Apache Spark beyond Hadoop MapReduce
Apache Spark beyond Hadoop MapReduceApache Spark beyond Hadoop MapReduce
Apache Spark beyond Hadoop MapReduce
 
Apache Spark PDF
Apache Spark PDFApache Spark PDF
Apache Spark PDF
 
Apache Spark Introduction.pdf
Apache Spark Introduction.pdfApache Spark Introduction.pdf
Apache Spark Introduction.pdf
 
Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014
 
Apache spark
Apache spark Apache spark
Apache spark
 
Spark SQL | Apache Spark
Spark SQL | Apache SparkSpark SQL | Apache Spark
Spark SQL | Apache Spark
 
Big Data Processing With Spark
Big Data Processing With SparkBig Data Processing With Spark
Big Data Processing With Spark
 
Review on Apache Spark Technology
Review on Apache Spark TechnologyReview on Apache Spark Technology
Review on Apache Spark Technology
 
spark interview questions & answers acadgild blogs
 spark interview questions & answers acadgild blogs spark interview questions & answers acadgild blogs
spark interview questions & answers acadgild blogs
 
Apache Spark Fundamentals Training
Apache Spark Fundamentals TrainingApache Spark Fundamentals Training
Apache Spark Fundamentals Training
 
Module01
 Module01 Module01
Module01
 
5 Reasons why Spark is in demand!
5 Reasons why Spark is in demand!5 Reasons why Spark is in demand!
5 Reasons why Spark is in demand!
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
 
PYSPARK PROGRAMMING.pdf
PYSPARK PROGRAMMING.pdfPYSPARK PROGRAMMING.pdf
PYSPARK PROGRAMMING.pdf
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQL
 

Más de Yukti Kaura (8)

Cloud computing saas
Cloud computing   saasCloud computing   saas
Cloud computing saas
 
Cloud computing - Basics and Beyond
Cloud computing - Basics and BeyondCloud computing - Basics and Beyond
Cloud computing - Basics and Beyond
 
Hadoop and big data
Hadoop and big dataHadoop and big data
Hadoop and big data
 
Web services for Laymen
Web services for LaymenWeb services for Laymen
Web services for Laymen
 
Spring batch
Spring batch Spring batch
Spring batch
 
Clean code - Agile Software Craftsmanship
Clean code - Agile Software CraftsmanshipClean code - Agile Software Craftsmanship
Clean code - Agile Software Craftsmanship
 
Maven overview
Maven overviewMaven overview
Maven overview
 
Basics of Flex Components, Skinning
Basics of Flex Components, SkinningBasics of Flex Components, Skinning
Basics of Flex Components, Skinning
 

Último

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Último (20)

Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 

Apache spark linkedin

  • 1.
  • 2. Ericsson Internal | 2015-08-11 | Page 2 • Wh a t ? • Wh y ? • Ho w ? • De m o • EDR A n a l y t i c s AGENDA
  • 3. Ericsson Internal | 2015-08-11 | Page 3 Spark eco-system Technology landscape Spark eco-system
  • 4. Ericsson Internal | 2015-08-11 | Page 4 “Fast and general engine for big data processing with libraries for SQL, streaming, advanced analytics(machine learning)
  • 5. Ericsson Internal | 2015-08-11 | Page 5 WHAT? Originally developed in 2009 in UC Berkeley’sAMPLab Fully open sourced in 2010 – now at Apache Software Foundation http://spark.apache.org
  • 6. Ericsson Internal | 2015-08-11 | Page 6 Spark is the Most Active Open Source Project in Big Data Projectcontributorsinpastyear Giraph Storm Tez 0 20 40 60 80 100 120 140
  • 7. Ericsson Internal | 2015-08-11 | Page 7 Distributors Applications 7 The Spark Community
  • 8. Ericsson Internal | 2015-08-11 | Page 8 2015 SNAPSHOT
  • 9. Ericsson Internal | 2015-08-11 | Page 9 WHY SPARK? Speed Run programs up to 100x faster than Hadoop Map Reduce in memory, or 10x faster on disk. Ease of Use Supports different languages for developing applications using Spark Generality Combine SQL, streaming, and complex analytics into one platform Runs Everywhere Spark runs on Hadoop, Mesos, standalone, or in the cloud.
  • 10. Ericsson Internal | 2015-08-11 | Page 10 Easy: Get Started Immediately Interactive Shell
  • 11. Ericsson Internal | 2015-08-11 | Page 11 Monitoring
  • 12. Ericsson Internal | 2015-08-11 | Page 12 FEATURE COMPARISON 12 Source: Daytona GraySort benchmark, sortbenchmark.org
  • 13. Ericsson Internal | 2015-08-11 | Page 13 WORD COUNT
  • 14. Ericsson Internal | 2015-08-11 | Page 14 Spark eco-system Local YARN Mesos Spark Streaming Spark SQL GraphX MLLib Spark Core Engine (Scala/Java/Python) Standalone cluster Persistence Cluster Manager … 1 4
  • 15. Ericsson Internal | 2015-08-11 | Page 15 SPARK ON HDFS
  • 16. Ericsson Internal | 2015-08-11 | Page 16 HADOOP SPARK SQL Query interface HIVE SPARKSQL Machine Learning APACHE MAHOUT MLIB Graph processing APACHE GIRAPH GRAPHX Streaming APACHE STORM SPARK STREAMING ECOSYSTEM
  • 17. Ericsson Internal | 2015-08-11 | Page 17 HOW?
  • 18. Ericsson Internal | 2015-08-11 | Page 18 So, HOW is It BETTER
  • 19. Ericsson Internal | 2015-08-11 | Page 19 THE BIG QUESTION? Is Spark going to replace Hadoop? Answer – Yes, Spark will be used on top of Hadoop and replace MapReduce Reasons: 1. Hadoop MapReduce cannot handle real-time processing 2. Hadoop MapReduce is slower than Hadoop Spark 3. With rise of IOT, Spark is a must
  • 20. Ericsson Internal | 2015-08-11 | Page 20 RDD & SPARK COMPONENTS Technology landscape Spark eco-system
  • 21. Ericsson Internal | 2015-08-11 | Page 21 RESILIENT Distributed Dataset RDDs track lineage information that can be used to efficiently re-compute lost data
  • 22. Ericsson Internal | 2015-08-11 | Page 22 Partitions in the cluster SparkM SparkW SparkWSparkW SparkW partition RDD @doanduy 2 2
  • 23. Ericsson Internal | 2015-08-11 | Page 23 RDD TRANSFORMATIONS & ACTIONS
  • 24. Ericsson Internal | 2015-08-11 | Page 24 PARTITION TRANSFORMATION map(tuple => (tuple._3, tuple)) groupByKey() countByKey() partition RDD direct transformation shuffle
  • 25. Ericsson Internal | 2015-08-11 | Page 25 Stage 1 Stages Shuffle operation Stage 2 Delimits "shuffle" frontiers @doanduy 2 5
  • 26. Ericsson Internal | 2015-08-11 | Page 26 SPARK COMPONENTS
  • 27. Ericsson Internal | 2015-08-11 | Page 27 SPARK STREAMING
  • 28. Ericsson Internal | 2015-08-11 | Page 28 SPARK SQL
  • 29. Ericsson Internal | 2015-08-11 | Page 29 Let’s try some examples…
  • 30. Ericsson Internal | 2015-08-11 | Page 30 Spark Shell ./bin/spark-shell --master local[2] The --master option specifies the master URL for a distributed cluster, or local to run locally with one thread, or local[N] to run locally with N threads. You should start by using local for testing.
  • 31. Ericsson Internal | 2015-08-11 | Page 31 scala> textFile.count() // Number of items in this RDD ees0: Long = 126 scala> textFile.first() // First item in this RDD res1: String = # Apache Spark scala> val linesWithSpark = textFile.filter(line => line.contains("Spark")) Simplier scala> textFile.filter(line => line.contains("Spark")).count() // How many lines contain "Spark"? res3: Long = 15 scala> val textFile = sc.textFile(“../README.md") textFile: spark.RDD[String] = spark.MappedRDD@2ee9b6e3 Basic operations…
  • 32. Ericsson Internal | 2015-08-11 | Page 32 Map - Reduce scala> textFile.map(line => line.split(" ").size).reduce((a, b) => if (a > b) a else b) res4: Long = 15 scala> import java.lang.Math scala> textFile.map(line => line.split(" ").size).reduce((a, b) => Math.max(a, b)) res5: Int = 15 scala> val wordCounts = textFile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey((a, b) => a + b) wordCounts: spark.RDD[(String, Int)] = spark.ShuffledAggregatedRDD@71f027b8 wordCounts.collect()
  • 33. Ericsson Internal | 2015-08-11 | Page 33 With Caching… scala> linesWithSpark.cache() res7: spark.RDD[String] = spark.FilteredRDD@17e51082 scala> linesWithSpark.count() res8: Long = 15 scala> linesWithSpark.count() res9: Long = 15
  • 34. Ericsson Internal | 2015-08-11 | Page 34 With HDFS… val lines = spark.textFile(“hdfs://...”) val errors = lines.filter(line => line.startsWith(“ERROR”)) println(Total errors: + errors.count())
  • 35. Ericsson Internal | 2015-08-11 | Page 35 Job Submission $SPARK_HOME/bin/spark-submit --class "SimpleApp" --master local[4] target/scala-2.10/simple-project_2.10-1.0.jar
  • 36. Ericsson Internal | 2015-08-11 | Page 36 Configuration val conf = new SparkConf() .setMaster("local") .setAppName("CountingSheep") .set("spark.executor.memory", "1g") val sc = new SparkContext(conf)
  • 37. Ericsson Internal | 2015-08-11 | Page 37 SQL to RDD Translation Projection & selection SELECT name, age FROM people WHERE age ≥ 13 AND age ≤ 19 SELECT name, age WHERE age ≥ 13 AND age ≤ 19 val people:RDD[Person] val teenagers:RDD[(String,Int)] = people .filter(p => p.age ≥ 13 && p.age ≤ 19) .map(p => (p.name, p.age)) .map(p => (p.name, p.age)) .filter(p => p.age ≥ 13 && p.age ≤ 19)