SlideShare una empresa de Scribd logo
1 de 38
11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus
Getting started in Apache Spark
and Flink (with Scala)
Alexander Panchenko, Gerold Hintz, Steffen Remus
11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 2
Outline
 Scala
 basics of Scala programming language
 Spark
 motivation / what do you get on top of MapReduce
 basics of Spark: RDDs, transformations, actions, shuffling
 “tricks” useful in Spark context
 Spark Hands-on session
 run Spark notebook and solve easy tasks
 setup Spark project & submit job to cluster
 Flink
 theory
 difference from Spark
11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 3
Three main benefits to use Spark
1. Spark is easy to use—you can develop applications on your laptop,
using a high-level API
2. Spark is fast, enabling interactive use and complex algorithms
3. Spark is a general engine, letting you combine multiple types of
computations (e.g., SQL queries, text processing, and machine
learning) that might previously have required different engines.
This tutorial is based on the book by creators of Spark:
Karau H., Konwinski A., Windell P., Zaharia M. “Learning
Spark. Lighting-fast Data Analysis.” O’Really. 2015
11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 4
Data Science Tasks
Experimentation: development of the model
 Python, MATLAB, R
 iPython notebooks
 Interactive computing
 Easy-to-use
Production: using the model
 Java, Scala, C++/C
 Unit tests
 Fault tolerance
 No interactive computing
 Scalability
Scala + Spark can be used for both!
11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 5
A Brief History of Spark
 Spark is an open source project
 Spark started in 2009 as a research project in the UC Berkeley RAD Lab
 Research papers were published about Spark at academic conferences
and soon after its creation in 2009
 In 2011, the AMPLab started to develop higher-level components on
Spark, such as Shark (Hive on Spark) and Spark Streaming
 Currently one of the most active project in Scala language:
11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 6
What Is Apache Spark?
 Spark Core: resilient distributed dataset (RDD)
 Spark SQL: Hive tables, Parquet, JSON, Datasets
11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 7
What Is Apache Spark?
 Components for distributed execution in Spark
11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 8
Spark Runtime Architecture
The components of a distributed Spark application
11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 9
Spark Runtime Architecture
 The master/slave architecture with one central coordinator and many
distributed workers
 The central coordinator is called the driver
 The driver communicates with distributed workers called executors
 The driver is the process where the main() method of your program runs
 The driver:
 Converting a user program into tasks
 Scheduling tasks on executors
11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 10
Downloading Spark and Getting Started
 Download a version “Pre-built for Hadoop 2.X and later”:
http://spark.apache.org/downloads.html
 Directories you see here that come with Spark:
 README.md
 Contains short instructions for getting started with Spark.
 bin
 Contains executable files that can be used to interact with Spark in various
ways (e.g., the Spark shell, which we will cover later in this chapter).
 core, streaming, python, …
 Contains the source code of major components of the Spark project.
 examples
 Contains some helpful Spark standalone jobs that you can look at and run to
learn about the Spark API.
11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 11
Introduction to Spark’s Scala Shell
 Run: bin/spark-shell
 Type in the shell the Scala line count:
 We can run parallel operations on the RDD, such as counting the lines of
text in the file or printing the first one
11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 12
Filtering: lambda functions
 Filtering example (Scala):
 Filtering example (Java 7):
 Filtering example (Java 8):
11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 13
Standalone Spark Applications
 Link to Spark (Maven or SBT), e.g.:
 Write a sample class, e.g. word count:
11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 14
Standalone Spark Applications
 SBT build file
 Build JAR and run it:
11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 15
Programming with RDDs
RDD -- Resilient Distributed Dataset
 Immutable distributed collection of objects
 Each RDD is split into multiple partitions
 Partitions may be computed on different nodes
Creating an RDD
 Loading an external dataset
 Distributing a collection of objects
11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 16
Programming with RDDs
 Once created, RDDs offer two types of operations:
 Transformations
 Actions
 Transformations construct a new RDD from a previous one
 Actions, compute a result based on an RDD
 either return it to the driver program
 or save it to an external storage system, e.g. HDFS
 RDDs are recomputed each time you run an action
 To reuse an RDD you need to persist it in memory:
11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 17
Spark Execution Steps (Shell & Standalone)
1. Create some input RDDs from external data.
2. Transform them to define new RDDs using transformations like filter().
3. Persist any intermediate RDDs that will need to be reused.
4. Launch actions such as count() and first() to kick off a parallel
computation, which is then optimized and executed by Spark.
11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 18
RDD Operations: Transformations
 filter() operation does not mutate the existing inputRDD
 It returns a pointer to an entirely new RDD
 inputRDD can still be reused later in the program, e.g.:
11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 19
RDD Operations: Actions
Return some result and launch actual computation:
 take() to retrieve a small number of elements
11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 20
Common Transformations and Actions
Element-wise transformations
 Mapped and filtered RDD from an input RDD:
 Squaring the values in an RDD:
11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 21
Common Transformations and Actions
Element-wise transformations
 Splitting lines into multiple words:
 Difference between flatMap() and map() on an RDD:
11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 22
Common Transformations and Actions
Some simple set operations:
11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 23
Common Transformations and Actions
Basic RDD transformations on an RDD containing {1, 2, 3, 3}:
11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 24
Common Transformations and Actions
Two-RDD transformations on RDDs containing {1, 2, 3} and {3, 4, 5}:
11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 25
Common Transformations and Actions
Basic actions on an RDD containing {1, 2, 3, 3}:
11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 26
Common Transformations and Actions
Basic actions on an RDD containing {1, 2, 3, 3}:
11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 27
Persistence (Caching)
Double execution: Reusing result:
 Persistence levels:
11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 28
Working with Key/Value Pairs
 Pair RDDs are a useful building block in many programs
 Allow you to act on each key in parallel or regroup data
 For instance:
 reduceByKey() method that can aggregate data for each key
 join() method that can merge two RDDs by grouping elements with the same
key
 Creating Pair RDDs = creating Scala tuples:
 Creating a pair RDD using the first word as the key
11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 29
Transformations on Pair RDDs
 Transformations on one pair RDD (example: {(1, 2), (3, 4), (3, 6)})
11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 30
Transformations on Pair RDDs
 Transformations on one pair RDD (example: {(1, 2), (3, 4), (3, 6)})
11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 31
Transformations on Pair RDDs
 Transformations on two pair RDDs (rdd = {(1, 2), (3, 4), (3, 6)}
other = {(3, 9)})
11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 32
Transformations on Pair RDDs
 Using partial functions syntax for Pair RDDs in Scala
 Simple filter on second element:
11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 33
Transformations on Pair RDDs
Word and document counts:
 Per-key average with reduceByKey() and mapValues():
11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 34
Transformations on Pair RDDs
Word count example revisited:
11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 35
Transformations on Pair RDDs
Example of a join (inner join is the default):
11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 36
Actions Available on Pair RDDs
Actions on pair RDDs (example ({(1, 2), (3, 4), (3, 6)}))
11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 37
Example: PageRank
 links – (pageID, link List) – a list of neighbors of each page
 ranks – (pageID,rank) – current rank for each page
11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 38
Important topics not covered in this intro
MLlib
 Machine Learning in the distributed way
 Basic Linear Algebra in the distributed way: sparse and dense vectors
and matrices
Partitioning
 No free lunch, neither automagic scaling of any algorithm
 Making efficient algorithm = trying to minimize shuffling of the data
Spark SQL, Spark 2.0, Datasets, DataFrames
 Something like Python’s pandas or R’s DataFrame
 Great for interactive data mining and for working with CSV files

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study Notes
 
Apache Spark An Overview
Apache Spark An OverviewApache Spark An Overview
Apache Spark An Overview
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDs
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
 
Apache spark Intro
Apache spark IntroApache spark Intro
Apache spark Intro
 
Apache Spark
Apache Spark Apache Spark
Apache Spark
 
Apache Spark Internals
Apache Spark InternalsApache Spark Internals
Apache Spark Internals
 
Spark overview
Spark overviewSpark overview
Spark overview
 
Apache Spark RDD 101
Apache Spark RDD 101Apache Spark RDD 101
Apache Spark RDD 101
 
Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to Spark
 
BDM25 - Spark runtime internal
BDM25 - Spark runtime internalBDM25 - Spark runtime internal
BDM25 - Spark runtime internal
 
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Survey of Spark for Data Pre-Processing and Analytics
Survey of Spark for Data Pre-Processing and AnalyticsSurvey of Spark for Data Pre-Processing and Analytics
Survey of Spark for Data Pre-Processing and Analytics
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hood
 
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
 

Destacado

Destacado (9)

Noun Sense Induction and Disambiguation using Graph-Based Distributional Sema...
Noun Sense Induction and Disambiguation using Graph-Based Distributional Sema...Noun Sense Induction and Disambiguation using Graph-Based Distributional Sema...
Noun Sense Induction and Disambiguation using Graph-Based Distributional Sema...
 
IIT-TUDA at SemEval-2016 Task 5: Beyond Sentiment Lexicon: Combining Domain ...
IIT-TUDA at SemEval-2016 Task 5: Beyond Sentiment Lexicon: Combining Domain ...IIT-TUDA at SemEval-2016 Task 5: Beyond Sentiment Lexicon: Combining Domain ...
IIT-TUDA at SemEval-2016 Task 5: Beyond Sentiment Lexicon: Combining Domain ...
 
21.04.2016 Meetup: Spark vs. Flink
21.04.2016 Meetup: Spark vs. Flink21.04.2016 Meetup: Spark vs. Flink
21.04.2016 Meetup: Spark vs. Flink
 
Making Sense of Word Embeddings
Making Sense of Word EmbeddingsMaking Sense of Word Embeddings
Making Sense of Word Embeddings
 
Why Apache Flink is better than Spark by Rubén Casado
Why Apache Flink is better than Spark by Rubén CasadoWhy Apache Flink is better than Spark by Rubén Casado
Why Apache Flink is better than Spark by Rubén Casado
 
Apache Flink at Strata San Jose 2016
Apache Flink at Strata San Jose 2016Apache Flink at Strata San Jose 2016
Apache Flink at Strata San Jose 2016
 
Continuous Processing with Apache Flink - Strata London 2016
Continuous Processing with Apache Flink - Strata London 2016Continuous Processing with Apache Flink - Strata London 2016
Continuous Processing with Apache Flink - Strata London 2016
 
HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUpon
HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUponHBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUpon
HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUpon
 
Apache Spark & Hadoop : Train-the-trainer
Apache Spark & Hadoop : Train-the-trainerApache Spark & Hadoop : Train-the-trainer
Apache Spark & Hadoop : Train-the-trainer
 

Similar a Getting started in Apache Spark and Flink (with Scala) - Part II

Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25th
Sneha Challa
 

Similar a Getting started in Apache Spark and Flink (with Scala) - Part II (20)

Spark vs Hadoop
Spark vs HadoopSpark vs Hadoop
Spark vs Hadoop
 
Introduction to Spark - DataFactZ
Introduction to Spark - DataFactZIntroduction to Spark - DataFactZ
Introduction to Spark - DataFactZ
 
Learning spark ch06 - Advanced Spark Programming
Learning spark ch06 - Advanced Spark ProgrammingLearning spark ch06 - Advanced Spark Programming
Learning spark ch06 - Advanced Spark Programming
 
Apache Spark PDF
Apache Spark PDFApache Spark PDF
Apache Spark PDF
 
Bds session 13 14
Bds session 13 14Bds session 13 14
Bds session 13 14
 
Apache spark with java 8
Apache spark with java 8Apache spark with java 8
Apache spark with java 8
 
Apache spark with java 8
Apache spark with java 8Apache spark with java 8
Apache spark with java 8
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25th
 
Introduction to Apache Spark
Introduction to Apache Spark Introduction to Apache Spark
Introduction to Apache Spark
 
Introduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingIntroduction to Apache Spark Developer Training
Introduction to Apache Spark Developer Training
 
Big Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computingBig Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computing
 
Apache Spark Introduction.pdf
Apache Spark Introduction.pdfApache Spark Introduction.pdf
Apache Spark Introduction.pdf
 
Sparkr sigmod
Sparkr sigmodSparkr sigmod
Sparkr sigmod
 
spark interview questions & answers acadgild blogs
 spark interview questions & answers acadgild blogs spark interview questions & answers acadgild blogs
spark interview questions & answers acadgild blogs
 
Spark Worshop
Spark WorshopSpark Worshop
Spark Worshop
 
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MoreStrata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
 
Big_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionBig_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_Session
 
Apache spark basics
Apache spark basicsApache spark basics
Apache spark basics
 
Spark Advanced Analytics NJ Data Science Meetup - Princeton University
Spark Advanced Analytics NJ Data Science Meetup - Princeton UniversitySpark Advanced Analytics NJ Data Science Meetup - Princeton University
Spark Advanced Analytics NJ Data Science Meetup - Princeton University
 
Energy analytics with Apache Spark workshop
Energy analytics with Apache Spark workshopEnergy analytics with Apache Spark workshop
Energy analytics with Apache Spark workshop
 

Más de Alexander Panchenko

The 6th Conference on Analysis of Images, Social Networks, and Texts (AIST 2...
The 6th Conference on Analysis of Images, Social Networks, and Texts  (AIST 2...The 6th Conference on Analysis of Images, Social Networks, and Texts  (AIST 2...
The 6th Conference on Analysis of Images, Social Networks, and Texts (AIST 2...
Alexander Panchenko
 
Using Linked Disambiguated Distributional Networks for Word Sense Disambiguation
Using Linked Disambiguated Distributional Networks for Word Sense DisambiguationUsing Linked Disambiguated Distributional Networks for Word Sense Disambiguation
Using Linked Disambiguated Distributional Networks for Word Sense Disambiguation
Alexander Panchenko
 
Text Analysis of Social Networks: Working with FB and VK Data
Text Analysis of Social Networks: Working with FB and VK DataText Analysis of Social Networks: Working with FB and VK Data
Text Analysis of Social Networks: Working with FB and VK Data
Alexander Panchenko
 
Неологизмы в социальной сети Фейсбук
Неологизмы в социальной сети ФейсбукНеологизмы в социальной сети Фейсбук
Неологизмы в социальной сети Фейсбук
Alexander Panchenko
 
Dmitry Gubanov. An Approach to the Study of Formal and Informal Relations of ...
Dmitry Gubanov. An Approach to the Study of Formal and Informal Relations of ...Dmitry Gubanov. An Approach to the Study of Formal and Informal Relations of ...
Dmitry Gubanov. An Approach to the Study of Formal and Informal Relations of ...
Alexander Panchenko
 
Semantic Similarity Measures for Semantic Relation Extraction
Semantic Similarity Measures for Semantic Relation ExtractionSemantic Similarity Measures for Semantic Relation Extraction
Semantic Similarity Measures for Semantic Relation Extraction
Alexander Panchenko
 

Más de Alexander Panchenko (18)

Graph's not dead: from unsupervised induction of linguistic structures from t...
Graph's not dead: from unsupervised induction of linguistic structures from t...Graph's not dead: from unsupervised induction of linguistic structures from t...
Graph's not dead: from unsupervised induction of linguistic structures from t...
 
Building a Web-Scale Dependency-Parsed Corpus from Common Crawl
Building a Web-Scale Dependency-Parsed Corpus from Common CrawlBuilding a Web-Scale Dependency-Parsed Corpus from Common Crawl
Building a Web-Scale Dependency-Parsed Corpus from Common Crawl
 
Improving Hypernymy Extraction with Distributional Semantic Classes
Improving Hypernymy Extraction with Distributional Semantic ClassesImproving Hypernymy Extraction with Distributional Semantic Classes
Improving Hypernymy Extraction with Distributional Semantic Classes
 
Inducing Interpretable Word Senses for WSD and Enrichment of Lexical Resources
Inducing Interpretable Word Senses for WSD and Enrichment of Lexical ResourcesInducing Interpretable Word Senses for WSD and Enrichment of Lexical Resources
Inducing Interpretable Word Senses for WSD and Enrichment of Lexical Resources
 
IIT-UHH at SemEval-2017 Task 3: Exploring Multiple Features for Community Que...
IIT-UHH at SemEval-2017 Task 3: Exploring Multiple Features for Community Que...IIT-UHH at SemEval-2017 Task 3: Exploring Multiple Features for Community Que...
IIT-UHH at SemEval-2017 Task 3: Exploring Multiple Features for Community Que...
 
Fighting with Sparsity of the Synonymy Dictionaries for Automatic Synset Indu...
Fighting with Sparsity of the Synonymy Dictionaries for Automatic Synset Indu...Fighting with Sparsity of the Synonymy Dictionaries for Automatic Synset Indu...
Fighting with Sparsity of the Synonymy Dictionaries for Automatic Synset Indu...
 
The 6th Conference on Analysis of Images, Social Networks, and Texts (AIST 2...
The 6th Conference on Analysis of Images, Social Networks, and Texts  (AIST 2...The 6th Conference on Analysis of Images, Social Networks, and Texts  (AIST 2...
The 6th Conference on Analysis of Images, Social Networks, and Texts (AIST 2...
 
Using Linked Disambiguated Distributional Networks for Word Sense Disambiguation
Using Linked Disambiguated Distributional Networks for Word Sense DisambiguationUsing Linked Disambiguated Distributional Networks for Word Sense Disambiguation
Using Linked Disambiguated Distributional Networks for Word Sense Disambiguation
 
Unsupervised Does Not Mean Uninterpretable: The Case for Word Sense Induction...
Unsupervised Does Not Mean Uninterpretable: The Case for Word Sense Induction...Unsupervised Does Not Mean Uninterpretable: The Case for Word Sense Induction...
Unsupervised Does Not Mean Uninterpretable: The Case for Word Sense Induction...
 
Text Analysis of Social Networks: Working with FB and VK Data
Text Analysis of Social Networks: Working with FB and VK DataText Analysis of Social Networks: Working with FB and VK Data
Text Analysis of Social Networks: Working with FB and VK Data
 
Неологизмы в социальной сети Фейсбук
Неологизмы в социальной сети ФейсбукНеологизмы в социальной сети Фейсбук
Неологизмы в социальной сети Фейсбук
 
Sentiment Index of the Russian Speaking Facebook
Sentiment Index of the Russian Speaking FacebookSentiment Index of the Russian Speaking Facebook
Sentiment Index of the Russian Speaking Facebook
 
Similarity Measures for Semantic Relation Extraction
Similarity Measures for Semantic Relation ExtractionSimilarity Measures for Semantic Relation Extraction
Similarity Measures for Semantic Relation Extraction
 
Dmitry Gubanov. An Approach to the Study of Formal and Informal Relations of ...
Dmitry Gubanov. An Approach to the Study of Formal and Informal Relations of ...Dmitry Gubanov. An Approach to the Study of Formal and Informal Relations of ...
Dmitry Gubanov. An Approach to the Study of Formal and Informal Relations of ...
 
Detecting Gender by Full Name: Experiments with the Russian Language
Detecting Gender by Full Name:  Experiments with the Russian LanguageDetecting Gender by Full Name:  Experiments with the Russian Language
Detecting Gender by Full Name: Experiments with the Russian Language
 
Document
DocumentDocument
Document
 
Вычислительная лексическая семантика: метрики семантической близости и их при...
Вычислительная лексическая семантика: метрики семантической близости и их при...Вычислительная лексическая семантика: метрики семантической близости и их при...
Вычислительная лексическая семантика: метрики семантической близости и их при...
 
Semantic Similarity Measures for Semantic Relation Extraction
Semantic Similarity Measures for Semantic Relation ExtractionSemantic Similarity Measures for Semantic Relation Extraction
Semantic Similarity Measures for Semantic Relation Extraction
 

Último

Último (20)

Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 

Getting started in Apache Spark and Flink (with Scala) - Part II

  • 1. 11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus Getting started in Apache Spark and Flink (with Scala) Alexander Panchenko, Gerold Hintz, Steffen Remus
  • 2. 11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 2 Outline  Scala  basics of Scala programming language  Spark  motivation / what do you get on top of MapReduce  basics of Spark: RDDs, transformations, actions, shuffling  “tricks” useful in Spark context  Spark Hands-on session  run Spark notebook and solve easy tasks  setup Spark project & submit job to cluster  Flink  theory  difference from Spark
  • 3. 11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 3 Three main benefits to use Spark 1. Spark is easy to use—you can develop applications on your laptop, using a high-level API 2. Spark is fast, enabling interactive use and complex algorithms 3. Spark is a general engine, letting you combine multiple types of computations (e.g., SQL queries, text processing, and machine learning) that might previously have required different engines. This tutorial is based on the book by creators of Spark: Karau H., Konwinski A., Windell P., Zaharia M. “Learning Spark. Lighting-fast Data Analysis.” O’Really. 2015
  • 4. 11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 4 Data Science Tasks Experimentation: development of the model  Python, MATLAB, R  iPython notebooks  Interactive computing  Easy-to-use Production: using the model  Java, Scala, C++/C  Unit tests  Fault tolerance  No interactive computing  Scalability Scala + Spark can be used for both!
  • 5. 11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 5 A Brief History of Spark  Spark is an open source project  Spark started in 2009 as a research project in the UC Berkeley RAD Lab  Research papers were published about Spark at academic conferences and soon after its creation in 2009  In 2011, the AMPLab started to develop higher-level components on Spark, such as Shark (Hive on Spark) and Spark Streaming  Currently one of the most active project in Scala language:
  • 6. 11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 6 What Is Apache Spark?  Spark Core: resilient distributed dataset (RDD)  Spark SQL: Hive tables, Parquet, JSON, Datasets
  • 7. 11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 7 What Is Apache Spark?  Components for distributed execution in Spark
  • 8. 11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 8 Spark Runtime Architecture The components of a distributed Spark application
  • 9. 11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 9 Spark Runtime Architecture  The master/slave architecture with one central coordinator and many distributed workers  The central coordinator is called the driver  The driver communicates with distributed workers called executors  The driver is the process where the main() method of your program runs  The driver:  Converting a user program into tasks  Scheduling tasks on executors
  • 10. 11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 10 Downloading Spark and Getting Started  Download a version “Pre-built for Hadoop 2.X and later”: http://spark.apache.org/downloads.html  Directories you see here that come with Spark:  README.md  Contains short instructions for getting started with Spark.  bin  Contains executable files that can be used to interact with Spark in various ways (e.g., the Spark shell, which we will cover later in this chapter).  core, streaming, python, …  Contains the source code of major components of the Spark project.  examples  Contains some helpful Spark standalone jobs that you can look at and run to learn about the Spark API.
  • 11. 11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 11 Introduction to Spark’s Scala Shell  Run: bin/spark-shell  Type in the shell the Scala line count:  We can run parallel operations on the RDD, such as counting the lines of text in the file or printing the first one
  • 12. 11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 12 Filtering: lambda functions  Filtering example (Scala):  Filtering example (Java 7):  Filtering example (Java 8):
  • 13. 11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 13 Standalone Spark Applications  Link to Spark (Maven or SBT), e.g.:  Write a sample class, e.g. word count:
  • 14. 11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 14 Standalone Spark Applications  SBT build file  Build JAR and run it:
  • 15. 11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 15 Programming with RDDs RDD -- Resilient Distributed Dataset  Immutable distributed collection of objects  Each RDD is split into multiple partitions  Partitions may be computed on different nodes Creating an RDD  Loading an external dataset  Distributing a collection of objects
  • 16. 11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 16 Programming with RDDs  Once created, RDDs offer two types of operations:  Transformations  Actions  Transformations construct a new RDD from a previous one  Actions, compute a result based on an RDD  either return it to the driver program  or save it to an external storage system, e.g. HDFS  RDDs are recomputed each time you run an action  To reuse an RDD you need to persist it in memory:
  • 17. 11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 17 Spark Execution Steps (Shell & Standalone) 1. Create some input RDDs from external data. 2. Transform them to define new RDDs using transformations like filter(). 3. Persist any intermediate RDDs that will need to be reused. 4. Launch actions such as count() and first() to kick off a parallel computation, which is then optimized and executed by Spark.
  • 18. 11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 18 RDD Operations: Transformations  filter() operation does not mutate the existing inputRDD  It returns a pointer to an entirely new RDD  inputRDD can still be reused later in the program, e.g.:
  • 19. 11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 19 RDD Operations: Actions Return some result and launch actual computation:  take() to retrieve a small number of elements
  • 20. 11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 20 Common Transformations and Actions Element-wise transformations  Mapped and filtered RDD from an input RDD:  Squaring the values in an RDD:
  • 21. 11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 21 Common Transformations and Actions Element-wise transformations  Splitting lines into multiple words:  Difference between flatMap() and map() on an RDD:
  • 22. 11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 22 Common Transformations and Actions Some simple set operations:
  • 23. 11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 23 Common Transformations and Actions Basic RDD transformations on an RDD containing {1, 2, 3, 3}:
  • 24. 11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 24 Common Transformations and Actions Two-RDD transformations on RDDs containing {1, 2, 3} and {3, 4, 5}:
  • 25. 11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 25 Common Transformations and Actions Basic actions on an RDD containing {1, 2, 3, 3}:
  • 26. 11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 26 Common Transformations and Actions Basic actions on an RDD containing {1, 2, 3, 3}:
  • 27. 11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 27 Persistence (Caching) Double execution: Reusing result:  Persistence levels:
  • 28. 11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 28 Working with Key/Value Pairs  Pair RDDs are a useful building block in many programs  Allow you to act on each key in parallel or regroup data  For instance:  reduceByKey() method that can aggregate data for each key  join() method that can merge two RDDs by grouping elements with the same key  Creating Pair RDDs = creating Scala tuples:  Creating a pair RDD using the first word as the key
  • 29. 11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 29 Transformations on Pair RDDs  Transformations on one pair RDD (example: {(1, 2), (3, 4), (3, 6)})
  • 30. 11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 30 Transformations on Pair RDDs  Transformations on one pair RDD (example: {(1, 2), (3, 4), (3, 6)})
  • 31. 11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 31 Transformations on Pair RDDs  Transformations on two pair RDDs (rdd = {(1, 2), (3, 4), (3, 6)} other = {(3, 9)})
  • 32. 11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 32 Transformations on Pair RDDs  Using partial functions syntax for Pair RDDs in Scala  Simple filter on second element:
  • 33. 11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 33 Transformations on Pair RDDs Word and document counts:  Per-key average with reduceByKey() and mapValues():
  • 34. 11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 34 Transformations on Pair RDDs Word count example revisited:
  • 35. 11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 35 Transformations on Pair RDDs Example of a join (inner join is the default):
  • 36. 11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 36 Actions Available on Pair RDDs Actions on pair RDDs (example ({(1, 2), (3, 4), (3, 6)}))
  • 37. 11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 37 Example: PageRank  links – (pageID, link List) – a list of neighbors of each page  ranks – (pageID,rank) – current rank for each page
  • 38. 11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 38 Important topics not covered in this intro MLlib  Machine Learning in the distributed way  Basic Linear Algebra in the distributed way: sparse and dense vectors and matrices Partitioning  No free lunch, neither automagic scaling of any algorithm  Making efficient algorithm = trying to minimize shuffling of the data Spark SQL, Spark 2.0, Datasets, DataFrames  Something like Python’s pandas or R’s DataFrame  Great for interactive data mining and for working with CSV files