SlideShare una empresa de Scribd logo
1 de 34
Descargar para leer sin conexión
WITH
Gökhan Atıl
GÖKHAN ATIL
➤ Database Administrator
➤ Oracle ACE Director (2016)

ACE (2011)
➤ 10g/11g and R12 Oracle Certified Professional (OCP)
➤ Co-author of Expert Oracle Enterprise Manager 12c
➤ Founding Member and Vice President of TROUG
➤ Blogger (since 2008) gokhanatil.com
➤ Twitter: @gokhanatil
2
APACHE SPARK WITH PYTHON
➤ Introduction to Apache Spark
➤ Why Python (PySpark) instead of Scala?
➤ Spark RDD
➤ SQL and DataFrames
➤ Spark Streaming
➤ Spark Graphx
➤ Spark MLlib (Machine Learning)
3
INTRODUCTION TO APACHE SPARK
➤ A fast and general engine for large-scale data processing
➤ Top-Level Apache Project since 2014.
➤ Response to limitations in the MapReduce
➤ Run programs up to 100x faster than Hadoop MapReduce
in memory, or 10x faster on disk
➤ Implemented in Scala programming language, supports
Java, Scala, Python, R
➤ Runs on Hadoop, Mesos, Kubernetes, standalone, cloud
4
DOWNLOAD AND RUN ON YOUR PC
➤ https://spark.apache.org/downloads.html
➤ Extract and Spark is ready:
tar -xzf spark-2.3.0-bin-hadoop2.7.tgz
spark-2.3.0-bin-hadoop2.7/bin/spark
➤ You can also use PIP:
pip install pyspark
5
PYSPARK AND SPARK-SUBMIT
➤ PySpark is the interface that gives access to Spark using the
Python programming language
➤ The spark-submit script in Spark’s bin directory is used to
launch applications on a cluster
spark-submit example1.py
6
WHY PYTHON INSTEAD OF SCALA?
➤ If you know Scala, then use Scala!
➤ Learning curve: Python is comparatively easier to learn
➤ Easy to use: Code readability, maintainability and familiarity is
far better with Python
➤ Libraries: Python comes with great libraries for data analysis,
statistics and visualization (numpy, pandas, matplotlib etc...)
➤ Performance:  Scala is faster then Python but if your Python
code just calls Spark libraries, the differences in performance is
minimal (*)
7
Reminder: Any new feature added in Spark API will be
available in Scala first
RESILIENT DISTRIBUTED DATASET (RDD)
➤ RDDs are the core data structure in Spark
➤ Distributed, resilient, immutable, can store unstructured and
structured data, lazy evaluated
8
node 1
RDD
partition 1
node 2
RDD
partition 2
node 3
RDD
partition 3
RDD
RDD TRANSFORMATIONS AND ACTIONS
sc.textFile(*)
9
RDD T1 T2 T3 ACTION
LAZY EVALUATATION
SPARK CONTEXT
.collect().reduceByKey(*).filter(*).map(*)
TRANSFORMATIONS ACTIONS
➤ map
➤ filter
➤ flatMap
➤ mapPartitions
➤ reduceByKey
➤ union
➤ intersection
➤ join
10
➤ collect
➤ count
➤ first
➤ take
➤ takeSample
➤ takeOrdered
➤ saveAsTextFile
➤ foreach
HOW TO CREATE RDD IN PYSPARK
➤ Referencing a dataset in an external storage system:
rdd = sc.textFile( ... )
➤ Parallelizing already existing collection:
rdd = sc.parallelize( ... )
➤ Creating RDD from already existing RDDs:
rdd2 = rdd1.map( ... )
11
USERS.CSV (MOVIELENS DATABASE)
id | age | gender | occupation | zip
1|24|M|technician|85711
2|53|F|other|94043
3|23|M|writer|32067
4|24|M|technician|43537
5|33|F|other|15213
6|42|M|executive|98101
7|57|M|administrator|91344
8|36|M|administrator|05201
12
M = 670
F = 273
EXAMPLE #1: USE RDD TO GROUP DATA FROM CSV
from pyspark import SparkContext
sc = SparkContext.getOrCreate()
print sc.textFile( "users.csv" ) 
.map( lambda x: (x.split("|")[2], 1) ) 
.reduceByKey(lambda x,y:x+y).collect()
sc.stop()
13
M, 1
M, 1
F, 1
M, 1
[(u'M', 670), (u'F', 273)]
SPARK SQL AND DATAFRAMES
14
Catalyst
RDD
DataFrames/DataSetsSQL
SPARKSQL
MLlib GraphFrames
Structured
Streaming
➤ Spark SQL is Apache Spark's module for working with
structured data
DATAFRAMES AND DATASETS
➤ DataFrame is a distributed collection of "structured" data, organized
into named columns.
➤ Spark DataSets are statically typed, while Python is a dynamically
typed programming language so Python supports only DataFrames.
15
EXAMPLE #2: USE DATAFRAME TO GROUP DATA FROM CSV
from pyspark import SparkContext
from pyspark.sql import SparkSession
sc = SparkContext.getOrCreate()
spark = SparkSession(sc)
spark.read.load( "users.csv", format="csv", sep="|" ) 
.toDF( "id","age","gender","occupation","zip" ) 
.groupby( "gender" ).count().show()
sc.stop()
16
DATAFRAME VERSUS RDD
17
?
CATALYST OPTIMIZER
➤ Spark SQL uses Catalyst optimizer to optimize query plans.
➤ Supports cost-based optimization since Spark 2.2
18
SQL
DataFrame
DataSet
Query
Plan
Optimized
Query Plan
RDD
Code Generation
CONVERSION BETWEEN RDD AND DATAFRAME
➤ An RDD can be converted to DataFrame using
createDataFrame or toDF method:
rdd = sc.parallelize([("osman",21),("ahmet",25)])
df = rdd.toDF( "name STRING, age INT" )
df.show()
➤ You can access underlying RDD of a DataFrame using rdd
property:
df.rdd.collect()
[Row(name=u'osman',age=21),Row(name=u'ahmet',age=25)]
19
EXAMPLE #3: CREATE TEMPORARY VIEWS FROM DATAFRAMES
spark.read.load( "users.csv", format="csv", sep="|" ) 
.toDF( "id","age","gender","occupation","zip" ) 
.createOrReplaceTempView( "users" )
spark.sql( "select count(*) from users" ).show()
spark.sql( "select case when age < 25 then '-25' 
when age between 25 and 39 then '25-40' 
when age >= 40 then '40+' end age_group, 
count(*) from users group by age_group order by 1" ).show()
20
EXAMPLE #4: READ AND WRITE DATA
df = spark.read.load( "users.csv", format="csv", sep="|" ) 
.toDF( "id","age","gender","occupation","zip" )
df.write.saveAsTable("users")
df .write.save("users.json", format="json", mode="overwrite")
spark.sql("SELECT gender, count(*) FROM 
json.`users.json` GROUP BY gender").show()
21
HIVE
SPARK STREAMING (DSTREAMS)
➤ Scalable, high-throughput, fault-tolerant stream processing of
live data streams
➤ Supports: File, Socket, Kafka, Flume, Kinesis
➤ Spark Streaming receives live input data streams and divides
the data into batches
22
EXAMPLE #5: DISCRETIZED STREAMS (DSTREAMS)
ssc = StreamingContext(sc, 1)
stream_data = ssc.textFileStream("file:///tmp/stream") 
.map( lambda x: x.split(","))
stream_data.pprint()
ssc.start()
ssc.awaitTermination()
23
EXAMPLE #5: OUTPUT
24
STRUCTURED STREAMING
➤ Stream processing engine built on the Spark SQL engine
➤ Supports File and Kafka sources for production; Socket and
Rate sources for testing
25
EXAMPLE #6: STRUCTURED STREAMING
stream_data = spark.readStream 
.load( format="csv",path="/tmp/stream/*.csv",
schema="name string, points int" ) 
.groupBy("name").sum("points").orderBy( "sum(points)",
ascending=0 )
stream_data.writeStream.start( format="console",
outputMode="complete" ).awaitTermination()
26
EXAMPLE #6: OUTPUT
27
GRAPHX (GRAPHFRAMES)
➤ GraphX is a new component in Spark for graphs and graph-
parallel computation.
28
EXAMPLE #7: GRAPHFRAMES
vertex =
spark.createDataFrame([
(1, "Ahmet"),
(2, "Mehmet"),
(3, "Cengiz"),
(4, "Osman")],
["id", "name"])
edges =
spark.createDataFrame([
( 1, 2, "friend" ),
( 2, 1, "friend" ),
( 2, 3, "friend" ),
( 3, 2, "friend" ),
( 2, 4, "friend" ),
( 4, 2, "friend" ),
( 3, 4, "friend" ),
( 4, 3, "friend" )],
["src","dst", "relation"])
29
EXAMPLE #7: GRAPHFRAMES
pyspark --packages graphframes:graphframes:0.5.0-spark2.1-
s_2.11
import graphframes as gf
g = gf.GraphFrame(vertex, edges)
g.shortestPaths([4]).show()
30
1
2
3
4
MLLIB (MACHINE LEARNING)
➤ Supports common ML Algorithms such as classification,
regression, clustering, and collaborative filtering
➤ Featurization:
➤ Feature extraction (TF-IDF, Word2Vec, CountVectorizer ...)
➤ Transformation (Tokenizer, StopWordsRemover ...)
➤ Selection (VectorSlicer, RFormula ... )
➤ Pipelines: combine multiple algorithms into a single pipeline,
or workflow
➤ DataFrame-based API is primary API
31
EXAMPLE #8: ALTERNATING LEAST SQUARES (ALS)
def parseratings( x ):
v = x.split("::")
return (int(v[0]), int(v[1]), float(v[2]))
ratings = sc.textFile("ratings.dat").map(parseratings) 
.toDF( ["user", "id", "rating"] )
als = ALS(userCol="user", itemCol="id", ratingCol="rating")
model = als.fit(ratings)
model.recommendForAllUsers(10).show()
32
EXAMPLE #8 OUTPUT
33
Blog: www.gokhanatil.com Twitter: @gokhanatil

Más contenido relacionado

La actualidad más candente

Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache Spark
Databricks
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Simplilearn
 
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Simplilearn
 

La actualidad más candente (20)

Apache Spark 101
Apache Spark 101Apache Spark 101
Apache Spark 101
 
Spark
SparkSpark
Spark
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache Spark
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
 
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data Science
 
PySpark in practice slides
PySpark in practice slidesPySpark in practice slides
PySpark in practice slides
 
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
 
Data Source API in Spark
Data Source API in SparkData Source API in Spark
Data Source API in Spark
 
What Is RDD In Spark? | Edureka
What Is RDD In Spark? | EdurekaWhat Is RDD In Spark? | Edureka
What Is RDD In Spark? | Edureka
 
Learn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideLearn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive Guide
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
 

Similar a Introduction to Spark with Python

An Introduction to Spark
An Introduction to SparkAn Introduction to Spark
An Introduction to Spark
jlacefie
 

Similar a Introduction to Spark with Python (20)

Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
 
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax Enablement
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & libraries
 
Memulai Data Processing dengan Spark dan Python
Memulai Data Processing dengan Spark dan PythonMemulai Data Processing dengan Spark dan Python
Memulai Data Processing dengan Spark dan Python
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
 
DataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks
DataFrame: Spark's new abstraction for data science by Reynold Xin of DatabricksDataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks
DataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)
 
Parallelizing Existing R Packages
Parallelizing Existing R PackagesParallelizing Existing R Packages
Parallelizing Existing R Packages
 
Big data processing with Apache Spark and Oracle Database
Big data processing with Apache Spark and Oracle DatabaseBig data processing with Apache Spark and Oracle Database
Big data processing with Apache Spark and Oracle Database
 
Spark core
Spark coreSpark core
Spark core
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study Notes
 
Building highly scalable data pipelines with Apache Spark
Building highly scalable data pipelines with Apache SparkBuilding highly scalable data pipelines with Apache Spark
Building highly scalable data pipelines with Apache Spark
 
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
 
Spark: The State of the Art Engine for Big Data Processing
Spark: The State of the Art Engine for Big Data ProcessingSpark: The State of the Art Engine for Big Data Processing
Spark: The State of the Art Engine for Big Data Processing
 
Spark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsSpark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka Streams
 
Spark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard MaasSpark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard Maas
 
An Introduction to Spark
An Introduction to SparkAn Introduction to Spark
An Introduction to Spark
 
An Introduct to Spark - Atlanta Spark Meetup
An Introduct to Spark - Atlanta Spark MeetupAn Introduct to Spark - Atlanta Spark Meetup
An Introduct to Spark - Atlanta Spark Meetup
 

Más de Gokhan Atil

Más de Gokhan Atil (15)

Introduction to Cassandra
Introduction to CassandraIntroduction to Cassandra
Introduction to Cassandra
 
SQL or noSQL - Oracle Cloud Day Istanbul
SQL or noSQL - Oracle Cloud Day IstanbulSQL or noSQL - Oracle Cloud Day Istanbul
SQL or noSQL - Oracle Cloud Day Istanbul
 
EM13c: Write Powerful Scripts with EMCLI
EM13c: Write Powerful Scripts with EMCLIEM13c: Write Powerful Scripts with EMCLI
EM13c: Write Powerful Scripts with EMCLI
 
Oracle Enterprise Manager Cloud Control 13c for DBAs
Oracle Enterprise Manager Cloud Control 13c for DBAsOracle Enterprise Manager Cloud Control 13c for DBAs
Oracle Enterprise Manager Cloud Control 13c for DBAs
 
Essential Linux Commands for DBAs
Essential Linux Commands for DBAsEssential Linux Commands for DBAs
Essential Linux Commands for DBAs
 
Oracle Enterprise Manager Cloud Control 13c for DBAs
Oracle Enterprise Manager Cloud Control 13c for DBAsOracle Enterprise Manager Cloud Control 13c for DBAs
Oracle Enterprise Manager Cloud Control 13c for DBAs
 
Enterprise Manager: Write powerful scripts with EMCLI
Enterprise Manager: Write powerful scripts with EMCLIEnterprise Manager: Write powerful scripts with EMCLI
Enterprise Manager: Write powerful scripts with EMCLI
 
EMCLI Crash Course - DOAG Germany
EMCLI Crash Course - DOAG GermanyEMCLI Crash Course - DOAG Germany
EMCLI Crash Course - DOAG Germany
 
Oracle Enterprise Manager 12c: EMCLI Crash Course
Oracle Enterprise Manager 12c: EMCLI Crash CourseOracle Enterprise Manager 12c: EMCLI Crash Course
Oracle Enterprise Manager 12c: EMCLI Crash Course
 
TROUG & Turkey JUG Semineri: Veriye erişimin en hızlı yolu
TROUG & Turkey JUG Semineri: Veriye erişimin en hızlı yoluTROUG & Turkey JUG Semineri: Veriye erişimin en hızlı yolu
TROUG & Turkey JUG Semineri: Veriye erişimin en hızlı yolu
 
Oracle 12c Database In Memory DBA SIG
Oracle 12c Database In Memory DBA SIGOracle 12c Database In Memory DBA SIG
Oracle 12c Database In Memory DBA SIG
 
Oracle 12c Database In-Memory
Oracle 12c Database In-MemoryOracle 12c Database In-Memory
Oracle 12c Database In-Memory
 
Oracle DB Standard Edition: Başka Bir Arzunuz?
Oracle DB Standard Edition: Başka Bir Arzunuz?Oracle DB Standard Edition: Başka Bir Arzunuz?
Oracle DB Standard Edition: Başka Bir Arzunuz?
 
Enterprise Manager 12c ASH Analytics
Enterprise Manager 12c ASH AnalyticsEnterprise Manager 12c ASH Analytics
Enterprise Manager 12c ASH Analytics
 
Using APEX to Create a Mobile User Interface for Enterprise Manager 12c
Using APEX to Create a Mobile User Interface for Enterprise Manager 12cUsing APEX to Create a Mobile User Interface for Enterprise Manager 12c
Using APEX to Create a Mobile User Interface for Enterprise Manager 12c
 

Último

TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
mohitmore19
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
Health
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
VishalKumarJha10
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
shinachiaurasa2
 

Último (20)

TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
 
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdf
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
LEVEL 5 - SESSION 1 2023 (1).pptx - PDF 123456
LEVEL 5   - SESSION 1 2023 (1).pptx - PDF 123456LEVEL 5   - SESSION 1 2023 (1).pptx - PDF 123456
LEVEL 5 - SESSION 1 2023 (1).pptx - PDF 123456
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdfThe Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
 
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionIntroducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
 
Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verifiedSector 18, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verified
 

Introduction to Spark with Python

  • 2. GÖKHAN ATIL ➤ Database Administrator ➤ Oracle ACE Director (2016)
 ACE (2011) ➤ 10g/11g and R12 Oracle Certified Professional (OCP) ➤ Co-author of Expert Oracle Enterprise Manager 12c ➤ Founding Member and Vice President of TROUG ➤ Blogger (since 2008) gokhanatil.com ➤ Twitter: @gokhanatil 2
  • 3. APACHE SPARK WITH PYTHON ➤ Introduction to Apache Spark ➤ Why Python (PySpark) instead of Scala? ➤ Spark RDD ➤ SQL and DataFrames ➤ Spark Streaming ➤ Spark Graphx ➤ Spark MLlib (Machine Learning) 3
  • 4. INTRODUCTION TO APACHE SPARK ➤ A fast and general engine for large-scale data processing ➤ Top-Level Apache Project since 2014. ➤ Response to limitations in the MapReduce ➤ Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk ➤ Implemented in Scala programming language, supports Java, Scala, Python, R ➤ Runs on Hadoop, Mesos, Kubernetes, standalone, cloud 4
  • 5. DOWNLOAD AND RUN ON YOUR PC ➤ https://spark.apache.org/downloads.html ➤ Extract and Spark is ready: tar -xzf spark-2.3.0-bin-hadoop2.7.tgz spark-2.3.0-bin-hadoop2.7/bin/spark ➤ You can also use PIP: pip install pyspark 5
  • 6. PYSPARK AND SPARK-SUBMIT ➤ PySpark is the interface that gives access to Spark using the Python programming language ➤ The spark-submit script in Spark’s bin directory is used to launch applications on a cluster spark-submit example1.py 6
  • 7. WHY PYTHON INSTEAD OF SCALA? ➤ If you know Scala, then use Scala! ➤ Learning curve: Python is comparatively easier to learn ➤ Easy to use: Code readability, maintainability and familiarity is far better with Python ➤ Libraries: Python comes with great libraries for data analysis, statistics and visualization (numpy, pandas, matplotlib etc...) ➤ Performance:  Scala is faster then Python but if your Python code just calls Spark libraries, the differences in performance is minimal (*) 7 Reminder: Any new feature added in Spark API will be available in Scala first
  • 8. RESILIENT DISTRIBUTED DATASET (RDD) ➤ RDDs are the core data structure in Spark ➤ Distributed, resilient, immutable, can store unstructured and structured data, lazy evaluated 8 node 1 RDD partition 1 node 2 RDD partition 2 node 3 RDD partition 3 RDD
  • 9. RDD TRANSFORMATIONS AND ACTIONS sc.textFile(*) 9 RDD T1 T2 T3 ACTION LAZY EVALUATATION SPARK CONTEXT .collect().reduceByKey(*).filter(*).map(*)
  • 10. TRANSFORMATIONS ACTIONS ➤ map ➤ filter ➤ flatMap ➤ mapPartitions ➤ reduceByKey ➤ union ➤ intersection ➤ join 10 ➤ collect ➤ count ➤ first ➤ take ➤ takeSample ➤ takeOrdered ➤ saveAsTextFile ➤ foreach
  • 11. HOW TO CREATE RDD IN PYSPARK ➤ Referencing a dataset in an external storage system: rdd = sc.textFile( ... ) ➤ Parallelizing already existing collection: rdd = sc.parallelize( ... ) ➤ Creating RDD from already existing RDDs: rdd2 = rdd1.map( ... ) 11
  • 12. USERS.CSV (MOVIELENS DATABASE) id | age | gender | occupation | zip 1|24|M|technician|85711 2|53|F|other|94043 3|23|M|writer|32067 4|24|M|technician|43537 5|33|F|other|15213 6|42|M|executive|98101 7|57|M|administrator|91344 8|36|M|administrator|05201 12 M = 670 F = 273
  • 13. EXAMPLE #1: USE RDD TO GROUP DATA FROM CSV from pyspark import SparkContext sc = SparkContext.getOrCreate() print sc.textFile( "users.csv" ) .map( lambda x: (x.split("|")[2], 1) ) .reduceByKey(lambda x,y:x+y).collect() sc.stop() 13 M, 1 M, 1 F, 1 M, 1 [(u'M', 670), (u'F', 273)]
  • 14. SPARK SQL AND DATAFRAMES 14 Catalyst RDD DataFrames/DataSetsSQL SPARKSQL MLlib GraphFrames Structured Streaming ➤ Spark SQL is Apache Spark's module for working with structured data
  • 15. DATAFRAMES AND DATASETS ➤ DataFrame is a distributed collection of "structured" data, organized into named columns. ➤ Spark DataSets are statically typed, while Python is a dynamically typed programming language so Python supports only DataFrames. 15
  • 16. EXAMPLE #2: USE DATAFRAME TO GROUP DATA FROM CSV from pyspark import SparkContext from pyspark.sql import SparkSession sc = SparkContext.getOrCreate() spark = SparkSession(sc) spark.read.load( "users.csv", format="csv", sep="|" ) .toDF( "id","age","gender","occupation","zip" ) .groupby( "gender" ).count().show() sc.stop() 16
  • 18. CATALYST OPTIMIZER ➤ Spark SQL uses Catalyst optimizer to optimize query plans. ➤ Supports cost-based optimization since Spark 2.2 18 SQL DataFrame DataSet Query Plan Optimized Query Plan RDD Code Generation
  • 19. CONVERSION BETWEEN RDD AND DATAFRAME ➤ An RDD can be converted to DataFrame using createDataFrame or toDF method: rdd = sc.parallelize([("osman",21),("ahmet",25)]) df = rdd.toDF( "name STRING, age INT" ) df.show() ➤ You can access underlying RDD of a DataFrame using rdd property: df.rdd.collect() [Row(name=u'osman',age=21),Row(name=u'ahmet',age=25)] 19
  • 20. EXAMPLE #3: CREATE TEMPORARY VIEWS FROM DATAFRAMES spark.read.load( "users.csv", format="csv", sep="|" ) .toDF( "id","age","gender","occupation","zip" ) .createOrReplaceTempView( "users" ) spark.sql( "select count(*) from users" ).show() spark.sql( "select case when age < 25 then '-25' when age between 25 and 39 then '25-40' when age >= 40 then '40+' end age_group, count(*) from users group by age_group order by 1" ).show() 20
  • 21. EXAMPLE #4: READ AND WRITE DATA df = spark.read.load( "users.csv", format="csv", sep="|" ) .toDF( "id","age","gender","occupation","zip" ) df.write.saveAsTable("users") df .write.save("users.json", format="json", mode="overwrite") spark.sql("SELECT gender, count(*) FROM json.`users.json` GROUP BY gender").show() 21 HIVE
  • 22. SPARK STREAMING (DSTREAMS) ➤ Scalable, high-throughput, fault-tolerant stream processing of live data streams ➤ Supports: File, Socket, Kafka, Flume, Kinesis ➤ Spark Streaming receives live input data streams and divides the data into batches 22
  • 23. EXAMPLE #5: DISCRETIZED STREAMS (DSTREAMS) ssc = StreamingContext(sc, 1) stream_data = ssc.textFileStream("file:///tmp/stream") .map( lambda x: x.split(",")) stream_data.pprint() ssc.start() ssc.awaitTermination() 23
  • 25. STRUCTURED STREAMING ➤ Stream processing engine built on the Spark SQL engine ➤ Supports File and Kafka sources for production; Socket and Rate sources for testing 25
  • 26. EXAMPLE #6: STRUCTURED STREAMING stream_data = spark.readStream .load( format="csv",path="/tmp/stream/*.csv", schema="name string, points int" ) .groupBy("name").sum("points").orderBy( "sum(points)", ascending=0 ) stream_data.writeStream.start( format="console", outputMode="complete" ).awaitTermination() 26
  • 28. GRAPHX (GRAPHFRAMES) ➤ GraphX is a new component in Spark for graphs and graph- parallel computation. 28
  • 29. EXAMPLE #7: GRAPHFRAMES vertex = spark.createDataFrame([ (1, "Ahmet"), (2, "Mehmet"), (3, "Cengiz"), (4, "Osman")], ["id", "name"]) edges = spark.createDataFrame([ ( 1, 2, "friend" ), ( 2, 1, "friend" ), ( 2, 3, "friend" ), ( 3, 2, "friend" ), ( 2, 4, "friend" ), ( 4, 2, "friend" ), ( 3, 4, "friend" ), ( 4, 3, "friend" )], ["src","dst", "relation"]) 29
  • 30. EXAMPLE #7: GRAPHFRAMES pyspark --packages graphframes:graphframes:0.5.0-spark2.1- s_2.11 import graphframes as gf g = gf.GraphFrame(vertex, edges) g.shortestPaths([4]).show() 30 1 2 3 4
  • 31. MLLIB (MACHINE LEARNING) ➤ Supports common ML Algorithms such as classification, regression, clustering, and collaborative filtering ➤ Featurization: ➤ Feature extraction (TF-IDF, Word2Vec, CountVectorizer ...) ➤ Transformation (Tokenizer, StopWordsRemover ...) ➤ Selection (VectorSlicer, RFormula ... ) ➤ Pipelines: combine multiple algorithms into a single pipeline, or workflow ➤ DataFrame-based API is primary API 31
  • 32. EXAMPLE #8: ALTERNATING LEAST SQUARES (ALS) def parseratings( x ): v = x.split("::") return (int(v[0]), int(v[1]), float(v[2])) ratings = sc.textFile("ratings.dat").map(parseratings) .toDF( ["user", "id", "rating"] ) als = ALS(userCol="user", itemCol="id", ratingCol="rating") model = als.fit(ratings) model.recommendForAllUsers(10).show() 32