SlideShare una empresa de Scribd logo
1 de 19
Welcome
Chicago Data Engineering Meetup
- Our First Event – November 2018
- Objectives
- Every 2 months
- Format
- sharing experiences (open for volunteers)
- new tools / demos
- Open for suggestions
01 Who I am
02 QuantumBlack
03 Today’s topic: Spark UDF Performance
04 Background
05 Benchmarking – Live demo
06 Conclusion and Our Approach
07 Q&A
Agenda
Who I am
01
4All content copyright © 2017 QuantumBlack, a McKinsey company
Client case studies
Experience across several industry sectors,
including telecoms, retail, financial services and
pharmaceuticals.
Financial sector – Advanced Analytics
projects for Fraud detection in Internet Banking
and Credit Risk Modelling.
Telecommunications – Petabyte scale
environment, delivering several use cases,
including: real-time failure detection using CDR
data, customer profiling and marketing
campaigns.
Manufacturing– data wrangling in failure
detection project for computer parts
manufacturing in Europe.
Pharmaceuticals – Site selection optimisation
for a top pharma players.
Telematics (Car insurance) – machine learning
model that estimates the probability of crashing
for each driver based data obtained from on
board units box installed on cars containing
geo-location positions, speed and acceleration
of ~2 million drivers over a 2-year period.
Complex feature creation using terabyte scale
and external data sources such as weather,
street and traffic data.
Education
Guilherme has a BSc in Data Processing from
Mackenzie University and specialisations in
Machine Learning and Business Intelligence.
Role
Big Data technology expert based in Chicago.
Work with clients to translate business
hypotheses into data requirements and
technology solutions.
Expertise
Provides technical data engineering oversight
on projects and advises other data engineers
on architecture definition and performance
optimization for large-scale data wrangling.
Professional experience
Prior to joining QuantumBlack, Guilherme
specialised for over 18 years in Data
Warehouse and Business Intelligence projects
on large-scale environments. More recently, 6
years experience in Big Data projects and
architecture, lots of them at petabyte scale, as
well as real-time projects.
Previously led big data projects at Hortonworks,
SAP and large financial institutions.
BIOGRAPHY
Guilherme Braccialli
Principal Data Engineer, QuantumBlack,
Chicago
QuantumBlack
02
6All content copyright © 2017 QuantumBlack, a McKinsey company
QB exploit data, analytics and
design to help our clients be the
best they can be
We were born and proven in
Formula One, where the smallest
margins are the difference
between winning and losing and
data has emerged as a
fundamental element of
competitive advantage
QuantumBlack
6All content copyright © 2017 QuantumBlack, a McKinsey company
In elite sport the
smallest edge makes
the difference,
and the best teams
exploit this to outlearn
their rivals
8All content copyright © 2017 QuantumBlack, a McKinsey company
Since then, we have applied our proven
methodology across multiple sectors
Advanced
Industries
Aerospace
Automotive
Semi-Conductors
Urban Infrastructure
Financial
Services
Asset Management
Payment Networks
Private Banking
Retail Banking
Health &
Wellbeing
Hospitals
Medical Devices
Pharmaceutical
Natural
Resources
Oil & Gas
Mining
Renewable Energy
Utilities
Sports
Basketball
Baseball
Formula One
Soccer
Spark UDF Performance
03
- Share our learnings
- Running spark at scale
- Practical Examples
- Live demo (code)
Background
04
11All content copyright © 2017 QuantumBlack, a McKinsey company
• Open Source
‒ We are a consulting company, we enable our clients to use Advanced Analytics
‒ We don’t sell a out-of-box solution / licensing
‒ Clients can run it anywhere, we use open source tools
• Scalable
‒ We deal with big data volumes
‒ Multiple TBs of data
‒ Spark has several options to run on distributed mode (Hadoop, Kubernetes, Stand Alone)
• Flexibility and Integration
‒ Supports multiple languages: Python, SQL, Scala, Java, R
‒ Batch, Streaming, Graph, Machine Learning
‒ Easy to integrate with Data Scientist code, single data pipeline
Why we use spark
BACKGROUND
12All content copyright © 2017 QuantumBlack, a McKinsey company
• In the Cloud
‒ AWS (EMR)
‒ Azure (HDInsight)
‒ Google Cloud (DataProc)
‒ Databricks (AWS or Azure)
• On-premises
‒ Some clients have their internal hadoop cluster on premisses
Where we run
BACKGROUND
13All content copyright © 2017 QuantumBlack, a McKinsey company
Why PySpark / Performance implications
BACKGROUND
• PySpark is best choice to integrate data pipeline Data Engineering + Data Scientist
• Same performance for data frame operations (pyspark is a wrapper that runs native scala code)
• Performance hit when we use UDF (execution relies on: scala - python - scala)
• Pandas UDFs (Vectorized UDFs) + Arrow
‒ Nov/2017 – Spark 2.3
https://www.twosigma.com/insights/article/introducing-vectorized-udfs-for-pyspark/
https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html
‒ but… where are Scala numbers?
Benchmarking – Live Demo
05
15All content copyright © 2017 QuantumBlack, a McKinsey company
Databricks Notebook – (try on Community version)
LIVE DEMO
https://bit.ly/2E4ehIm
Conclusion and Our Approach
06
17All content copyright © 2017 QuantumBlack, a McKinsey company
Best of both worlds: PySpark with Scala performance
CONCLUSION AND OUR APPROACH
• Conclusion
‒ PySpark Pandas (Vectorized UDFs) can be faster than PySpark UDF, but not ALWAYS
‒ PySpark UDFs (vectorized or not) are much slower than scala UDFs
• Our Approach
‒ We use PySpark UDFs when data volume is not big, or quick insights on sample data
‒ Built an internal library with re-usable Scala UDFs
‒ Created Python wrappers to call Scala UDFs
‒ Demo
Q&A
07
Thank you!
- Would you like to share your
experiences on next events?
and…
- We are hiring!!!

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Accenture Communications Industry 2021 - Connected Consumer Platform
Accenture Communications Industry 2021 - Connected Consumer PlatformAccenture Communications Industry 2021 - Connected Consumer Platform
Accenture Communications Industry 2021 - Connected Consumer Platform
 
AI and Big Data Analytics
AI and Big Data AnalyticsAI and Big Data Analytics
AI and Big Data Analytics
 
State of the Cloud 2023
State of the Cloud 2023State of the Cloud 2023
State of the Cloud 2023
 
Intelligent Marketing Operations | SlideShare | Accenture
Intelligent Marketing Operations | SlideShare | AccentureIntelligent Marketing Operations | SlideShare | Accenture
Intelligent Marketing Operations | SlideShare | Accenture
 
AI Transformation
AI TransformationAI Transformation
AI Transformation
 
Transforming the industry that transformed the world
Transforming the industry that transformed the worldTransforming the industry that transformed the world
Transforming the industry that transformed the world
 
Big data
Big dataBig data
Big data
 
AI: Built to Scale
AI: Built to ScaleAI: Built to Scale
AI: Built to Scale
 
Airbyte - Series-A deck
Airbyte - Series-A deckAirbyte - Series-A deck
Airbyte - Series-A deck
 
Whole Brain Leadership: New Rules of Engagement for the C-Suite| Accenture St...
Whole Brain Leadership: New Rules of Engagement for the C-Suite| Accenture St...Whole Brain Leadership: New Rules of Engagement for the C-Suite| Accenture St...
Whole Brain Leadership: New Rules of Engagement for the C-Suite| Accenture St...
 
World Economic Forum: The power of analytics for better and faster decisions ...
World Economic Forum: The power of analytics for better and faster decisions ...World Economic Forum: The power of analytics for better and faster decisions ...
World Economic Forum: The power of analytics for better and faster decisions ...
 
Tech in finance. robo vs. human advice. digital wealth management by Fincite....
Tech in finance. robo vs. human advice. digital wealth management by Fincite....Tech in finance. robo vs. human advice. digital wealth management by Fincite....
Tech in finance. robo vs. human advice. digital wealth management by Fincite....
 
Airbyte - Series-A deck
Airbyte - Series-A deckAirbyte - Series-A deck
Airbyte - Series-A deck
 
BCG_The_CEO_s_Dilemma_1662964502.pdf
BCG_The_CEO_s_Dilemma_1662964502.pdfBCG_The_CEO_s_Dilemma_1662964502.pdf
BCG_The_CEO_s_Dilemma_1662964502.pdf
 
Digital and Innovation Strategies for the Infrastructure Industry: Tim McManu...
Digital and Innovation Strategies for the Infrastructure Industry: Tim McManu...Digital and Innovation Strategies for the Infrastructure Industry: Tim McManu...
Digital and Innovation Strategies for the Infrastructure Industry: Tim McManu...
 
Maximizing Enterprise AI Investments | Accenture
Maximizing Enterprise AI Investments | AccentureMaximizing Enterprise AI Investments | Accenture
Maximizing Enterprise AI Investments | Accenture
 
Medical Cost Trend: Behind the Numbers 2017
Medical Cost Trend: Behind the Numbers 2017Medical Cost Trend: Behind the Numbers 2017
Medical Cost Trend: Behind the Numbers 2017
 
Accenture-Ready-Set-Scale - AI.pdf
Accenture-Ready-Set-Scale - AI.pdfAccenture-Ready-Set-Scale - AI.pdf
Accenture-Ready-Set-Scale - AI.pdf
 
How to build effective SEO teams from scratch - SMXL (Milan) 2021
How to build effective SEO teams from scratch - SMXL (Milan) 2021How to build effective SEO teams from scratch - SMXL (Milan) 2021
How to build effective SEO teams from scratch - SMXL (Milan) 2021
 
State of Global IT Services and Software Industry - 2023
State of Global IT Services and Software Industry - 2023State of Global IT Services and Software Industry - 2023
State of Global IT Services and Software Industry - 2023
 

Similar a Meetup Spark UDF performance

Customer Presentation - IBM Cloud Pak for Data Overview (Level 100).PPTX
Customer Presentation - IBM Cloud Pak for Data Overview (Level 100).PPTXCustomer Presentation - IBM Cloud Pak for Data Overview (Level 100).PPTX
Customer Presentation - IBM Cloud Pak for Data Overview (Level 100).PPTX
tsigitnist02
 
A Journey to a Serverless Business Intelligence, Machine Learning and Big Dat...
A Journey to a Serverless Business Intelligence, Machine Learning and Big Dat...A Journey to a Serverless Business Intelligence, Machine Learning and Big Dat...
A Journey to a Serverless Business Intelligence, Machine Learning and Big Dat...
DataWorks Summit
 

Similar a Meetup Spark UDF performance (20)

Customer Presentation - IBM Cloud Pak for Data Overview (Level 100).PPTX
Customer Presentation - IBM Cloud Pak for Data Overview (Level 100).PPTXCustomer Presentation - IBM Cloud Pak for Data Overview (Level 100).PPTX
Customer Presentation - IBM Cloud Pak for Data Overview (Level 100).PPTX
 
Journey to analytics in the cloud
Journey to analytics in the cloudJourney to analytics in the cloud
Journey to analytics in the cloud
 
A Journey to a Serverless Business Intelligence, Machine Learning and Big Dat...
A Journey to a Serverless Business Intelligence, Machine Learning and Big Dat...A Journey to a Serverless Business Intelligence, Machine Learning and Big Dat...
A Journey to a Serverless Business Intelligence, Machine Learning and Big Dat...
 
Deploying Enterprise Scale Deep Learning in Actuarial Modeling at Nationwide
Deploying Enterprise Scale Deep Learning in Actuarial Modeling at NationwideDeploying Enterprise Scale Deep Learning in Actuarial Modeling at Nationwide
Deploying Enterprise Scale Deep Learning in Actuarial Modeling at Nationwide
 
Benchmarking for Big Data Applications with the DataBench Framework, Arne Ber...
Benchmarking for Big Data Applications with the DataBench Framework, Arne Ber...Benchmarking for Big Data Applications with the DataBench Framework, Arne Ber...
Benchmarking for Big Data Applications with the DataBench Framework, Arne Ber...
 
Insights into Real World Data Management Challenges
Insights into Real World Data Management ChallengesInsights into Real World Data Management Challenges
Insights into Real World Data Management Challenges
 
BIG Data & Hadoop Applications in Logistics
BIG Data & Hadoop Applications in LogisticsBIG Data & Hadoop Applications in Logistics
BIG Data & Hadoop Applications in Logistics
 
Digital Reinvention by NRB
Digital Reinvention by NRBDigital Reinvention by NRB
Digital Reinvention by NRB
 
From Business Idea to Successful Delivery by Serhiy Haziyev & Olha Hrytsay, S...
From Business Idea to Successful Delivery by Serhiy Haziyev & Olha Hrytsay, S...From Business Idea to Successful Delivery by Serhiy Haziyev & Olha Hrytsay, S...
From Business Idea to Successful Delivery by Serhiy Haziyev & Olha Hrytsay, S...
 
Paris FOD Meetup #5 Cognizant Presentation
Paris FOD Meetup #5 Cognizant PresentationParis FOD Meetup #5 Cognizant Presentation
Paris FOD Meetup #5 Cognizant Presentation
 
BCC: offer for providers of SAP complementary solutions
BCC: offer for providers of SAP complementary solutionsBCC: offer for providers of SAP complementary solutions
BCC: offer for providers of SAP complementary solutions
 
Architecting for the Cloud with TOGAF®
Architecting for the Cloud with TOGAF®Architecting for the Cloud with TOGAF®
Architecting for the Cloud with TOGAF®
 
Top Trends in Building Data Lakes for Machine Learning and AI
Top Trends in Building Data Lakes for Machine Learning and AI Top Trends in Building Data Lakes for Machine Learning and AI
Top Trends in Building Data Lakes for Machine Learning and AI
 
Building an Enterprise Data Platform with Azure Databricks to Enable Machine ...
Building an Enterprise Data Platform with Azure Databricks to Enable Machine ...Building an Enterprise Data Platform with Azure Databricks to Enable Machine ...
Building an Enterprise Data Platform with Azure Databricks to Enable Machine ...
 
Insights into Real-world Data Management Challenges
Insights into Real-world Data Management ChallengesInsights into Real-world Data Management Challenges
Insights into Real-world Data Management Challenges
 
Cwin16 tls-partner-mark logic-an innovation journey in manufacturing
Cwin16 tls-partner-mark logic-an innovation journey in manufacturingCwin16 tls-partner-mark logic-an innovation journey in manufacturing
Cwin16 tls-partner-mark logic-an innovation journey in manufacturing
 
On Demand HDP Clusters using Cloudbreak and Ambari
On Demand HDP Clusters using Cloudbreak and AmbariOn Demand HDP Clusters using Cloudbreak and Ambari
On Demand HDP Clusters using Cloudbreak and Ambari
 
Bhadale group of companies projects portfolio
Bhadale group of companies  projects portfolioBhadale group of companies  projects portfolio
Bhadale group of companies projects portfolio
 
Why Infrastructure matters?!
Why Infrastructure matters?!Why Infrastructure matters?!
Why Infrastructure matters?!
 
Orange Data Centre and Cloud
Orange Data Centre and CloudOrange Data Centre and Cloud
Orange Data Centre and Cloud
 

Último

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Último (20)

Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 

Meetup Spark UDF performance

  • 1. Welcome Chicago Data Engineering Meetup - Our First Event – November 2018 - Objectives - Every 2 months - Format - sharing experiences (open for volunteers) - new tools / demos - Open for suggestions
  • 2. 01 Who I am 02 QuantumBlack 03 Today’s topic: Spark UDF Performance 04 Background 05 Benchmarking – Live demo 06 Conclusion and Our Approach 07 Q&A Agenda
  • 4. 4All content copyright © 2017 QuantumBlack, a McKinsey company Client case studies Experience across several industry sectors, including telecoms, retail, financial services and pharmaceuticals. Financial sector – Advanced Analytics projects for Fraud detection in Internet Banking and Credit Risk Modelling. Telecommunications – Petabyte scale environment, delivering several use cases, including: real-time failure detection using CDR data, customer profiling and marketing campaigns. Manufacturing– data wrangling in failure detection project for computer parts manufacturing in Europe. Pharmaceuticals – Site selection optimisation for a top pharma players. Telematics (Car insurance) – machine learning model that estimates the probability of crashing for each driver based data obtained from on board units box installed on cars containing geo-location positions, speed and acceleration of ~2 million drivers over a 2-year period. Complex feature creation using terabyte scale and external data sources such as weather, street and traffic data. Education Guilherme has a BSc in Data Processing from Mackenzie University and specialisations in Machine Learning and Business Intelligence. Role Big Data technology expert based in Chicago. Work with clients to translate business hypotheses into data requirements and technology solutions. Expertise Provides technical data engineering oversight on projects and advises other data engineers on architecture definition and performance optimization for large-scale data wrangling. Professional experience Prior to joining QuantumBlack, Guilherme specialised for over 18 years in Data Warehouse and Business Intelligence projects on large-scale environments. More recently, 6 years experience in Big Data projects and architecture, lots of them at petabyte scale, as well as real-time projects. Previously led big data projects at Hortonworks, SAP and large financial institutions. BIOGRAPHY Guilherme Braccialli Principal Data Engineer, QuantumBlack, Chicago
  • 6. 6All content copyright © 2017 QuantumBlack, a McKinsey company QB exploit data, analytics and design to help our clients be the best they can be We were born and proven in Formula One, where the smallest margins are the difference between winning and losing and data has emerged as a fundamental element of competitive advantage QuantumBlack 6All content copyright © 2017 QuantumBlack, a McKinsey company
  • 7. In elite sport the smallest edge makes the difference, and the best teams exploit this to outlearn their rivals
  • 8. 8All content copyright © 2017 QuantumBlack, a McKinsey company Since then, we have applied our proven methodology across multiple sectors Advanced Industries Aerospace Automotive Semi-Conductors Urban Infrastructure Financial Services Asset Management Payment Networks Private Banking Retail Banking Health & Wellbeing Hospitals Medical Devices Pharmaceutical Natural Resources Oil & Gas Mining Renewable Energy Utilities Sports Basketball Baseball Formula One Soccer
  • 9. Spark UDF Performance 03 - Share our learnings - Running spark at scale - Practical Examples - Live demo (code)
  • 11. 11All content copyright © 2017 QuantumBlack, a McKinsey company • Open Source ‒ We are a consulting company, we enable our clients to use Advanced Analytics ‒ We don’t sell a out-of-box solution / licensing ‒ Clients can run it anywhere, we use open source tools • Scalable ‒ We deal with big data volumes ‒ Multiple TBs of data ‒ Spark has several options to run on distributed mode (Hadoop, Kubernetes, Stand Alone) • Flexibility and Integration ‒ Supports multiple languages: Python, SQL, Scala, Java, R ‒ Batch, Streaming, Graph, Machine Learning ‒ Easy to integrate with Data Scientist code, single data pipeline Why we use spark BACKGROUND
  • 12. 12All content copyright © 2017 QuantumBlack, a McKinsey company • In the Cloud ‒ AWS (EMR) ‒ Azure (HDInsight) ‒ Google Cloud (DataProc) ‒ Databricks (AWS or Azure) • On-premises ‒ Some clients have their internal hadoop cluster on premisses Where we run BACKGROUND
  • 13. 13All content copyright © 2017 QuantumBlack, a McKinsey company Why PySpark / Performance implications BACKGROUND • PySpark is best choice to integrate data pipeline Data Engineering + Data Scientist • Same performance for data frame operations (pyspark is a wrapper that runs native scala code) • Performance hit when we use UDF (execution relies on: scala - python - scala) • Pandas UDFs (Vectorized UDFs) + Arrow ‒ Nov/2017 – Spark 2.3 https://www.twosigma.com/insights/article/introducing-vectorized-udfs-for-pyspark/ https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html ‒ but… where are Scala numbers?
  • 15. 15All content copyright © 2017 QuantumBlack, a McKinsey company Databricks Notebook – (try on Community version) LIVE DEMO https://bit.ly/2E4ehIm
  • 16. Conclusion and Our Approach 06
  • 17. 17All content copyright © 2017 QuantumBlack, a McKinsey company Best of both worlds: PySpark with Scala performance CONCLUSION AND OUR APPROACH • Conclusion ‒ PySpark Pandas (Vectorized UDFs) can be faster than PySpark UDF, but not ALWAYS ‒ PySpark UDFs (vectorized or not) are much slower than scala UDFs • Our Approach ‒ We use PySpark UDFs when data volume is not big, or quick insights on sample data ‒ Built an internal library with re-usable Scala UDFs ‒ Created Python wrappers to call Scala UDFs ‒ Demo
  • 19. Thank you! - Would you like to share your experiences on next events? and… - We are hiring!!!