Dynamics Talks: "Writing Spark Pipelines with Less Boilerplate Code" - Egor Pakhomov, AirBnB

•

0 recomendaciones•323 vistas

Apache Spark is a general-purpose big data execution engine. You can work with different data sources with the same set of API in both batch and streaming mode. Such flexibility is great if you are experienced Spark developer solving a complicated data engineering problem, which might include ML or streaming. In Airbnb, 95% of all data pipelines are daily batch jobs, which read from Hive tables and write to Hive tables. For such jobs, you would like to trade some flexibility for more extensive functionality around writing to Hive or multiple days processing orchestration. Another advantage of reducing flexibility is creating "best practices", which can be followed by less experienced data engineers. In Airbnb, we've created a framework called "Sputnik," which tries to address these issues. In this talk, I'll show the typical boilerplate code, which Sputnik tries to reduce and concepts it introduces to simplify pipeline development.

Tecnología

WritingSpark
pipelineswithless
boilerplatecode
EGOR PAKHOMOV • AIRBNB

https://ogirardot.ﬁles.wordpress.com/2015/05/future-of-spark.png
Toomuchflexibility!
Ourcase

Joblogic
• Job does some business logic (for example multiply every value
from input table by 2)
• Job specifies:
- Source tables and result tables
- Partitioning schema
- Validations for result data
JoblogicvsRunLogic Runlogic(examples)
• Running job for specific date retrieves input only for that date
from input table
• Job tries to write to table, which does not exists, so we need to
create the table
• Job runs in testing mode, so all result tables are created with
“_testing” suffix

SputnikHiveTableWriter:
• creates table with “CREATE TABLE” hive statement, if table
does not yet exist.
• updates table metainformation
• manage result table name (staging/testing mode)
• normalize dataframe schema according to result Hive table
• repartitions and tries to reduce number of result files on disk
• runs the checks on result, before saving it.
• Etc…
Writingdata
inSputnik

Readingdata
inSputnik
Reading dataframe:
Reading dataset:

BackfillinginSputnik HiveTable
2019-01-01 2019-01-02 2019-01-03 2019-01-04 2019-01-05 2019-01-06 2019-01-082019-01-07
Dailyjob
--ds2019-01-07
Dailyjob
--ds2019-01-08
t

BackfillinginSputnik HiveTable
2019-01-01 2019-01-02 2019-01-03 2019-01-04 2019-01-05 2019-01-06
Backfilljob
--startDate2019-01-01--endDate2019-01-06
t
--stepSize3

Environments
inSputnik
some_database.some_input_table
some_database.result_table
Production
Sputnik

Environments
inSputnik
some_database.some_input_table
some_database.result_table
Production
Sputnik
some_database.result_table_dev
Testing
Sputnik
some_database.some_input_table

Environments
inSputnik
some_database.result_table_dev
Testing
Sputnik
some_database.some_input_table
-- writeEnv PROD some_database.result_table
-- writeEnv STAGE some_database.result_table_staging
-- writeEnv DEV some_database.result_table_dev

https://github.com/airbnb/sputnik
pahomov.egor@gmail.com
egor.pakhomov@airbnb.com

Más contenido relacionado

La actualidad más candente

Creating the applicationJason Noble

Gatling @ Scala.Io 2013slandelle

Gatling Gaurav Shukla

Load test REST APIs using gatlingJayaram Sankaranarayanan

Governor limitsShivanath Devinarayanan

Database Build and Release - SQL In The City - Ernest HwangRed Gate Software

Continuous performance management with GatlingRadoslaw Smilgin

ajn11 BT appengine SDK updatesSATOSHI TAGOMORI

7 steps to simplifying your AI workflowsWisecube AI

How To Practice TDD Without Shooting Yourself In The FootDennis Doomen

Build, Test and Extend Integrated Workflows 3.7StephenKardian

MetricsZach Cox

Runtime performanceEliran Eliassy

C# 6 New Featureszahid-mian

Spring batchChandan Kumar Rana

ReactivePranav E K

PerfTest in SOATharinda Liyanage

La actualidad más candente (17)

Creating the application

Gatling @ Scala.Io 2013

Gatling

Load test REST APIs using gatling

Governor limits

Database Build and Release - SQL In The City - Ernest Hwang

Continuous performance management with Gatling

ajn11 BT appengine SDK updates

7 steps to simplifying your AI workflows

How To Practice TDD Without Shooting Yourself In The Foot

Build, Test and Extend Integrated Workflows 3.7

Metrics

Runtime performance

C# 6 New Features

Spring batch

Reactive

PerfTest in SOA

Similar a Dynamics Talks: "Writing Spark Pipelines with Less Boilerplate Code" - Egor Pakhomov, AirBnB

Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...Spark Summit

Sputnik: Airbnb’s Apache Spark Framework for Data EngineeringDatabricks

CODAIT/Spark-BenchEmily Curtin

Spark Summit 2014: Spark Job Server TalkEvan Chan

MuleSoft ESB Filtering data instead of Loopingakashdprajapati

How We Optimize Spark SQL Jobs With parallel and sync IODatabricks

Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...Chester Chen

Database Fundamental Concepts- Series 1 - Performance AnalysisDAGEOP LTD

Sqoop on Spark for Data IngestionDataWorks Summit

Quick tour to front end unit testing using jasmineGil Fink

Batching and Java EE (jdk.io)Ryan Cuprak

Java EE 7 Batch processing in the Real WorldRoberto Cortez

Query Optimization with MySQL 5.6: Old and New Tricks - Percona Live London 2013Jaime Crespo

Springaopdecoded ajipMakarand Bhatambarekar

An Approach to Sql tuning - Part 1Navneet Upneja

Oracle SQL Tuning for Day-to-Day Data Warehouse Supportnkarag

10 Reasons to Start Your Analytics Project with PostgreSQLSatoshi Nagayasu

Spring batch for large enterprises operations Ignasi González

OOW13 Exadata and ODI with ParallelKellyn Pot'Vin-Gorman

Spark summit2014 techtalk - testing sparkAnu Shetty

Similar a Dynamics Talks: "Writing Spark Pipelines with Less Boilerplate Code" - Egor Pakhomov, AirBnB (20)

Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...

Sputnik: Airbnb’s Apache Spark Framework for Data Engineering

CODAIT/Spark-Bench

Spark Summit 2014: Spark Job Server Talk

MuleSoft ESB Filtering data instead of Looping

How We Optimize Spark SQL Jobs With parallel and sync IO

Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...

Database Fundamental Concepts- Series 1 - Performance Analysis

Sqoop on Spark for Data Ingestion

Quick tour to front end unit testing using jasmine

Batching and Java EE (jdk.io)

Java EE 7 Batch processing in the Real World

Query Optimization with MySQL 5.6: Old and New Tricks - Percona Live London 2013

Springaopdecoded ajip

An Approach to Sql tuning - Part 1

Oracle SQL Tuning for Day-to-Day Data Warehouse Support

10 Reasons to Start Your Analytics Project with PostgreSQL

Spring batch for large enterprises operations

OOW13 Exadata and ODI with Parallel

Spark summit2014 techtalk - testing spark

Más de Grid Dynamics

Are you keeping up with your customer Grid Dynamics

"Implementing data quality automation with open source stack" - Max Martynov,...Grid Dynamics

"How to build cool & useful voice commerce applications (such as devices like...Grid Dynamics

"Challenges for AI in Healthcare" - Peter Graven Ph.DGrid Dynamics

Dynamic Talks: "Applications of Big Data, Machine Learning and Artificial Int...Grid Dynamics

Dynamic Talks: "Digital Transformation in Banking & Financial Services… a per...Grid Dynamics

Dynamic Talks: "Data Strategy as a Conduit for Data Maturity and Monetization...Grid Dynamics

"Trends in Building Advanced Analytics Platform for Large Enterprises" - Atul...Grid Dynamics

The New Era of Public Safety Records Management: Dynamic talks Chicago 9/24/2019Grid Dynamics

Dynamic Talks: "Implementing data quality automation with open source stack" ...Grid Dynamics

"Implementing AI for New Business Models and Efficiencies" - Parag Shrivastav...Grid Dynamics

Reducing No-shows and Late Cancelations in Healthcare Enterprise" - Shervin M...Grid Dynamics

Customer intelligence: a Machine Learning Approach: Dynamic talks Atlanta 8/2...Grid Dynamics

"ML Services - How do you begin and when do you start scaling?" - Madhura Dud...Grid Dynamics

Realtime Contextual Product Recommendations…that scale and generate revenue -...Grid Dynamics

Decision Automation in Marketing Systems using Reinforcement Learning: Dynami...Grid Dynamics

Best practices for enterprise-grade microservices implementations with Google...Grid Dynamics

Attribution Modelling 101: Credit Where Credit is Due!: Dynamic talks Seattle...Grid Dynamics

Building an algorithmic price management system using ML: Dynamic talks Seatt...Grid Dynamics

Customer intelligence: a machine learning approach- Dynamic talks Dallas Q2 Grid Dynamics

Más de Grid Dynamics (20)

Are you keeping up with your customer

"Implementing data quality automation with open source stack" - Max Martynov,...

"How to build cool & useful voice commerce applications (such as devices like...

"Challenges for AI in Healthcare" - Peter Graven Ph.D

Dynamic Talks: "Applications of Big Data, Machine Learning and Artificial Int...

Dynamic Talks: "Digital Transformation in Banking & Financial Services… a per...

Dynamic Talks: "Data Strategy as a Conduit for Data Maturity and Monetization...

"Trends in Building Advanced Analytics Platform for Large Enterprises" - Atul...

The New Era of Public Safety Records Management: Dynamic talks Chicago 9/24/2019

Dynamic Talks: "Implementing data quality automation with open source stack" ...

"Implementing AI for New Business Models and Efficiencies" - Parag Shrivastav...

Reducing No-shows and Late Cancelations in Healthcare Enterprise" - Shervin M...

Customer intelligence: a Machine Learning Approach: Dynamic talks Atlanta 8/2...

"ML Services - How do you begin and when do you start scaling?" - Madhura Dud...

Realtime Contextual Product Recommendations…that scale and generate revenue -...

Decision Automation in Marketing Systems using Reinforcement Learning: Dynami...

Best practices for enterprise-grade microservices implementations with Google...

Attribution Modelling 101: Credit Where Credit is Due!: Dynamic talks Seattle...

Building an algorithmic price management system using ML: Dynamic talks Seatt...

Customer intelligence: a machine learning approach- Dynamic talks Dallas Q2

Último

A Domino Admins Adventures (Engage 2024)Gabriella Davis

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays

Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi

Real Time Object Detection Using Open CVKhem

Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1

Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services

Artificial Intelligence: Facts and MythsJoaquim Jorge

A Year of the Servo Reboot: Where Are We Now?Igalia

2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo

Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software

The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays

presentation ICT roal in 21st century educationjfdjdjcjdnsjd

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93

How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes

Apidays New York 2024 - The value of a flexible API Management solution for O...apidays

Dynamics Talks: "Writing Spark Pipelines with Less Boilerplate Code" - Egor Pakhomov, AirBnB

1. WritingSpark pipelineswithless boilerplatecode EGOR PAKHOMOV • AIRBNB

2. TypicalSparkjob

3. TypicalSparkjob

4. TypicalSparkjob

5. https://ogirardot.ﬁles.wordpress.com/2015/05/future-of-spark.png Toomuchflexibility! Ourcase

6. Whatdataengineer shouldonlywrite

7. Joblogic • Job does some business logic (for example multiply every value from input table by 2) • Job specifies: - Source tables and result tables - Partitioning schema - Validations for result data JoblogicvsRunLogic Runlogic(examples) • Running job for specific date retrieves input only for that date from input table • Job tries to write to table, which does not exists, so we need to create the table • Job runs in testing mode, so all result tables are created with “_testing” suffix

8. JoblogicvsRunLogic

9. Sputnikjob

10. RunningtheSputnik job

11. Writingdata inSputnik

12. SputnikHiveTableWriter: • creates table with “CREATE TABLE” hive statement, if table does not yet exist. • updates table metainformation • manage result table name (staging/testing mode) • normalize dataframe schema according to result Hive table • repartitions and tries to reduce number of result files on disk • runs the checks on result, before saving it. • Etc… Writingdata inSputnik

13. Readingdata inSputnik Reading dataframe: Reading dataset:

19. ConfigsinSputnik Job: Config:

20. Checksonresult inSputnik

21. Checksonresult inSputnik

22. Checksonresult inSputnik

23. BackfillinginSputnik HiveTable 2019-01-01 2019-01-02 2019-01-03 2019-01-04 2019-01-05 2019-01-06 2019-01-082019-01-07 Dailyjob --ds2019-01-07 Dailyjob --ds2019-01-08 t

24. BackfillinginSputnik HiveTable 2019-01-01 2019-01-02 2019-01-03 2019-01-04 2019-01-05 2019-01-06 2019-01-082019-01-07 Dailyjob --ds2019-01-07 Dailyjob --ds2019-01-08 Backfilljob --startDate2019-01-01--endDate2019-01-06 t

25. BackfillinginSputnik HiveTable 2019-01-01 2019-01-02 2019-01-03 2019-01-04 2019-01-05 2019-01-06 Backfilljob --startDate2019-01-01--endDate2019-01-06 t --stepSize3

26. Environments inSputnik some_database.some_input_table some_database.result_table Production Sputnik

27. Environments inSputnik some_database.some_input_table some_database.result_table Production Sputnik some_database.result_table_dev Testing Sputnik some_database.some_input_table

28. Environments inSputnik some_database.result_table_dev Testing Sputnik some_database.some_input_table -- writeEnv PROD some_database.result_table -- writeEnv STAGE some_database.result_table_staging -- writeEnv DEV some_database.result_table_dev

29. Flags

30. https://github.com/airbnb/sputnik pahomov.egor@gmail.com egor.pakhomov@airbnb.com

Dynamics Talks: "Writing Spark Pipelines with Less Boilerplate Code" - Egor Pakhomov, AirBnB

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (17)

Similar a Dynamics Talks: "Writing Spark Pipelines with Less Boilerplate Code" - Egor Pakhomov, AirBnB

Similar a Dynamics Talks: "Writing Spark Pipelines with Less Boilerplate Code" - Egor Pakhomov, AirBnB (20)

Más de Grid Dynamics

Más de Grid Dynamics (20)

Último

Último (20)

Dynamics Talks: "Writing Spark Pipelines with Less Boilerplate Code" - Egor Pakhomov, AirBnB