SlideShare una empresa de Scribd logo
1 de 22
Descargar para leer sin conexión
Building an Analytics
Workflow using Apache
Airflow
Yohei Onishi
PyCon APAC 2019, Feb. 23-24 2019
Presenter Profile
● Yohei Onishi
● Twitter: legoboku, Github:
yohei1126
● Data Engineer at a Japanese
retail company
● Based in Singapore since Oct.
2018
● Apache Airflow Contributor
2
Session overview
● Expected audiences: Data engineers
○ who are working on building a pipleline
○ who are looking for a better workflow solution
● Goal: Provide the following so they can use Airflow
○ Airflow overview and how to author workflow
○ Server configuration and CI/CD in my usecase
○ Recommendations for new users (GCP Cloud
Composer)
3
Data pipeline
data source collect ETL analytics data consumer
micro services
enterprise
systems
IoT devices
object storage
message queue
micro services
enterprise
systems
BI tool
4
Our requirements for ETL worflow
● Already built a data lake on AWS S3 to store structured /
unstructured data
● Want to build a batch based analytics platform
● Requirements
○ Workflow generation by code (Python) rather than GUI
○ OSS: avoid vendor lock in
○ Scalable: batch data processing and workflow
○ Simple and easily extensible
○ Workflow visualization 5
Another workflow engine: Apache Nifi
6
Airflow overview
● Brief history
○ Open sourced by Airbnb and Apache top project
○ Cloud Composer: managed Airflow on GCP
● Characteristics
○ Dynamic workflow generation by Python code
○ Easily extensible so you can fit it to your usecase
○ Scalable by using a message queue to orchestrate
arbitrary number of workers
7
Example: Copy a file from s3 bucket to another
export records
as CSV Singapore region
US region
EU region
transfer it to a
regional bucket
8
local region
DEMO: UI and source code
sample code: https://github.com/yohei1126/pycon-apac-2019-airflow-sample 9
Concept: Directed acyclic graph, operator, task, etc
custom_param_per_dag = {'sg': { ... }, 'eu': { ... }, 'us': { ... }}
for region, v in custom_param_per_dag.items():
dag = DAG('shipment_{}'.format(region), ...)
t1 = PostgresToS3Operator(task_id='db_to_s3', ...)
t2 = S3CopyObjectOperator(task_id='s3_to_s3', ...)
t1 >> t2
globals()[dag] = dag
10
template
t1 = PostgresToS3Operator(
task_id='db_to_s3',
sql="SELECT * FROM shipment WHERE region = '{{ params.region }}'
AND ship_date = '{{ execution_date.strftime("%Y-%m-%d") }}'",
bucket=default_args['source_bucket'],
object_key='{{ params.region }}/{{
execution_date.strftime("%Y%m%d%H%M%S") }}.csv',
params={'region':region},
dag=dag) 11
Operator
class PostgresToS3Operator(BaseOperator):
template_fields = ('sql', 'bucket', 'object_key')
def __init__(self, ..., *args, **kwargs):
super(PostgresToS3Operator, self).__init__(*args, **kwargs)
...
def execute(self, context):
...
12
HA Airflow cluster
executor
(1..N)
worker node (1)
executor
(1..N)
worker node (2)
executor
(1..N)
worker node (1)
... scheduler
master node (1)
web
server
master node
(2)
web
server
LB
admin
Airflow metadata DBCelery result backend message broker 13
http://site.clairvoyantsoft.com/setting-apache-airflow-cluster/
CI/CD pipeline
AWS SNS AWS SQS
Github repo
raise / merge
a PR
Airflow worker
polling
run Ansible script
git pull
test
deployment
14
Monitoring
Airflow worker
(EC2)
AWS CloudWatch
notify an error
if DAG fails using
Airflow slack webhook
notify an error if a
CloudWatch Alarm is
triggered slack webhook
15
GCP Cloud Composer
● Fully managed Airflow cluster provided by GCP
○ Fully managed
○ Built in integrated with the other GCP services
● To focus on business logic, you should build Airflow
cluster using GCP composer
16
Create a cluster using CLI
$ gcloud composer environments create ENVIRONMENT_NAME 
--location LOCATION 
OTHER_ARGUMENTS
● New Airflow cluster will be deployed as Kubenetes cluster on GKE
● We usually specify the following options as OTHER_ARGUMENTS
○ infra: instance type, disk size, VPC network, etc.
○ software configuration: Python version, Airflow version, etc.
17
Deploy your source code to the cluster
$ gcloud composer environments storage dags import 
--environment my-environment --location us-central1 
--source test-dags/quickstart.py
● This will upload your source code to cluster specific GCS bucket.
○ You can also directly upload your file to the bucket
● Then the file will be automatically deployed
18
monitoring cluster using Stackdriver
19
Demo: GCP Cloud Composer
● Create an environment
● Stackdriver logging
● GKE as backend
20
Summary
● Data Engineers have to build reliable and scalable data
pipeline to accelate data analytics activities
● Airflow is great tool to author and monitor workflow
● HA Airflow cluster is required for high availablity
● GCP Cloud Compose enables us to build a cluster easily
and focus on business logic
21
References
● Apache Airflow
● GCP Cloud Composer
● Airflow: a workflow management platform
● ETL best practices in Airflow 1.8
● Data Science for Startups: Data Pipelines
● Airflow: Tips, Tricks, and Pitfalls
22

Más contenido relacionado

La actualidad más candente

Airflow presentation
Airflow presentationAirflow presentation
Airflow presentationIlias Okacha
 
Airflow tutorials hands_on
Airflow tutorials hands_onAirflow tutorials hands_on
Airflow tutorials hands_onpko89403
 
How I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowHow I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowPyData
 
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)Yohei Onishi
 
Airflow - a data flow engine
Airflow - a data flow engineAirflow - a data flow engine
Airflow - a data flow engineWalter Liu
 
Apache Airflow Introduction
Apache Airflow IntroductionApache Airflow Introduction
Apache Airflow IntroductionLiangjun Jiang
 
Apache Airflow Architecture
Apache Airflow ArchitectureApache Airflow Architecture
Apache Airflow ArchitectureGerard Toonstra
 
Building Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache AirflowBuilding Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache AirflowSid Anand
 
Apache Airflow in Production
Apache Airflow in ProductionApache Airflow in Production
Apache Airflow in ProductionRobert Sanders
 
Airflow Best Practises & Roadmap to Airflow 2.0
Airflow Best Practises & Roadmap to Airflow 2.0Airflow Best Practises & Roadmap to Airflow 2.0
Airflow Best Practises & Roadmap to Airflow 2.0Kaxil Naik
 
Orchestrating workflows Apache Airflow on GCP & AWS
Orchestrating workflows Apache Airflow on GCP & AWSOrchestrating workflows Apache Airflow on GCP & AWS
Orchestrating workflows Apache Airflow on GCP & AWSDerrick Qin
 
Airflow at lyft
Airflow at lyftAirflow at lyft
Airflow at lyftTao Feng
 
From airflow to google cloud composer
From airflow to google cloud composerFrom airflow to google cloud composer
From airflow to google cloud composerBruce Kuo
 

La actualidad más candente (20)

Airflow for Beginners
Airflow for BeginnersAirflow for Beginners
Airflow for Beginners
 
Airflow presentation
Airflow presentationAirflow presentation
Airflow presentation
 
Airflow presentation
Airflow presentationAirflow presentation
Airflow presentation
 
Airflow introduction
Airflow introductionAirflow introduction
Airflow introduction
 
Airflow tutorials hands_on
Airflow tutorials hands_onAirflow tutorials hands_on
Airflow tutorials hands_on
 
How I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowHow I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with Airflow
 
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
 
Apache airflow
Apache airflowApache airflow
Apache airflow
 
Airflow - a data flow engine
Airflow - a data flow engineAirflow - a data flow engine
Airflow - a data flow engine
 
Apache Airflow Introduction
Apache Airflow IntroductionApache Airflow Introduction
Apache Airflow Introduction
 
Apache Airflow
Apache AirflowApache Airflow
Apache Airflow
 
Apache Airflow Architecture
Apache Airflow ArchitectureApache Airflow Architecture
Apache Airflow Architecture
 
Airflow Intro-1.pdf
Airflow Intro-1.pdfAirflow Intro-1.pdf
Airflow Intro-1.pdf
 
Building Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache AirflowBuilding Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache Airflow
 
Apache Airflow in Production
Apache Airflow in ProductionApache Airflow in Production
Apache Airflow in Production
 
Airflow Best Practises & Roadmap to Airflow 2.0
Airflow Best Practises & Roadmap to Airflow 2.0Airflow Best Practises & Roadmap to Airflow 2.0
Airflow Best Practises & Roadmap to Airflow 2.0
 
Airflow at WePay
Airflow at WePayAirflow at WePay
Airflow at WePay
 
Orchestrating workflows Apache Airflow on GCP & AWS
Orchestrating workflows Apache Airflow on GCP & AWSOrchestrating workflows Apache Airflow on GCP & AWS
Orchestrating workflows Apache Airflow on GCP & AWS
 
Airflow at lyft
Airflow at lyftAirflow at lyft
Airflow at lyft
 
From airflow to google cloud composer
From airflow to google cloud composerFrom airflow to google cloud composer
From airflow to google cloud composer
 

Similar a Building an analytics workflow using Apache Airflow

Scalable Clusters On Demand
Scalable Clusters On DemandScalable Clusters On Demand
Scalable Clusters On DemandBogdan Kyryliuk
 
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with KubernetesKubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with KubernetesSeungYong Oh
 
From business requirements to working pipelines with apache airflow
From business requirements to working pipelines with apache airflowFrom business requirements to working pipelines with apache airflow
From business requirements to working pipelines with apache airflowDerrick Qin
 
Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020
Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020
Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020Mariano Gonzalez
 
Improving Apache Spark Downscaling
 Improving Apache Spark Downscaling Improving Apache Spark Downscaling
Improving Apache Spark DownscalingDatabricks
 
Upcoming features in Airflow 2
Upcoming features in Airflow 2Upcoming features in Airflow 2
Upcoming features in Airflow 2Kaxil Naik
 
Sprint 121
Sprint 121Sprint 121
Sprint 121ManageIQ
 
Machine learning at scale with Google Cloud Platform
Machine learning at scale with Google Cloud PlatformMachine learning at scale with Google Cloud Platform
Machine learning at scale with Google Cloud PlatformMatthias Feys
 
Scaling 100PB Data Warehouse in Cloud
Scaling 100PB Data Warehouse in CloudScaling 100PB Data Warehouse in Cloud
Scaling 100PB Data Warehouse in CloudChangshu Liu
 
202107 - Orion introduction - COSCUP
202107 - Orion introduction - COSCUP202107 - Orion introduction - COSCUP
202107 - Orion introduction - COSCUPRonald Hsu
 
What's coming in Airflow 2.0? - NYC Apache Airflow Meetup
What's coming in Airflow 2.0? - NYC Apache Airflow MeetupWhat's coming in Airflow 2.0? - NYC Apache Airflow Meetup
What's coming in Airflow 2.0? - NYC Apache Airflow MeetupKaxil Naik
 
Building Kick Ass Video Games for the Cloud
Building Kick Ass Video Games for the CloudBuilding Kick Ass Video Games for the Cloud
Building Kick Ass Video Games for the CloudChris Schalk
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Integrating ChatGPT with Apache Airflow
Integrating ChatGPT with Apache AirflowIntegrating ChatGPT with Apache Airflow
Integrating ChatGPT with Apache AirflowTatiana Al-Chueyr
 
Unifying Frontend and Backend Development with Scala - ScalaCon 2021
Unifying Frontend and Backend Development with Scala - ScalaCon 2021Unifying Frontend and Backend Development with Scala - ScalaCon 2021
Unifying Frontend and Backend Development with Scala - ScalaCon 2021Taro L. Saito
 
A GitOps Kubernetes Native CICD Solution with Argo Events, Workflows, and CD
A GitOps Kubernetes Native CICD Solution with Argo Events, Workflows, and CDA GitOps Kubernetes Native CICD Solution with Argo Events, Workflows, and CD
A GitOps Kubernetes Native CICD Solution with Argo Events, Workflows, and CDJulian Mazzitelli
 
Designing flexible apps deployable to App Engine, Cloud Functions, or Cloud Run
Designing flexible apps deployable to App Engine, Cloud Functions, or Cloud RunDesigning flexible apps deployable to App Engine, Cloud Functions, or Cloud Run
Designing flexible apps deployable to App Engine, Cloud Functions, or Cloud Runwesley chun
 

Similar a Building an analytics workflow using Apache Airflow (20)

Scalable Clusters On Demand
Scalable Clusters On DemandScalable Clusters On Demand
Scalable Clusters On Demand
 
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with KubernetesKubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
 
From business requirements to working pipelines with apache airflow
From business requirements to working pipelines with apache airflowFrom business requirements to working pipelines with apache airflow
From business requirements to working pipelines with apache airflow
 
Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020
Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020
Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020
 
Google Cloud Dataflow
Google Cloud DataflowGoogle Cloud Dataflow
Google Cloud Dataflow
 
Improving Apache Spark Downscaling
 Improving Apache Spark Downscaling Improving Apache Spark Downscaling
Improving Apache Spark Downscaling
 
Upcoming features in Airflow 2
Upcoming features in Airflow 2Upcoming features in Airflow 2
Upcoming features in Airflow 2
 
Sprint 121
Sprint 121Sprint 121
Sprint 121
 
Machine learning at scale with Google Cloud Platform
Machine learning at scale with Google Cloud PlatformMachine learning at scale with Google Cloud Platform
Machine learning at scale with Google Cloud Platform
 
Scaling 100PB Data Warehouse in Cloud
Scaling 100PB Data Warehouse in CloudScaling 100PB Data Warehouse in Cloud
Scaling 100PB Data Warehouse in Cloud
 
202107 - Orion introduction - COSCUP
202107 - Orion introduction - COSCUP202107 - Orion introduction - COSCUP
202107 - Orion introduction - COSCUP
 
What's coming in Airflow 2.0? - NYC Apache Airflow Meetup
What's coming in Airflow 2.0? - NYC Apache Airflow MeetupWhat's coming in Airflow 2.0? - NYC Apache Airflow Meetup
What's coming in Airflow 2.0? - NYC Apache Airflow Meetup
 
HPC on OpenStack
HPC on OpenStackHPC on OpenStack
HPC on OpenStack
 
Building Kick Ass Video Games for the Cloud
Building Kick Ass Video Games for the CloudBuilding Kick Ass Video Games for the Cloud
Building Kick Ass Video Games for the Cloud
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Integrating ChatGPT with Apache Airflow
Integrating ChatGPT with Apache AirflowIntegrating ChatGPT with Apache Airflow
Integrating ChatGPT with Apache Airflow
 
Unifying Frontend and Backend Development with Scala - ScalaCon 2021
Unifying Frontend and Backend Development with Scala - ScalaCon 2021Unifying Frontend and Backend Development with Scala - ScalaCon 2021
Unifying Frontend and Backend Development with Scala - ScalaCon 2021
 
A GitOps Kubernetes Native CICD Solution with Argo Events, Workflows, and CD
A GitOps Kubernetes Native CICD Solution with Argo Events, Workflows, and CDA GitOps Kubernetes Native CICD Solution with Argo Events, Workflows, and CD
A GitOps Kubernetes Native CICD Solution with Argo Events, Workflows, and CD
 
Designing flexible apps deployable to App Engine, Cloud Functions, or Cloud Run
Designing flexible apps deployable to App Engine, Cloud Functions, or Cloud RunDesigning flexible apps deployable to App Engine, Cloud Functions, or Cloud Run
Designing flexible apps deployable to App Engine, Cloud Functions, or Cloud Run
 

Más de Yohei Onishi

Better parking experience with Automatic - Api Days San Francisco
Better parking experience with Automatic - Api Days San FranciscoBetter parking experience with Automatic - Api Days San Francisco
Better parking experience with Automatic - Api Days San FranciscoYohei Onishi
 
(日本人一人で)米国企業で働くために必要な3つのこと〜渡米後1ヶ月編〜
(日本人一人で)米国企業で働くために必要な3つのこと〜渡米後1ヶ月編〜(日本人一人で)米国企業で働くために必要な3つのこと〜渡米後1ヶ月編〜
(日本人一人で)米国企業で働くために必要な3つのこと〜渡米後1ヶ月編〜Yohei Onishi
 
誰かが言ってたけど人生はRPGのようだ
誰かが言ってたけど人生はRPGのようだ誰かが言ってたけど人生はRPGのようだ
誰かが言ってたけど人生はRPGのようだYohei Onishi
 
Test-Driven Development for [Embedded] C by James Grenning at Agile Japan 2013
Test-Driven Development for [Embedded] C by James Grenning at Agile Japan 2013Test-Driven Development for [Embedded] C by James Grenning at Agile Japan 2013
Test-Driven Development for [Embedded] C by James Grenning at Agile Japan 2013Yohei Onishi
 
ど根性駆動型コミュニティ開発
ど根性駆動型コミュニティ開発ど根性駆動型コミュニティ開発
ど根性駆動型コミュニティ開発Yohei Onishi
 
#tdd4ec is back!!〜テスト駆動開発による 組み込みプログラミングの集い〜
#tdd4ec is back!!〜テスト駆動開発による 組み込みプログラミングの集い〜#tdd4ec is back!!〜テスト駆動開発による 組み込みプログラミングの集い〜
#tdd4ec is back!!〜テスト駆動開発による 組み込みプログラミングの集い〜Yohei Onishi
 
自分のコミュニティを始めてみませんか?
自分のコミュニティを始めてみませんか?自分のコミュニティを始めてみませんか?
自分のコミュニティを始めてみませんか?Yohei Onishi
 
外乱光対策にまいまい式を使おう(ETロボコン2011東京連合第1回)
外乱光対策にまいまい式を使おう(ETロボコン2011東京連合第1回)外乱光対策にまいまい式を使おう(ETロボコン2011東京連合第1回)
外乱光対策にまいまい式を使おう(ETロボコン2011東京連合第1回)Yohei Onishi
 

Más de Yohei Onishi (8)

Better parking experience with Automatic - Api Days San Francisco
Better parking experience with Automatic - Api Days San FranciscoBetter parking experience with Automatic - Api Days San Francisco
Better parking experience with Automatic - Api Days San Francisco
 
(日本人一人で)米国企業で働くために必要な3つのこと〜渡米後1ヶ月編〜
(日本人一人で)米国企業で働くために必要な3つのこと〜渡米後1ヶ月編〜(日本人一人で)米国企業で働くために必要な3つのこと〜渡米後1ヶ月編〜
(日本人一人で)米国企業で働くために必要な3つのこと〜渡米後1ヶ月編〜
 
誰かが言ってたけど人生はRPGのようだ
誰かが言ってたけど人生はRPGのようだ誰かが言ってたけど人生はRPGのようだ
誰かが言ってたけど人生はRPGのようだ
 
Test-Driven Development for [Embedded] C by James Grenning at Agile Japan 2013
Test-Driven Development for [Embedded] C by James Grenning at Agile Japan 2013Test-Driven Development for [Embedded] C by James Grenning at Agile Japan 2013
Test-Driven Development for [Embedded] C by James Grenning at Agile Japan 2013
 
ど根性駆動型コミュニティ開発
ど根性駆動型コミュニティ開発ど根性駆動型コミュニティ開発
ど根性駆動型コミュニティ開発
 
#tdd4ec is back!!〜テスト駆動開発による 組み込みプログラミングの集い〜
#tdd4ec is back!!〜テスト駆動開発による 組み込みプログラミングの集い〜#tdd4ec is back!!〜テスト駆動開発による 組み込みプログラミングの集い〜
#tdd4ec is back!!〜テスト駆動開発による 組み込みプログラミングの集い〜
 
自分のコミュニティを始めてみませんか?
自分のコミュニティを始めてみませんか?自分のコミュニティを始めてみませんか?
自分のコミュニティを始めてみませんか?
 
外乱光対策にまいまい式を使おう(ETロボコン2011東京連合第1回)
外乱光対策にまいまい式を使おう(ETロボコン2011東京連合第1回)外乱光対策にまいまい式を使おう(ETロボコン2011東京連合第1回)
外乱光対策にまいまい式を使おう(ETロボコン2011東京連合第1回)
 

Último

presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Bhuvaneswari Subramani
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityWSO2
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 

Último (20)

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 

Building an analytics workflow using Apache Airflow

  • 1. Building an Analytics Workflow using Apache Airflow Yohei Onishi PyCon APAC 2019, Feb. 23-24 2019
  • 2. Presenter Profile ● Yohei Onishi ● Twitter: legoboku, Github: yohei1126 ● Data Engineer at a Japanese retail company ● Based in Singapore since Oct. 2018 ● Apache Airflow Contributor 2
  • 3. Session overview ● Expected audiences: Data engineers ○ who are working on building a pipleline ○ who are looking for a better workflow solution ● Goal: Provide the following so they can use Airflow ○ Airflow overview and how to author workflow ○ Server configuration and CI/CD in my usecase ○ Recommendations for new users (GCP Cloud Composer) 3
  • 4. Data pipeline data source collect ETL analytics data consumer micro services enterprise systems IoT devices object storage message queue micro services enterprise systems BI tool 4
  • 5. Our requirements for ETL worflow ● Already built a data lake on AWS S3 to store structured / unstructured data ● Want to build a batch based analytics platform ● Requirements ○ Workflow generation by code (Python) rather than GUI ○ OSS: avoid vendor lock in ○ Scalable: batch data processing and workflow ○ Simple and easily extensible ○ Workflow visualization 5
  • 6. Another workflow engine: Apache Nifi 6
  • 7. Airflow overview ● Brief history ○ Open sourced by Airbnb and Apache top project ○ Cloud Composer: managed Airflow on GCP ● Characteristics ○ Dynamic workflow generation by Python code ○ Easily extensible so you can fit it to your usecase ○ Scalable by using a message queue to orchestrate arbitrary number of workers 7
  • 8. Example: Copy a file from s3 bucket to another export records as CSV Singapore region US region EU region transfer it to a regional bucket 8 local region
  • 9. DEMO: UI and source code sample code: https://github.com/yohei1126/pycon-apac-2019-airflow-sample 9
  • 10. Concept: Directed acyclic graph, operator, task, etc custom_param_per_dag = {'sg': { ... }, 'eu': { ... }, 'us': { ... }} for region, v in custom_param_per_dag.items(): dag = DAG('shipment_{}'.format(region), ...) t1 = PostgresToS3Operator(task_id='db_to_s3', ...) t2 = S3CopyObjectOperator(task_id='s3_to_s3', ...) t1 >> t2 globals()[dag] = dag 10
  • 11. template t1 = PostgresToS3Operator( task_id='db_to_s3', sql="SELECT * FROM shipment WHERE region = '{{ params.region }}' AND ship_date = '{{ execution_date.strftime("%Y-%m-%d") }}'", bucket=default_args['source_bucket'], object_key='{{ params.region }}/{{ execution_date.strftime("%Y%m%d%H%M%S") }}.csv', params={'region':region}, dag=dag) 11
  • 12. Operator class PostgresToS3Operator(BaseOperator): template_fields = ('sql', 'bucket', 'object_key') def __init__(self, ..., *args, **kwargs): super(PostgresToS3Operator, self).__init__(*args, **kwargs) ... def execute(self, context): ... 12
  • 13. HA Airflow cluster executor (1..N) worker node (1) executor (1..N) worker node (2) executor (1..N) worker node (1) ... scheduler master node (1) web server master node (2) web server LB admin Airflow metadata DBCelery result backend message broker 13 http://site.clairvoyantsoft.com/setting-apache-airflow-cluster/
  • 14. CI/CD pipeline AWS SNS AWS SQS Github repo raise / merge a PR Airflow worker polling run Ansible script git pull test deployment 14
  • 15. Monitoring Airflow worker (EC2) AWS CloudWatch notify an error if DAG fails using Airflow slack webhook notify an error if a CloudWatch Alarm is triggered slack webhook 15
  • 16. GCP Cloud Composer ● Fully managed Airflow cluster provided by GCP ○ Fully managed ○ Built in integrated with the other GCP services ● To focus on business logic, you should build Airflow cluster using GCP composer 16
  • 17. Create a cluster using CLI $ gcloud composer environments create ENVIRONMENT_NAME --location LOCATION OTHER_ARGUMENTS ● New Airflow cluster will be deployed as Kubenetes cluster on GKE ● We usually specify the following options as OTHER_ARGUMENTS ○ infra: instance type, disk size, VPC network, etc. ○ software configuration: Python version, Airflow version, etc. 17
  • 18. Deploy your source code to the cluster $ gcloud composer environments storage dags import --environment my-environment --location us-central1 --source test-dags/quickstart.py ● This will upload your source code to cluster specific GCS bucket. ○ You can also directly upload your file to the bucket ● Then the file will be automatically deployed 18
  • 19. monitoring cluster using Stackdriver 19
  • 20. Demo: GCP Cloud Composer ● Create an environment ● Stackdriver logging ● GKE as backend 20
  • 21. Summary ● Data Engineers have to build reliable and scalable data pipeline to accelate data analytics activities ● Airflow is great tool to author and monitor workflow ● HA Airflow cluster is required for high availablity ● GCP Cloud Compose enables us to build a cluster easily and focus on business logic 21
  • 22. References ● Apache Airflow ● GCP Cloud Composer ● Airflow: a workflow management platform ● ETL best practices in Airflow 1.8 ● Data Science for Startups: Data Pipelines ● Airflow: Tips, Tricks, and Pitfalls 22