SlideShare una empresa de Scribd logo
1 de 18
Descargar para leer sin conexión
AIRFLOW
An Open Source Platform to Author and
Monitor Data Pipelines
WHAT ISTHAT!?
A platform to monitor
and control data pipelines
Pipelines are configured as
code, allowing for dynamic
pipeline generation
100% developed in Python
Easily define your own
operators, executors and
extend the library It’s all about DAGs
WHY DO I NEEDTHAT?
• There are several critical processes to be maintained and
monitored
• Different kinds of jobs in different tools
• Jobs require dependencies and run in a specific order
• A consistent notification method
• Action must be takes in case things go wrong
VERY FLEXIBLE!
DAGs are made in code
Rich User Interface
Efficient CLIEasily extensible
Allow communication between task
Backfill control
ARCHITECTURE
Sequence
• Runs on one CPU core
• Not recommended for
production
• Runs with SQLLite
• Scales vertically
• Runs in threads allowing
tasks parallelism
• Suitable for production
usually when there’s not
so many DAGs
ARCHITECTURE
Local Executor
ARCHITECTURE
Celery
• Scales a lot
• Each executor resides in
one node
• Requires Celery to
manage nodes and Redis
or RabbitMQ for
communication
TECHNOLOGIES
User Interface: Flask,
SQLAlchemy, d3.js and
Highcharts
Tempting: Jinja!
Database: Usually
Postgres or MySQL
Distributed Mode:
Celery with
RabbitMQ or Redis
USER INTERFACE
WHAT WE ARE DOING
OUR CASE
Airflow	Webserver Airflow	Scheduler
PostgreSQL	Database
Local Executor Architecture
OUR CASE
Using 100% IBM Bluemix
OUR CASE - PIPELINE
Database Cleanup
SSH Actions
Spark Jobs (ETLs)
Watson Explorer
Crawlers
Slack Notifications
on Specific Channels
What we run with Airflow
PROS
• We are able to run tasks in parallel ensuring
dependencies are respected
• Whole process requires less time
• We have detailed graphics views for each one of the tasks
• We get notifications from all steps of the flow in Slack
• There’s a control version using GitHub for all our flows
• We are able to repeat failed tasks after a pre-defined
time when it fails
CONS
• Lack of tutorials and detailed documentation
• Missing operators for some databases (we have
to create our own)
• DAG's sync not handled by Airflow
• Not that good for those who doesn't like
programming
SOME LINKS
• My Airflow implementation using Docker container - https://github.com/
brunocfnba/docker-airflow
• Airflow official website - https://airflow.incubator.apache.org/
• Airflow GitHub - https://github.com/apache/incubator-airflow

Más contenido relacionado

La actualidad más candente

How I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowHow I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with Airflow
PyData
 

La actualidad más candente (20)

Building an analytics workflow using Apache Airflow
Building an analytics workflow using Apache AirflowBuilding an analytics workflow using Apache Airflow
Building an analytics workflow using Apache Airflow
 
Airflow 101
Airflow 101Airflow 101
Airflow 101
 
Airflow - a data flow engine
Airflow - a data flow engineAirflow - a data flow engine
Airflow - a data flow engine
 
Airflow presentation
Airflow presentationAirflow presentation
Airflow presentation
 
How I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowHow I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with Airflow
 
Apache airflow
Apache airflowApache airflow
Apache airflow
 
Apache Airflow
Apache AirflowApache Airflow
Apache Airflow
 
Airflow for Beginners
Airflow for BeginnersAirflow for Beginners
Airflow for Beginners
 
Airflow Intro-1.pdf
Airflow Intro-1.pdfAirflow Intro-1.pdf
Airflow Intro-1.pdf
 
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
 
Apache Airflow in Production
Apache Airflow in ProductionApache Airflow in Production
Apache Airflow in Production
 
Orchestrating workflows Apache Airflow on GCP & AWS
Orchestrating workflows Apache Airflow on GCP & AWSOrchestrating workflows Apache Airflow on GCP & AWS
Orchestrating workflows Apache Airflow on GCP & AWS
 
Apache Airflow Introduction
Apache Airflow IntroductionApache Airflow Introduction
Apache Airflow Introduction
 
Airflow at WePay
Airflow at WePayAirflow at WePay
Airflow at WePay
 
Airflow tutorials hands_on
Airflow tutorials hands_onAirflow tutorials hands_on
Airflow tutorials hands_on
 
Apache Airflow Architecture
Apache Airflow ArchitectureApache Airflow Architecture
Apache Airflow Architecture
 
Running Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on HadoopRunning Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on Hadoop
 
Building Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache AirflowBuilding Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache Airflow
 
Cloud Monitoring tool Grafana
Cloud Monitoring  tool Grafana Cloud Monitoring  tool Grafana
Cloud Monitoring tool Grafana
 
Airflow Best Practises & Roadmap to Airflow 2.0
Airflow Best Practises & Roadmap to Airflow 2.0Airflow Best Practises & Roadmap to Airflow 2.0
Airflow Best Practises & Roadmap to Airflow 2.0
 

Similar a Introducing Apache Airflow and how we are using it

Infrastructure as code with Terraform
Infrastructure as code with TerraformInfrastructure as code with Terraform
Infrastructure as code with Terraform
Sam Bashton
 

Similar a Introducing Apache Airflow and how we are using it (20)

The ABC's of IaC
The ABC's of IaCThe ABC's of IaC
The ABC's of IaC
 
Building Efficient Parallel Testing Platforms with Docker
Building Efficient Parallel Testing Platforms with DockerBuilding Efficient Parallel Testing Platforms with Docker
Building Efficient Parallel Testing Platforms with Docker
 
Prefect Paris Airflow Meetup Jeff Hale April 2023.pdf
Prefect Paris Airflow Meetup Jeff Hale April 2023.pdfPrefect Paris Airflow Meetup Jeff Hale April 2023.pdf
Prefect Paris Airflow Meetup Jeff Hale April 2023.pdf
 
CT Software Developers Meetup: Using Docker and Vagrant Within A GitHub Pull ...
CT Software Developers Meetup: Using Docker and Vagrant Within A GitHub Pull ...CT Software Developers Meetup: Using Docker and Vagrant Within A GitHub Pull ...
CT Software Developers Meetup: Using Docker and Vagrant Within A GitHub Pull ...
 
Top 10 dev ops tools (1)
Top 10 dev ops tools (1)Top 10 dev ops tools (1)
Top 10 dev ops tools (1)
 
Latest (storage IO) patterns for cloud-native applications
Latest (storage IO) patterns for cloud-native applications Latest (storage IO) patterns for cloud-native applications
Latest (storage IO) patterns for cloud-native applications
 
Serverless design with Fn project
Serverless design with Fn projectServerless design with Fn project
Serverless design with Fn project
 
London DevOps Meetup - PaaS as a platform for devops
London DevOps Meetup - PaaS as a platform for devopsLondon DevOps Meetup - PaaS as a platform for devops
London DevOps Meetup - PaaS as a platform for devops
 
GoDocker presentation
GoDocker presentationGoDocker presentation
GoDocker presentation
 
56k.cloud training
56k.cloud training56k.cloud training
56k.cloud training
 
Five Years of EC2 Distilled
Five Years of EC2 DistilledFive Years of EC2 Distilled
Five Years of EC2 Distilled
 
Apex world 2018 continuously delivering APEX
Apex world 2018 continuously delivering APEXApex world 2018 continuously delivering APEX
Apex world 2018 continuously delivering APEX
 
Butter bei die Fische - Ein Jahr Entwicklung und Produktion mit Docker
Butter bei die Fische - Ein Jahr Entwicklung und Produktion mit DockerButter bei die Fische - Ein Jahr Entwicklung und Produktion mit Docker
Butter bei die Fische - Ein Jahr Entwicklung und Produktion mit Docker
 
Cloud Orchestration is Broken
Cloud Orchestration is BrokenCloud Orchestration is Broken
Cloud Orchestration is Broken
 
Hot to build continuously processing for 24/7 real-time data streaming platform?
Hot to build continuously processing for 24/7 real-time data streaming platform?Hot to build continuously processing for 24/7 real-time data streaming platform?
Hot to build continuously processing for 24/7 real-time data streaming platform?
 
Efficient Parallel Testing with Docker
Efficient Parallel Testing with DockerEfficient Parallel Testing with Docker
Efficient Parallel Testing with Docker
 
Ansible: What, Why & How
Ansible: What, Why & HowAnsible: What, Why & How
Ansible: What, Why & How
 
12 Factor App Methodology
12 Factor App Methodology12 Factor App Methodology
12 Factor App Methodology
 
Serverless: The future of application delivery
Serverless: The future of application deliveryServerless: The future of application delivery
Serverless: The future of application delivery
 
Infrastructure as code with Terraform
Infrastructure as code with TerraformInfrastructure as code with Terraform
Infrastructure as code with Terraform
 

Último

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Último (20)

ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 

Introducing Apache Airflow and how we are using it

  • 1. AIRFLOW An Open Source Platform to Author and Monitor Data Pipelines
  • 2. WHAT ISTHAT!? A platform to monitor and control data pipelines Pipelines are configured as code, allowing for dynamic pipeline generation 100% developed in Python Easily define your own operators, executors and extend the library It’s all about DAGs
  • 3. WHY DO I NEEDTHAT? • There are several critical processes to be maintained and monitored • Different kinds of jobs in different tools • Jobs require dependencies and run in a specific order • A consistent notification method • Action must be takes in case things go wrong
  • 4. VERY FLEXIBLE! DAGs are made in code Rich User Interface Efficient CLIEasily extensible Allow communication between task Backfill control
  • 5. ARCHITECTURE Sequence • Runs on one CPU core • Not recommended for production • Runs with SQLLite
  • 6. • Scales vertically • Runs in threads allowing tasks parallelism • Suitable for production usually when there’s not so many DAGs ARCHITECTURE Local Executor
  • 7. ARCHITECTURE Celery • Scales a lot • Each executor resides in one node • Requires Celery to manage nodes and Redis or RabbitMQ for communication
  • 8. TECHNOLOGIES User Interface: Flask, SQLAlchemy, d3.js and Highcharts Tempting: Jinja! Database: Usually Postgres or MySQL Distributed Mode: Celery with RabbitMQ or Redis
  • 10.
  • 11.
  • 12. WHAT WE ARE DOING
  • 14. OUR CASE Using 100% IBM Bluemix
  • 15. OUR CASE - PIPELINE Database Cleanup SSH Actions Spark Jobs (ETLs) Watson Explorer Crawlers Slack Notifications on Specific Channels What we run with Airflow
  • 16. PROS • We are able to run tasks in parallel ensuring dependencies are respected • Whole process requires less time • We have detailed graphics views for each one of the tasks • We get notifications from all steps of the flow in Slack • There’s a control version using GitHub for all our flows • We are able to repeat failed tasks after a pre-defined time when it fails
  • 17. CONS • Lack of tutorials and detailed documentation • Missing operators for some databases (we have to create our own) • DAG's sync not handled by Airflow • Not that good for those who doesn't like programming
  • 18. SOME LINKS • My Airflow implementation using Docker container - https://github.com/ brunocfnba/docker-airflow • Airflow official website - https://airflow.incubator.apache.org/ • Airflow GitHub - https://github.com/apache/incubator-airflow