SlideShare una empresa de Scribd logo
1 de 32
Descargar para leer sin conexión
© 2018
Apache Airflow at Dailymotion
Wednesday20th June 2018
Cédric
Hourcade
Germain
Tanguy
• Data Engineer
• Github: https://github.com/germaintanguy
• Linkedin: https://fr.linkedin.com/in/germain-tanguy
• Data Engineering Manager
• Github: https://github.com/hc
• Linkedin: https://www.linkedin.com/in/hourcade/
Agenda
• Data at Dailymotion
• Apache Airflow
• Airflow at Dailymotion
• Deployment
• Working on a DAG
• Example of a pipeline
Data at
Dailymotion
© 2018
Data at Dailymotion
the team
Data at Dailymotion
who
what
Data at Dailymotion
who
tools
Apache Airflow
© 2018
Apache Airflow
Success
Retry
Failed
• Workflowmanagement system
• Workflow is defined as a Directed Acyclic
Grah (DAG)
• Each DAG is composed by task
• Workflowis defined in Python
• Schedulable
What airflow is?
Apache Airflow
What airflow is
solving?
• Scheduling dependanttasks
• Visualizing pipeline
• Monitoring / Alerting
• Backfilling
What airflow is
not?
• Data “streaming” :
Tasks do not move data from one to the other
Workflow
Command Line Interface:
• airflow list_dags
• airflow list_tasks
• airflow trigger_dag
• airflow test
• airflow clear
• airflow backfill ...
Architecture
Celery Executor
Operator / Sensor
Operatorbuilt-in:
• PythonOperator
• BashOperator
• BigQueryToBigQueryOperator
• GcsDataFlowJavaOperator
• ShortCircuitOperator
• HTTPOperator
• ... Sensorbuilt-in:
• HivePartitionSensor
• BigQueryTableSensor
• DatadogSensor
• S3PrefixSensor
• SqlSensor
• ...
Airflow at
Dailymotion
© 2018
Deployment
Deployment at Dailymotion
Our first deployment:
the LocalExecutor
Deployment at Dailymotion
Stage Prod
Deployment at Dailymotion
Deployment at Dailymotion
~$ make run
[] starting worker...
[] starting scheduler...
[] Starting webserver...
** Airflow is ready!
airflow@7ecb6a08559c:~$ |
Dev
Airflow at
Dailymotion
© 2018
Working on a DAG
Working on a DAG
Main property : idempotence
Working on a DAG
Example on a simple case :
We want to load some files from
storage to BigQuery table on a
daily basis.
Before loading data to
production partition, we want to
verify its quality (example :
number of row more or less the
same as last week).
Working on a DAG
Working on
a new DAG
Working on a DAG
Working on
a new DAG
2
4
Working on a DAG
Airflow at
Dailymotion
© 2018
Example of a pipeline
Example of a pipeline
Presentation of one real use case : Dailymotion’s new Advanced Analytics
Goals :
• External : allowing partners to monitor theirs activities on Dailymotion, through a
serie of dashboards
• Internal : allowing internal business teams (ads op, finance, content..) to query
aggregated data via tableau
Challenges:
• Differents sources with different frequency of updated with
different timezone
• Backfill anything at anytime
• Monitor and alerting
• Solution approchable by differentdata friendly profession
Tools:
• Airflow
• BigQuery
• Exasol
• Tableau
Example of a pipeline
Presentation of one real use case : Dailymotion’s new Advanced Analytics
Workflow of a pipeline
DAG
Workflow of a pipeline
DAG
Workflow of a pipeline
DAG
Workflow of a pipeline
DAG
© 2017
Confidential
© 2018
Thank you
Cédric Hourcade et Germain Tanguy

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Airflow presentation
Airflow presentationAirflow presentation
Airflow presentation
 
Running Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on HadoopRunning Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on Hadoop
 
Airflow at WePay
Airflow at WePayAirflow at WePay
Airflow at WePay
 
Clearing Airflow Obstructions
Clearing Airflow ObstructionsClearing Airflow Obstructions
Clearing Airflow Obstructions
 
Building Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache AirflowBuilding Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache Airflow
 
(CMP310) Data Processing Pipelines Using Containers & Spot Instances
(CMP310) Data Processing Pipelines Using Containers & Spot Instances(CMP310) Data Processing Pipelines Using Containers & Spot Instances
(CMP310) Data Processing Pipelines Using Containers & Spot Instances
 
Apache Airflow (incubating) NL HUG Meetup 2016-07-19
Apache Airflow (incubating) NL HUG Meetup 2016-07-19Apache Airflow (incubating) NL HUG Meetup 2016-07-19
Apache Airflow (incubating) NL HUG Meetup 2016-07-19
 
From AWS Data Pipeline to Airflow - managing data pipelines in Nielsen Market...
From AWS Data Pipeline to Airflow - managing data pipelines in Nielsen Market...From AWS Data Pipeline to Airflow - managing data pipelines in Nielsen Market...
From AWS Data Pipeline to Airflow - managing data pipelines in Nielsen Market...
 
What is Spark
What is SparkWhat is Spark
What is Spark
 
Airflow 4 manager
Airflow 4 managerAirflow 4 manager
Airflow 4 manager
 
Stream Processing Live Traffic Data with Kafka Streams
Stream Processing Live Traffic Data with Kafka StreamsStream Processing Live Traffic Data with Kafka Streams
Stream Processing Live Traffic Data with Kafka Streams
 
InfluxDB 2.0 Client Libraries by Noah Crowley
InfluxDB 2.0 Client Libraries by Noah CrowleyInfluxDB 2.0 Client Libraries by Noah Crowley
InfluxDB 2.0 Client Libraries by Noah Crowley
 
InfluxDB and Grafana: An Introduction to Time-Based Data Storage and Visualiz...
InfluxDB and Grafana: An Introduction to Time-Based Data Storage and Visualiz...InfluxDB and Grafana: An Introduction to Time-Based Data Storage and Visualiz...
InfluxDB and Grafana: An Introduction to Time-Based Data Storage and Visualiz...
 
InfluxDB 2.0: Dashboarding 101 by David G. Simmons
InfluxDB 2.0: Dashboarding 101 by David G. SimmonsInfluxDB 2.0: Dashboarding 101 by David G. Simmons
InfluxDB 2.0: Dashboarding 101 by David G. Simmons
 
GCPUG.TW - GCP學習資源分享
GCPUG.TW - GCP學習資源分享GCPUG.TW - GCP學習資源分享
GCPUG.TW - GCP學習資源分享
 
Intro to InfluxDB 2.0 and Your First Flux Query by Sonia Gupta
Intro to InfluxDB 2.0 and Your First Flux Query by Sonia GuptaIntro to InfluxDB 2.0 and Your First Flux Query by Sonia Gupta
Intro to InfluxDB 2.0 and Your First Flux Query by Sonia Gupta
 
GDG London Workshop: Build GCP infrastructure with Terraform
GDG London Workshop: Build GCP infrastructure with Terraform GDG London Workshop: Build GCP infrastructure with Terraform
GDG London Workshop: Build GCP infrastructure with Terraform
 
Care and Feeding of Large Scale Graphite Installations - DevOpsDays Austin 2013
Care and Feeding of Large Scale Graphite Installations - DevOpsDays Austin 2013Care and Feeding of Large Scale Graphite Installations - DevOpsDays Austin 2013
Care and Feeding of Large Scale Graphite Installations - DevOpsDays Austin 2013
 
Building a Data Ingestion & Processing Pipeline with Spark & Airflow
Building a Data Ingestion & Processing Pipeline with Spark & AirflowBuilding a Data Ingestion & Processing Pipeline with Spark & Airflow
Building a Data Ingestion & Processing Pipeline with Spark & Airflow
 
A True Story About Database Orchestration
A True Story About Database OrchestrationA True Story About Database Orchestration
A True Story About Database Orchestration
 

Similar a Apache Airflow at Dailymotion

Similar a Apache Airflow at Dailymotion (20)

Capstone presentation
Capstone presentationCapstone presentation
Capstone presentation
 
Big Data Ingestion Using Hadoop - Capstone Presentation
Big Data Ingestion Using Hadoop - Capstone PresentationBig Data Ingestion Using Hadoop - Capstone Presentation
Big Data Ingestion Using Hadoop - Capstone Presentation
 
Building CI from scratch
Building CI from scratchBuilding CI from scratch
Building CI from scratch
 
Gimel and PayPal Notebooks @ TDWI Leadership Summit Orlando
Gimel and PayPal Notebooks @ TDWI Leadership Summit OrlandoGimel and PayPal Notebooks @ TDWI Leadership Summit Orlando
Gimel and PayPal Notebooks @ TDWI Leadership Summit Orlando
 
PayPal Notebooks at Jupytercon 2018
PayPal Notebooks at Jupytercon 2018PayPal Notebooks at Jupytercon 2018
PayPal Notebooks at Jupytercon 2018
 
Kubernetes as data platform
Kubernetes as data platformKubernetes as data platform
Kubernetes as data platform
 
DevOpsDaysRiga 2018: Eric Skoglund, Lars Albertsson - Kubernetes as data plat...
DevOpsDaysRiga 2018: Eric Skoglund, Lars Albertsson - Kubernetes as data plat...DevOpsDaysRiga 2018: Eric Skoglund, Lars Albertsson - Kubernetes as data plat...
DevOpsDaysRiga 2018: Eric Skoglund, Lars Albertsson - Kubernetes as data plat...
 
The journey of Moving from AWS ELK to GCP Data Pipeline
The journey of Moving from AWS ELK to GCP Data PipelineThe journey of Moving from AWS ELK to GCP Data Pipeline
The journey of Moving from AWS ELK to GCP Data Pipeline
 
DevOps KPIs as a Service: Daimler’s Solution
DevOps KPIs as a Service: Daimler’s SolutionDevOps KPIs as a Service: Daimler’s Solution
DevOps KPIs as a Service: Daimler’s Solution
 
Apigee Insights: Data & Context-Driven Actions
Apigee Insights: Data & Context-Driven ActionsApigee Insights: Data & Context-Driven Actions
Apigee Insights: Data & Context-Driven Actions
 
Webinar - Building Custom Extensions With AppDynamics
Webinar - Building Custom Extensions With AppDynamicsWebinar - Building Custom Extensions With AppDynamics
Webinar - Building Custom Extensions With AppDynamics
 
Stream processing for the practitioner: Blueprints for common stream processi...
Stream processing for the practitioner: Blueprints for common stream processi...Stream processing for the practitioner: Blueprints for common stream processi...
Stream processing for the practitioner: Blueprints for common stream processi...
 
Datadog APM Product Launch
Datadog APM Product LaunchDatadog APM Product Launch
Datadog APM Product Launch
 
Why Open Source Works for DevOps Monitoring
Why Open Source Works for DevOps MonitoringWhy Open Source Works for DevOps Monitoring
Why Open Source Works for DevOps Monitoring
 
Gimel at Dataworks Summit San Jose 2018
Gimel at Dataworks Summit San Jose 2018Gimel at Dataworks Summit San Jose 2018
Gimel at Dataworks Summit San Jose 2018
 
Dataworks | 2018-06-20 | Gimel data platform
Dataworks | 2018-06-20 | Gimel data platformDataworks | 2018-06-20 | Gimel data platform
Dataworks | 2018-06-20 | Gimel data platform
 
Xtending nintex workflow cloud w azure functions - xchange conference
Xtending nintex workflow cloud w azure functions - xchange conferenceXtending nintex workflow cloud w azure functions - xchange conference
Xtending nintex workflow cloud w azure functions - xchange conference
 
SAP Teched 2012 Session Tec3438 Automate IaaS SAP deployments
SAP Teched 2012 Session Tec3438 Automate IaaS SAP deploymentsSAP Teched 2012 Session Tec3438 Automate IaaS SAP deployments
SAP Teched 2012 Session Tec3438 Automate IaaS SAP deployments
 
Dell Technology World - CloudOps - Leveraging DevOps Principles and Practice...
Dell Technology  World - CloudOps - Leveraging DevOps Principles and Practice...Dell Technology  World - CloudOps - Leveraging DevOps Principles and Practice...
Dell Technology World - CloudOps - Leveraging DevOps Principles and Practice...
 
Getting started with GCP ( Google Cloud Platform)
Getting started with GCP ( Google  Cloud Platform)Getting started with GCP ( Google  Cloud Platform)
Getting started with GCP ( Google Cloud Platform)
 

Último

NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
Amil baba
 
Exploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptxExploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptx
DilipVasan
 
Fuzzy Sets decision making under information of uncertainty
Fuzzy Sets decision making under information of uncertaintyFuzzy Sets decision making under information of uncertainty
Fuzzy Sets decision making under information of uncertainty
RafigAliyev2
 
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
pyhepag
 
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
pyhepag
 
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Valters Lauzums
 
一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理
cyebo
 

Último (20)

Supply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflictSupply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflict
 
Formulas dax para power bI de microsoft.pdf
Formulas dax para power bI de microsoft.pdfFormulas dax para power bI de microsoft.pdf
Formulas dax para power bI de microsoft.pdf
 
The Significance of Transliteration Enhancing
The Significance of Transliteration EnhancingThe Significance of Transliteration Enhancing
The Significance of Transliteration Enhancing
 
AI Imagen for data-storytelling Infographics.pdf
AI Imagen for data-storytelling Infographics.pdfAI Imagen for data-storytelling Infographics.pdf
AI Imagen for data-storytelling Infographics.pdf
 
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
 
Exploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptxExploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptx
 
Pre-ProductionImproveddsfjgndflghtgg.pptx
Pre-ProductionImproveddsfjgndflghtgg.pptxPre-ProductionImproveddsfjgndflghtgg.pptx
Pre-ProductionImproveddsfjgndflghtgg.pptx
 
Fuzzy Sets decision making under information of uncertainty
Fuzzy Sets decision making under information of uncertaintyFuzzy Sets decision making under information of uncertainty
Fuzzy Sets decision making under information of uncertainty
 
Machine Learning for Accident Severity Prediction
Machine Learning for Accident Severity PredictionMachine Learning for Accident Severity Prediction
Machine Learning for Accident Severity Prediction
 
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
 
Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...
Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...
Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...
 
2024 Q1 Tableau User Group Leader Quarterly Call
2024 Q1 Tableau User Group Leader Quarterly Call2024 Q1 Tableau User Group Leader Quarterly Call
2024 Q1 Tableau User Group Leader Quarterly Call
 
Generative AI for Trailblazers_ Unlock the Future of AI.pdf
Generative AI for Trailblazers_ Unlock the Future of AI.pdfGenerative AI for Trailblazers_ Unlock the Future of AI.pdf
Generative AI for Trailblazers_ Unlock the Future of AI.pdf
 
Atlantic Grupa Case Study (Mintec Data AI)
Atlantic Grupa Case Study (Mintec Data AI)Atlantic Grupa Case Study (Mintec Data AI)
Atlantic Grupa Case Study (Mintec Data AI)
 
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
 
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
 
一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理
 
Slip-and-fall Injuries: Top Workers' Comp Claims
Slip-and-fall Injuries: Top Workers' Comp ClaimsSlip-and-fall Injuries: Top Workers' Comp Claims
Slip-and-fall Injuries: Top Workers' Comp Claims
 
2024 Q2 Orange County (CA) Tableau User Group Meeting
2024 Q2 Orange County (CA) Tableau User Group Meeting2024 Q2 Orange County (CA) Tableau User Group Meeting
2024 Q2 Orange County (CA) Tableau User Group Meeting
 
社内勉強会資料  Mamba - A new era or ephemeral
社内勉強会資料   Mamba - A new era or ephemeral社内勉強会資料   Mamba - A new era or ephemeral
社内勉強会資料  Mamba - A new era or ephemeral
 

Apache Airflow at Dailymotion