This document discusses Apache Airflow and its use at Dailymotion. It provides an agenda that covers data at Dailymotion, Apache Airflow, how Airflow is used at Dailymotion, deployment of Airflow at Dailymotion, working on a DAG (directed acyclic graph) pipeline, and an example pipeline for Dailymotion's new Advanced Analytics project. The example pipeline aggregates data from different sources with varying frequencies and timezones into BigQuery and Exasol for visualization in Tableau.
10. Apache Airflow
What airflow is
solving?
• Scheduling dependanttasks
• Visualizing pipeline
• Monitoring / Alerting
• Backfilling
What airflow is
not?
• Data “streaming” :
Tasks do not move data from one to the other
18. Deployment at Dailymotion
~$ make run
[] starting worker...
[] starting scheduler...
[] Starting webserver...
** Airflow is ready!
airflow@7ecb6a08559c:~$ |
Dev
21. Working on a DAG
Example on a simple case :
We want to load some files from
storage to BigQuery table on a
daily basis.
Before loading data to
production partition, we want to
verify its quality (example :
number of row more or less the
same as last week).
26. Example of a pipeline
Presentation of one real use case : Dailymotion’s new Advanced Analytics
Goals :
• External : allowing partners to monitor theirs activities on Dailymotion, through a
serie of dashboards
• Internal : allowing internal business teams (ads op, finance, content..) to query
aggregated data via tableau
Challenges:
• Differents sources with different frequency of updated with
different timezone
• Backfill anything at anytime
• Monitor and alerting
• Solution approchable by differentdata friendly profession
Tools:
• Airflow
• BigQuery
• Exasol
• Tableau
27. Example of a pipeline
Presentation of one real use case : Dailymotion’s new Advanced Analytics