Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

Data Lineage with Apache Airflow using Marquez

995 visualizaciones

Publicado el

The term data quality is used to describe the dependability, reliability, and usability of datasets. Data scientists and business analysts often determine the quality of a dataset by its trustworthiness and completeness. But what information might be needed to differentiate between useful vs noisy data? How quickly can data quality issues be identified and explored? More importantly, how can metadata enable data scientists to make better sense of the high volume of data within their organization from a variety of data sources?

With Airflow now ubiquitous for DAG orchestration, organizations increasingly dependon Airflow to manage complex inter-DAG dependencies and provide up-to-date runtime visibility into DAG execution. At WeWork, Airflow has quickly become an important component of our Data Platform powering billing, space inventory, etc. But what effects (if any) would upstream DAGs have on downstream DAGs if dataset consumption was delayed? What alerting rules should be in place to notify downstream DAGs of possible upstream processing issues or failures?

At WeWork, we feel it’s critical that DAG metadata is collected, maintained, and shared across the organization. This investment in metadata enables:

● Data lineage
● Data governance
● Data discovery

In this talk, we introduce Marquez: an open source metadata service for the collection, aggregation, and visualization of a data ecosystem’s metadata. We will demonstrate how metadata management with Marquez helps maintain inter-DAG dependencies, catalog historical runs of DAGs, and minimize data quality issues.

Publicado en: Datos y análisis
  • Sé el primero en comentar

Data Lineage with Apache Airflow using Marquez

  1. 1. Data Platform Data Lineage with Apache Airflow using Marquez CRUNCH ‘19
  2. 2. Data Platform Hey! I’m Willy Lulciuc Software Engineer Marquez Team, Data Platform @wslulciuc
  3. 3. AGENDA Intro to Marquez Marquez + Airflow 02 03 Why metadata?01 Data Platform Demo04 Future work05
  4. 4. Data Platform DATA ● What is the data source? ● Does the data have a schema? ● Does the data have an owner? ● How often is the data updated? Today: Limited context
  5. 5. Data Platform DATA SOURCES ● MYSQL ● POSTGRESQL ● REDSHIFT ● SNOWFLAKE ● KAFKA FORMATS ● CSV ● JSON ● AVRO ● PARQUET Data contextualization!
  6. 6. Why metadata?01
  7. 7. Data lineage ● Add context to data Democratize ● Self-service data culture Data quality ● Build trust in data Why manage and utilize metadata? Data Platform
  8. 8. … creating a healthy data ecosystem
  9. 9. Freedom ● Experiment ● Flexible ● Self-sufficient Accountability ● Cost ● Trust Self-service ● Discover ● Explore ● Global context A healthy data ecosystem Data Platform
  10. 10. WeWork Metadata
  11. 11. Metadata (Marquez) Ingest Storage Compute StreamingBatch/ETL ● Data Platform built around Marquez ● Integrations ○ Ingest ○ Storage ○ Compute Data Platform Flink Airflow Kafka Iceberg / S3
  12. 12. Intro to Marquez02
  13. 13. http://cidrdb.org/cidr2017/papers/p111-hellerstein-cidr17.pdf
  14. 14. Data Lineage Data Governance Data Discovery Marquez Data Platform
  15. 15. Data Platform Metadata Service ● Centralized metadata management ○ Sources ○ Datasets ○ Jobs ● Modular framework ○ Data governance ○ Data lineage ○ Data discovery + exploration Marquez: Design Marquez Core Lineage Search REST API ETL Batch Stream
  16. 16. Integrations APIs / Libraries Data PlatformData Platform
  17. 17. Core Lineage SearchServices Integrations APIs / Libraries Data PlatformData Platform
  18. 18. Core Lineage SearchServices Integrations APIs / Libraries Data Platform Metadata DB Graph Search Storage Data Platform
  19. 19. Marquez: Data model Job Dataset JobVersion RunDatasetVersion * 1 * 1 * 1 1* 1* Data Platform Source 1 * ● MYSQL ● POSTGRESQL ● REDSHIFT ● SNOWFLAKE ● KAFKA ● BATCH ● STREAM ● SERVICE
  20. 20. Marquez: Data model Data Platform DbTable Filesystem Stream Job Dataset JobVersion RunDatasetVersion * 1 * 1 * 1 1* 1* Source 1 * ● MYSQL ● POSTGRESQL ● REDSHIFT ● SNOWFLAKE ● KAFKA ● BATCH ● STREAM ● SERVICE
  21. 21. Data Platform v1 v4Dataset v2 v4 v4 Job v1 Dataset v4 Job v2 Marquez: Data model ● Debugging ○ What job version(s) produced and consumed dataset version X? ● Backfilling ○ Full / incremental processing Design benefits
  22. 22. Data Platform Marquez: Metadata collection How is metadata collected? ● Push-based metadata collection ● REST API ● Language-specific SDKs ○ Java ○ Python ○ Go Marquez Job Dataset+job metadata
  23. 23. Data Platform Marquez: Metadata collection Source { "type":"POSTGRESQL", "name":"analytics_db”, "connectionUrl":"jdbc:postgresql://we.db:5431/analytics”, "description":“Contains tables such as office room bookings.” } 01
  24. 24. Data Platform Marquez: Metadata collection { "type":"POSTGRESQL", "name":"analytics_db”, "connectionUrl":"jdbc:postgresql://we.db:5431/analytics”, "description":“Contains tables such as office room bookings.” } { "type":"DB_TABLE", "name":"wedata.room_bookings”, "physicalName":"wedata.room_bookings”, "sourceName":"analytics_db”, "namespace":"datascience", "columns": [...], "description":“All global room bookings for each office.” } 02 Dataset Source01
  25. 25. Data Platform Marquez: Metadata collection { "type":"POSTGRESQL", "name":"analytics_db”, "connectionUrl":"jdbc:postgresql://we.db:5431/analytics”, "description":“Contains tables such as office room bookings.” } { "type":"DB_TABLE", "name":"wedata.room_bookings”, "physicalName":"wedata.room_bookings”, "sourceName":"analytics_db”, "namespace":"datascience”, "columns": [...], "description":“All global room bookings for each office.” } { "type":"BATCH", "name":"room_bookings_7_days”, "inputs":["wedata.room_bookings”], "outputs":[], "location":"https://github.com/we/jobs/commit/124f6089...”, "namespace":"datascience", "description":“Weekly email of room bookings occupancy patterns.” } 03 Job Source01 02 Dataset
  26. 26. Data Platform Marquez: Metadata collection { "type":"POSTGRESQL", "name":"analytics_db”, "connectionUrl":"jdbc:postgresql://we.db:5431/analytics”, "description":“Contains tables such as office room bookings.” } { "type":"DB_TABLE", "name":"wedata.room_bookings”, "physicalName":"wedata.room_bookings”, "sourceName":"analytics_db”, "namespace":"datascience”, "columns": [...], "description":“All global room bookings for each office.” } { "type":"BATCH", "name":"room_bookings_7_days”, "inputs":["wedata.room_bookings”], "outputs":[], "location":"https://github.com/we/jobs/commit/124f6089...”, "namespace":"datascience”, "description":“Weekly email of room bookings occupancy patterns.” } 03 Job Source01 LINK SOURCE LINK DATASET 02 Dataset
  27. 27. Data Platform Marquez: Metadata collection { "type":"POSTGRESQL", "name":"analytics_db”, "connectionUrl":"jdbc:postgresql://we.db:5431/analytics”, "description":“Contains tables such as office room bookings.” } { "type":"DB_TABLE", "name":"wedata.room_bookings”, "physicalName":"wedata.room_bookings”, "sourceName":"analytics_db”, "namespace":"datascience”, "columns": [...], "description":“All global room bookings for each office.” } Source01 { "type":"BATCH", "name":"room_bookings_7_days”, "inputs":["wedata.room_bookings”], "outputs":[], "location":"https://github.com/we/jobs/commit/124f6089...”, "namespace":"datascience”, "description":“Weekly email of room bookings occupancy patterns.” } 03 Job OWNERSHIP 02 Dataset
  28. 28. Data Platform 01 Job v1 { "type":"BATCH", "name":"room_bookings_7_days” "inputs":["wedata.room_bookings”], "outputs":[], ... } LINEAGE JOBDATASET Marquez: Metadata collection
  29. 29. Data Platform Marquez: Metadata collection 02 01 Job v1 Job v2 { "type":"BATCH", "name":"room_bookings_7_days” "inputs":["wedata.room_bookings”], "outputs":[], ... } { "type":"BATCH", "name":"room_bookings_7_days” "inputs":["wedata.room_bookings”], "outputs":["wedata.room_bookings_aggs”], ... } LINEAGE LINEAGE JOBDATASET
  30. 30. Data Platform Marquez: Metadata collection 03 Job v3 { "type":"BATCH", "name":"room_bookings_7_days” "inputs":["wedata.room_bookings”], "outputs":["wedata.room_bookings_aggs”], ... } { "type":"BATCH", "name":"room_bookings_7_days” "inputs":["wedata.room_bookings”,"wedata.spaces”], "outputs":["wedata.room_bookings_aggs”], ... } 02 Job v2 01 Job v1 { "type":"BATCH", "name":"room_bookings_7_days” "inputs":["wedata.room_bookings”], "outputs":[], ... } LINEAGE LINEAGE LINEAGE JOBDATASET
  31. 31. Marquez @ WeWork
  32. 32. VERSION CONTROL Quay.io BUILD TEST PUBLISH IMAGE CIRCLECI BUILD PIPELINE DEPLOY STAGING DEPLOY PRODUCTION HOLD PUSH IMAGE PULL IMAGE Marquez Deployment HELM INSTALL HELM INSTALL
  33. 33. +Marquez03 Airflow
  34. 34. Airflow
  35. 35. Airflow DAG DAG DAG DAG Marquez Lib. Data Platform ● Metadata ○ Task lifecycle ○ Task parameters ○ Task runs linked to versioned code ○ Task inputs / outputs ● Lineage ○ Track inter-DAG dependencies ● Built-in ○ SQL parser ○ Link to code builder (GitHub) ○ Metadata extractors Marquez: Airflow Airflow support for Marquez
  36. 36. Data Platform Marquez: Airflow Airflow Marquez Airflow Lib. Extractor Operator Metadata
  37. 37. Data Platform airflow.operators.PostgresOperator marquez_airflow.extractors.PostgresExtractor Extractor Operator Metadata Airflow Marquez Airflow Lib. Example Marquez: Airflow
  38. 38. Data Platform ● Open source! ● Enables global task-level metadata collection ● Extends Airflow’s DAG class from marquez_airflow import DAG from airflow.operators.postgres_operator import PostgresOperator ... room_bookings_7_days_dag.py Marquez: Airflow Marquez Airflow Lib.
  39. 39. Data Platform DAG MarquezLib. Integration Marquez RESTAPI Capturing task-level metadata in a nutshell Marquez: Airflow Job Dataset Job Version Run Dataset Version * 1 * 1 1* 1* Source 1 * * 1 Airflow
  40. 40. Airflow @ WeWork Quay.io IMAGE AIRFLOW MARQUEZ AIRFLOW LIB. ● Multiple Airflow instances ● Internally distribute our own docker image ○ Airflow ○ Marquez Airflow Lib. ● All DAGs are Marquez-enabled ○ Lineage across Airflow instances ○ DAG ownership enforcement
  41. 41. IMAGE AIRFLOW CUSTOM OPERATORS CUSTOM EXTRACTORSQuay.io NAMESPACE 1 PULL IMAGE NAMESPACE 2 NAMESPACE N ... PULL IMAGEPULL IMAGE MARQUEZ AIRFLOW LIB. DATAPLATFORM MARQUEZ BUILD / PUSH IMAGE
  42. 42. Marquez: Data triggers ● Powered by data lineage ● Timely processing of data ○ No polling! ● Reduce manual handling of backfills ● Reduce production of bad data ○ Incomplete data ○ Low-quality data LINEAGE Data Platform JOBDATASET Managing inter-DAG dependencies with data triggers
  43. 43. Marquez: Data triggers Managing inter-DAG dependencies with data triggers LINEAGE Data Platform ● Powered by data lineage ● Timely processing of data ○ No polling! ● Reduce manual handling of backfills ● Reduce production of bad data ○ Incomplete data ○ Low-quality data JOBDATASET
  44. 44. Marquez: Data triggers Managing inter-DAG dependencies with data triggers LINEAGE Data Platform ● Powered by data lineage ● Timely processing of data ○ No polling! ● Reduce manual handling of backfills ● Reduce production of bad data ○ Incomplete data ○ Low-quality data JOBDATASET
  45. 45. Marquez: Data triggers Managing inter-DAG dependencies with data triggers LINEAGE Data Platform ● Powered by data lineage ● Timely processing of data ○ No polling! ● Reduce manual handling of backfills ● Reduce production of bad data ○ Incomplete data ○ Low-quality data JOBDATASET
  46. 46. Marquez: Data triggers Managing inter-DAG dependencies with data triggers LINEAGE Data Platform ● Powered by data lineage ● Timely processing of data ○ No polling! ● Reduce manual handling of backfills ● Reduce production of bad data ○ Incomplete data ○ Low-quality data JOBDATASET
  47. 47. Marquez: Data triggers Managing inter-DAG dependencies with data triggers LINEAGE Data Platform ● Powered by data lineage ● Timely processing of data ○ No polling! ● Reduce manual handling of backfills ● Reduce production of bad data ○ Incomplete data ○ Low-quality data JOBDATASET
  48. 48. Marquez: Data triggers Managing inter-DAG dependencies with data triggers LINEAGE Data Platform ● Powered by data lineage ● Timely processing of data ○ No polling! ● Reduce manual handling of backfills ● Reduce production of bad data ○ Incomplete data ○ Low-quality data JOBDATASET
  49. 49. Marquez: Data triggers Managing inter-DAG dependencies with data triggers LINEAGE Data Platform ● Powered by data lineage ● Timely processing of data ○ No polling! ● Reduce manual handling of backfills ● Reduce production of bad data ○ Incomplete data ○ Low-quality data JOBDATASET
  50. 50. Marquez: Data triggers Managing inter-DAG dependencies with data triggers LINEAGE Data Platform ● Powered by data lineage ● Timely processing of data ○ No polling! ● Reduce manual handling of backfills ● Reduce production of bad data ○ Incomplete data ○ Low-quality data JOBDATASET
  51. 51. Marquez: Data triggers Managing inter-DAG dependencies with data triggers LINEAGE Data Platform ● Powered by data lineage ● Timely processing of data ○ No polling! ● Reduce manual handling of backfills ● Reduce production of bad data ○ Incomplete data ○ Low-quality data JOBDATASET
  52. 52. Demo04
  53. 53. Future work05
  54. 54. Data Platform Roadmap ● Short-term ○ Release marquez 0.5.0 ○ Release marquez-airflow 0.2.0 ○ Release early version of marquez-helm ○ Docs ● Long-term ○ Apache Iceberg metadata support ○ Search ○ Data triggers Marquez: Future work
  55. 55. Thanks! <o/ Data Platform
  56. 56. https://marquezproject.github.io/marquez
  57. 57. github.com/MarquezProject @MarquezProject
  58. 58. Questions? Data Platform CRUNCH ‘19

×