Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

Marquez: A Metadata Service for Data Abstraction, Data Lineage, and Event-based Triggers

1.421 visualizaciones

Publicado el

At WeWork, it's critical that we understand the complete context for all datasets. We also want to be able to explore dependencies between jobs and the datasets they produce and consume. To do this, WeWork needs metadata. In this talk I will focus on Marquez, a core service for the collection, aggregation and visualization of a data ecosystems metadata. Marquez maintains the provenance of how datasets are consumed and produced while providing global visibility into job runtime.

Publicado en: Ingeniería
  • Sé el primero en comentar

Marquez: A Metadata Service for Data Abstraction, Data Lineage, and Event-based Triggers

  1. 1. Data Platform Marquez: A Metadata Service for Data Abstraction, Data Lineage, and Event-based Triggers DataEngConf NYC ‘18
  2. 2. Data Platform Hey! I’m Willy Lulciuc Data Engineer Marquez Team, Data Platform @wslulciuc
  3. 3. Data Platform Space01 Community02 Services03
  4. 4. Data Platform 268,000 members globally 287 physical locations 72 cities 23 countries
  5. 5. Data Platform AGENDA Room bookings pipeline (naïve) Intro to Marquez Room bookings pipeline (take 2) 02 03 04 @wslulciuc Future work05 Why metadata?01
  6. 6. Why metadata?01
  7. 7. Data lineage ● Add context to data Democratize ● Self-service data culture Data quality ● Build trust in data Why manage and utilize metadata? Data Platform
  8. 8. … creating a healthy data ecosystem
  9. 9. Freedom ● Experiment ● Flexible ● Self-sufficient Accountability ● Cost ● Trust Self-service ● Discover ● Explore ● Global context A healthy data ecosystem Data Platform
  10. 10. Data Platform Let’s get booking!
  11. 11. Location + floor01 Data Platform
  12. 12. Data Platform Location + floor01 Open time slot02
  13. 13. Data Platform Location + floor01 Open time slot02 Duration03
  14. 14. Data Platform Location + floor01 Open time slot02 Duration03 Confirm04
  15. 15. Which location has the most bookings? Data Platform
  16. 16. Set[RoomBooking] LocationID
  17. 17. Room bookings pipeline (naïve) 02
  18. 18. Data Platform @wslulciuc Requirements Example: Room bookings pipeline (naïve) ● Read room bookings ● Sum room bookings by location ● Write top location ● Run once an hour Read SumStart Write
  19. 19. Data Platform @wslulciucExample: Room bookings pipeline (naïve) S3 Postgres .csv .csv
  20. 20. Data Platform @wslulciucExample: Room bookings pipeline (naïve) S3 Postgres .csv .csv b940314,1541624285,2 TSLOCATION ROOM b648485,1541501885,9 b648485,1541710685,4
  21. 21. Data Platform @wslulciucExample: Room bookings pipeline (naïve) S3 Postgres .csv .csv b940314,1541624285,2 1 b648485 1541721600 2 TSLOCATION ROOM LOCATIONID TS BOOKINGS b648485,1541501885,9 b648485,1541710685,4
  22. 22. Data Platform Example: Room bookings pipeline (naïve) @wslulciuc Job Scheduler Upstream Downstream S3 Postgres Room Bookings Job Archival Top Locations Workflow We’re live!
  23. 23. Data Platform @wslulciuc Problems ● What’s our job’s input dataset? ● Does the dataset have an owner? ● How often is the dataset updated? Example: Room bookings pipeline (naïve)
  24. 24. Data Platform Example: Room bookings pipeline (naïve) @wslulciuc Job Scheduler Upstream Downstream S3 Postgres Room Bookings Job Archival Top Locations Workflow Curses, our job’s failing …
  25. 25. Data Platform Example: Room bookings pipeline (naïve) @wslulciuc Job Scheduler Upstream Downstream S3 Postgres Room Bookings Job Archival Top Locations Workflow Oh, might be our input data!
  26. 26. Data Platform @wslulciucExample: Room bookings pipeline (naïve) S3 .csv .csv Room field is of type string b648485,1541501885,9A b940314,1541624285,2G b648485,1541710685,4F TSLOCATION ROOM int
  27. 27. Data Platform @wslulciuc Problems ● What’s our job’s input dataset? ● Does the dataset have an owner? ● How often is the dataset updated? ● Coordinate changes Example: Room bookings pipeline (naïve)
  28. 28. Data Platform Example: Room bookings pipeline (naïve) @wslulciuc Job Scheduler Upstream Downstream S3 Postgres Room Bookings Job Archival Top Locations Workflow Ugh, gaps in output data
  29. 29. Data Platform Example: Room bookings pipeline (naïve) @wslulciuc 00h 01h 02h 03h 04h 05h 06h 07h 08h 09h Backfills! time partitions latest
  30. 30. Data Platform @wslulciuc Problems ● What’s our job’s input dataset? ● Does the dataset have an owner? ● How often is the dataset updated? ● Coordinate changes ● Figure out backfills Example: Room bookings pipeline (naïve)
  31. 31. Data Platform @wslulciucExample: Room bookings pipeline (naïve) Job Scheduler S3 Postgres Room Bookings Workflow What we have so far …
  32. 32. Data Platform @wslulciucExample: Room bookings pipeline (naïve) Job Scheduler S3 Postgres What we have so far … Problems ● What’s our job’s input dataset? ● Does the dataset have an owner? ● How often is the dataset updated? ● Coordinate changes ● Figure out backfillsRoom Bookings Workflow
  33. 33. … writing a job shouldn’t be this hard!
  34. 34. Intro to Marquez04
  35. 35. Data Platform Metadata Service ● Centralized metadata management ○ Jobs ○ Datasets ● Modular ○ Data discovery ○ Data health ○ Data triggers Marquez: Design @wslulciuc Clients (JVM) Clients (Python) Marquez Search Health Triggers REST API
  36. 36. Data Platform Module: Search ● Unified search ● Documentation ○ Owner ○ Schema ○ Datasource @wslulciuc Marquez Search Health Triggers Marquez: Data discovery
  37. 37. @wslulciucMarquez: Data discovery room bo Room Bookings (SF) All created: jul. 8, 2018 Room Booking Metrics (GLBL) created: feb. 15, 2010 All San Francisco room bookings Global room booking metrics Search Datasets TagsS3
  38. 38. Data Platform Module: Health ● Owner ○ Team / project ● Schema ● Location ● Description ● Size ○ Growth over time ○ Number of records ● Lineage @wslulciuc Marquez Search Health Triggers Marquez: Data health
  39. 39. Data graph Dataset Job
  40. 40. Lineage queries! Dataset Job Lineage
  41. 41. Data Platform Module: Triggers ● Timely processing of data ○ No polling! ● Reduce manual handling of backfills ● Reduce production of bad data ○ Incomplete data ○ Low-quality data @wslulciuc Marquez Search Health Triggers Marquez: Data triggers
  42. 42. Dataset Job Upstream failure detection! Job failure
  43. 43. Dataset Job Affected paths! Job failure
  44. 44. Cascading triggers! Dataset Job Trigger
  45. 45. Core concepts
  46. 46. Data Platform Job + Datasets Input Dataset Output Dataset Job @wslulciucMarquez: Core concepts
  47. 47. Data Platform Dataset versions! @wslulciucMarquez: Core concepts A dataset version contains a complete snapshot of data as of some point in time v1 v1 v2 v2 v3 Job
  48. 48. Data Platform Deltas “diffs”! v1 v1 v2 v2 v3 Job @wslulciucMarquez: Core concepts INSERT INTO room_bookings (location, bookings) VALUES (b648485, 2)
  49. 49. Data Platform Deltas “diffs”! v1 v1 v2 v2 v3 Job @wslulciucMarquez: Core concepts Δv2→v3 INSERT INTO room_bookings (location, bookings) VALUES (b648485, 2)
  50. 50. Data Platform Job versions! @wslulciucMarquez: Core concepts A job version is created when business logic has changed v1 v1 v2 v2 v3 Job v1 Job v2
  51. 51. Data Platform Job runs! v1 v1 v2 v2 v3 Job v1 @wslulciucMarquez: Core concepts Job Dataset New Run Job v2
  52. 52. Data Platform Job runs! v1 v1 v2 v2 v3 Job v1 @wslulciucMarquez: Core concepts Dataset New Run v4 Job Job v2
  53. 53. Data Platform Job runs! v1 v1 v2 v2 v3 Job v1 @wslulciucMarquez: Core concepts Dataset New Run v4 Finish Update Job Job v2
  54. 54. Data Platform Data triggers! v1 v1 v2 v2 v3 Job v1 @wslulciucMarquez: Core concepts Dataset New Run v4 Trigger Job v7 Job v10 Job Update Finish Job v2
  55. 55. Data Platform Job failures! v1 v1 v2 v2 v3 Job v1 @wslulciucMarquez: Core concepts Dataset New Run FailureJob v4 Job v2
  56. 56. Data Platform Delayed datasets! v1 v1 v2 v2 v3 Job v1 @wslulciucMarquez: Core concepts Dataset New RunJob v4 Job v2 Failure Delay
  57. 57. Data Platform Design benefits @wslulciucMarquez: Core concepts ● Early upstream failure detection ● Debugging ○ What job version(s) produced / consumed dataset version X? ● Recoverability ○ Full / incremental processing ● Coordination
  58. 58. Data model
  59. 59. Job Marquez: Data model @wslulciuc Dataset JobVersion JobRunDatasetVersion * 1 * 1 * 1 1* 1*
  60. 60. Marquez: Data model @wslulciuc DbTable Filesystem Stream Datasource Types Job Dataset JobVersion JobRunDatasetVersion * 1 * 1 * 1 1* 1*
  61. 61. Metadata collection
  62. 62. Data Platform @wslulciucMarquez: Metadata collection How is metadata collected? ● Marquez API ● Language-specific SDKs ○ Java ○ Python Marquez Job record metadata
  63. 63. Data Platform @wslulciucMarquez: Metadata collection Workflow Register Job ● Job version ● Inputs / outputs (logical names) ● Owner ● Description
  64. 64. Data Platform @wslulciucMarquez: Metadata collection Register Job ● Job version ● Inputs / outputs (logical names) ● Owner ● Description Register Job Run Workflow
  65. 65. Data Platform @wslulciucMarquez: Metadata collection Register Job ● Job version ● Inputs / outputs (logical names) ● Owner ● Description Register Job Run Start ● Update job run state to STARTED Complete ● Update job run state to COMPLETED Workflow
  66. 66. Data Platform @wslulciucMarquez: Metadata collection Register Job ● Job version ● Inputs / outputs (logical names) ● Owner ● Description Register Job Run Start ● Update job run state to STARTED Complete ● Update job run state to COMPLETED Register Job Run Outputs ● Outputs (physical locations) Workflow
  67. 67. Room bookings pipeline (take 2) 04
  68. 68. Data Platform Example: Room bookings pipeline (take 2) @wslulciuc Recall, we are tasked with analyzing room booking trends …
  69. 69. Data Platform Example: Room bookings pipeline (take 2) @wslulciuc Job Postgres Room Bookings Workflow Top Locations S3 Scheduler Recall, we are tasked with analyzing room booking trends …
  70. 70. Data Platform @wslulciuc Problems ● What’s our job’s input dataset? ● Does the dataset have an owner? ● How often is the dataset updated? ● Coordinate changes ● Figure out backfills Example: Room bookings pipeline (take 2)
  71. 71. Enter Marquez
  72. 72. @wslulciuc room bo Room Bookings (ALL) All created: feb. 15, 2010 Room Bookings (SF) created: jul. 8, 2018 All room bookings since beginning of time All San Francisco room bookings Example: Room bookings pipeline (take 2) Data Platform S3 S3
  73. 73. @wslulciuc room bo All Room Bookings (SF) created: jul. 8, 2018All San Francisco room bookings Example: Room bookings pipeline (take 2) Well, that was easy! Room Bookings (ALL) created: feb. 15, 2010All room bookings since beginning of time Data Platform S3 S3
  74. 74. @wslulciucExample: Room bookings pipeline (take 2) Room Bookings (ALL) created: feb. 15, 2010All room bookings since beginning of time Owner: Data Engineering Location: s3://room_bookings/raw/ Info Schema: https://registry.wework.com/schemas/ids/1 Updated: Hourly Data Platform Description: All room bookings since beginning of time S3
  75. 75. @wslulciucExample: Room bookings pipeline (take 2) Room Bookings (ALL) created: feb. 15, 2010All room bookings since beginning of time Owner: Data Engineering Location: s3://room_bookings/raw/ Info Schema: https://registry.wework.com/schemas/ids/1 Updated: Hourly Data Platform Description: All room bookings since beginning of time S3
  76. 76. @wslulciucExample: Room bookings pipeline (take 2) Room Bookings (ALL) created: feb. 15, 2010All room bookings since beginning of time Owner: Data Engineering Location: s3://room_bookings/raw/ Info Schema: https://registry.wework.com/schemas/ids/1 Updated: Hourly Data Platform Description: All room bookings since beginning of time Bonus! S3
  77. 77. Data Platform @wslulciuc Problems ● What’s our job’s input dataset? ● Does the dataset have an owner? ● How often is the dataset updated? ● Coordinate changes ● Figure out backfills Example: Room bookings pipeline (take 2)
  78. 78. Data Platform @wslulciuc Problems ● What’s our job’s input dataset? ● Does the dataset have an owner? ● How often is the dataset updated? ● Coordinate changes ● Figure out backfills Example: Room bookings pipeline (take 2)
  79. 79. Data Platform @wslulciuc Problems ● What’s our job’s input dataset? ● Does the dataset have an owner? ● How often is the dataset updated? ● Coordinate changes ● Figure out backfills Example: Room bookings pipeline (take 2)
  80. 80. Data Platform @wslulciuc Problems ● What’s our job’s input dataset? ● Does the dataset have an owner? ● How often is the dataset updated? ● Coordinate changes ● Figure out backfills Example: Room bookings pipeline (take 2)
  81. 81. Data Platform Example: Room bookings pipeline (take 2) @wslulciuc Job Postgres Room Bookings Workflow Top Locations S3 We also had to coordinate changes to our input data Scheduler
  82. 82. Our view Dataset Job Job failure Room bookings workflow
  83. 83. Global view! Dataset Job Job failure Room bookings workflow Top locations dataset
  84. 84. @wslulciucExample: Room bookings pipeline (take 2) Room Bookings (ALL) created: feb. 15, 2010All room bookings since beginning of time Owner: Data Engineering Location: s3://room_bookings/raw/ Info Schema: https://registry.wework.com/schemas/ids/2 Updated: Hourly Data Platform Description: All room bookings since beginning of time Oh, version bumped! S3
  85. 85. Patch, deploy, trigger! Dataset Job Room bookings workflow Top locations dataset Trigger
  86. 86. Data Platform @wslulciuc Problems ● What’s our job’s input dataset? ● Does the dataset have an owner? ● How often is the dataset updated? ● Coordinate changes ● Figure out backfills Example: Room bookings pipeline (take 2)
  87. 87. Data Platform @wslulciuc Problems ● What’s our job’s input dataset? ● Does the dataset have an owner? ● How often is the dataset updated? ● Coordinate changes ● Figure out backfills Example: Room bookings pipeline (take 2)
  88. 88. Data Platform @wslulciuc Problems ● What’s our job’s input dataset? ● Does the dataset have an owner? ● How often is the dataset updated? ● Coordinate changes ● Figure out backfills Example: Room bookings pipeline (take 2)
  89. 89. Data Platform @wslulciuc RECAP ● Make it trival to discovery datasets ● Global context when debugging ● Easily handle backfills ○ Datasets as dependencies
  90. 90. Future work05
  91. 91. Data Platform WeWork + Marquez ● Data platform built around Marquez ● Internal integrations ○ Scheduling ○ Batching ○ Streaming @wslulciucMarquez: Future work
  92. 92. Data Platform Roadmap ● Short-term ○ Release Marquez 0.1.0 ○ Docs ● Long-term ○ Marquez UI @wslulciucMarquez: Future work
  93. 93. github.com/MarquezProject @MarquezProject
  94. 94. Thanks! Data Platform DataEngConf NYC ‘18
  95. 95. Data Platform We’re hiring! contact: willy.lulciuc@wework.com
  96. 96. Questions? Data Platform DataEngConf NYC ‘18

×