Data lineage and metadata with OpenLineage

•

0 likes•478 views

Presentation of Data lineage an Observability with OpenLineage at the "Data and AI summit" (formerly Spark summit). With a focus on the Apache Spark integration for OpenLineage

Technology

Data lineage and observability
with OpenLineage
Julien Le Dem, CTO and Co-Founder Datakin | Mai 2021

AGENDA
● The need for metadata
● OpenLineage - the open standard for lineage
collection - and Marquez, its reference
implementation
● Spark observability with OpenLineage

Building a healthy data ecosystem
Team A Team B
Team C

5
Today: Limited context
● What is the data source?
● What is the schema?
● Who is the owner?
● How often is it updated?
● Where is it coming from?
● Who is using the data?
● What has changed?
DATA

Maslow’s Data hierarchy of needs
New Business Opportunities
Business Optimization
Data Quality
Data Freshness
Data Availability

OpenLineage contributors
Creators and contributors from major open source projects involved

Purpose:
Deﬁne an Open standard for
metadata and lineage collection
by instrumenting data pipelines
as they are running.

Problem
Before:
● Duplication of effort: Each project
has to instrument all jobs
● Integrations are external and can
break with new versions
● Effort of integration is shared
● Integration can be pushed in
each project: no need to play
catch up
With Open Lineage

Open Lineage scope Not in scope
Backend
Integrations
Metadata
and
lineage
collection
standard
Warehouse
Schedulers
...
Kafka
topic
Graph
db
HTTP
client
Consumers
Kafka
client
GraphDB
client
...

Core Model:
- JSONSchema spec
- Consistent naming:
Jobs:
scheduler.job.task
Datasets:
instance.schema.table
13

14
Protocol:
- Asynchronous events:
Unique run id for identifying a
run and correlate events
- Configurable backend:
- Kafka
- Http
Examples:
● Run Start event
○ source code version
○ run parameters
● Run Complete event
○ input dataset
○ output dataset version and schema

15
Facets
● Extensible:
Facets are atomic pieces of metadata
identiﬁed by a unique name that can be
attached to the core entities.
● Decentralized:
Preﬁxes in facet names allow the
deﬁnition of Custom facets that can be
promoted to the spec at a later point.

Metadata:
Ingest Storage Compute
Streaming
Batch/ML
● Data Platform
built around
Marquez
● Integrations
○ Ingest
○ Storage
○ Compute
Flink
Airflow
Kafka
Iceberg / S3
BI
OpenLineage

Marquez: Data model
Job
Dataset Job Version
Run
*
1
*
1
*
1
1
*
1
*
Source
1 *
● MYSQL
● POSTGRESQL
● REDSHIFT
● SNOWFLAKE
● KAFKA
● S3
● ICEBERG
● DELTALAKE
● BATCH
● STREAM
● SERVICE
Dataset Version

API
● Open Lineage and Marquez standardize
metadata collection
○ Job runs
○ Parameters
○ Version
○ Inputs / outputs
● Datakin enables
○ Understanding operational dependencies
○ Impact analysis
○ Troubleshooting: What has changed
since the last time it worked?
Datakin leverages Marquez metadata
Lineage analysis
Graph
Integrations

22
Spark java agent
spark.driver.extraJavaOptions:
-javaagent:marquez-spark-0.13.1.jar={argument}

Metadata
collected
23
Lineage: inputs/outputs
Data volume: row count/byte size
Logical plan

26
Example of
OpenLineage
metadata usage:
Data volume
evolution

Join the conversation
OpenLineage:
Github: github.com/OpenLineage
Slack: OpenLineage.slack.com
Twitter: @OpenLineage
Email: groups.google.com/g/openlineage
Marquez:
Github: github.com/MarquezProject/marquez
Slack: MarquezProject.slack.com
Twitter: @MarquezProject

What's hot

The evolution of Netflix's S3 data warehouse (Strata NY 2018)Ryan Blue

Apache Kafka® and the Data MeshConfluentInc1

Data Catalog for Better Data Discovery and GovernanceDenodo

Introduction to Data EngineeringDurga Gadiraju

Data Modeling & Metadata ManagementDATAVERSITY

Building Better Data Pipelines using Apache AirflowSid Anand

Data Observability.pptxSonaSamad1

How to Build the Data Mesh Foundation: A Principled Approach | Zhamak Dehghan...HostedbyConfluent

Inside open metadata—the deep diveDataWorks Summit

Databricks Platform.pptxAlex Ivy

Democratizing Data Quality Through a Centralized PlatformDatabricks

How a Semantic Layer Makes Data Mesh Work at ScaleDATAVERSITY

Data Governance Best Practices, Assessments, and RoadmapsDATAVERSITY

How to build a successful Data LakeDataWorks Summit/Hadoop Summit

Data Governance Trends - A Look Backwards and ForwardsDATAVERSITY

Data Architecture Brief OverviewHal Kalechofsky

[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...DataScienceConferenc1

Data Modeling & Metadata for Graph DatabasesDATAVERSITY

Intro to Delta LakeDatabricks

Building a Data Strategy – Practical Steps for Aligning with Business GoalsDATAVERSITY

What's hot (20)

The evolution of Netflix's S3 data warehouse (Strata NY 2018)

Apache Kafka® and the Data Mesh

Data Catalog for Better Data Discovery and Governance

Introduction to Data Engineering

Data Modeling & Metadata Management

Building Better Data Pipelines using Apache Airflow

Data Observability.pptx

How to Build the Data Mesh Foundation: A Principled Approach | Zhamak Dehghan...

Inside open metadata—the deep dive

Databricks Platform.pptx

Democratizing Data Quality Through a Centralized Platform

How a Semantic Layer Makes Data Mesh Work at Scale

Data Governance Best Practices, Assessments, and Roadmaps

How to build a successful Data Lake

Data Governance Trends - A Look Backwards and Forwards

Data Architecture Brief Overview

[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...

Data Modeling & Metadata for Graph Databases

Intro to Delta Lake

Building a Data Strategy – Practical Steps for Aligning with Business Goals

Similar to Data lineage and metadata with OpenLineage

Open core summit: Observability for data pipelines with OpenLineageJulien Le Dem

Structured Streaming in SparkDigital Vidya

Upleveling Analytics with Kafka with Amy ChenHostedbyConfluent

Deploying Data Science Engines to ProductionMostafa Majidpour

The Enterprise Guide to Building a Data Mesh - Introducing SpecMeshIanFurlong4

Machine learning at scale with Google Cloud PlatformMatthias Feys

Data platform architecture principles - ieee infrastructure 2020Julien Le Dem

Gobblin @ NerdWallet (Nov 2015)NerdWalletHQ

SDN in the Management Plane: OpenConfig and Streaming TelemetryAnees Shaikh

Graph Data Science at ScaleNeo4j

Enterprise guide to building a Data MeshSion Smith

Big data processing systems researchVasia Kalavri

Managing Apache Spark Workload and Automatic OptimizingDatabricks

Data Platform in the CloudAmihay Zer-Kavod

Salesforce integration best practices columbus meetupMuleSoft Meetup

Deploying Python Machine Learning Models with Apache Spark with Brandon Hamri...Databricks

Spark Driven Big Data Analyticsinoshg

Clearing Airflow ObstructionsTatiana Al-Chueyr

MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB

OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...NETWAYS

Similar to Data lineage and metadata with OpenLineage (20)

Open core summit: Observability for data pipelines with OpenLineage

Structured Streaming in Spark

Upleveling Analytics with Kafka with Amy Chen

Deploying Data Science Engines to Production

The Enterprise Guide to Building a Data Mesh - Introducing SpecMesh

Machine learning at scale with Google Cloud Platform

Data platform architecture principles - ieee infrastructure 2020

Gobblin @ NerdWallet (Nov 2015)

SDN in the Management Plane: OpenConfig and Streaming Telemetry

Graph Data Science at Scale

Enterprise guide to building a Data Mesh

Big data processing systems research

Managing Apache Spark Workload and Automatic Optimizing

Data Platform in the Cloud

Salesforce integration best practices columbus meetup

Deploying Python Machine Learning Models with Apache Spark with Brandon Hamri...

Spark Driven Big Data Analytics

Clearing Airflow Obstructions

MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas

OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...

Recently uploaded

Gen AI in Business - Global Trends Report 2024.pdfAddepto

Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited

Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge

SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxnull - The Open Security Community

Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity

Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm

"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays

Training state-of-the-art general text embeddingZilliz

"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays

Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz

Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi

Story boards and shot lists for my a level piececharlottematthew16

My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer

Anypoint Exchange: It’s Not Just a Repo!Manik S Magar

Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst

Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson

The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2

Commit 2024 - Secret Management made easyAlfredo García Lavilla

Recently uploaded (20)

Gen AI in Business - Global Trends Report 2024.pdf

Ensuring Technical Readiness For Copilot in Microsoft 365

Designing IA for AI - Information Architecture Conference 2024

SAP Build Work Zone - Overview L2-L3.pptx

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx

Dev Dives: Streamline document processing with UiPath Studio Web

Streamlining Python Development: A Guide to a Modern Project Setup

"Federated learning: out of reach no matter how close",Oleksandr Lapshyn

Training state-of-the-art general text embedding

"Debugging python applications inside k8s environment", Andrii Soldatenko

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...

Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost

Vertex AI Gemini Prompt Engineering Tips

Story boards and shot lists for my a level piece

My INSURER PTE LTD - Insurtech Innovation Award 2024

Anypoint Exchange: It’s Not Just a Repo!

Human Factors of XR: Using Human Factors to Design XR Systems

Are Multi-Cloud and Serverless Good or Bad?

The Future of Software Development - Devin AI Innovative Approach.pdf

Commit 2024 - Secret Management made easy

Data lineage and metadata with OpenLineage

1. Data lineage and observability with OpenLineage Julien Le Dem, CTO and Co-Founder Datakin | Mai 2021

2. AGENDA ● The need for metadata ● OpenLineage - the open standard for lineage collection - and Marquez, its reference implementation ● Spark observability with OpenLineage

3. The need for Metadata 3

4. Building a healthy data ecosystem Team A Team B Team C

5. 5 Today: Limited context ● What is the data source? ● What is the schema? ● Who is the owner? ● How often is it updated? ● Where is it coming from? ● Who is using the data? ● What has changed? DATA

6. Maslow’s Data hierarchy of needs New Business Opportunities Business Optimization Data Quality Data Freshness Data Availability

7. OpenLineage 7

8. OpenLineage contributors Creators and contributors from major open source projects involved

9. Purpose: Deﬁne an Open standard for metadata and lineage collection by instrumenting data pipelines as they are running.

10. Purpose: EXIF for data pipelines

11. Problem Before: ● Duplication of effort: Each project has to instrument all jobs ● Integrations are external and can break with new versions ● Effort of integration is shared ● Integration can be pushed in each project: no need to play catch up With Open Lineage

12. Open Lineage scope Not in scope Backend Integrations Metadata and lineage collection standard Warehouse Schedulers ... Kafka topic Graph db HTTP client Consumers Kafka client GraphDB client ...

13. Core Model: - JSONSchema spec - Consistent naming: Jobs: scheduler.job.task Datasets: instance.schema.table 13

14. 14 Protocol: - Asynchronous events: Unique run id for identifying a run and correlate events - Configurable backend: - Kafka - Http Examples: ● Run Start event ○ source code version ○ run parameters ● Run Complete event ○ input dataset ○ output dataset version and schema

15. 15 Facets ● Extensible: Facets are atomic pieces of metadata identified by a unique name that can be attached to the core entities. ● Decentralized: Prefixes in facet names allow the definition of Custom facets that can be promoted to the spec at a later point.

16. Facet examples Dataset: - Stats - Schema - Version - Column level lineage Job: - Source code - Dependencies - params - Source control - Query plan - Query profile Run: - Schedule time - Batch id

17. 17

18. Metadata: Ingest Storage Compute Streaming Batch/ML ● Data Platform built around Marquez ● Integrations ○ Ingest ○ Storage ○ Compute Flink Airflow Kafka Iceberg / S3 BI OpenLineage

19. Marquez: Data model Job Dataset Job Version Run * 1 * 1 * 1 1 * 1 * Source 1 * ● MYSQL ● POSTGRESQL ● REDSHIFT ● SNOWFLAKE ● KAFKA ● S3 ● ICEBERG ● DELTALAKE ● BATCH ● STREAM ● SERVICE Dataset Version

20. API ● Open Lineage and Marquez standardize metadata collection ○ Job runs ○ Parameters ○ Version ○ Inputs / outputs ● Datakin enables ○ Understanding operational dependencies ○ Impact analysis ○ Troubleshooting: What has changed since the last time it worked? Datakin leverages Marquez metadata Lineage analysis Graph Integrations

21. Spark observability with OpenLineage 21

22. 22 Spark java agent spark.driver.extraJavaOptions: -javaagent:marquez-spark-0.13.1.jar={argument}

23. Metadata collected 23 Lineage: inputs/outputs Data volume: row count/byte size Logical plan

24. Lineage model 24

25. Lineage Example across jobs

26. 26 Example of OpenLineage metadata usage: Data volume evolution

27. Join the conversation OpenLineage: Github: github.com/OpenLineage Slack: OpenLineage.slack.com Twitter: @OpenLineage Email: groups.google.com/g/openlineage Marquez: Github: github.com/MarquezProject/marquez Slack: MarquezProject.slack.com Twitter: @MarquezProject

Data lineage and metadata with OpenLineage

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Data lineage and metadata with OpenLineage

Similar to Data lineage and metadata with OpenLineage (20)

More from Julien Le Dem

More from Julien Le Dem (19)

Recently uploaded

Recently uploaded (20)

Data lineage and metadata with OpenLineage