Data Contracts: Consensus as Code - Pycon 2023

Ryan Collingwood
Ryan CollingwoodBusiness Analyst | Requirements Wrangler | Boundary Spanner | Continuously Learning
Data Contracts
Consensus as Code
Ryan Collingwood
2023-08-18
Who am I and my current context
• Ryan Collingwood, Head of Data & Analytics at Oroton
• Australia’s oldest luxury fashion company
• Centralised Data Team
• Monoliths (ERP & POS) surrounded by number of SaaS
• Data is mostly moved in batch
Why I think you might care about this
Responsibility in the
modern data stack
Andrew Jones -
Driving Data Quality with
Data Contracts (2023)
Shout out to Andrew Jones
https://data-contracts.com/
Similar, Related, and Complementary Concepts
APIs Data
Dictionaries
Data Mesh Event Storming
I’d be curious to know what else you might add to this list
Data Catalogs
Domain Driven
Design
Advice is a form of nostalgia. Dispensing it is a way
of fishing the past from the disposal, wiping it off,
painting over the ugly parts and recycling it for
more than it's worth
Mary Schmich
https://www.chicagotribune.com/columns/chi-schmich-sunscreen-column-column.htm
“If I could offer you only one tip for the future, sunscreen would be it.”
What are Data
Contracts?
... outlines how data can get exchanged between two parties.
It defines the structure, format, and rules of exchange in a
distributed data architecture. These formal agreements make
sure that there aren’t any uncertainties or undocumented
assumptions about data.
https://atlan.com/data-contracts/
... is an agreed interface between the generators of data and
its consumers. It sets the expectations around that data,
defines how it should be governed, and facilitates the explicit
generation of quality data that meets the business
requirements.
Andrew Jones - Driving Data Quality with Data Contracts (2023)
Data Producers and Data Consumers
Team A Team B
Team C
You can be a Data Producer without knowing about it
Non-consensual API
Team C
��
Broken pipelines, broken non-promises
Non-consensual API
Non-consensual API
Non-consensual API
🧰󰠼
❌
Team A
Team C
��
Team B
One of the largest impediments to addressing data quality at any organization is the
lack of collaboration between data producers and data consumers.
...
A common workaround (is the) proliferation of non-consensual APIs.
Can’t get a software engineer to emit the data you need to solve some business
problem?
Connect your ELT tool to a production source and extract a batch dump on a
schedule.
Easy
(Until things start breaking…whoops).
Chad Sanderson - https://dataproducts.substack.com/p/the-production-grade-data-pipeline
What makes up a Data Contract
https://github.com/PacktPublishing/Driving-Data-Quality-with-Data-Contracts/blob/main/Chapter03/order_events.yaml
However, data contracts are more than just a
schema... we need our data contracts to capture
metadata that describes how the data can be used,
how it is governed, and the controls around the data
Driving Data Quality with Data Contracts - Andrew Jones (2023)
What makes up a Data Contract
Schema
Contract
Governance
Semantics
Service Level
Objectives
Dataset
Governance
Mechanisms of
Transmission
People
Schema versus Semantics
Schema Semantics
Systems interoperability Human Expectations
Support for Implicit Validation
by Database Technologies
Tends to require Explicit
Validation by complimentary
solutions
Ensuring we capture and
retrieve the data consistently
Ensuring we interpret the data
consistently
Dates / times, monetary values - are a trap if considered only as schema.
What are your “schema” but “secretly semantic” situations?
Minimum Viable Data Contract Tooling
Andrew Jones - Driving Data Quality with Data Contracts (2023)
Operate
Meta-Data Powered Tooling
Andrew Jones -
Driving Data Quality
with Data Contracts
(2023)
Data Quality Checks
Andrew Jones -
Driving Data Quality
with Data Contracts
(2023)
Data Contract Tooling - My Context
Data Contract Tooling - My Context
Producer
Boundaries
Semantics
Schema &
SLOs
Checks
and Tests
Semantics
Schema &
SLOs
Checks
and Tests
Semantics
Ok so how are
we going to
make this all
happen?
Awesome humans who
understand models,
abstractions, constraints
You could even do it in
✨code ✨
... and you should definitely
version control it
Why Code? Why not Text?
● Entanglement of meaning and representation
● Finding References instead of text matches
● Enforcement of structure
● Refactoring
● Testable constraints
● More options for document generation
○ Including JSON and yaml
Although... I’ve been having a blast using Logseq (a graph like outliner) and
I might be crazy enough to give that a go as an IDE for this
“Refactoring” Text
Expectation Reality
https://xkcd.com/208/
Scope &
Allies
Constraints
& Guiding
Principles
People
and
Process
Centric
Contract
Meta
Schema
Maximise
Contribution
Opportunities
What was considered
Guiding
Principles
● Primary Objective: Consensus
● Evolution
● Quick Feedback
● First Outcome: Data Tests
Creating a Meta
Model
● Focused around Events
● From UI to DB
● Schema and Semantics
● People
... still figuring it out
Don’t have to do it all at once!
Data Contracts: Consensus as Code - Pycon 2023
The optimistic path to capturing and generating contracts
The Event Capture spreadsheet
Who’s Going to Do The Work?
Andrew Jones - Driving Data Quality with Data Contracts (2023)
Probably
these people
Hopefully
these people
Why Python? ● Gradual Typing*
● Static Analysis
● Well understood within the team
Helpful Python
Libraries
● Pandas
● Pydantic
● Rope
● Pytest
● Mypy
● Black
Data Contracts: Consensus as Code - Pycon 2023
Data Contracts: Consensus as Code - Pycon 2023
Refactoring, doing variable extraction with Rope
https://colab.research.google.com/drive/1fHLit3hF2G0dFV0Xl11jnovcdPR87s-E
Refactoring, doing variable extraction with Rope
https://colab.research.google.com/drive/1fHLit3hF2G0dFV0Xl11jnovcdPR87s-E
Code Refactoring - Other Libraries
• https://pybowler.io/ - doesn't have variable extraction and not much
development activity in the last while
• https://github.com/hchasestevens/astpath - useful for finding parts of the AST
but then I'm not sure how to proceed with it, seems to be powering a number
of meta-programming libs though
• traad - https://av.tib.eu/en/media/19947
Further explorations for wrangling generated code
• Abstract Syntax Tree - Options for querying
• Linting - Define my own rules to as they apply to the meta
schema
• Code duplication detection
• Network (Graph) Analysis
linkedin.com/in/ryancollingwood
mastodon.social/@ryancollingwood
twitter.com/ryancollingwood
www.meetup.com/en-AU/data-engineering-melbourne
• You can be a Data Producer without knowing about it, make it
worthwhile for Consumers to “register” with you
• You can do this through having a contract which provides clarity and
can be used to power tooling and generate artefacts
• Code is easier to refactor, find references, and generally maintain than
the alternatives
Key Takeaways
My References
• Andrew Jones - Driving Data Quality with Data Contracts (2023) - ISBN 13 978-1837635009
• Data Contracts: The Key to Scaling Distributed Data Architecture and Reducing Data Chaos -
https://atlan.com/data-contracts/
• Chad Sanderson - The Production-Grade Data Pipeline -
https://dataproducts.substack.com/p/the-production-grade-data-pipeline
• Chad Sanderson and Adrian Kreuziger - An Engineers Guide to Data Contracts -
https://mlops.community/an-engineers-guide-to-data-contracts-pt-1/
• Green Tree Snakes the missing Python AST docs - https://greentreesnakes.readthedocs.io/en/latest/
• Rope - Refactoring Variable Extraction -
https://rope.readthedocs.io/en/latest/library.html#performing-refactorings
Questions?
linkedin.com/in/ryancollingwood
mastodon.social/@ryancollingwood
twitter.com/ryancollingwood
www.meetup.com/en-AU/data-engineering-melbourne
1 de 44

Recomendados

BigData Analysis por
BigData AnalysisBigData Analysis
BigData AnalysisInnfinision Cloud and BigData Solutions
1.6K vistas21 diapositivas
BIAM 410 Final Paper - Beyond the Buzzwords: Big Data, Machine Learning, What... por
BIAM 410 Final Paper - Beyond the Buzzwords: Big Data, Machine Learning, What...BIAM 410 Final Paper - Beyond the Buzzwords: Big Data, Machine Learning, What...
BIAM 410 Final Paper - Beyond the Buzzwords: Big Data, Machine Learning, What...Thomas Rones
189 vistas12 diapositivas
Roadmap for Enterprise Graph Strategy por
Roadmap for Enterprise Graph StrategyRoadmap for Enterprise Graph Strategy
Roadmap for Enterprise Graph StrategyNeo4j
1.4K vistas37 diapositivas
Perchè potresti aver bisogno di un database NoSQL anche se non sei Google o F... por
Perchè potresti aver bisogno di un database NoSQL anche se non sei Google o F...Perchè potresti aver bisogno di un database NoSQL anche se non sei Google o F...
Perchè potresti aver bisogno di un database NoSQL anche se non sei Google o F...Codemotion
1.5K vistas92 diapositivas
Data engineering design patterns por
Data engineering design patternsData engineering design patterns
Data engineering design patternsValdas Maksimavičius
1K vistas53 diapositivas
Big data business case por
Big data   business caseBig data   business case
Big data business caseKarthik Padmanabhan ( MLE℠)
1K vistas38 diapositivas

Más contenido relacionado

Similar a Data Contracts: Consensus as Code - Pycon 2023

Ordering the chaos: Creating websites with imperfect data por
Ordering the chaos: Creating websites with imperfect dataOrdering the chaos: Creating websites with imperfect data
Ordering the chaos: Creating websites with imperfect dataAndy Stretton
777 vistas30 diapositivas
Building an enterprise Natural Language Search Engine with ElasticSearch and ... por
Building an enterprise Natural Language Search Engine with ElasticSearch and ...Building an enterprise Natural Language Search Engine with ElasticSearch and ...
Building an enterprise Natural Language Search Engine with ElasticSearch and ...Debmalya Biswas
305 vistas24 diapositivas
L'architettura di classe enterprise di nuova generazione - Massimo Brignoli por
L'architettura di classe enterprise di nuova generazione - Massimo BrignoliL'architettura di classe enterprise di nuova generazione - Massimo Brignoli
L'architettura di classe enterprise di nuova generazione - Massimo BrignoliData Driven Innovation
543 vistas35 diapositivas
How to Get Cloud Architecture and Design Right the First Time por
How to Get Cloud Architecture and Design Right the First TimeHow to Get Cloud Architecture and Design Right the First Time
How to Get Cloud Architecture and Design Right the First TimeDavid Linthicum
12.4K vistas65 diapositivas
Your Roadmap for An Enterprise Graph Strategy por
Your Roadmap for An Enterprise Graph StrategyYour Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph StrategyNeo4j
340 vistas34 diapositivas
How Cloud is Affecting Data Scientists por
How Cloud is Affecting Data Scientists How Cloud is Affecting Data Scientists
How Cloud is Affecting Data Scientists CCG
104 vistas27 diapositivas

Similar a Data Contracts: Consensus as Code - Pycon 2023(20)

Ordering the chaos: Creating websites with imperfect data por Andy Stretton
Ordering the chaos: Creating websites with imperfect dataOrdering the chaos: Creating websites with imperfect data
Ordering the chaos: Creating websites with imperfect data
Andy Stretton777 vistas
Building an enterprise Natural Language Search Engine with ElasticSearch and ... por Debmalya Biswas
Building an enterprise Natural Language Search Engine with ElasticSearch and ...Building an enterprise Natural Language Search Engine with ElasticSearch and ...
Building an enterprise Natural Language Search Engine with ElasticSearch and ...
Debmalya Biswas305 vistas
L'architettura di classe enterprise di nuova generazione - Massimo Brignoli por Data Driven Innovation
L'architettura di classe enterprise di nuova generazione - Massimo BrignoliL'architettura di classe enterprise di nuova generazione - Massimo Brignoli
L'architettura di classe enterprise di nuova generazione - Massimo Brignoli
How to Get Cloud Architecture and Design Right the First Time por David Linthicum
How to Get Cloud Architecture and Design Right the First TimeHow to Get Cloud Architecture and Design Right the First Time
How to Get Cloud Architecture and Design Right the First Time
David Linthicum12.4K vistas
Your Roadmap for An Enterprise Graph Strategy por Neo4j
Your Roadmap for An Enterprise Graph StrategyYour Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph Strategy
Neo4j340 vistas
How Cloud is Affecting Data Scientists por CCG
How Cloud is Affecting Data Scientists How Cloud is Affecting Data Scientists
How Cloud is Affecting Data Scientists
CCG104 vistas
Secrets of Enterprise Data Mining: SQL Saturday Oregon 201411 por Mark Tabladillo
Secrets of Enterprise Data Mining: SQL Saturday Oregon 201411Secrets of Enterprise Data Mining: SQL Saturday Oregon 201411
Secrets of Enterprise Data Mining: SQL Saturday Oregon 201411
Mark Tabladillo575 vistas
Jeremy cabral search marketing summit - scraping data-driven content (1) por Jeremy Cabral
Jeremy cabral   search marketing summit - scraping data-driven content (1)Jeremy cabral   search marketing summit - scraping data-driven content (1)
Jeremy cabral search marketing summit - scraping data-driven content (1)
Jeremy Cabral363 vistas
Knowledge Graph for Machine Learning and Data Science por Cambridge Semantics
Knowledge Graph for Machine Learning and Data ScienceKnowledge Graph for Machine Learning and Data Science
Knowledge Graph for Machine Learning and Data Science
Cambridge Semantics260 vistas
Your Roadmap for An Enterprise Graph Strategy por Neo4j
Your Roadmap for An Enterprise Graph StrategyYour Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph Strategy
Neo4j180 vistas
Data Discovery and Metadata por markgrover
Data Discovery and MetadataData Discovery and Metadata
Data Discovery and Metadata
markgrover610 vistas
La bi, l'informatique décisionnelle et les graphes por Cédric Fauvet
La bi, l'informatique décisionnelle et les graphesLa bi, l'informatique décisionnelle et les graphes
La bi, l'informatique décisionnelle et les graphes
Cédric Fauvet1.2K vistas
Optimizing Your Supply Chain with Neo4j por Neo4j
Optimizing Your Supply Chain with Neo4jOptimizing Your Supply Chain with Neo4j
Optimizing Your Supply Chain with Neo4j
Neo4j46 vistas
Microsoft Build 2020: Data Science Recap por Mark Tabladillo
Microsoft Build 2020: Data Science RecapMicrosoft Build 2020: Data Science Recap
Microsoft Build 2020: Data Science Recap
Mark Tabladillo196 vistas
2022-09-14-MATLABDay_SREC.pptx por AnjanMayra1
2022-09-14-MATLABDay_SREC.pptx2022-09-14-MATLABDay_SREC.pptx
2022-09-14-MATLABDay_SREC.pptx
AnjanMayra127 vistas
Improve your Tech Quotient por Tarence DSouza
Improve your Tech QuotientImprove your Tech Quotient
Improve your Tech Quotient
Tarence DSouza1.1K vistas
Your Roadmap for An Enterprise Graph Strategy por Neo4j
Your Roadmap for An Enterprise Graph StrategyYour Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph Strategy
Neo4j1.2K vistas
2023 GEOINT Tutorial - Synthetic Data Tools for Computer Vision-Based AI - Re... por Chris Andrews
2023 GEOINT Tutorial - Synthetic Data Tools for Computer Vision-Based AI - Re...2023 GEOINT Tutorial - Synthetic Data Tools for Computer Vision-Based AI - Re...
2023 GEOINT Tutorial - Synthetic Data Tools for Computer Vision-Based AI - Re...
Chris Andrews63 vistas

Último

Custom Tag Manager Templates por
Custom Tag Manager TemplatesCustom Tag Manager Templates
Custom Tag Manager TemplatesMarkus Baersch
29 vistas17 diapositivas
[DSC Europe 23] Matteo Molteni - Implementing a Robust CI Workflow with dbt f... por
[DSC Europe 23] Matteo Molteni - Implementing a Robust CI Workflow with dbt f...[DSC Europe 23] Matteo Molteni - Implementing a Robust CI Workflow with dbt f...
[DSC Europe 23] Matteo Molteni - Implementing a Robust CI Workflow with dbt f...DataScienceConferenc1
5 vistas18 diapositivas
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation por
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented GenerationDataScienceConferenc1
17 vistas29 diapositivas
[DSC Europe 23] Aleksandar Tomcic - Adversarial Attacks por
[DSC Europe 23] Aleksandar Tomcic - Adversarial Attacks[DSC Europe 23] Aleksandar Tomcic - Adversarial Attacks
[DSC Europe 23] Aleksandar Tomcic - Adversarial AttacksDataScienceConferenc1
5 vistas20 diapositivas
6498-Butun_Beyinli_Cocuq-Daniel_J.Siegel-Tina_Payne_Bryson-2011-259s.pdf por
6498-Butun_Beyinli_Cocuq-Daniel_J.Siegel-Tina_Payne_Bryson-2011-259s.pdf6498-Butun_Beyinli_Cocuq-Daniel_J.Siegel-Tina_Payne_Bryson-2011-259s.pdf
6498-Butun_Beyinli_Cocuq-Daniel_J.Siegel-Tina_Payne_Bryson-2011-259s.pdf10urkyr34
7 vistas259 diapositivas
Amy slides.pdf por
Amy slides.pdfAmy slides.pdf
Amy slides.pdfStatsCommunications
5 vistas13 diapositivas

Último(20)

[DSC Europe 23] Matteo Molteni - Implementing a Robust CI Workflow with dbt f... por DataScienceConferenc1
[DSC Europe 23] Matteo Molteni - Implementing a Robust CI Workflow with dbt f...[DSC Europe 23] Matteo Molteni - Implementing a Robust CI Workflow with dbt f...
[DSC Europe 23] Matteo Molteni - Implementing a Robust CI Workflow with dbt f...
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation por DataScienceConferenc1
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation
6498-Butun_Beyinli_Cocuq-Daniel_J.Siegel-Tina_Payne_Bryson-2011-259s.pdf por 10urkyr34
6498-Butun_Beyinli_Cocuq-Daniel_J.Siegel-Tina_Payne_Bryson-2011-259s.pdf6498-Butun_Beyinli_Cocuq-Daniel_J.Siegel-Tina_Payne_Bryson-2011-259s.pdf
6498-Butun_Beyinli_Cocuq-Daniel_J.Siegel-Tina_Payne_Bryson-2011-259s.pdf
10urkyr347 vistas
[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx por DataScienceConferenc1
[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx
[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx
4_4_WP_4_06_ND_Model.pptx por d6fmc6kwd4
4_4_WP_4_06_ND_Model.pptx4_4_WP_4_06_ND_Model.pptx
4_4_WP_4_06_ND_Model.pptx
d6fmc6kwd47 vistas
Data Journeys Hard Talk workshop final.pptx por info828217
Data Journeys Hard Talk workshop final.pptxData Journeys Hard Talk workshop final.pptx
Data Journeys Hard Talk workshop final.pptx
info82821711 vistas
[DSC Europe 23] Danijela Horak - The Innovator’s Dilemma: to Build or Not to ... por DataScienceConferenc1
[DSC Europe 23] Danijela Horak - The Innovator’s Dilemma: to Build or Not to ...[DSC Europe 23] Danijela Horak - The Innovator’s Dilemma: to Build or Not to ...
[DSC Europe 23] Danijela Horak - The Innovator’s Dilemma: to Build or Not to ...
OECD-Persol Holdings Workshop on Advancing Employee Well-being in Business an... por StatsCommunications
OECD-Persol Holdings Workshop on Advancing Employee Well-being in Business an...OECD-Persol Holdings Workshop on Advancing Employee Well-being in Business an...
OECD-Persol Holdings Workshop on Advancing Employee Well-being in Business an...
Product Research sample.pdf por AllenSingson
Product Research sample.pdfProduct Research sample.pdf
Product Research sample.pdf
AllenSingson33 vistas
Lack of communication among family.pptx por ahmed164023
Lack of communication among family.pptxLack of communication among family.pptx
Lack of communication among family.pptx
ahmed16402314 vistas
[DSC Europe 23] Stefan Mrsic_Goran Savic - Evolving Technology Excellence.pptx por DataScienceConferenc1
[DSC Europe 23] Stefan Mrsic_Goran Savic - Evolving Technology Excellence.pptx[DSC Europe 23] Stefan Mrsic_Goran Savic - Evolving Technology Excellence.pptx
[DSC Europe 23] Stefan Mrsic_Goran Savic - Evolving Technology Excellence.pptx
Short Story Assignment by Kelly Nguyen por kellynguyen01
Short Story Assignment by Kelly NguyenShort Story Assignment by Kelly Nguyen
Short Story Assignment by Kelly Nguyen
kellynguyen0120 vistas
[DSC Europe 23][AI:CSI] Aleksa Stojanovic - Applying AI for Threat Detection ... por DataScienceConferenc1
[DSC Europe 23][AI:CSI] Aleksa Stojanovic - Applying AI for Threat Detection ...[DSC Europe 23][AI:CSI] Aleksa Stojanovic - Applying AI for Threat Detection ...
[DSC Europe 23][AI:CSI] Aleksa Stojanovic - Applying AI for Threat Detection ...
CRM stick or twist.pptx por info828217
CRM stick or twist.pptxCRM stick or twist.pptx
CRM stick or twist.pptx
info82821711 vistas
OPPOTUS - Malaysians on Malaysia 3Q2023.pdf por Oppotus
OPPOTUS - Malaysians on Malaysia 3Q2023.pdfOPPOTUS - Malaysians on Malaysia 3Q2023.pdf
OPPOTUS - Malaysians on Malaysia 3Q2023.pdf
Oppotus27 vistas

Data Contracts: Consensus as Code - Pycon 2023

  • 1. Data Contracts Consensus as Code Ryan Collingwood 2023-08-18
  • 2. Who am I and my current context • Ryan Collingwood, Head of Data & Analytics at Oroton • Australia’s oldest luxury fashion company • Centralised Data Team • Monoliths (ERP & POS) surrounded by number of SaaS • Data is mostly moved in batch
  • 3. Why I think you might care about this Responsibility in the modern data stack Andrew Jones - Driving Data Quality with Data Contracts (2023)
  • 4. Shout out to Andrew Jones https://data-contracts.com/
  • 5. Similar, Related, and Complementary Concepts APIs Data Dictionaries Data Mesh Event Storming I’d be curious to know what else you might add to this list Data Catalogs Domain Driven Design
  • 6. Advice is a form of nostalgia. Dispensing it is a way of fishing the past from the disposal, wiping it off, painting over the ugly parts and recycling it for more than it's worth Mary Schmich https://www.chicagotribune.com/columns/chi-schmich-sunscreen-column-column.htm “If I could offer you only one tip for the future, sunscreen would be it.”
  • 8. ... outlines how data can get exchanged between two parties. It defines the structure, format, and rules of exchange in a distributed data architecture. These formal agreements make sure that there aren’t any uncertainties or undocumented assumptions about data. https://atlan.com/data-contracts/ ... is an agreed interface between the generators of data and its consumers. It sets the expectations around that data, defines how it should be governed, and facilitates the explicit generation of quality data that meets the business requirements. Andrew Jones - Driving Data Quality with Data Contracts (2023)
  • 9. Data Producers and Data Consumers Team A Team B Team C
  • 10. You can be a Data Producer without knowing about it Non-consensual API Team C ��
  • 11. Broken pipelines, broken non-promises Non-consensual API Non-consensual API Non-consensual API 🧰󰠼 ❌ Team A Team C �� Team B
  • 12. One of the largest impediments to addressing data quality at any organization is the lack of collaboration between data producers and data consumers. ... A common workaround (is the) proliferation of non-consensual APIs. Can’t get a software engineer to emit the data you need to solve some business problem? Connect your ELT tool to a production source and extract a batch dump on a schedule. Easy (Until things start breaking…whoops). Chad Sanderson - https://dataproducts.substack.com/p/the-production-grade-data-pipeline
  • 13. What makes up a Data Contract https://github.com/PacktPublishing/Driving-Data-Quality-with-Data-Contracts/blob/main/Chapter03/order_events.yaml
  • 14. However, data contracts are more than just a schema... we need our data contracts to capture metadata that describes how the data can be used, how it is governed, and the controls around the data Driving Data Quality with Data Contracts - Andrew Jones (2023)
  • 15. What makes up a Data Contract Schema Contract Governance Semantics Service Level Objectives Dataset Governance Mechanisms of Transmission People
  • 16. Schema versus Semantics Schema Semantics Systems interoperability Human Expectations Support for Implicit Validation by Database Technologies Tends to require Explicit Validation by complimentary solutions Ensuring we capture and retrieve the data consistently Ensuring we interpret the data consistently Dates / times, monetary values - are a trap if considered only as schema. What are your “schema” but “secretly semantic” situations?
  • 17. Minimum Viable Data Contract Tooling Andrew Jones - Driving Data Quality with Data Contracts (2023) Operate
  • 18. Meta-Data Powered Tooling Andrew Jones - Driving Data Quality with Data Contracts (2023)
  • 19. Data Quality Checks Andrew Jones - Driving Data Quality with Data Contracts (2023)
  • 20. Data Contract Tooling - My Context
  • 21. Data Contract Tooling - My Context Producer Boundaries
  • 24. Ok so how are we going to make this all happen? Awesome humans who understand models, abstractions, constraints You could even do it in ✨code ✨ ... and you should definitely version control it
  • 25. Why Code? Why not Text? ● Entanglement of meaning and representation ● Finding References instead of text matches ● Enforcement of structure ● Refactoring ● Testable constraints ● More options for document generation ○ Including JSON and yaml Although... I’ve been having a blast using Logseq (a graph like outliner) and I might be crazy enough to give that a go as an IDE for this
  • 28. Guiding Principles ● Primary Objective: Consensus ● Evolution ● Quick Feedback ● First Outcome: Data Tests
  • 29. Creating a Meta Model ● Focused around Events ● From UI to DB ● Schema and Semantics ● People ... still figuring it out Don’t have to do it all at once!
  • 31. The optimistic path to capturing and generating contracts
  • 32. The Event Capture spreadsheet
  • 33. Who’s Going to Do The Work? Andrew Jones - Driving Data Quality with Data Contracts (2023) Probably these people Hopefully these people
  • 34. Why Python? ● Gradual Typing* ● Static Analysis ● Well understood within the team
  • 35. Helpful Python Libraries ● Pandas ● Pydantic ● Rope ● Pytest ● Mypy ● Black
  • 38. Refactoring, doing variable extraction with Rope https://colab.research.google.com/drive/1fHLit3hF2G0dFV0Xl11jnovcdPR87s-E
  • 39. Refactoring, doing variable extraction with Rope https://colab.research.google.com/drive/1fHLit3hF2G0dFV0Xl11jnovcdPR87s-E
  • 40. Code Refactoring - Other Libraries • https://pybowler.io/ - doesn't have variable extraction and not much development activity in the last while • https://github.com/hchasestevens/astpath - useful for finding parts of the AST but then I'm not sure how to proceed with it, seems to be powering a number of meta-programming libs though • traad - https://av.tib.eu/en/media/19947
  • 41. Further explorations for wrangling generated code • Abstract Syntax Tree - Options for querying • Linting - Define my own rules to as they apply to the meta schema • Code duplication detection • Network (Graph) Analysis
  • 42. linkedin.com/in/ryancollingwood mastodon.social/@ryancollingwood twitter.com/ryancollingwood www.meetup.com/en-AU/data-engineering-melbourne • You can be a Data Producer without knowing about it, make it worthwhile for Consumers to “register” with you • You can do this through having a contract which provides clarity and can be used to power tooling and generate artefacts • Code is easier to refactor, find references, and generally maintain than the alternatives Key Takeaways
  • 43. My References • Andrew Jones - Driving Data Quality with Data Contracts (2023) - ISBN 13 978-1837635009 • Data Contracts: The Key to Scaling Distributed Data Architecture and Reducing Data Chaos - https://atlan.com/data-contracts/ • Chad Sanderson - The Production-Grade Data Pipeline - https://dataproducts.substack.com/p/the-production-grade-data-pipeline • Chad Sanderson and Adrian Kreuziger - An Engineers Guide to Data Contracts - https://mlops.community/an-engineers-guide-to-data-contracts-pt-1/ • Green Tree Snakes the missing Python AST docs - https://greentreesnakes.readthedocs.io/en/latest/ • Rope - Refactoring Variable Extraction - https://rope.readthedocs.io/en/latest/library.html#performing-refactorings