DataEngConf SF16 - Data Asserts: Defensive Data Science

•

3 recomendaciones•1,085 vistas

Hakka Labs

Talk by Tommy Guy, Microsoft. To hear about future conferences go to http://dataengconf.com

Tecnología

Data Asserts
Defensive Data Science
Tommy Guy
Microsoft

Our pipeline:
DATA!!!
Insight! Direction! Strategy!

Our pipeline in reality: bugs tend to compound
DATA!!!

How do Engineers Manage Complexity?
Encapsulate: create functions/classes/subsystems
with clear APIs. This helps isolate complexity
Integration Tests: ensure that the components interact
correctly. This helps identify breaking changes.

Data introduces a few complications
Pipelines take many upstream dependencies
Researcher use cases are frequently unknown and
unanticipated by data providers.
Pushing requirements upstream to all producers is
Sisyphean.

We are not talking about data pipeline tests
The data pipeline teams:
Are all rows that are produced stored
• Counter fields to ensure no dropped rows
• Sentinel events to measure join fidelity
Are availability SLAs being met?
• Progressive server-client merging

Data Scientists Require Semantic Correctness
Does this field mean what I think it does?

How do Data Scientists identify potential
errors?

How do Data Scientists identify potential
errors?
Some follow-on fact is absurd…
… which leads to investigation …
… which finds a broader problem
If [potential conclusion], then we must have 3 billion
OneDrive users…
… because my user table doesn’t have a primary key …
… so I should aggregate by user.

What are your Assumptions?
If I conclude “Users who upload files to OneDrive are XXX% more likely
to buy Office if they also sent mail through Mobile Outlook”, I’m
making many silent assumptions:
Field Assumptions
User Id • Logged and PII-encrypted similarly in Outlook and OneDrive
• Correctly logging timestamp for Office purchase
• User Id isn’t empty or missing
OneDrive activity • Wasn’t automated traffic [identified by a certain flag].
Email Activity • Mobile client identifiers are correct.
All • Any upstream changes to OneDrive, Office, or Exchange
data have been communicated to pipeline owners.

What are your Sanity Checks?
• If a column “OfficeId” is really a user id, it has certain known properties:
• Observation: these sorts of checks take place when the pipeline is set
up, but they may not be re-checked very often.
Assumption Why does it matter?
Never null/empty Causes job-breaking data skew issues
Users are 1:* with Tenants Logical constraint: sign you are missing something.
Very high cardinality If this isn’t true, it’s unlikely that it’s a user-id.
All rows in event data join to it Otherwise, your data is incomplete.
Matches a certain regex Sanity check: if this isn’t true, it’s unlikely that it’s a
user-id.

These should
match!
Data Asserts: Defensive Data Science

Data Asserts in Production: A few
Observations
• Most of the analysis-impacting assertion failures we’ve seen were
actually errors in our assumptions not errors in the pipeline.
• Good tests beget good code: we’ve had to modularize our code in
order to produce testable chunks that get re-used in pipelines.
• Data Asserts is the backbone to data provenance. A data conclusion
can directly link all of the assumptions about the input that we made.

Más contenido relacionado

La actualidad más candente

Docker/DevOps Meetup: Metrics-Driven Continuous Performance and ScalabiltyAndreas Grabner

Metrics & more Stefan Thies

Dr. Datascience or: How I Learned to Stop Munging and Love TestsWork-Bench

(BDT207) Real-Time Analytics In Service Of Self-Healing EcosystemsAmazon Web Services

How to keep you out of the News: Web and End-to-End Performance TipsAndreas Grabner

Database DevOps Anti-patternsAlex Yates

Boston DevOps Days 2016: Implementing Metrics Driven DevOps - Why and HowAndreas Grabner

Nordstrom Customer PresentationSplunk

DMCA#21: reactive-programmingOlivier Destrebecq

Web and App Performance: Top Problems to avoid to keep you out of the NewsAndreas Grabner

BTD2015 - Your Place In DevTOps is Finding Solutions - Not Just Bugs!Andreas Grabner

Getting CI right for SQL ServerAlex Yates

Code Once Use Often with Declarative Data PipelinesDatabricks

DevOps 101 for data professionalsAlex Yates

HSPS 2015 - SharePoint Performance Santiy ChecksAndreas Grabner

Taking Jupyter Notebooks and Apache Spark to the Next Level PixieDust with Da...Databricks

Machine learning model to productionGeorg Heiler

node-crate: node.js and big dataStefan Thies

Netflix Data Engineering @ Uber Engineering MeetupBlake Irvine

Building Scalable Prediction Services in RWork-Bench

La actualidad más candente (20)

Docker/DevOps Meetup: Metrics-Driven Continuous Performance and Scalabilty

Metrics & more

Dr. Datascience or: How I Learned to Stop Munging and Love Tests

(BDT207) Real-Time Analytics In Service Of Self-Healing Ecosystems

How to keep you out of the News: Web and End-to-End Performance Tips

Database DevOps Anti-patterns

Boston DevOps Days 2016: Implementing Metrics Driven DevOps - Why and How

Nordstrom Customer Presentation

DMCA#21: reactive-programming

Web and App Performance: Top Problems to avoid to keep you out of the News

BTD2015 - Your Place In DevTOps is Finding Solutions - Not Just Bugs!

Getting CI right for SQL Server

Code Once Use Often with Declarative Data Pipelines

DevOps 101 for data professionals

HSPS 2015 - SharePoint Performance Santiy Checks

Taking Jupyter Notebooks and Apache Spark to the Next Level PixieDust with Da...

Machine learning model to production

node-crate: node.js and big data

Netflix Data Engineering @ Uber Engineering Meetup

Building Scalable Prediction Services in R

Destacado

DataEngConf SF16 - Beginning with OurselvesHakka Labs

DataEngConf SF16 - Routing Billions of Analytics Events with High DeliverabilityHakka Labs

DataEngConf SF16 - High cardinality time series searchHakka Labs

DataEngConf SF16 - Collecting and Moving Data at Scale Hakka Labs

DataEngConf SF16 - Running simulations at scaleHakka Labs

DataEngConf SF16 - Tales from the other side - What a hiring manager wish you...Hakka Labs

DataEngConf SF16 - Recommendations at InstacartHakka Labs

DataEngConf SF16 - Entity Resolution in Data Pipelines Using SparkHakka Labs

Always Valid Inference (Ramesh Johari, Stanford)Hakka Labs

DataEngConf SF16 - BYOMQ: Why We [re]Built IronMQHakka Labs

DataEngConf SF16 - Scalable and Reliable Logging at PinterestHakka Labs

DataEngConf SF16 - Deriving Meaning from Wearable Sensor DataHakka Labs

DataEngConf SF16 - Bridging the gap between data science and data engineeringHakka Labs

DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambd...Hakka Labs

DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataHakka Labs

DataEngConf SF16 - Multi-temporal Data StructuresHakka Labs

Destacado (16)

DataEngConf SF16 - Beginning with Ourselves

DataEngConf SF16 - Routing Billions of Analytics Events with High Deliverability

DataEngConf SF16 - High cardinality time series search

DataEngConf SF16 - Collecting and Moving Data at Scale

DataEngConf SF16 - Running simulations at scale

DataEngConf SF16 - Tales from the other side - What a hiring manager wish you...

DataEngConf SF16 - Recommendations at Instacart

DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark

Always Valid Inference (Ramesh Johari, Stanford)

DataEngConf SF16 - BYOMQ: Why We [re]Built IronMQ

DataEngConf SF16 - Scalable and Reliable Logging at Pinterest

DataEngConf SF16 - Deriving Meaning from Wearable Sensor Data

DataEngConf SF16 - Bridging the gap between data science and data engineering

DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambd...

DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data

DataEngConf SF16 - Multi-temporal Data Structures

Similar a DataEngConf SF16 - Data Asserts: Defensive Data Science

Measuring Data Quality with DataOpsSteven Ensslen

IT Operation Analytic for security- MiSSconf(sp1)stelligence

Data QualityVijaya K

Building the enterprise data architectureCosta Pissaris

Data quality and data profilingShailja Khurana

End User InformaticsAmbareesh Kulkarni

BI on Big Data PresentationArcadia Data

Creating a Data validation and Testing StrategyRTTS

MLOps and Data Quality: Deploying Reliable ML Models in ProductionProvectus

Qiagramjwppz

How to improve your system monitoringAndrew White

Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...Data Con LA

Automate data warehouse etl testing and migration testing the agile wayTorana, Inc.

Pragmatics Driven Issues in Data and Process Integrity in EnterprisesAmit Sheth

DGIQ 2015 The Fundamentals of Data QualityCaserta

Starting Your DevOps Journey – Practical Tips for OpsDynatrace

Use of Formal Methods at Amazon Web ServicesSulman Ahmed

Data Quality: principles, approaches, and best practicesCarl Anderson

Predictive Analytics - Big Data Warehousing MeetupCaserta

Automatic Data Reconciliation, Data Quality, and Data Observability.pdf4dalert

Similar a DataEngConf SF16 - Data Asserts: Defensive Data Science (20)

Measuring Data Quality with DataOps

IT Operation Analytic for security- MiSSconf(sp1)

Data Quality

Building the enterprise data architecture

Data quality and data profiling

End User Informatics

BI on Big Data Presentation

Creating a Data validation and Testing Strategy

MLOps and Data Quality: Deploying Reliable ML Models in Production

Qiagram

How to improve your system monitoring

Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...

Automate data warehouse etl testing and migration testing the agile way

Pragmatics Driven Issues in Data and Process Integrity in Enterprises

DGIQ 2015 The Fundamentals of Data Quality

Starting Your DevOps Journey – Practical Tips for Ops

Use of Formal Methods at Amazon Web Services

Data Quality: principles, approaches, and best practices

Predictive Analytics - Big Data Warehousing Meetup

Automatic Data Reconciliation, Data Quality, and Data Observability.pdf

Más de Hakka Labs

DataEngConf SF16 - Spark SQL WorkshopHakka Labs

DataEngConf: Building a Music Recommender System from Scratch with Spotify Da...Hakka Labs

DataEngConf: Data Science at the New York Times by Chris WigginsHakka Labs

DataEngConf: Building the Next New York Times Recommendation EngineHakka Labs

DataEngConf: Measuring Impact with Data in a Distributed World at Conde NastHakka Labs

DataEngConf: Feature Extraction: Modern Questions and Challenges at GoogleHakka Labs

DataEngConf: Talkographics: Using What Viewers Say Online to Measure TV and B...Hakka Labs

DataEngConf: The Science of Virality at BuzzFeedHakka Labs

DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with...Hakka Labs

DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big DataHakka Labs

DataEngConf: Apache Kafka at Rocana: a scalable, distributed log for machine ...Hakka Labs

DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedInHakka Labs

Más de Hakka Labs (12)

DataEngConf SF16 - Spark SQL Workshop

DataEngConf: Building a Music Recommender System from Scratch with Spotify Da...

DataEngConf: Data Science at the New York Times by Chris Wiggins

DataEngConf: Building the Next New York Times Recommendation Engine

DataEngConf: Measuring Impact with Data in a Distributed World at Conde Nast

DataEngConf: Feature Extraction: Modern Questions and Challenges at Google

DataEngConf: Talkographics: Using What Viewers Say Online to Measure TV and B...

DataEngConf: The Science of Virality at BuzzFeed

DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with...

DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data

DataEngConf: Apache Kafka at Rocana: a scalable, distributed log for machine ...

DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn

Último

Automating Google Workspace (GWS) & more with Apps Scriptwesley chun

Developing An App To Navigate The Roads of BrazilV3cube

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal

Scaling API-first – The story of a global engineering organizationRadu Cotescu

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science

Presentation on how to chat with PDF using ChatGPT code interpreternaman860154

Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC

Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700

Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer

08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls

Partners Life - Insurer Innovation Award 2024The Digital Insurer

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays

How to convert PDF to text with Nanonetsnaman860154

The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung

Slack Application Development 101 Slidespraypatel2

Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j

Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies

TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc

DataEngConf SF16 - Data Asserts: Defensive Data Science

1. Data Asserts Defensive Data Science Tommy Guy Microsoft

2. Observation: Complexity In Pipeline

3. Our pipeline: DATA!!! Insight! Direction! Strategy!

4. Our pipeline in reality: bugs tend to compound DATA!!!

5. How do Engineers Manage Complexity? Encapsulate: create functions/classes/subsystems with clear APIs. This helps isolate complexity Integration Tests: ensure that the components interact correctly. This helps identify breaking changes.

6. Data introduces a few complications Pipelines take many upstream dependencies Researcher use cases are frequently unknown and unanticipated by data providers. Pushing requirements upstream to all producers is Sisyphean.

7. We are not talking about data pipeline tests The data pipeline teams: Are all rows that are produced stored • Counter fields to ensure no dropped rows • Sentinel events to measure join fidelity Are availability SLAs being met? • Progressive server-client merging

8. Data Scientists Require Semantic Correctness Does this field mean what I think it does?

9. How do Data Scientists identify potential errors?

10. How do Data Scientists identify potential errors? Some follow-on fact is absurd… … which leads to investigation … … which finds a broader problem If [potential conclusion], then we must have 3 billion OneDrive users… … because my user table doesn’t have a primary key … … so I should aggregate by user.

11. What are your Assumptions? If I conclude “Users who upload files to OneDrive are XXX% more likely to buy Office if they also sent mail through Mobile Outlook”, I’m making many silent assumptions: Field Assumptions User Id • Logged and PII-encrypted similarly in Outlook and OneDrive • Correctly logging timestamp for Office purchase • User Id isn’t empty or missing OneDrive activity • Wasn’t automated traffic [identified by a certain flag]. Email Activity • Mobile client identifiers are correct. All • Any upstream changes to OneDrive, Office, or Exchange data have been communicated to pipeline owners.

12. What are your Sanity Checks? • If a column “OfficeId” is really a user id, it has certain known properties: • Observation: these sorts of checks take place when the pipeline is set up, but they may not be re-checked very often. Assumption Why does it matter? Never null/empty Causes job-breaking data skew issues Users are 1:* with Tenants Logical constraint: sign you are missing something. Very high cardinality If this isn’t true, it’s unlikely that it’s a user-id. All rows in event data join to it Otherwise, your data is incomplete. Matches a certain regex Sanity check: if this isn’t true, it’s unlikely that it’s a user-id.

13. Data Asserts: Defensive Data Science

14. Data Asserts: Maintain Quality

15. Data Asserts: Clear Trust Boundaries

16. These should match! Data Asserts: Defensive Data Science

17. Data Asserts in Production: A few Observations • Most of the analysis-impacting assertion failures we’ve seen were actually errors in our assumptions not errors in the pipeline. • Good tests beget good code: we’ve had to modularize our code in order to produce testable chunks that get re-used in pipelines. • Data Asserts is the backbone to data provenance. A data conclusion can directly link all of the assumptions about the input that we made.

DataEngConf SF16 - Data Asserts: Defensive Data Science

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (16)

Similar a DataEngConf SF16 - Data Asserts: Defensive Data Science

Similar a DataEngConf SF16 - Data Asserts: Defensive Data Science (20)

Más de Hakka Labs

Más de Hakka Labs (12)

Último

Último (20)

DataEngConf SF16 - Data Asserts: Defensive Data Science