Netflix is a famously data-driven company. Data is used to make informed decisions on everything from content acquisition to content delivery, and everything in-between. As with any data-driven company, it’s critical that data used by the business is accurate. Or, at worst, that the business has visibility into potential quality issues as soon as they arise. But even in the most mature data warehouses, data quality can be hard. How can we ensure high quality in a cloud-based, internet-scale, modern big data warehouse employing a variety of data engineering technologies?
In this talk, Michelle Ufford will share how the Data Engineering & Analytics team at Netflix is doing exactly that. We’ll kick things off with a quick overview of Netflix’s analytics environment, then dig into details of our data quality solution. We’ll cover what worked, what didn’t work so well, and what we plan to work on next. We’ll conclude with some tips and lessons learned for ensuring data quality on big data.
48. 20170612
Tips & Lessons Learned.
● Query-based solution may be “good enough” for many.
● Not all tables need quality coverage.
● One size rarely fits all tables.
● Build components, not “all-or-nothing” frameworks.
https://dataworkssummit.com/san-jose-2017/sessions/whoops-the-numbers-are-wrong-scaling-data-quality-netflix/
WHOOPS, THE NUMBERS ARE WRONG! SCALING DATA QUALITY @ NETFLIX
Netflix is a famously data-driven company. Data is used to make informed decisions on everything from content acquisition to content delivery, and everything in-between. As with any data-driven company, it’s critical that data used by the business is accurate. Or, at worst, that the business has visibility into potential quality issues as soon as they arise. But even in the most mature data warehouses, data quality can be hard. How can we ensure high quality in a cloud-based, internet-scale, modern big data warehouse employing a variety of data engineering technologies?
In this talk, Michelle Ufford will share how the Data Engineering & Analytics team at Netflix is doing exactly that. We’ll kick things off with a quick overview of Netflix’s analytics environment, then dig into details of our data quality solution. We’ll cover what worked, what didn’t work so well, and what we plan to work on next. We’ll conclude with some tips and lessons learned for ensuring data quality on big data.
DETAILS
This session is a (Intermediate) talk in our Data Processing and Warehousing track. It focuses on Apache Hadoop, Apache Hive, Apache Pig, Apache Spark and is geared towards Architect, Data Analyst, Developer / Engineer, Operations / IT audiences.
1500+ devices as of Q1 2017
Goal is to provide behind-the-scenes look at how we’re approaching DQ.
We’re sharing ideas, not code – no open-source announcement.
That’s cool but sounds like a lot of work
I need to:
know what stats are available in Metacat
know what quality templates exist
figure out which ones I should use
figure out a good configuration for each
and do everything we just walked through in WAP
Takes ~5 minutes to enable WAP with 108 audits on a new Spark job.
Added 42 seconds to hourly processing time.
It’s a combination of these solutions that allow us to scale not only the processing time but the engineering time too.
Takes ~5 minutes to enable WAP with 108 audits on a new Spark job.
Added 42 seconds to hourly processing time.
Stats
Cardinality
Histograms
Map keys
RAD
High cardinality dimensions
Seasonality beyond week-over-week
Atypical data distributions
Reduce false positives
Query-based solution
Not efficient but much less complicated to implement
Good place to start
Works well for small-to-medium datasets and/or nightly batch ETL
Transient, experimental, single-user
Content vs. Streaming
2 main motivations
Confidence
Notify users when quality issues arise
Only make data available after some basic validation
Increase confidence for data consumers that data is good to use
Efficiency
Catch issues faster
Less business impact
Much easier to simply not update downstream dependencies than to fix it after-the-fact