Michelle Ufford of Netflix presented on their approach to data quality. They developed Quinto, a data quality service that implements a Write-Audit-Publish pattern for ETL jobs. It audits metrics after data is written to check for issues like row counts being too high/low. Configurable rules determine if issues warrant failing or warning on a job. Future work includes expanding metadata tracking and anomaly detection. The presentation emphasized building modular components over monolithic frameworks and only implementing quality checks where needed.
48. 20170612
Tips & Lessons Learned.
● Query-based solution may be “good enough” for many.
● Not all tables need quality coverage.
● One size rarely fits all tables.
● Build components, not “all-or-nothing” frameworks.
https://dataworkssummit.com/san-jose-2017/sessions/whoops-the-numbers-are-wrong-scaling-data-quality-netflix/
WHOOPS, THE NUMBERS ARE WRONG! SCALING DATA QUALITY @ NETFLIX
Netflix is a famously data-driven company. Data is used to make informed decisions on everything from content acquisition to content delivery, and everything in-between. As with any data-driven company, it’s critical that data used by the business is accurate. Or, at worst, that the business has visibility into potential quality issues as soon as they arise. But even in the most mature data warehouses, data quality can be hard. How can we ensure high quality in a cloud-based, internet-scale, modern big data warehouse employing a variety of data engineering technologies?
In this talk, Michelle Ufford will share how the Data Engineering & Analytics team at Netflix is doing exactly that. We’ll kick things off with a quick overview of Netflix’s analytics environment, then dig into details of our data quality solution. We’ll cover what worked, what didn’t work so well, and what we plan to work on next. We’ll conclude with some tips and lessons learned for ensuring data quality on big data.
DETAILS
This session is a (Intermediate) talk in our Data Processing and Warehousing track. It focuses on Apache Hadoop, Apache Hive, Apache Pig, Apache Spark and is geared towards Architect, Data Analyst, Developer / Engineer, Operations / IT audiences.