Se ha denunciado esta presentación.
Se está descargando tu SlideShare. ×

Scaling Data Quality @ Netflix

Ad

SlideShare ya no admite vídeos de YouTube

Ver el vídeo original en YouTube

Ad

WHOOPS, THE NUMBERS
ARE WRONG!
SCALING DATA QUALITY
@
MICHELLE UFFORD
DATA ENGINEERING & ANALYTICS, NETFLIX
HADOOP SUMMIT ...

Ad

Overview.

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Eche un vistazo a continuación

1 de 50 Anuncio
1 de 50 Anuncio

Scaling Data Quality @ Netflix

Netflix is a famously data-driven company. Data is used to make informed decisions on everything from content acquisition to content delivery, and everything in-between. As with any data-driven company, it’s critical that data used by the business is accurate. Or, at worst, that the business has visibility into potential quality issues as soon as they arise. But even in the most mature data warehouses, data quality can be hard. How can we ensure high quality in a cloud-based, internet-scale, modern big data warehouse employing a variety of data engineering technologies?

In this talk, Michelle Ufford will share how the Data Engineering & Analytics team at Netflix is doing exactly that. We’ll kick things off with a quick overview of Netflix’s analytics environment, then dig into the architecture of our current data quality solution. We’ll cover what worked, what didn’t work so well, and what we're working on next. We’ll conclude with some tips & lessons learned for ensuring high quality on big data.

This talk was presented at DataWorks/Hadoop Summit 2017 on June 13, 2017.

Netflix is a famously data-driven company. Data is used to make informed decisions on everything from content acquisition to content delivery, and everything in-between. As with any data-driven company, it’s critical that data used by the business is accurate. Or, at worst, that the business has visibility into potential quality issues as soon as they arise. But even in the most mature data warehouses, data quality can be hard. How can we ensure high quality in a cloud-based, internet-scale, modern big data warehouse employing a variety of data engineering technologies?

In this talk, Michelle Ufford will share how the Data Engineering & Analytics team at Netflix is doing exactly that. We’ll kick things off with a quick overview of Netflix’s analytics environment, then dig into the architecture of our current data quality solution. We’ll cover what worked, what didn’t work so well, and what we're working on next. We’ll conclude with some tips & lessons learned for ensuring high quality on big data.

This talk was presented at DataWorks/Hadoop Summit 2017 on June 13, 2017.

Más Contenido Relacionado

Scaling Data Quality @ Netflix

  1. 1. WHOOPS, THE NUMBERS ARE WRONG! SCALING DATA QUALITY @ MICHELLE UFFORD DATA ENGINEERING & ANALYTICS, NETFLIX HADOOP SUMMIT 2017
  2. 2. Overview.
  3. 3. The business. 20170612 100+ million members $6 billion on content 125+ million hours watched launched in 1997 every. day. $
  4. 4. Anytime. Anywhere.* 20170612 * Well, almost anywhere.
  5. 5. Any device. 20170612
  6. 6. 300 terabyte DW writes 5 petabyte DW reads The data. 20170612 60+ petabyte data warehouse 700+ billion events written
  7. 7. 300 terabyte DW writes 5 petabyte DW reads The data. 20170612 60+ petabyte data warehouse 700+ billion events written
  8. 8. 20170612 data access AWS S3 Amazon Redshift data processing fast storage data viz METACA T data services events data operational data elastic storage Apache Pig Big Data Platform
  9. 9. Data Quality.
  10. 10. Federated metastore & extensible data catalog
  11. 11. 20170612 Metacat Federated Metastore s3://…/dw/fact_table_f/utc_date=20170101/batchid=1483229855 … s3://…/dw/fact_table_f/utc_date=20170611/batchid=1497226702 s3://…/dw/fact_table_f/utc_date=20170612/batchid=1497312541 dw.fact_table_f
  12. 12. 20170612 Metacat Federated Metastore s3://…/dw/fact_table_f/utc_date=20170101/batchid=1483229855 … s3://…/dw/fact_table_f/utc_date=20170611/batchid=1497226702 s3://…/dw/fact_table_f/utc_date=20170612/batchid=1497312541 dw.fact_table_f utc_date=20170101 utc_date=20170611 utc_date=20170612 …
  13. 13. 20170612 Metacat Federated Metastore utc_date=20170101
  14. 14. 20170612 Metacat Federated Metastore utc_date=20170101
  15. 15. 20170612 Extended table attributes ● primary key(s) ● column types ● lifecycle ● audience ● “valid-thru” timestamp ● … and much more Metacat Federated Metastore
  16. 16. Data Quality Service.
  17. 17. 20170612 Quinto Data Quality Service
  18. 18. 20170612 Quinto Data Quality Service
  19. 19. 20170612 Quinto Data Quality Service
  20. 20. 20170612 Quinto Data Quality Service
  21. 21. Write - Audit - Publish ETL pattern for high-quality big data jobs
  22. 22. 20170612 s3://…/utc_date=20170101/batchid=1483229855 … s3://…/utc_date=20170611/batchid=1497226702 dw.my_table_f WAP ETL Pattern
  23. 23. 20170612 s3://…/utc_date=20170101/batchid=1483229855 … s3://…/utc_date=20170611/batchid=1497226702 dw.my_table_f audit.my_table_f_1497312000 WAPStage-0: Prep ETL Pattern
  24. 24. 20170612 s3://…/utc_date=20170101/batchid=1483229855 … s3://…/utc_date=20170611/batchid=1497226702 s3://…/utc_date=20170612/batchid=1497312541 WAPStage-1: Write audit.my_table_f_1497312000dw.my_table_f ETL Pattern
  25. 25. 20170612 s3://…/utc_date=20170101/batchid=1483229855 … s3://…/utc_date=20170611/batchid=1497226702 WAPStage-1: Write audit.my_table_f_1497312000 $TABLE dw.my_table_f utc_date=20170612 com.netflix.dse.mds.metric.RowCount: 17240 com.netflix.dse.mds.metric.NullCount: 17240 ... ETL Pattern
  26. 26. 20170612 s3://…/utc_date=20170101/batchid=1483229855 … WAPStage-2: Audit audit.my_table_f_1497312000dw.my_table_f utc_date=20170611 com.netflix.dse.mds.metric.RowCount: 16135 com.netflix.dse.mds.metric.NullCount: 21 ... Quint o utc_date=20170612 com.netflix.dse.mds.metric.RowCount: 17240 com.netflix.dse.mds.metric.NullCount: 17240 ... ETL Pattern
  27. 27. 20170612 … WAPStage-2: Audit audit.my_table_f_1497312000dw.my_table_f Quint o metric eval behavior result -------------------------------------------------- RowCount >= zero fail job RowCount >= prior value fail job NullCount normal dist warn job Quinto configuration utc_date=20170612 com.netflix.dse.mds.metric.RowCount: 17240 com.netflix.dse.mds.metric.NullCount: 17240 ... utc_date=20170611 com.netflix.dse.mds.metric.RowCount: 16135 com.netflix.dse.mds.metric.NullCount: 21 ... ETL Pattern
  28. 28. 20170612 … WAPStage-2: Audit audit.my_table_f_1497312000dw.my_table_f Quint o metric eval behavior result -------------------------------------------------- RowCount >= zero fail job RowCount >= prior value fail job NullCount normal dist warn job utc_date=20170612 com.netflix.dse.mds.metric.RowCount: 17240 com.netflix.dse.mds.metric.NullCount: 17240 ... utc_date=20170611 com.netflix.dse.mds.metric.RowCount: 16135 com.netflix.dse.mds.metric.NullCount: 21 ... Quinto configuration ETL Pattern
  29. 29. 20170612 … WAPStage-2: Audit audit.my_table_f_1497312000dw.my_table_f Quint o metric eval behavior result -------------------------------------------------- RowCount >= zero fail job RowCount >= prior value fail job NullCount normal dist warn job utc_date=20170612 com.netflix.dse.mds.metric.RowCount: 17240 com.netflix.dse.mds.metric.NullCount: 17240 ... utc_date=20170611 com.netflix.dse.mds.metric.RowCount: 16135 com.netflix.dse.mds.metric.NullCount: 21 ... Quinto configuration ETL Pattern
  30. 30. 20170612 … WAPStage-2: Audit audit.my_table_f_1497312000dw.my_table_fmetric eval behavior result -------------------------------------------------- RowCount >= zero fail job pass RowCount >= prior value fail job NullCount normal dist warn job Quint o utc_date=20170612 com.netflix.dse.mds.metric.RowCount: 17240 com.netflix.dse.mds.metric.NullCount: 17240 ... utc_date=20170611 com.netflix.dse.mds.metric.RowCount: 16135 com.netflix.dse.mds.metric.NullCount: 21 ... Quinto configuration ETL Pattern
  31. 31. 20170612 … WAPStage-2: Audit audit.my_table_f_1497312000dw.my_table_fmetric eval behavior result -------------------------------------------------- RowCount >= zero fail job pass RowCount >= prior value fail job NullCount normal dist warn job Quint o utc_date=20170612 com.netflix.dse.mds.metric.RowCount: 17240 com.netflix.dse.mds.metric.NullCount: 17240 ... utc_date=20170611 com.netflix.dse.mds.metric.RowCount: 16135 com.netflix.dse.mds.metric.NullCount: 21 ... Quinto configuration ETL Pattern
  32. 32. 20170612 … WAPStage-2: Audit audit.my_table_f_1497312000dw.my_table_fmetric eval behavior result -------------------------------------------------- RowCount >= zero fail job pass RowCount >= prior value fail job NullCount normal dist warn job Quint o utc_date=20170612 com.netflix.dse.mds.metric.RowCount: 17240 com.netflix.dse.mds.metric.NullCount: 17240 ... utc_date=20170611 com.netflix.dse.mds.metric.RowCount: 16135 com.netflix.dse.mds.metric.NullCount: 21 ... Quinto configuration ETL Pattern
  33. 33. 20170612 … WAPStage-2: Audit audit.my_table_f_1497312000dw.my_table_fmetric eval behavior result -------------------------------------------------- RowCount >= zero fail job pass RowCount >= prior value fail job pass NullCount normal dist warn job Quint o utc_date=20170612 com.netflix.dse.mds.metric.RowCount: 17240 com.netflix.dse.mds.metric.NullCount: 17240 ... utc_date=20170611 com.netflix.dse.mds.metric.RowCount: 16135 com.netflix.dse.mds.metric.NullCount: 21 ... Quinto configuration ETL Pattern
  34. 34. 20170612 … WAPStage-2: Audit audit.my_table_f_1497312000dw.my_table_fmetric eval behavior result -------------------------------------------------- RowCount >= zero fail job pass RowCount >= prior value fail job pass NullCount normal dist warn job Quint o utc_date=20170612 com.netflix.dse.mds.metric.RowCount: 17240 com.netflix.dse.mds.metric.NullCount: 17240 ... utc_date=20170611 com.netflix.dse.mds.metric.RowCount: 16135 com.netflix.dse.mds.metric.NullCount: 21 ... Quinto configuration ETL Pattern
  35. 35. 20170612 … WAPStage-2: Audit audit.my_table_f_1497312000dw.my_table_fmetric eval behavior result -------------------------------------------------- RowCount >= zero fail job pass RowCount >= prior value fail job pass NullCount normal dist warn job Quint o utc_date=20170612 com.netflix.dse.mds.metric.RowCount: 17240 com.netflix.dse.mds.metric.NullCount: 17240 ... utc_date=20170611 com.netflix.dse.mds.metric.RowCount: 16135 com.netflix.dse.mds.metric.NullCount: 21 ... Quinto configuration ETL Pattern
  36. 36. 20170612 … WAPStage-2: Audit audit.my_table_f_1497312000dw.my_table_fmetric eval behavior result -------------------------------------------------- RowCount >= zero fail job pass RowCount >= prior value fail job pass NullCount normal dist warn job fail Quint o utc_date=20170612 com.netflix.dse.mds.metric.RowCount: 17240 com.netflix.dse.mds.metric.NullCount: 17240 ... utc_date=20170611 com.netflix.dse.mds.metric.RowCount: 16135 com.netflix.dse.mds.metric.NullCount: 21 ... Quinto configuration ETL Pattern
  37. 37. 20170612 … WAPStage-2: Audit audit.my_table_f_1497312000dw.my_table_fmetric eval behavior result -------------------------------------------------- RowCount >= zero fail job pass RowCount >= prior value fail job pass NullCount normal dist warn job fail Quint o utc_date=20170612 com.netflix.dse.mds.metric.RowCount: 17240 com.netflix.dse.mds.metric.NullCount: 17240 ... utc_date=20170611 com.netflix.dse.mds.metric.RowCount: 16135 com.netflix.dse.mds.metric.NullCount: 21 ... Quinto configuration ETL Pattern
  38. 38. 20170612 s3://…/utc_date=20170101/batchid=1483229855 … s3://…/utc_date=20170611/batchid=1497226702 WAPStage-3: Publish audit.my_table_f_1497312000dw.my_table_f s3://…/utc_date=20170612/batchid=1497312541 ETL Pattern
  39. 39. 20170612 s3://…/utc_date=20170101/batchid=1483229855 … s3://…/utc_date=20170611/batchid=1497226702 WAPStage-3: Publish audit.my_table_f_1497312000dw.my_table_f s3://…/utc_date=20170612/batchid=1497312541 ETL Pattern
  40. 40. 20170612 s3://…/utc_date=20170101/batchid=1483229855 … s3://…/utc_date=20170611/batchid=1497226702 WAPStage-3: Publish s3://…/utc_date=20170612/batchid=1497312541 dw.my_table_f valid_thru_ts = 20170613 00:00:00 ETL Pattern
  41. 41. 20170612 Quinto evaluations ● intelligent recommendations ● multiple tiers of coverage ● configurable rules Jumpstarter. Python Library
  42. 42. WAP. Python Library Minimal requirements ● parameterized destination table
  43. 43. WAP. Python Library Running WAP.
  44. 44. 20170612 What’s Next. ● additional Metacat statistics ● robust anomaly detection (RAD) ● complete migration for all prod tables
  45. 45. 20170612 Tips & Lessons Learned. ● Query-based solution may be “good enough” for many. ● Not all tables need quality coverage. ● One size rarely fits all tables. ● Build components, not “all-or-nothing” frameworks.
  46. 46. MICHELLE UFFORD mufford@netflix.com twitter.com/MichelleUfford DATA techblog.netflix.com medium.com/netflix-techblog twitter.com/NetflixData tinyurl.com/NetflixData Thank you! WE’RE HIRING! jobs.netflix.com

Notas del editor

  • https://dataworkssummit.com/san-jose-2017/sessions/whoops-the-numbers-are-wrong-scaling-data-quality-netflix/

    WHOOPS, THE NUMBERS ARE WRONG! SCALING DATA QUALITY @ NETFLIX
    Netflix is a famously data-driven company. Data is used to make informed decisions on everything from content acquisition to content delivery, and everything in-between. As with any data-driven company, it’s critical that data used by the business is accurate. Or, at worst, that the business has visibility into potential quality issues as soon as they arise. But even in the most mature data warehouses, data quality can be hard. How can we ensure high quality in a cloud-based, internet-scale, modern big data warehouse employing a variety of data engineering technologies?

    In this talk, Michelle Ufford will share how the Data Engineering & Analytics team at Netflix is doing exactly that. We’ll kick things off with a quick overview of Netflix’s analytics environment, then dig into details of our data quality solution. We’ll cover what worked, what didn’t work so well, and what we plan to work on next. We’ll conclude with some tips and lessons learned for ensuring data quality on big data.

    DETAILS
    This session is a (Intermediate) talk in our Data Processing and Warehousing track. It focuses on Apache Hadoop, Apache Hive, Apache Pig, Apache Spark and is geared towards Architect, Data Analyst, Developer / Engineer, Operations / IT audiences.

×