SlideShare una empresa de Scribd logo
1 de 49
WHOOPS, THE NUMBERS
ARE WRONG!
SCALING DATA QUALITY
@
MICHELLE UFFORD
DATA ENGINEERING & ANALYTICS, NETFLIX
HADOOP SUMMIT 2017
Overview.
The business.
20170612
100+ million
members
$6 billion
on content
125+ million
hours watched
launched
in 1997
every. day.
$
Anytime. Anywhere.*
20170612
* Well, almost anywhere.
Any device.
20170612
300 terabyte
DW writes
5 petabyte
DW reads
The data.
20170612
60+ petabyte
data warehouse
700+ billion
events written
300 terabyte
DW writes
5 petabyte
DW reads
The data.
20170612
60+ petabyte
data warehouse
700+ billion
events written
20170612
data access
AWS
S3
Amazon
Redshift
data processing
fast storage data viz
METACA
T
data services
events data
operational data
elastic storage
Apache Pig
Big Data Platform
Data Quality.
Federated metastore &
extensible data catalog
20170612
Metacat
Federated Metastore
s3://…/dw/fact_table_f/utc_date=20170101/batchid=1483229855
…
s3://…/dw/fact_table_f/utc_date=20170611/batchid=1497226702
s3://…/dw/fact_table_f/utc_date=20170612/batchid=1497312541
dw.fact_table_f
20170612
Metacat
Federated Metastore
s3://…/dw/fact_table_f/utc_date=20170101/batchid=1483229855
…
s3://…/dw/fact_table_f/utc_date=20170611/batchid=1497226702
s3://…/dw/fact_table_f/utc_date=20170612/batchid=1497312541
dw.fact_table_f
utc_date=20170101
utc_date=20170611
utc_date=20170612
…
20170612
Metacat
Federated Metastore
utc_date=20170101
20170612
Metacat
Federated Metastore
utc_date=20170101
20170612
Extended table attributes
● primary key(s)
● column types
● lifecycle
● audience
● “valid-thru” timestamp
● … and much more
Metacat
Federated Metastore
Data Quality Service.
20170612
Quinto
Data Quality Service
20170612
Quinto
Data Quality Service
20170612
Quinto
Data Quality Service
20170612
Quinto
Data Quality Service
Write - Audit - Publish
ETL pattern
for high-quality
big data jobs
20170612
s3://…/utc_date=20170101/batchid=1483229855
…
s3://…/utc_date=20170611/batchid=1497226702
dw.my_table_f
WAP
ETL Pattern
20170612
s3://…/utc_date=20170101/batchid=1483229855
…
s3://…/utc_date=20170611/batchid=1497226702
dw.my_table_f audit.my_table_f_1497312000
WAPStage-0: Prep
ETL Pattern
20170612
s3://…/utc_date=20170101/batchid=1483229855
…
s3://…/utc_date=20170611/batchid=1497226702
s3://…/utc_date=20170612/batchid=1497312541
WAPStage-1: Write
audit.my_table_f_1497312000dw.my_table_f
ETL Pattern
20170612
s3://…/utc_date=20170101/batchid=1483229855
…
s3://…/utc_date=20170611/batchid=1497226702
WAPStage-1: Write
audit.my_table_f_1497312000
$TABLE
dw.my_table_f
utc_date=20170612
com.netflix.dse.mds.metric.RowCount: 17240
com.netflix.dse.mds.metric.NullCount: 17240
...
ETL Pattern
20170612
s3://…/utc_date=20170101/batchid=1483229855
…
WAPStage-2: Audit
audit.my_table_f_1497312000dw.my_table_f
utc_date=20170611
com.netflix.dse.mds.metric.RowCount: 16135
com.netflix.dse.mds.metric.NullCount: 21
...
Quint
o
utc_date=20170612
com.netflix.dse.mds.metric.RowCount: 17240
com.netflix.dse.mds.metric.NullCount: 17240
...
ETL Pattern
20170612
…
WAPStage-2: Audit
audit.my_table_f_1497312000dw.my_table_f
Quint
o
metric eval behavior result
--------------------------------------------------
RowCount >= zero fail job
RowCount >= prior value fail job
NullCount normal dist warn job
Quinto configuration
utc_date=20170612
com.netflix.dse.mds.metric.RowCount: 17240
com.netflix.dse.mds.metric.NullCount: 17240
...
utc_date=20170611
com.netflix.dse.mds.metric.RowCount: 16135
com.netflix.dse.mds.metric.NullCount: 21
...
ETL Pattern
20170612
…
WAPStage-2: Audit
audit.my_table_f_1497312000dw.my_table_f
Quint
o
metric eval behavior result
--------------------------------------------------
RowCount >= zero fail job
RowCount >= prior value fail job
NullCount normal dist warn job
utc_date=20170612
com.netflix.dse.mds.metric.RowCount: 17240
com.netflix.dse.mds.metric.NullCount: 17240
...
utc_date=20170611
com.netflix.dse.mds.metric.RowCount: 16135
com.netflix.dse.mds.metric.NullCount: 21
...
Quinto configuration
ETL Pattern
20170612
…
WAPStage-2: Audit
audit.my_table_f_1497312000dw.my_table_f
Quint
o
metric eval behavior result
--------------------------------------------------
RowCount >= zero fail job
RowCount >= prior value fail job
NullCount normal dist warn job
utc_date=20170612
com.netflix.dse.mds.metric.RowCount: 17240
com.netflix.dse.mds.metric.NullCount: 17240
...
utc_date=20170611
com.netflix.dse.mds.metric.RowCount: 16135
com.netflix.dse.mds.metric.NullCount: 21
...
Quinto configuration
ETL Pattern
20170612
…
WAPStage-2: Audit
audit.my_table_f_1497312000dw.my_table_fmetric eval behavior result
--------------------------------------------------
RowCount >= zero fail job pass
RowCount >= prior value fail job
NullCount normal dist warn job
Quint
o
utc_date=20170612
com.netflix.dse.mds.metric.RowCount: 17240
com.netflix.dse.mds.metric.NullCount: 17240
...
utc_date=20170611
com.netflix.dse.mds.metric.RowCount: 16135
com.netflix.dse.mds.metric.NullCount: 21
...
Quinto configuration
ETL Pattern
20170612
…
WAPStage-2: Audit
audit.my_table_f_1497312000dw.my_table_fmetric eval behavior result
--------------------------------------------------
RowCount >= zero fail job pass
RowCount >= prior value fail job
NullCount normal dist warn job
Quint
o
utc_date=20170612
com.netflix.dse.mds.metric.RowCount: 17240
com.netflix.dse.mds.metric.NullCount: 17240
...
utc_date=20170611
com.netflix.dse.mds.metric.RowCount: 16135
com.netflix.dse.mds.metric.NullCount: 21
...
Quinto configuration
ETL Pattern
20170612
…
WAPStage-2: Audit
audit.my_table_f_1497312000dw.my_table_fmetric eval behavior result
--------------------------------------------------
RowCount >= zero fail job pass
RowCount >= prior value fail job
NullCount normal dist warn job
Quint
o
utc_date=20170612
com.netflix.dse.mds.metric.RowCount: 17240
com.netflix.dse.mds.metric.NullCount: 17240
...
utc_date=20170611
com.netflix.dse.mds.metric.RowCount: 16135
com.netflix.dse.mds.metric.NullCount: 21
...
Quinto configuration
ETL Pattern
20170612
…
WAPStage-2: Audit
audit.my_table_f_1497312000dw.my_table_fmetric eval behavior result
--------------------------------------------------
RowCount >= zero fail job pass
RowCount >= prior value fail job pass
NullCount normal dist warn job
Quint
o
utc_date=20170612
com.netflix.dse.mds.metric.RowCount: 17240
com.netflix.dse.mds.metric.NullCount: 17240
...
utc_date=20170611
com.netflix.dse.mds.metric.RowCount: 16135
com.netflix.dse.mds.metric.NullCount: 21
...
Quinto configuration
ETL Pattern
20170612
…
WAPStage-2: Audit
audit.my_table_f_1497312000dw.my_table_fmetric eval behavior result
--------------------------------------------------
RowCount >= zero fail job pass
RowCount >= prior value fail job pass
NullCount normal dist warn job
Quint
o
utc_date=20170612
com.netflix.dse.mds.metric.RowCount: 17240
com.netflix.dse.mds.metric.NullCount: 17240
...
utc_date=20170611
com.netflix.dse.mds.metric.RowCount: 16135
com.netflix.dse.mds.metric.NullCount: 21
...
Quinto configuration
ETL Pattern
20170612
…
WAPStage-2: Audit
audit.my_table_f_1497312000dw.my_table_fmetric eval behavior result
--------------------------------------------------
RowCount >= zero fail job pass
RowCount >= prior value fail job pass
NullCount normal dist warn job
Quint
o
utc_date=20170612
com.netflix.dse.mds.metric.RowCount: 17240
com.netflix.dse.mds.metric.NullCount: 17240
...
utc_date=20170611
com.netflix.dse.mds.metric.RowCount: 16135
com.netflix.dse.mds.metric.NullCount: 21
...
Quinto configuration
ETL Pattern
20170612
…
WAPStage-2: Audit
audit.my_table_f_1497312000dw.my_table_fmetric eval behavior result
--------------------------------------------------
RowCount >= zero fail job pass
RowCount >= prior value fail job pass
NullCount normal dist warn job fail
Quint
o
utc_date=20170612
com.netflix.dse.mds.metric.RowCount: 17240
com.netflix.dse.mds.metric.NullCount: 17240
...
utc_date=20170611
com.netflix.dse.mds.metric.RowCount: 16135
com.netflix.dse.mds.metric.NullCount: 21
...
Quinto configuration
ETL Pattern
20170612
…
WAPStage-2: Audit
audit.my_table_f_1497312000dw.my_table_fmetric eval behavior result
--------------------------------------------------
RowCount >= zero fail job pass
RowCount >= prior value fail job pass
NullCount normal dist warn job fail
Quint
o
utc_date=20170612
com.netflix.dse.mds.metric.RowCount: 17240
com.netflix.dse.mds.metric.NullCount: 17240
...
utc_date=20170611
com.netflix.dse.mds.metric.RowCount: 16135
com.netflix.dse.mds.metric.NullCount: 21
...
Quinto configuration
ETL Pattern
20170612
s3://…/utc_date=20170101/batchid=1483229855
…
s3://…/utc_date=20170611/batchid=1497226702
WAPStage-3: Publish
audit.my_table_f_1497312000dw.my_table_f
s3://…/utc_date=20170612/batchid=1497312541
ETL Pattern
20170612
s3://…/utc_date=20170101/batchid=1483229855
…
s3://…/utc_date=20170611/batchid=1497226702
WAPStage-3: Publish
audit.my_table_f_1497312000dw.my_table_f
s3://…/utc_date=20170612/batchid=1497312541
ETL Pattern
20170612
s3://…/utc_date=20170101/batchid=1483229855
…
s3://…/utc_date=20170611/batchid=1497226702
WAPStage-3: Publish
s3://…/utc_date=20170612/batchid=1497312541
dw.my_table_f
valid_thru_ts = 20170613 00:00:00
ETL Pattern
20170612
Quinto evaluations
● intelligent recommendations
● multiple tiers of coverage
● configurable rules
Jumpstarter.
Python Library
WAP.
Python Library
Minimal requirements
● parameterized destination table
WAP.
Python Library
Running WAP.
20170612
What’s Next.
● additional Metacat statistics
● robust anomaly detection (RAD)
● complete migration for all prod tables
20170612
Tips & Lessons Learned.
● Query-based solution may be “good enough” for many.
● Not all tables need quality coverage.
● One size rarely fits all tables.
● Build components, not “all-or-nothing” frameworks.
MICHELLE UFFORD
mufford@netflix.com
twitter.com/MichelleUfford
DATA
techblog.netflix.com
medium.com/netflix-techblog
twitter.com/NetflixData
tinyurl.com/NetflixData
Thank you!
WE’RE HIRING! jobs.netflix.com

Más contenido relacionado

La actualidad más candente

Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Flink Forward
 
Data In Motion Paris 2023
Data In Motion Paris 2023Data In Motion Paris 2023
Data In Motion Paris 2023confluent
 
Disaster Recovery Plans for Apache Kafka
Disaster Recovery Plans for Apache KafkaDisaster Recovery Plans for Apache Kafka
Disaster Recovery Plans for Apache Kafkaconfluent
 
Rainbird: Realtime Analytics at Twitter (Strata 2011)
Rainbird: Realtime Analytics at Twitter (Strata 2011)Rainbird: Realtime Analytics at Twitter (Strata 2011)
Rainbird: Realtime Analytics at Twitter (Strata 2011)Kevin Weil
 
Building Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeBuilding Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeFlink Forward
 
RedisConf17- Using Redis at scale @ Twitter
RedisConf17- Using Redis at scale @ TwitterRedisConf17- Using Redis at scale @ Twitter
RedisConf17- Using Redis at scale @ TwitterRedis Labs
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeDatabricks
 
Scaling Data Quality @ Netflix
Scaling Data Quality @ NetflixScaling Data Quality @ Netflix
Scaling Data Quality @ NetflixMichelle Ufford
 
Anomaly Detection at Scale!
Anomaly Detection at Scale!Anomaly Detection at Scale!
Anomaly Detection at Scale!Databricks
 
Hardening Kafka Replication
Hardening Kafka Replication Hardening Kafka Replication
Hardening Kafka Replication confluent
 
How Uber scaled its Real Time Infrastructure to Trillion events per day
How Uber scaled its Real Time Infrastructure to Trillion events per dayHow Uber scaled its Real Time Infrastructure to Trillion events per day
How Uber scaled its Real Time Infrastructure to Trillion events per dayDataWorks Summit
 
Bash: the #1 skill for Devops Engineers
Bash: the #1 skill for Devops EngineersBash: the #1 skill for Devops Engineers
Bash: the #1 skill for Devops Engineersandersjanmyr
 
Air traffic controller - Streams Processing meetup
Air traffic controller  - Streams Processing meetupAir traffic controller  - Streams Processing meetup
Air traffic controller - Streams Processing meetupEd Yakabosky
 
Product management class rookie to pro
Product management class rookie to proProduct management class rookie to pro
Product management class rookie to proBim Akinfenwa
 
Log management with ELK
Log management with ELKLog management with ELK
Log management with ELKGeert Pante
 
When Apache Spark Meets TiDB with Xiaoyu Ma
When Apache Spark Meets TiDB with Xiaoyu MaWhen Apache Spark Meets TiDB with Xiaoyu Ma
When Apache Spark Meets TiDB with Xiaoyu MaDatabricks
 
Road to (Enterprise) Observability
Road to (Enterprise) ObservabilityRoad to (Enterprise) Observability
Road to (Enterprise) ObservabilityChristoph Engelbert
 
Lessons learned processing 70 billion data points a day using the hybrid cloud
Lessons learned processing 70 billion data points a day using the hybrid cloudLessons learned processing 70 billion data points a day using the hybrid cloud
Lessons learned processing 70 billion data points a day using the hybrid cloudDataWorks Summit
 
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkDataWorks Summit
 

La actualidad más candente (20)

Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
 
Data In Motion Paris 2023
Data In Motion Paris 2023Data In Motion Paris 2023
Data In Motion Paris 2023
 
Disaster Recovery Plans for Apache Kafka
Disaster Recovery Plans for Apache KafkaDisaster Recovery Plans for Apache Kafka
Disaster Recovery Plans for Apache Kafka
 
Rainbird: Realtime Analytics at Twitter (Strata 2011)
Rainbird: Realtime Analytics at Twitter (Strata 2011)Rainbird: Realtime Analytics at Twitter (Strata 2011)
Rainbird: Realtime Analytics at Twitter (Strata 2011)
 
Building Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeBuilding Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta Lake
 
RedisConf17- Using Redis at scale @ Twitter
RedisConf17- Using Redis at scale @ TwitterRedisConf17- Using Redis at scale @ Twitter
RedisConf17- Using Redis at scale @ Twitter
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
 
Scaling Data Quality @ Netflix
Scaling Data Quality @ NetflixScaling Data Quality @ Netflix
Scaling Data Quality @ Netflix
 
Anomaly Detection at Scale!
Anomaly Detection at Scale!Anomaly Detection at Scale!
Anomaly Detection at Scale!
 
Hardening Kafka Replication
Hardening Kafka Replication Hardening Kafka Replication
Hardening Kafka Replication
 
How Uber scaled its Real Time Infrastructure to Trillion events per day
How Uber scaled its Real Time Infrastructure to Trillion events per dayHow Uber scaled its Real Time Infrastructure to Trillion events per day
How Uber scaled its Real Time Infrastructure to Trillion events per day
 
Bash: the #1 skill for Devops Engineers
Bash: the #1 skill for Devops EngineersBash: the #1 skill for Devops Engineers
Bash: the #1 skill for Devops Engineers
 
Observability
ObservabilityObservability
Observability
 
Air traffic controller - Streams Processing meetup
Air traffic controller  - Streams Processing meetupAir traffic controller  - Streams Processing meetup
Air traffic controller - Streams Processing meetup
 
Product management class rookie to pro
Product management class rookie to proProduct management class rookie to pro
Product management class rookie to pro
 
Log management with ELK
Log management with ELKLog management with ELK
Log management with ELK
 
When Apache Spark Meets TiDB with Xiaoyu Ma
When Apache Spark Meets TiDB with Xiaoyu MaWhen Apache Spark Meets TiDB with Xiaoyu Ma
When Apache Spark Meets TiDB with Xiaoyu Ma
 
Road to (Enterprise) Observability
Road to (Enterprise) ObservabilityRoad to (Enterprise) Observability
Road to (Enterprise) Observability
 
Lessons learned processing 70 billion data points a day using the hybrid cloud
Lessons learned processing 70 billion data points a day using the hybrid cloudLessons learned processing 70 billion data points a day using the hybrid cloud
Lessons learned processing 70 billion data points a day using the hybrid cloud
 
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache Flink
 

Similar a Whoops, The Numbers Are Wrong! Scaling Data Quality @ Netflix

SRV405 Deep Dive Amazon Redshift & Redshift Spectrum at Cardinal Health
SRV405 Deep Dive Amazon Redshift & Redshift Spectrum at Cardinal HealthSRV405 Deep Dive Amazon Redshift & Redshift Spectrum at Cardinal Health
SRV405 Deep Dive Amazon Redshift & Redshift Spectrum at Cardinal HealthAmazon Web Services
 
AWS re:Invent re:Cap - 데이터 분석: Amazon EC2 C4 Instance + Amazon EBS - 김일호
AWS re:Invent re:Cap - 데이터 분석: Amazon EC2 C4 Instance + Amazon EBS - 김일호AWS re:Invent re:Cap - 데이터 분석: Amazon EC2 C4 Instance + Amazon EBS - 김일호
AWS re:Invent re:Cap - 데이터 분석: Amazon EC2 C4 Instance + Amazon EBS - 김일호Amazon Web Services Korea
 
Launching Your First Big Data Project on AWS
Launching Your First Big Data Project on AWSLaunching Your First Big Data Project on AWS
Launching Your First Big Data Project on AWSAmazon Web Services
 
LendingClub RealTime BigData Platform with Oracle GoldenGate
LendingClub RealTime BigData Platform with Oracle GoldenGateLendingClub RealTime BigData Platform with Oracle GoldenGate
LendingClub RealTime BigData Platform with Oracle GoldenGateRajit Saha
 
Extending Analytics Beyond the Data Warehouse, ft. Warner Bros. Analytics (AN...
Extending Analytics Beyond the Data Warehouse, ft. Warner Bros. Analytics (AN...Extending Analytics Beyond the Data Warehouse, ft. Warner Bros. Analytics (AN...
Extending Analytics Beyond the Data Warehouse, ft. Warner Bros. Analytics (AN...Amazon Web Services
 
Oracle Goldengate for Big Data - LendingClub Implementation
Oracle Goldengate for Big Data - LendingClub ImplementationOracle Goldengate for Big Data - LendingClub Implementation
Oracle Goldengate for Big Data - LendingClub ImplementationVengata Guruswamy
 
Managing your Black Friday Logs NDC Oslo
Managing your  Black Friday Logs NDC OsloManaging your  Black Friday Logs NDC Oslo
Managing your Black Friday Logs NDC OsloDavid Pilato
 
Rakuten Technology Conference 2017 A Distributed SQL Database For Data Analy...
Rakuten Technology Conference 2017 A Distributed SQL Database  For Data Analy...Rakuten Technology Conference 2017 A Distributed SQL Database  For Data Analy...
Rakuten Technology Conference 2017 A Distributed SQL Database For Data Analy...Rakuten Group, Inc.
 
PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase
PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase
PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase DataWorks Summit
 
Applying AI to Performance Engineering: Shift-Left, Shift-Right, Self-Healing
Applying AI to Performance Engineering: Shift-Left, Shift-Right, Self-HealingApplying AI to Performance Engineering: Shift-Left, Shift-Right, Self-Healing
Applying AI to Performance Engineering: Shift-Left, Shift-Right, Self-HealingAndreas Grabner
 
Managing your black friday logs - Code Europe
Managing your black friday logs - Code EuropeManaging your black friday logs - Code Europe
Managing your black friday logs - Code EuropeDavid Pilato
 
OLTP+OLAP=HTAP
 OLTP+OLAP=HTAP OLTP+OLAP=HTAP
OLTP+OLAP=HTAPEDB
 
Refactoring your EDW with Mobile Analytics Products
Refactoring your EDW with Mobile Analytics ProductsRefactoring your EDW with Mobile Analytics Products
Refactoring your EDW with Mobile Analytics ProductsLuke Han
 
GPSWKS401_Designing a Cloud Enterprise Data Warehouse
GPSWKS401_Designing a Cloud Enterprise Data WarehouseGPSWKS401_Designing a Cloud Enterprise Data Warehouse
GPSWKS401_Designing a Cloud Enterprise Data WarehouseAmazon Web Services
 
Running Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformRunning Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformEva Tse
 
Flink Forward SF 2017: Chinmay Soman - Real Time Analytics in the real World ...
Flink Forward SF 2017: Chinmay Soman - Real Time Analytics in the real World ...Flink Forward SF 2017: Chinmay Soman - Real Time Analytics in the real World ...
Flink Forward SF 2017: Chinmay Soman - Real Time Analytics in the real World ...Flink Forward
 
Redis Day TLV 2018 - 10 Reasons why Redis should be your Primary Database
Redis Day TLV 2018 - 10 Reasons why Redis should be your Primary DatabaseRedis Day TLV 2018 - 10 Reasons why Redis should be your Primary Database
Redis Day TLV 2018 - 10 Reasons why Redis should be your Primary DatabaseRedis Labs
 
(SDD409) Amazon RDS for PostgreSQL Deep Dive | AWS re:Invent 2014
(SDD409) Amazon RDS for PostgreSQL Deep Dive | AWS re:Invent 2014(SDD409) Amazon RDS for PostgreSQL Deep Dive | AWS re:Invent 2014
(SDD409) Amazon RDS for PostgreSQL Deep Dive | AWS re:Invent 2014Amazon Web Services
 
AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data
AWS Summit Tel Aviv - Startup Track - Data Analytics & Big DataAWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data
AWS Summit Tel Aviv - Startup Track - Data Analytics & Big DataAmazon Web Services
 

Similar a Whoops, The Numbers Are Wrong! Scaling Data Quality @ Netflix (20)

SRV405 Deep Dive Amazon Redshift & Redshift Spectrum at Cardinal Health
SRV405 Deep Dive Amazon Redshift & Redshift Spectrum at Cardinal HealthSRV405 Deep Dive Amazon Redshift & Redshift Spectrum at Cardinal Health
SRV405 Deep Dive Amazon Redshift & Redshift Spectrum at Cardinal Health
 
AWS re:Invent re:Cap - 데이터 분석: Amazon EC2 C4 Instance + Amazon EBS - 김일호
AWS re:Invent re:Cap - 데이터 분석: Amazon EC2 C4 Instance + Amazon EBS - 김일호AWS re:Invent re:Cap - 데이터 분석: Amazon EC2 C4 Instance + Amazon EBS - 김일호
AWS re:Invent re:Cap - 데이터 분석: Amazon EC2 C4 Instance + Amazon EBS - 김일호
 
Launching Your First Big Data Project on AWS
Launching Your First Big Data Project on AWSLaunching Your First Big Data Project on AWS
Launching Your First Big Data Project on AWS
 
LendingClub RealTime BigData Platform with Oracle GoldenGate
LendingClub RealTime BigData Platform with Oracle GoldenGateLendingClub RealTime BigData Platform with Oracle GoldenGate
LendingClub RealTime BigData Platform with Oracle GoldenGate
 
SQREAM DB on IBM Power9
SQREAM DB on IBM Power9SQREAM DB on IBM Power9
SQREAM DB on IBM Power9
 
Extending Analytics Beyond the Data Warehouse, ft. Warner Bros. Analytics (AN...
Extending Analytics Beyond the Data Warehouse, ft. Warner Bros. Analytics (AN...Extending Analytics Beyond the Data Warehouse, ft. Warner Bros. Analytics (AN...
Extending Analytics Beyond the Data Warehouse, ft. Warner Bros. Analytics (AN...
 
Oracle Goldengate for Big Data - LendingClub Implementation
Oracle Goldengate for Big Data - LendingClub ImplementationOracle Goldengate for Big Data - LendingClub Implementation
Oracle Goldengate for Big Data - LendingClub Implementation
 
Managing your Black Friday Logs NDC Oslo
Managing your  Black Friday Logs NDC OsloManaging your  Black Friday Logs NDC Oslo
Managing your Black Friday Logs NDC Oslo
 
Rakuten Technology Conference 2017 A Distributed SQL Database For Data Analy...
Rakuten Technology Conference 2017 A Distributed SQL Database  For Data Analy...Rakuten Technology Conference 2017 A Distributed SQL Database  For Data Analy...
Rakuten Technology Conference 2017 A Distributed SQL Database For Data Analy...
 
PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase
PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase
PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase
 
Applying AI to Performance Engineering: Shift-Left, Shift-Right, Self-Healing
Applying AI to Performance Engineering: Shift-Left, Shift-Right, Self-HealingApplying AI to Performance Engineering: Shift-Left, Shift-Right, Self-Healing
Applying AI to Performance Engineering: Shift-Left, Shift-Right, Self-Healing
 
Managing your black friday logs - Code Europe
Managing your black friday logs - Code EuropeManaging your black friday logs - Code Europe
Managing your black friday logs - Code Europe
 
OLTP+OLAP=HTAP
 OLTP+OLAP=HTAP OLTP+OLAP=HTAP
OLTP+OLAP=HTAP
 
Refactoring your EDW with Mobile Analytics Products
Refactoring your EDW with Mobile Analytics ProductsRefactoring your EDW with Mobile Analytics Products
Refactoring your EDW with Mobile Analytics Products
 
GPSWKS401_Designing a Cloud Enterprise Data Warehouse
GPSWKS401_Designing a Cloud Enterprise Data WarehouseGPSWKS401_Designing a Cloud Enterprise Data Warehouse
GPSWKS401_Designing a Cloud Enterprise Data Warehouse
 
Running Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformRunning Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data Platform
 
Flink Forward SF 2017: Chinmay Soman - Real Time Analytics in the real World ...
Flink Forward SF 2017: Chinmay Soman - Real Time Analytics in the real World ...Flink Forward SF 2017: Chinmay Soman - Real Time Analytics in the real World ...
Flink Forward SF 2017: Chinmay Soman - Real Time Analytics in the real World ...
 
Redis Day TLV 2018 - 10 Reasons why Redis should be your Primary Database
Redis Day TLV 2018 - 10 Reasons why Redis should be your Primary DatabaseRedis Day TLV 2018 - 10 Reasons why Redis should be your Primary Database
Redis Day TLV 2018 - 10 Reasons why Redis should be your Primary Database
 
(SDD409) Amazon RDS for PostgreSQL Deep Dive | AWS re:Invent 2014
(SDD409) Amazon RDS for PostgreSQL Deep Dive | AWS re:Invent 2014(SDD409) Amazon RDS for PostgreSQL Deep Dive | AWS re:Invent 2014
(SDD409) Amazon RDS for PostgreSQL Deep Dive | AWS re:Invent 2014
 
AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data
AWS Summit Tel Aviv - Startup Track - Data Analytics & Big DataAWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data
AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data
 

Más de DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

Más de DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Último

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 

Último (20)

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 

Whoops, The Numbers Are Wrong! Scaling Data Quality @ Netflix

Notas del editor

  1. https://dataworkssummit.com/san-jose-2017/sessions/whoops-the-numbers-are-wrong-scaling-data-quality-netflix/ WHOOPS, THE NUMBERS ARE WRONG! SCALING DATA QUALITY @ NETFLIX Netflix is a famously data-driven company. Data is used to make informed decisions on everything from content acquisition to content delivery, and everything in-between. As with any data-driven company, it’s critical that data used by the business is accurate. Or, at worst, that the business has visibility into potential quality issues as soon as they arise. But even in the most mature data warehouses, data quality can be hard. How can we ensure high quality in a cloud-based, internet-scale, modern big data warehouse employing a variety of data engineering technologies? In this talk, Michelle Ufford will share how the Data Engineering & Analytics team at Netflix is doing exactly that. We’ll kick things off with a quick overview of Netflix’s analytics environment, then dig into details of our data quality solution. We’ll cover what worked, what didn’t work so well, and what we plan to work on next. We’ll conclude with some tips and lessons learned for ensuring data quality on big data. DETAILS This session is a (Intermediate) talk in our Data Processing and Warehousing track. It focuses on Apache Hadoop, Apache Hive, Apache Pig, Apache Spark and is geared towards Architect, Data Analyst, Developer / Engineer, Operations / IT audiences.
  2. 1500+ devices as of Q1 2017
  3. Goal is to provide behind-the-scenes look at how we’re approaching DQ. We’re sharing ideas, not code – no open-source announcement.
  4. That’s cool but sounds like a lot of work I need to: know what stats are available in Metacat know what quality templates exist figure out which ones I should use figure out a good configuration for each and do everything we just walked through in WAP
  5. Takes ~5 minutes to enable WAP with 108 audits on a new Spark job. Added 42 seconds to hourly processing time.
  6. It’s a combination of these solutions that allow us to scale not only the processing time but the engineering time too. Takes ~5 minutes to enable WAP with 108 audits on a new Spark job. Added 42 seconds to hourly processing time.
  7. Stats Cardinality Histograms Map keys RAD High cardinality dimensions Seasonality beyond week-over-week Atypical data distributions Reduce false positives
  8. Query-based solution Not efficient but much less complicated to implement Good place to start Works well for small-to-medium datasets and/or nightly batch ETL Transient, experimental, single-user Content vs. Streaming
  9. 2 main motivations Confidence Notify users when quality issues arise Only make data available after some basic validation Increase confidence for data consumers that data is good to use Efficiency Catch issues faster Less business impact Much easier to simply not update downstream dependencies than to fix it after-the-fact