Data is Yahoo!'s most strategic assets - from user engagement and insights data to revenue and billing data. Three years ago, Yahoo! invested in a Data Quality program.
By applying industry principles and techniques the Data Quality program has provided proactive and reactive system solutions to Audience data issues and root causes by addressing technical challenges of data quality at scale and engaging and leveraging the rest of the organization in the solution: from product teams all through the data stack (data sourcing, ETL, aggs and analytics) to analysts and sciences teams who consume the data. This methodology is now being scaled to the all data across Yahoo! including Search and Display Advertising.
4. The Anatomy of a Yahoo! Web Page Buzz Targeted Content Apps Ads Content Y! links
5. What Yahoo! Does With Its Data? Analytics & Business Insights – data-driven decisions How many people visited Home Page today and what did they click on? What impact did the Japan tsunami have News and global engagement? Targeting What products are you interested in based on your recent web usage? Advertisers pay a lot of $$ for good targeting. Targeted content means better user engagement. Experimentation “Live user testing” What layout do users like best? Are most profitable? 5
8. Yahoo! Has a LOT of Data Leading Internet Portal and Software Supplier[1] Serves 640 MM users or84.5% of US internet users Top ranked site in Mail, Messenger, Home Page, and more Collects over 25 terabytes of behavioral data per day 2 U.S. Library of Congress equivalents every day [1] US Yahoo! Audience Measurement Report. comScore, Jan 2011 8
9.
10. Processes data from all Yahoo! properties web server logs and delivers audience engagement metrics
41. Support for DQ in the QE Cycle Data Validation ` ` Test Environment E2E data validation tests covering major customer use cases in pre-release QE cycle Σ Note: Specific tools are not currently part of DQ standard but partnership in this area may make sense 23
42.
43. Compare results from legacy system or previous version of system (with production data)
46. Accuracy - Test that the data input equals the data output. If data is requested for a specific day in one time zone but fetched in another the data will not be accurate.24
47. Support for DQ in the QE CycleQE Coverage of DQ Features ` Σ Functional test coverage for built-in DQ features, e.g., in-line DQ checks 25
71. Owned by DQ team with review input from PM, Dev, QE, & Customer32
72. DQ Central – End to End Audience Data Features: Data statistics/trending of audience PVs for property at each stages of audience pipeline End to end data transparency per page and server Critical traffic fluctuation notification for properties and custom monitoring for any data customers Data issue investigation and diagnostics Open/overdue data quality bug tracking 33
73. DQ Central – DQ Champion Engagement 5. Sign off information is captured for alert in DQ database 4. DQ Champion manages alert sign-off in DQ Central UI 6. Explanation overlaid on data Reason…. BugID σ σ Reason…. BugID ALERT 1. Data Source Metrics are monitored and an anomaly is found 2. Each alert is registered in the DQ database 3. Email detailing alert/s is sent to DQ Champion 34
78. Extensive end-to-end analysis of feeds with missing data, upstream feeds and data sources slicing and dicing by interesting dimensions to understand source & cause of issue
79. Conclusion: Expected behavior; field of interest was populated according to sampling rates as designed, but known only by Serving teams – not by Sciences customers
83. Reduces or removes confusion regarding differences in two seemingly-similar data sets
84. There is a close relationship between Data Lineage and Data Transparency: the former describes the processing rules behind the later transactional data.37
139. Challenge of dealing with sr. tech/arch who want the perfect technical solution vs. the need to make progress with interim/viable ugly/manual solutions
“Riding Giants” is only possible by using a recently-discovered method: tow-in surfingShow video: http://www.youtube.com/watch?v=LhKFTqxn6qs70’ wave = power of dataUDA DSI = jet-ski method (unlocking the data to harness the 70 wave)UDA DQ = getting the GPS coordinates correct so you are in the right place to catch it – without high quality data we miss the wave altogether!