Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.
WIFI SSID:SparkAISummit | Password: UnifiedAnalytics
David Newell & Geoff Oitment, McAfee
Customer Insights from 250TB+ of Data:
Lessons Learned in Data Governance and Lineage...
“ 3#UnifiedAnalytics #SparkAISummit
At the heart of [AI] technology is data.
Henry Schuck, Forbes–
https://www.forbes.com/...
What’s so hard about Big Data?
4#UnifiedAnalytics #SparkAISummit
Technology Moves Fast
Siloed Data Sources
Finding Signal ...
McAfee started no better
5#UnifiedAnalytics #SparkAISummit
Current data systems are 15-20 years old
Multiple data silos
Wh...
Goal: enable self-service analytics
6#UnifiedAnalytics #SparkAISummit
Comprehensive Customer 360
All the data in one place...
The data lake fallacy: it’s magic!
7#UnifiedAnalytics #SparkAISummit
These photos by Unknown Authors are licensed under CC...
How we kept the data lake clean:
portal for systematic data governance
• Data validation rules
• Source data collection dr...
What we built:
documentation-driven pipeline
• All ETL jobs defined in
configuration
• Data lineage derived from
same conf...
Architecture
10#UnifiedAnalytics #SparkAISummit
Anonymized Source Data
Data
Governance
Data Portal
Event
Processing
Client...
Lessons Learned
11#UnifiedAnalytics #SparkAISummit
Don’t compromise on data quality;
you’ll regret it later
Create small, ...
Don’t compromise on data quality;
you’ll regret it later
12#UnifiedAnalytics #SparkAISummit
DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT
Próxima SlideShare
Cargando en…5
×

How McAfee Built High-Quality Pipelines with Azure Databricks to Power Customer Insights on 250TB+ of Data: Lessons Learned in Data Governance and Lineage

224 visualizaciones

Publicado el

"How do you make over 250TB of data useful to data scientists? Bring a lot of CPUs? If only it were so simple!

Understanding customer behavior requires having high quality, reliable data. Any data quality challenges are magnified with high volumes of data, limiting data scientists’ ability to understand, clean, and use data. If you have to clean up 600 million events per day, it’s like cleaning Moscone West’s floors: as soon as you’re done, you have to start all over again.

Come learn how McAfee built a data-driven pipeline using Azure Databricks to maintain high data quality and comprehensive lineage to enable data scientists to be more productive and make sound statistical inferences."

Publicado en: Datos y análisis
  • Sé el primero en comentar

How McAfee Built High-Quality Pipelines with Azure Databricks to Power Customer Insights on 250TB+ of Data: Lessons Learned in Data Governance and Lineage

  1. 1. WIFI SSID:SparkAISummit | Password: UnifiedAnalytics
  2. 2. David Newell & Geoff Oitment, McAfee Customer Insights from 250TB+ of Data: Lessons Learned in Data Governance and Lineage #UnifiedAnalytics #SparkAISummit
  3. 3. “ 3#UnifiedAnalytics #SparkAISummit At the heart of [AI] technology is data. Henry Schuck, Forbes– https://www.forbes.com/sites/forbestechcouncil/2018/05/02/why-data-accuracy-is-critical-to-the-evolution-of-artificial-intelligence-in-b2b-sales/#57bebedf466d
  4. 4. What’s so hard about Big Data? 4#UnifiedAnalytics #SparkAISummit Technology Moves Fast Siloed Data Sources Finding Signal in the Noise Limited Big Data Skills
  5. 5. McAfee started no better 5#UnifiedAnalytics #SparkAISummit Current data systems are 15-20 years old Multiple data silos What activity does an event track? Limited data science talent across business functions Typical McAfee Data Analyst I don’t know how to start!
  6. 6. Goal: enable self-service analytics 6#UnifiedAnalytics #SparkAISummit Comprehensive Customer 360 All the data in one place Accurate Calculations Standardized, certified metrics Easy to Access Self-service in tool of choice Scalable & Responsive Get data fast, even in real-time
  7. 7. The data lake fallacy: it’s magic! 7#UnifiedAnalytics #SparkAISummit These photos by Unknown Authors are licensed under CC BY-NC-SA vs A data lake is not a magic wand: • Veracity still dependent on data quality • Volume still dependent on inputs • Value still dependent on skill set
  8. 8. How we kept the data lake clean: portal for systematic data governance • Data validation rules • Source data collection driven by documentation Comprehensive documentation at all ingestion points • Automatic documentation and governance enforcement • Generated tracking manifests • Data cleaned at the source Centralized governance & documentation system 8#UnifiedAnalytics #SparkAISummit
  9. 9. What we built: documentation-driven pipeline • All ETL jobs defined in configuration • Data lineage derived from same configuration Configuration-driven design • Enables end-to-end data flow visualization • Configuration generated by governance system Integrates into the centralized portal 9#UnifiedAnalytics #SparkAISummit
  10. 10. Architecture 10#UnifiedAnalytics #SparkAISummit Anonymized Source Data Data Governance Data Portal Event Processing Client SDK Collection & Queuing Event Hubs (Kafka-compatible) Storage Data Lake Data Lake Store (HDFS-compatible) Self-service Analysis or tool of choice! Data Science Pipeline Processing Azure Databricks Analytics Platform Azure Databricks Data Catalog Proprietary Marketing Systems Proprietary & Third-party Profiles • Account • Device Orchestration • Databricks Jobs • Azure Data Factory
  11. 11. Lessons Learned 11#UnifiedAnalytics #SparkAISummit Don’t compromise on data quality; you’ll regret it later Create small, manageable features that enable quick iteration Think ahead; don’t do something that will handicap future development
  12. 12. Don’t compromise on data quality; you’ll regret it later 12#UnifiedAnalytics #SparkAISummit
  13. 13. DON’T FORGET TO RATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT

×