How McAfee Built High-Quality Pipelines with Azure Databricks to Power Customer Insights on 250TB+ of Data: Lessons Learned in Data Governance and Lineage
"How do you make over 250TB of data useful to data scientists? Bring a lot of CPUs? If only it were so simple!
Understanding customer behavior requires having high quality, reliable data. Any data quality challenges are magnified with high volumes of data, limiting data scientists’ ability to understand, clean, and use data. If you have to clean up 600 million events per day, it’s like cleaning Moscone West’s floors: as soon as you’re done, you have to start all over again.
Come learn how McAfee built a data-driven pipeline using Azure Databricks to maintain high data quality and comprehensive lineage to enable data scientists to be more productive and make sound statistical inferences."
How McAfee Built High-Quality Pipelines with Azure Databricks to Power Customer Insights on 250TB+ of Data: Lessons Learned in Data Governance and Lineage
2. David Newell & Geoff Oitment, McAfee
Customer Insights from 250TB+ of Data:
Lessons Learned in Data Governance and Lineage
#UnifiedAnalytics #SparkAISummit
3. “ 3#UnifiedAnalytics #SparkAISummit
At the heart of [AI] technology is data.
Henry Schuck, Forbes–
https://www.forbes.com/sites/forbestechcouncil/2018/05/02/why-data-accuracy-is-critical-to-the-evolution-of-artificial-intelligence-in-b2b-sales/#57bebedf466d
4. What’s so hard about Big Data?
4#UnifiedAnalytics #SparkAISummit
Technology Moves Fast
Siloed Data Sources
Finding Signal in the Noise
Limited Big Data Skills
5. McAfee started no better
5#UnifiedAnalytics #SparkAISummit
Current data systems are 15-20 years old
Multiple data silos
What activity does an event track?
Limited data science talent across
business functions
Typical McAfee
Data Analyst
I don’t know how to start!
6. Goal: enable self-service analytics
6#UnifiedAnalytics #SparkAISummit
Comprehensive Customer 360
All the data in one place
Accurate Calculations
Standardized, certified metrics
Easy to Access
Self-service in tool of choice
Scalable & Responsive
Get data fast, even in real-time
7. The data lake fallacy: it’s magic!
7#UnifiedAnalytics #SparkAISummit
These photos by Unknown Authors are licensed under CC BY-NC-SA
vs
A data lake is not a magic wand:
• Veracity still dependent on data quality
• Volume still dependent on inputs
• Value still dependent on skill set
8. How we kept the data lake clean:
portal for systematic data governance
• Data validation rules
• Source data collection driven
by documentation
Comprehensive documentation
at all ingestion points
• Automatic documentation and
governance enforcement
• Generated tracking manifests
• Data cleaned at the source
Centralized governance &
documentation system
8#UnifiedAnalytics #SparkAISummit
9. What we built:
documentation-driven pipeline
• All ETL jobs defined in
configuration
• Data lineage derived from
same configuration
Configuration-driven design
• Enables end-to-end data flow
visualization
• Configuration generated by
governance system
Integrates into the centralized
portal
9#UnifiedAnalytics #SparkAISummit
10. Architecture
10#UnifiedAnalytics #SparkAISummit
Anonymized Source Data
Data
Governance
Data Portal
Event
Processing
Client SDK
Collection &
Queuing
Event Hubs
(Kafka-compatible)
Storage
Data Lake
Data Lake Store
(HDFS-compatible)
Self-service
Analysis
or tool of choice!
Data Science
Pipeline
Processing
Azure Databricks
Analytics
Platform
Azure Databricks
Data Catalog
Proprietary
Marketing
Systems
Proprietary &
Third-party
Profiles
• Account
• Device
Orchestration
• Databricks Jobs
• Azure Data Factory
11. Lessons Learned
11#UnifiedAnalytics #SparkAISummit
Don’t compromise on data quality;
you’ll regret it later
Create small, manageable features that
enable quick iteration
Think ahead; don’t do something that will
handicap future development
12. Don’t compromise on data quality;
you’ll regret it later
12#UnifiedAnalytics #SparkAISummit
13. DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT