2. 2
Salesforce Confidential
Kamal Duggireddy
Kamal Duggireddy currently leads Data Engineering, Product Data Science Team at Salesforce.com Prior to
this, he served as Director - Big Data Architecture at American Express. Combining deep technical skills along
with business knowledge and strong execution experience, Kamal developed reference architectures and new
enterprise-level capabilities with the Hadoop stack.
Prashant Gokhale
Prashant is currently working on solving big data problems at Salesforce.com using Hadoop and its ecosystem
components. Prior to this he held several critical engineering positions at Yahoo, Cloudera & Lookout.
About Us
4. 4
Salesforce Confidential
The Use Case | Flow
Ad-Hoc Requests
Predictive Data Apps
Data Engineering & Curation
Smart Data Dashboards
(Salesforce Wave)
Advanced Analysis
Instrumentation
150+ Loglines
Hadoop
Data Processing
Traditional Data
Warehouses
Dimensions
6. 6
Salesforce Confidential
Milestones | Along the way
</>
<>
Reusability Declarative Data Lake Data Dictionary
Self service
Automation
Security Visualization Governance
7. 7
Salesforce Confidential
The Framework | Finally!
Datasets
(Variousgrain)
Data Lake
Log Processing
Metadata
Flow Engine
WebApp
Self Service
LogSourcesCloudMetrics
Data Profiler
Data Science
Kafka
Splunk
Files
Warehouse
Objects
Hadoop
Cubes
(Customgrain)
8. 8
Salesforce Confidential
Goals
Scalable
Process hundreds of
billions of log lines.
Flexible
Handle thousands of
log schemas. Support
variable grain and
transformations using
custom code.
Data Quality
Automated data
profiling, monitoring
and alerting.
Self Service
Enable ad-hoc
analysis
9. 9
Salesforce Confidential
Log Processing Engine
•Declaratively define features and flows.
•Normalize data across multiple log lines.
•Custom code injection for data transformation.
Data Profiler
•Profile data at scale to detect anomalies.
Web App
•Interface to manage features and flows.
Job Automation engine
•End to end automation from features/flows to curated data sets in Wave.
Key Building Blocks
10. 10
Salesforce Confidential
Log Processing Engine
logType==’X’ and event==’Create Event’ and
page==’Home Landing’,”Feat
1”,”eval_code(event.toUpperCase())”,page,…..
logType==’ABC’ and event==’Create Event’ and
page==’Home’,”Feat
2”,”eval_code(event.substring(5))”,event,…..
usage Log Files
Feature definitions
Hive tables
Data Normalization Data Cleansing Data Transformation
+
12. 12
Salesforce Confidential
Everything put together
Datasets
(Variousgrain)
Data Lake
Log Processing
Metadata
Flow Engine
WebApp
Self Service
LogSourcesCloudMetrics
Data Profiler
Data Science
Kafka
Splunk
Files
Warehouse
Objects
Hadoop
Cubes
(Customgrain)
13. 13
Salesforce Confidential
Data Volumetrics
TOTAL
Avg. Volume of App Logs processed (Compressed) 100’s TB/mon
Avg. Number of Jobs 6000+ /mon
Avg. Log Size volume growth rate A lot!
Number of Log Record Types 1,000s
Number of fields 10s of 1,000s
200+ B
Events / Day
500+
Features