4. HUGE INVESTMENT IN ENTERPRISE IT & BIG DATA
Companies invested $3-4 Trillion in IT over last 20+ years
And now are investing billions in “Big Data” and Analytics 3.0...
5. DIRTY LITTLE SECRET: DATA VARIETY IN ENTERPRISE
Most investments oriented towards
some “silo” in the enterprise
● application
● function
● division
● geography
Data tied to these investments
is extremely siloed
6. BIG DATA & ANALYTICS NEED CLEAN + UNIFIED DATA
“Consider the more than $44 billion projected by Gartner to be spent on big data in
2014. The vast majority of it — $37.4 billion — is going to IT services. Enterprise software
only accounts for about a tenth. The disproportionate spending on services is a sign of
immaturity in how we manage data.” - Mahesh S. Kumar, Harvard Business Review
7. TACKLING THE ENTERPRISE DATA SILO PROBLEM
All are necessary but not sufficient to truly address next-gen challenges
● Democratized visualization and modeling - radical consumption heterogeneity
● SemanticWeb/LinkedData - radical source heterogeneity
● Provenance for data to improve reliability
● Rapid iteration/change requires reproducability from source
● Desire for longitudinal data across many entities
● Need for automated data quality / assurance
Traditional approaches...
● Standardization - worth trying
● Aggregation - yes - but actually makes the problem worse
● Top-down modeling (MDM/ETL) - ok for app-specific or well-defined data
8. THE MYTH OF THE SINGLE TECH VENDOR SOLUTION
“Use my brand and data unification will just happen!”
REALLY?
9. HEALTHCARE/BIOPHARMA IS THE FRONT LINE
The diversity of data and
decentralized nature of healthcare
and specifically biopharmaceutical
research make our industry the
place where next gen data
management will develop.
11. CURATION AT SCALE
Hiring More Data Scientists Makes the Problem Worse
Reality Enterprise RealityGoal
• Manual data collection
and preparation
• Long lead time to
analyses
• Limited individual view
on variety of data
• Extensive rework
• No cohesive view of
data efforts
• Expertise across
organization underutilized
12. NEW TOOLS ARE NECESSARY
New transformation tools are necessary… but not sufficient to
solve the enterprise data variety problem
Unified View
A few sources...
Thousands of sources
13. SOLUTION: BOTTOM-UP, PROBABILISTIC DATA MODELING & “COLLABORATIVE CURATION”
Time to embrace the reality of extreme data variety
across the entire enterprise - “Unified Data”
Back to the future
● 1990’s web: probabilistic search / website connection
● 2020’s enterprise: probabilistic data source connection
& curation
Requires a bottom-up, probabilistic and collaborative
approach to data (complements deterministic)
● Rules for transformation are necessary but not sufficient
to solve broad problem of broad integration
● Mix of 80% probabilistic & 20% deterministic
● Iteratively and systematically engage data experts
14. CORE OF TAMR
Machine Learning with Human Insight
Identify sources, understand relationships and curate the massive variety of siloed data
Structured and
Semi-structured
Data Sources
Collaborative
Curation
Data Experts
(Source
owners)
Data Stewards
and Curators
Data
Inventory
APIs
Systems
Tools
Data
Scientists
Advanced
Algorithms &
Machine
Learning
Expert
Input
Integrated Data
& Metadata
Expert
Directory
15. FORTUNE 5 BIOPHARMA
Challenges
• 7k+ scientists
• Decentralized organization
• Assay data in spreadsheets
• 30k+ tables
• 100k+ unique attributes
• Error detection in units
Tamr Unified View
Thousands of
Potential Sources
16. SOLUTION OVERVIEW: CDISC CONVERSION
The Problem
• Clinical trial data reported in wide variety of
formats, ontologies and standards
• Underspecified attribute names, varying
qualities of annotation, duplicate data, etc…
The Solution
• A scalable, replicable way to automatically
unify and convert clinical trial data to CDISC
format.
Benefit
• Tamr technology solves common CDISC problems: schema mapping and expert sourcing
• Faster way to aggregate and report ongoing trial data for regulatory filings
• Simplified reporting for various agency ontologies