1) The document discusses how Zephyr Health is solving the problem of disconnected healthcare data by building a platform that ingests and integrates data from various sources using algorithms and MongoDB.
2) It organizes data into entity-centric profiles and uses a graph-based index to allow complex queries across the integrated data.
3) The platform powers various analytical applications that help address real business problems by leveraging the integrated data in a standardized way.
Driving Behavioral Change for Information Management through Data-Driven Gree...
Solving the Disconnected Data Problem in Healthcare Using MongoDB
1. Solving the Disconnected Data
Problem in Healthcare Using
MongoDB
A MongoSF talk – December 3rd 2014
Sven Junkergård - CTO
2. ME
I am a reformed consultant who used to do architecture consulting…
• MSc Computer Science and Engineering – Chalmers University
of Technology in Gothenburg
• AMS, Capgemini
• Cake Financial – aggregating retail investor portfolios and
generating investment insights from the best of the best
• Billfloat – novel financial credit product with highly differentiated
underwriting method
• Zephyr Health – built out technology and engineering team to
deliver on a big vision – integrate disconnected data in
healthcare and solve real problems. Now CTO.
2
3. WHO I WORK FOR – ZEPHYR HEALTH
3
ORGANIZATIONAL
EXPERTISE
• Life Sciences
• Brand Management
• Big Data
• Applied Mathematics
• Algorithms
• IaaS | SaaS | PaaS
San Francisco
London
India
OFFICE LOCATIONS
CURRENT CLIENTS
Include members of:
GLOBAL TOP 5
BIOPHARM
GLOBAL TOP 5
PHARM
GLOBAL TOP 5
MEDICAL
DEVICES
• Machine Learning
• Artificial Intelligence
• Statistics & Modeling
• Data Science
• Visualization
• App Development
OUR FOCUS
• Organize disconnected data in healthcare and
life science
• Visualize the combination of heterogeneous
data sources in analytical problems
• Solve important and challenging problems for
our customers
4. V
V
V
SOLVING THE VARIETY PROBLEM
Volume
Velocity
Variety
V Visualization
Healthcare example
Genomic sequencing
Streaming device data
Understanding healthcare
landscape and treatment
effectiveness
• Image sources: illumina and iRhythm
4
Internal Vendor Public
Providing relevant and
powerful visualizations
that provide real insights
Data trends
5. WHY HEALTHCARE DATA IS A DIFFERENT WORLD ENTIRELY
Loan application decision Clinical trial investigator decision
5
• Research
• Published trials
• Current sponsored trials
• Prescriptions
• Claims
• Funding
• Network leadership
• Site profile
• Site certification
• Site statistics
SSN
Applicant demographics
SSN SSN
Bank
account
Credit
report
SSN SSN
Identity
check
Income
verification
Investigator
Site
Patients
Inconsistent or missing keys
6. THE TYPES OF PROBLEMS THAT CAN BE SOLVED
WITH INTEGRATED DISPARATE DATA
Problem What is it?
Site selection
Finding the right locations to house clinical trials
Trail outcomes
Visualizing data from different sources within clinical
trials
Medical expertise
communication
Identifying the healthcare professionals with the right
expertise
Scoring and ranking
Finding the top ranking healthcare professionals or
institutions for a particular purpose
Network leadership
analysis
Understanding who is connected to who and how
information is disseminated
Care delivery
effectiveness
Identifying areas of great or poor performance and the
underlying reason
Patient outcomes
Relating patient outcomes to specific market activities
Health economics
Understanding the financial effectiveness of an
intervention or introducing a new standard or care
6
7. DATA CATEGORIES AND EXAMPLES
Creating a complete picture requires combining disconnected data from
Internal
CRM
Trials
Payments
Sales
Partners
Speakers
an enormous variety of sources
Vendors
Rx
Claims
Referral patterns
Primary research
Consulting
Public
Providers
Grants
Public trials
Research
Keys Controlled Vendor specific Anything and nothing
Formats
Spreadsheets
(structured)
Flat files Anything
Managing variety is the key to solving the problem
Managing data variety is the key to solving the problem
7
8. A DIFFERENT PROBLEM REQUIRES A DIFFERENT SOLUTION
Instead…
• A different data model based on
descriptive meta data
• A non-traditional data store
• Something other than Informatica
• Automated intelligent algorithms
• A few special tricks
• An API
• Some really great applications...
8
ETL DW DM
OLAP
Cube BI Insigh
t
9. ENTITY CENTRIC DATA MODEL
Traditional, relational model Entity centric model
Entity
table
Data
source 1
Data
source 2
Data
source n
Attributes
Entity
Attributes
Entity
Attributes
Entity
Meta
data
……
……
……
……
……
……
……
……
……
……
……
……
……
10. ONTOLOGY-BASED DEVELOPMENT
10
Requirements
• Flexible
• Extensible and adaptive
• Easy to maintain
Solution
• Ontology: used to formally represent knowledge within a
domain
• Vocabulary: Collection of entities, attributes, relationships
that provides context within the domain
• Taxonomy (Classification): A hierarchical collection of
controlled terms from vocabulary
11. VOCABULARY
11
Entities
Organic
Attributes
Entity
Relationships
Derived
Attributes
Real world things or events
E.g. Institution, patient, sales,
potential, etc.
Data points coming from
datasets
E.g. first_name, age, revenue,
date, etc.
Relationships between different
entities
Processed key-value pairs from
existing organic and/or derived
attributes
12. WHY MONGODB?
Our requirements
• Extremely flexible data storage
• Low cost of evolving schema
• Highly performant for complex joints, recursive queries etc
• Scalable to large volumes of connected information
MongoDB:
• Document store is a great fit for storing arbitrary information
• Key-value pair in JSON format – (allowed for both adding data traceability and
cheap data evolution)
• Secondary indexes and strict consistency
• Map-reduce functionality
Challenges:
• Queries are powerful but not easy to write
• We needed complex joints across arbitrary information (how do you create an
index on something you don’t even know what it is ahead of time?)
12
13. DATA ORGANIZATION
dataset dataset_
13
Full Profile
Main Profile
Entity
Relationships
Attribute
References
Identity
Section
Attributes
(Organic +
Derived)
File Raw
records
Info
Data
Geo
locations
14. DATA INTEGRATION
14
{
first_name: Charles
last_name: Morris
street: 200 First St.
city: Rochester
state: MN
zip: 55905
phone: 802-555-1234
email: cmorris@mayoclinic.com
headshot: <AF6713…>
thought_leader_score: 8
pub_count: 203
}
DISPARATE SOURCES
OF INFORMATION
STRUCTURED
PROFILE
APPLICATION
REPRESENTATION
All enabled through a series of data integration algorithms
15. ALGORITHM EXAMPLES
15
Record linkage
Disambiguation
Dataset identification
Clustering
C Morris
Heart and Vascular Center
123 Main St
Rochester, MN 55903
802-555-9988
Charles “Chuck” Morris
Cardiologist
200 First St.
Rochester, MN 55905
802-555-1234
cmorris@mayoclinic.com
??
Automatically choosing
the most authoritative
version of an attribute
Maximizing re-use of
meta data describing
imported data sets
Pre-calculating clusters
in weakly attributed data
17. ADDING ADDITIONAL ATTRIBUTES
{
NPI Institutio
“_id” : “53bcf9cae4b03f352d4b47c7“,
"identity": {"npi": "1",
NPI FirstName LastName Specialty
1 Tom Smith Cardiologist
"specialty": ["Cardiologist”],
"first_name": "Tom",
"last_name": "Smith”},
"attributes": {
"npi": {1},
"first_name": {"Tom”},
"last_name": {"Smith”},
"specialty": {"Cardiologist”},
"institution": {"UCSF Medical Center”},
"clinical_trial": {"Heart Valve Clinical Trial”},
"start_date": {"01/01/2011”},
"end_date": {"03/25/2013”}
}
}
17
n
ClinicalTrial Name Start Date End Date
1 UCSF
Medical
Center
Heart Valve Clinical
Trial
01/01/2011 03/25/2013
18. TRICKS TO TAME THE WILD DATA
• Ontology – how we keep track of all ingested information
• Vocabulary – bringing structure to large variety of information
• Derived attributes – encapsulate complexity
• GIS transformations – practical integration of geo data
• Indexing – fast access to complex information in MongoDB
18
19. DERIVED ATTRIBUTES
What’s the problem?
• Data is rarely clean and business rules are
complex
What are we doing about it?
• Use existing (organic) attributes and apply
rules to generate new (derived) attributes
• Derived attributes generated through
queries or map-reduce jobs
Why it matters
• Too complex and expensive to consider all
business rules at run-time with every query
• Hides the complexity and introduces
uniformity
19
Attributes
Entity
20. GEOSPATIAL MAPPING APPROACH FOR
AWKWARD GEO DATA
20
Using traditional method
Reporting unit
Postal codes
Mapping + calculations
Stuttgart District
Using geospatial method
Geocoded reporting unit
District
Stuttgart
State
Mapping + calculation
Baden-Württemberg
State
• Additional challenges with mismatches
between
reporting unit postal codes and mapping
postal codes
• Have to compensate for missing postal
codes
• Split patients or metrics across multiple
regions
when reporting unit spans multiple regions
Baden-Württemberg
• Requires determining a single central point for each
reporting unit
• Uses no mapping documents
• No compensatory calculations required
• Overall accuracy increases
7700117733 70173
21. INDEXING
Why MongoDB alone does not get it done
• Cross collection queries required for large number of scenarios
• Indexing challenges when dealing with unknown information
What we did
• Graph based index
• Entities and attributes are nodes
• Entity – attribute ownership and entity to entity relationships are edges
How we use it
• zQueries allow us to do complex
queries from web front ends
21
22. THE ZEPHYR PLATFORM
100,000,000+ data points ingested and indexed each year
100,000,000+ data points ingested and indexed each year
Disconnected Data Apps for Life Sciences
Algorithm Driven
Data Ingestion
Synchronization
Proprietary REST API
zQuery
Internal Vendor Public
Data Organized in
Connected Profile
Documents
Graph Based
Materialized
Query Index
Ontology Driven Data Tier
22
23. CONSUMING INTEGRATED DISPARATE DATA
Analytical applications use the zAPI and the ontology to produce
applications that adapt to changing data
Zephyr Platform
Ontology Driven
Data Store
A
P
I
REST API
Exposes both data and the
ontology
zQueries
jSON based query language for
queries against dynamic and
connected data
Analytical Apps
Functional Focus
Solving specific business problem
with focused apps
Design
Single page apps with targeted
data visualizations
23
24. TARGETED ANALYTICAL APPLICATIONS
Apps for real business problems leveraged by everyday business users
Illuminate
Voyager Kaleidoscope
24
Lighthouse
26. LEARNINGS
• There was no one technology or one database that provided a
compete solution embrace diversity
• Create generic platform, pour effort into specialized
algorithms to populate data intelligently
• Ontology driven development can be very powerful but data
organization still a challenge
• Indexing on a priori unknown attributes is challenging
• Data modeling is always important, large profiles had to be
broken down
26
27. SUMMARY
Wrapping it all up in five points
1. Healthcare is different and has lots of critical data that is disconnected
2. Generic, MongoDB-based data storage model using meta-data
3. Data integration powered by algorithms
4. Document profiles for facts, graph for querying
5. Diverse set of end user analytical applications powered by the generic data
platform
Why this matters
• Standards are really important, but slow to develop
• Huge amount of change occurring in our healthcare system
• We need to make decisions today based on available data sets despite existing
challenges
27
28. THANK YOU!
Brian Roy – Strategy and architecture
Mahesh Chaudhari – Database architecture
Cesar Arevalo – Data integration implementation
The guys that made all of it come together!
28
29. Zephyr Health
CONTACT INFORMATION
450 Mission St. Suite 201
San Francisco, California 94105
+1.415.529.7649
zephyrhealth.com
Sven
Junkergård
CTO
+1.415.503.7412
sven@zephyrhealth.com
29
The company is focused on Big Data in Life Sciences. Commercial , Medical Affairs and other stake holders are working with a lot of different data points when they are planning their strategic and tactical initiatives, originating from primary market research, CRM systems, conferences you are attending, Ad Boards etc. – Zephyr Health focuses on putting all of this data together in a meaningful way for their clients.
Zephyr Health is a technology company but what differentiates it from the competition is state-of-the-art big data technology coupled with deep Life Sciences domain expertise. Zephyr currently works with top 5 global Pharma, Biotech and Med Device companies.
Market focus on volume and velocity
A ton of companies focus on Volume today
Mention different technologies in each category
Variety still a wide open and unsolved problem
In reality, many, many challenges exist in this category