9. Queries: SQL v. Gremlin
select p2.name,p1.post from posts p1
inner join persons p2
on p1.to_id=p2.id
where p1.from_id in(select id from persons where name='keith');
graph.traversal().V().has('name','keith').outE('posts')
.as('msg').inV().as('who')
.select('msg', 'who')
.by('comment').by('name’)
18. Modern life science R&D is challenging
● $50B “Cottage industry” now globalized and highly collaborative
○ Distributed teams
○ Universities, clinicians, non-profit labs, CROs, Biotech, Pharma
● PB of data, millions experiments per year
● Science is complicated
R&D organizations are expected to produce efficient pipelines from
academic research to clinical development
19. 70%
of data collected annually in life science goes “dark” — unaccessible,
undiscoverable or unuseable
20. $35B
of data collected annually in life science goes “dark” — unaccessible,
undiscoverable or unuseable
23. ● Secure cloud storage
(HIPAA, 21 CFR 11)
● Metadata tied to files
● File/data Provenance across
collaborators and analyses
● Integrated annotation, chat
● Low threshold: continue to
use preferred capture,
analysis tools
A Scientific data layer stops data from
going dark
25. ● Real-time
● Structural information: projects,
experiments, people
● High information events
○ Researcher annotation
○ Communication
○ File selection
Social+Data graph
26. ● 350,000 Researchers
● O(100B) files
● Average academic researcher
writes 1 paper per year with 3
other colleagues in >1 countries
● k=8
● 40,000 users to a fully
connected graph
Global Social+Data graph
27. Assisting R&D organizations to mobilize
idle assets
1. Find relevant internal experts
2. Recommend existing, relevant data (and the resources to utilize it)
3. Identify the best external resources and opportunities
4. Organizational analytics
a. Who are the effective collaborators?
b. Which are the most valuable data sets?
29. ● Shuffle rows & columns of the
matrix to minimize loss (spectral,
information, etc.)
● Well-studied in bioinformatics
(not that different) and text
classification
● NP-complete
● Clusters allow us to look up in
both directions
○ User → Data
○ Data → Users
○ (Users → Users)
Bi-cluster to identify
relevant groups