Exploring the Future Potential of AI-Enabled Smartphone Processors
Big data supporting drug discovery - cautionary tales from the world of chemistry for translational informatics
1. Big Data Supporting Drug Discovery
Cautionary Tales from the World of Chemistry
for Translational Informatics
Valery Tkachenko
RSC-CSIR/OSDD meeting
Pune, India
February 3rd 2014
2. Big Data
Chemical Space
Drug Discovery pipeline
Machine learning
Training sets
RSC/ChemSpider platforms
RSC/Archive
Research data management
Data quality, crowdsourcing and AltMetrics
Building Global Chemistry Network
5. Big Data
Chemical Space
Drug Discovery pipeline
Machine learning
Training sets
RSC/ChemSpider platforms
RSC/Archive
Research data management
Data quality, crowdsourcing and AltMetrics
Building Global Chemistry Network
9. Big Data
Chemical Space
Drug Discovery pipeline
Machine learning
Training sets
RSC/ChemSpider platforms
RSC/Archive
Research data management
Data quality, crowdsourcing and AltMetrics
Building Global Chemistry Network
14. Big Data
Chemical Space
Drug Discovery pipeline
Machine learning
Training sets
RSC/ChemSpider platforms
RSC/Archive
Research data management
Data quality, crowdsourcing and AltMetrics
Building Global Chemistry Network
17. Big Data
Chemical Space
Drug Discovery pipeline
Machine learning
Training sets
RSC/ChemSpider platforms
RSC/Archive
Research data management
Data quality, crowdsourcing and AltMetrics
Building Global Chemistry Network
18. •
•
•
•
~30 million chemicals and growing
Data sourced from >500 different sources
Crowdsourced curation and annotation
Ongoing deposition of data from our
journals and our collaborators
• A structure centric hub for web-searching
31. Big Data
Chemical Space
Drug Discovery pipeline
Machine learning
Training sets
RSC/ChemSpider platforms
RSC/Archive
Research data management
Data quality, crowdsourcing and AltMetrics
Building Global Chemistry Network
39. Research data inflow
All databases are
sliced by data
sources/data
collections and
have simple
security model
where each data
slice/source is
private, public or
embargoed
Web UI for unified depositions
Compounds
Deposition Gateway
Reactions
API, FTP, etc
DropBox, Google Drive,
SkyDrive, etc
LabTrove and other templated
data
Compounds
Module
Raw data
Reactions
Module
Spectra
Module
Materials
Module
Textmining
Module
͙
Module
Staging
databases
Staging
databases
Validated data
Spectra
Materials
Documents
Articles / CSSP
40. Research data outflow
User
interface tier
(examples)
Paid 3rd party integrations (various platforms – SharePoint, Google, etc)
Electronic Laboratory Notebook
Analytical Laboratory application
User
interface
components
tier
Data access
tier
Chemical Inventory application
Compounds
Widgets
Reactions
Widgets
Spectra
Widgets
Materials
Widgets
Documents
Widgets
Compounds
API
Reactions
API
Spectra
API
Materials
API
Documents
API
Compounds
Reactions
Spectra
Materials
Documents
Data tier
41. Big Data
Chemical Space
Drug Discovery pipeline
Machine learning
Training sets
RSC/ChemSpider platforms
RSC/Archive
Research data management
Data quality, crowdsourcing and AltMetrics
Building Global Chemistry Network
45. It is so difficult to navigate…
IP?
IP?
What’s the
What’s the
structure?
structure?
Are they in
Are they in
our file?
our file?
What’s
What’s
similar?
similar?
Pharmacology
Pharmacology
data?
data?
What’s the
What’s the
target?
target?
Known
Known
Pathways?
Pathways?
Competitors?
Competitors?
Connections
Connections
to disease?
to disease?
Working On
Working On
Now?
Now?
Expressed in
Expressed in
right cell type?
right cell type?
46. Data quality issue and CVSP
– Robochemistry
– Proliferation of errors in public and
private databases
– Automated quality control system
48. Big Data
Chemical Space
Drug Discovery pipeline
Machine learning
Training sets
RSC/ChemSpider platforms
RSC/Archive
Research data management
Data quality, crowdsourcing and AltMetrics
Building Global Chemistry Network
49. Research data management
Scientists
Funding bodies
External clients
Publishers
Indexes
Data Repository
indexed storage
Chemically
intelligent services
Data
Data Repository provided
data storage
University 1
University 2
Data Hub
Workstations
Company 3
Data Hub
Workstations
Data Hub
Workstations
50. Big Data
Chemical Space
Drug Discovery pipeline
Machine learning
Training sets
RSC/ChemSpider platforms
RSC/Archive
Research data management
Data quality, crowdsourcing and AltMetrics
Building Global Chemistry Network
53. RSC/Rewards and Recognition
The First Step badge is
awarded when a user
submits (& has published)
their 1st CSSP article.
Congratulations! Your 1st CSSP
article has been published.
Philosopher Lao Tzu said “A
journey of a thousand miles begins
with a single step”. In the same
way we hope that this will be the
first of many submissions that you
make to CSSP.
54. Big Data
Chemical Space
Drug Discovery pipeline
Machine learning
Training sets
RSC/ChemSpider platforms
RSC/Archive
Research data management
Visualization and navigation
Building Global Chemistry Network
58. Big Data
Chemical Space
Drug Discovery pipeline
Machine learning
Training sets
RSC/ChemSpider platforms
RSC/Archive
Research data management
Data quality, crowdsourcing and AltMetrics
Building Global Chemistry Network
62. http://www.openphacts.org
Open PHACTS is an Innovative
Medicines Initiative (IMI) project,
aiming to reduce the barriers to
drug discovery in industry,
academia and for small
businesses.
Semantic web is one of the
corner stones