Introduction to Chemoinformatics

Introduction to Chemoinformatics

Dr. Igor V. Tetko
Helmholtz Zentrum München - German Research Center for Environmental Health (GmbH)
Institute of Bioinformatics & Systems Biology (HMGU)

Kyiv, 13 August 2010, V Summer School AACIMP 2010

Helmholtz Zentrum München statistics

•  Part of Helmholtz Network (€3 Milliards, 30000 people)
•  Leading center for Environmental Health in Germany
•  25 institutes (1789 people, 569 scientists & 280 PhD students)*
•  70 contracts with EU
•  Disciplines
•  Biology 40%
•  Chemistry/biochemistry 15%
•  Physics 10%
•  Medicine 8%

•  Chemoinformatics group, Institute for Bioinformatics & Systems Biology
  9 peoples, strong expertise in in silico data analysis, machine learning methods,
chemoinfomatics software development, data dissemination

*January 2010

Layout

•  Chemoiformatics??? What does it mean?
•  Role of chemoinformatics
•  in drug discovery
•  Chemical Biology, NIH Roadmaps
•  REACH, chemical safety
•  Definitions of chemoinformatics
•  OCHEM
•  Overview of the demonstration course

Cheminformatics & Bioinformatics keywords in
SCOPUS
250 70000
2000
200 2001 60000 2000
2002 2001
50000
2002
150 2003
40000 2003
2004 2004
100 2005 30000 2005
2006 2006
20000
2007
50 2007
2008
10000
2008 2009
0 2009 0
Chemoinfomatics Cheminformatics Bioinfomatics

There was no dedicated journal, about 10x lower if journals
many authors does not explicitly use world and references are exclued
chemoinformatics

Dmitrii Ivanovich Mendeleev,
1834-1907

Discoverer of the Periodic Table — An Early “Chemoinformatician”

Why Mendeleev?
Faced with a large amount of data, with many gaps, Mendeleev:
•  Sought patterns where none were obvious,
•  Made predictions about properties of unknown chemical substances,
based on observed properties of known substances,
•  Created a great visualization tool!

Why he? Why at that time?
“The Law of transformation of Quantity into Quality”
by Karl Marx

Quantity aspects of chemoinformatics

Major challenges of Chemoinformatics

Millions of structures
Thousand of publications

Storage, organization and search of information
QSAR / QSPR studies: prediction of properties and activities

Chemical biology & drug discovery
In silico environmental toxicology, REACH

QSAR/QSPR - Quantitative Structure-Activity (Property) Relationship Studies

Goal of chemistry: Synthesis of Properties

The most fundamental and lasting objective
of synthesis is
not production of new compounds
but
production of properties

George S. Hammond
Norris Award Lecture, 1968

Where Chemoinformatics is required?

Pharma companies and public organisation
•  data collection and handling
•  model development QSAR/QSPR, in silico design
•  In vitro, in vivo data analysis and interpretation

REACH
•  > 140,000 compounds
•  Development of in silico methods to predict toxicity of chemical compounds

Chemical industry
•  Design of new properties; chemical synthesis
•  Prediction of toxicity of chemicals BEFORE they enter market

Pharma R&D: Cost and Productivity issues
Compound numbers
1.000.000 Cpds <5000 Cpds < 500 Cpds < 5 Cpds
Up to 15 Years:
1 drug

Lead Lead Optimisation Lead Profiling Clinics Approval
Target Discovery Discovery ADME/T ADME/T

In vitro and in vivo property determination:
Millions of screens for solubility, stability, absorption, metabolism, transport,
reactive products, drug interactions, etc etc
Preclinics Costs: > $300m PER COMPOUND to reach approval

Pharma: Declining R&D productivity in the
drug discovery industry

Approved medicine

See http://www.frost.com/prod/servlet/market-insight-top.pag?docid=128394740

Pharma R&D Cost and productivity:
Reasons for compound failure

Pharmacokinetics
Animal toxicity
Adverse effects
Lack of efficacy
Commercial reasons
Miscellaneous

TOP four reasons are connected to compound
Absorption, Distribution, Metabolism and Excretion, all of which may
contribute to lack of efficacy and Toxicity: ADME/T issues
Experimental measurements or computational predictions?

Chemoinformatics!

Public organizations: NIH Roadmaps

Dr. Elias Zerhouni
Former NIH director (2002-2008)
$29.5 billions in 2008 for
27 Institutes

Ukraine GDP is ca $127 billions
Chemical Biology Belarus GDP is ca $48 billions

Public organizations: NIH Roadmaps

New pathways to discovery Molecular Libraries Screening Center Network
•  understand complex biological (MLSCN)
systems •  10 centers were created
•  understand their connections •  250 assays to screen >200,000 molecules
•  build better "toolbox" for
researchers
Chemoinfomatics!
Research teams of the future
•  PubChem database to handle data
Re-engineering the clinical research team

REACH

Registration, Evaluation, Authorisation and
Restriction of Chemical substances

European Chemicals Agency (ECHA) in Helsinki

REACH and Environmental Chemoinformatics

> 140,000 chemicals to be registered
It is expensive to measure all of them ($200,000 per compound), a lot of
animal testing => €24 billions over next 10 years
QSAR models can be used to prioritize compounds
•  Compound is predicted to be toxic
•  Biological testing will be done to prove/
disprove the models
•  Compound is predicted to be not toxic
•  tests can be avoided, saving money, animals
•  but ... only if we are confident in the predictions

"One can not embrace the unembraceable.”

Possible: 1060 - 10100 molecules theoretically exist

Achievable: 1020 - 1024 can be synthesized now
by companies (weight of the Moon is ca 1023 kg)

Available: 2*107 molecules are on the market

Measured: 102 - 105 molecules with ADME/T data
Kozma Prutkov
Problem: To predict ADME/T properties of just molecules 1080 atoms in the Universe
on the market we must extrapolate data from one to
1,000 - 100,000 molecules!

Applicability domain: Key points

There are NO universal computational models that work well
If x is small, on the whole chemical space
Sin(x) ≈ x

Applicability domain: Key points

There are NO universal computational models that work well
on the whole chemical space

Performance on the test set Performance on a dissimilar set

Making predictions with the experimental accuracy:
case study of 96,000 Pfizer molecules

Non-reliable Reliable

60% of molecules
are predicted
with average accuracy of
0.3 log units

Tetko et al, Chem. & Biodiver., 2009, 6(11), 197-202.

Definitions of Chemoinformatics

Chemoinformatics is a generic term that encompasses the design, creation, organization,
management, retrieval, analysis, dissemination, visualization, and use of chemical
information. G. Paris, 1988
Chemoinformatics is the mixing of those information resources to transform data in to
information and information in to knowledge for the intended purpose of making better
decisions faster in the area of drug lead identification and optimization. F.K. Brown, 1998

Chemoinformatics is the application of informatics methods to solve chemical problems.
J. Gasteiger, 2004

Chemoinformatics is a field dealing with molecular objects (graphs, vectors) in
multidimensional chemical space. A. Varnek & A. Tropsha, 2007.

Chemoinfomatics: data to knowledge

What is required to create a good
(chemoinformatics) model?

•  Data is the most essential part!
•  Appropriate representation of molecules (choice of
descriptors is crucial)
•  Good statistical methods

>100,000 lines of code in Java

Database schema
Simplified overview
Property Condition
Filtering
Toxicology, Biology, Temperature,
logP = 0.5 Partition coefficient. pH, species,
Melting Point = 100
tissue, method
°C

Tags Data Point Introducer
Bill G., Sergey B.
Toxicology, Biology,
Partition coefficient.
Date of modification
Informationsystem

Structure Article
Manipulation
Benzene, Urea, ... Editing Garberg, P
Organization “In vitro models for …”
Working sets<

Practical course on using OCHEM and data
modeling on-line

Tomorrow 10:00-12:00 auditorium 235-6

Advice: make yourself familiar with the on-line tutorials available as
http://ochem/tutorials

If you have data (SMILES/sdf file and molecular activities) – bring it with you
– we can try to build model.

Interested in Short-Term stays (3-12 months)?
Apply!!!

http://www.eco-itn.eu

Introduction to Chemoinformatics

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Introduction to Chemoinformatics

Similar to Introduction to Chemoinformatics (20)

More from SSA KPI

More from SSA KPI (20)

Recently uploaded

Recently uploaded (20)

Introduction to Chemoinformatics