The document discusses computational protein design techniques. It covers topics like sequence-based and structure-based computational protein design, molecular force fields, knowledge-based potentials, and predicting protein dynamics. The author aims to provide an overview of different computational protein design approaches and challenges in the field.
1. Computational Protein Design
2. Computational Protein Design Techniques
Pablo Carbonell
pablo.carbonell@issb.genopole.fr
iSSB, Institute of Systems and Synthetic Biology
Genopole, University d’Évry-Val d’Essonne, France
mSSB: December 2010
Pablo Carbonell (iSSB) Computational Protein Design mSSB: December 2010 1 / 45
2. Outline
1 Introduction
2 Computational Protein Descriptors
3 Sequence-based CPD
4 Structure-based CPD
5 Search Algorithms in CPD
6 De Novo Design
7 Challenges in Sequence and Structure-Based CPD
Pablo Carbonell (iSSB) Computational Protein Design mSSB: December 2010 2 / 45
3. Outline
1 Introduction
2 Computational Protein Descriptors
3 Sequence-based CPD
4 Structure-based CPD
5 Search Algorithms in CPD
6 De Novo Design
7 Challenges in Sequence and Structure-Based CPD
Pablo Carbonell (iSSB) Computational Protein Design mSSB: December 2010 3 / 45
5. A Blueprint of CPD Approaches
∗ RS : research studies
Pablo Carbonell (iSSB) Computational Protein Design mSSB: December 2010 5 / 45
6. Outline
1 Introduction
2 Computational Protein Descriptors
3 Sequence-based CPD
4 Structure-based CPD
5 Search Algorithms in CPD
6 De Novo Design
7 Challenges in Sequence and Structure-Based CPD
Pablo Carbonell (iSSB) Computational Protein Design mSSB: December 2010 6 / 45
7. Molecular Signature Descriptors
A 2D representation of the molecular graphs Atomic signature :
as an undirected colored graphs G(V , E, C),
Xh
with V : atoms, E : bonds, C : atom type h
σ(G) = σ(x) (1)
The signature descriptor of height h of atom x x∈V
in the molecular graph G, or h σ(x), is a
The signature is a systematic
canonical representation of the subgraph of
codification of the molecular
G containing all atoms that are at distance h
graph [Faulon et al., 2004]
from x
σ(methylcyclopropane) =
1 [C]([H][C]([H][H][C,0])[C,0]([H][H])[C]([H][H][H]))
2 [C]([H][H][C]([H][C,0][C]([H][H][H]))[C,0]([H][H]))
1 [C]([H][H][H][C]([H][C]([H][H][C,0])[C,0]([H][H])))
1 [H]([C]([C]([H][H][C,0])[C,0]([H][H])[C]([H][H][H])))
4 [H]([C]([H][C]([H][C,0][C]([H][H][H]))[C,0]([H][H])))
3 [H]([C]([H][H][C]([H][C]([H][H][C,0])[C,0]([H][H]))))
Pablo Carbonell (iSSB) Computational Protein Design mSSB: December 2010 7 / 45
8. Molecular Signature of Reactions and Proteins
Signature of a reaction. The signature of reaction R
S1 + S2 + . . . + Sn → P1 + P2 + . . . + Pn (2)
that transforms n substrates into m products is given by the difference between the
signature of the products and the signature of the substrates:
h
Xh Xh
σ(R) = σ(p) − σ(s) (3)
p∈P s∈S
Signature of protein sequences. The protein P is represented by the linear
chain given by its collapsed graph at residue level, a reduced molecular graph
representation G(V , E, C) known as string signature where V : residues a ∈ A,
E : contiguous in sequence, C : amino acid type
h
Xh
σ(P) = σ(a) (4)
a∈A
Pablo Carbonell (iSSB) Computational Protein Design mSSB: December 2010 8 / 45
9. Protein Contact Maps
The protein contact map is a graph
representation of the 3D interactions
at residue level G(V , E, C) where V :
residues, E : contacts, C : amino acid
type
Two residues are considered to
interact when atoms between both
residues are at a distance lower than a
predetermined threshold (tipically
4.5 ∼ 5 Å)
Contact maps can account for
long-range interactions and
conformational states
Song et al. [2010]
Pablo Carbonell (iSSB) Computational Protein Design mSSB: December 2010 9 / 45
10. Outline
1 Introduction
2 Computational Protein Descriptors
3 Sequence-based CPD
4 Structure-based CPD
5 Search Algorithms in CPD
6 De Novo Design
7 Challenges in Sequence and Structure-Based CPD
Pablo Carbonell (iSSB) Computational Protein Design mSSB: December 2010 10 / 45
11. Sequence and Structure-Based CPD
Sequence-based CPD methods are in some cases a good trade-off between
complexity of the model and accuracy of the predictions
Pablo Carbonell (iSSB) Computational Protein Design mSSB: December 2010 11 / 45
12. Sequence-based Knowledge-based potentials
The simplest way to score a protein and to identify active regions is through amino
acid scales or indexes
AAindex is a database of
544 amino acid indexes
94 Amino Acid Matrices
47 amino acid pair-wise contact potentials
Examples: hydrophobicity,
accessibility, van der Waals volume,
secondary structure propensity,
flexibility
This approach is widely used when
analyzing conserved motifs and
correlated mutations in protein fold
families through multiple alignments
Pablo Carbonell (iSSB) Computational Protein Design mSSB: December 2010 12 / 45
13. Quantitative Structure-Activity Relationship (QSAR) Techniques
The goal is to model causal relationships
QSAR is a statistical method used
between
extensively by the chemical and
pharmaceutical industries in structures of interacting molecules
small-molecules and peptide measurables properties of scientific
optimization or commercial interest such as
ADME/Tox (absorption, distribution,
metabolism, excretion, and toxicity) of
drugs
Pablo Carbonell (iSSB) Computational Protein Design mSSB: December 2010 13 / 45
14. QSAR Model Evaluation
Model predictability is generally evaluated through the leave-one-out (LOO)
cross-validation correlation coefficient q 2
Partial least-squares (PLS) regression is commonly used
Additional nonlinear terms can be added through the use of nonlinear regression
or machine learning techniques (kernel methods, random forests, etc)
Pablo Carbonell (iSSB) Computational Protein Design mSSB: December 2010 14 / 45
15. QSAR Modeling Workflow
Pablo Carbonell (iSSB) Computational Protein Design mSSB: December 2010 15 / 45
18. The ProSAR Algorithm
An extension of SAR-based approaches to CPD
It formalizes the decision-making processes about which mutations to include in
combinatorial libraries
N
XX
y = cij xij (5)
i=1 j∈A
y : the predicted function (activity) of the protein sequence
cij : the regression coefficients corresponding to the mutational effect of having residue
j among the 20 amino acids A at postion i
xij : binary variable indicating the presence or absence of residue j at position i
Pablo Carbonell (iSSB) Computational Protein Design mSSB: December 2010 18 / 45
19. Improving Catalytic Function by ProSAR-driven Enzyme Evolution
Statistical analysis of protein sequence
activity relationships
Bacterial biocatalysis of
Atorvastatin (Lipitor)
(cholesterol-lowering drug)
Codexis Inc.
Pablo Carbonell (iSSB) Computational Protein Design mSSB: December 2010 19 / 45
20. Outline
1 Introduction
2 Computational Protein Descriptors
3 Sequence-based CPD
4 Structure-based CPD
5 Search Algorithms in CPD
6 De Novo Design
7 Challenges in Sequence and Structure-Based CPD
Pablo Carbonell (iSSB) Computational Protein Design mSSB: December 2010 20 / 45
21. Structure-based CPD
Energy functions and molecular force fields
Local conformational restrictions
Predicting entropic factors
Protein topological properties
From Narasimhan et al. [2010]
Pablo Carbonell (iSSB) Computational Protein Design mSSB: December 2010 21 / 45
22. Energy Functions and Molecular Force Fields
In structure-based CPD, folds are usually
represented by the spatial coordinates of the
backbone atoms or design scaffold
Protein design is done by amino acid side
chains along the scaffold
Side chains are only permitted to assume a
discrete set of statistically preferred
conformations: rotamers
Rotamer/backbone and rotamer/rotamer
interaction energies are tabulated
These potential energies can then be
approximated by using any of the standard
force fields : CHARMM, AMBER, GROMOS
Pablo Carbonell (iSSB) Computational Protein Design mSSB: December 2010 22 / 45
23. Molecular Force Fields
AMBER: a classical force field for energy and MD calculations:
X 1 X 1 X 1
V (r N ) = kb (l − l0 )2 + ka (θ − θ0 )2 + Vn [1 + cos(nω − γ)]
2 2 2
bonds angles torsions
N−1 X
( "„ « „ «6 # )
X N r0ij
12
r0ij qi qj
+ i,j −2 + (6)
rij rij 4π 0 rij
j=1 i=j+1
P
1 (·): energy between covalently bonded atoms.
Pbonds
angles (·): energy due to the geometry of electron orbitals involved in covalent
2
bonding.
P
torsions (·): energy for twisting a bond due to bond order (e.g. double bonds) and
3
neighboring bonds or lone pairs of electrons.
PN−1 PN
i=j+1 (·): non-bonded energy between all atom pairs:
4
j=1
1 van der Waals energies
2 Electrostatic energies
Pablo Carbonell (iSSB) Computational Protein Design mSSB: December 2010 23 / 45
24. Structure-based Knowledge-based Potentials
They are built by performing a large-scale statistical study of structural databases
such as PDB (Protein Data Bank)
Rotamer libraries (∼ 150 rotameric states)
Binary patterning: only some type of amino acids are allowed based on the
hydrophobic environment
An implicit solvation model
Secondary structure propensity
Frequency of small segments in the PDB
Pairwise potentials
van der Waals interactions
Hydrogen bonding
Electrostatics
Entropy-based penalties for flexible side-chains
From Boas and Harbury [2007]
Pablo Carbonell (iSSB) Computational Protein Design mSSB: December 2010 24 / 45
25. Energy Functions
Design along the backbone or scaffold
Rotamer/backbone and rotamer/rotamer interact. energies tabulated
Precomputed from molecular force fields : CHARMM, AMBER, GROMOS
Total energy of the protein
X X
ETOT = Ek (rk ) + Ekl (rk , rl ) (7)
k k =l
N : length of the protein
rk : the rotamer of the kth side chain
Ek (rk ) : the self-energy of a particular rotamer rk
Ekl (rk , rl ) : the pair energy of rotamers rk , rj
Pablo Carbonell (iSSB) Computational Protein Design mSSB: December 2010 25 / 45
26. The Role of Dynamics
Besides protein structure, protein dynamics can play a direct role in molecular
recognition
Flexible proteins recognize their targets through induced fit or conformational
selection, likely showing promiscuity
Binding is commonly enthalpy-driven, but in some cases entropy is important, for
instance:
Proteins with multiple binding sites
Small hydrophobic molecules
Two types of source of protein motions:
Protein flexibility: intraconformational dynamics (fast time scale motions)
Conformational heterogeneity: interconformational dynamics
Gibbs free energy:
∆G = ∆H − T ∆S (8)
∆S = ∆Ssolv + ∆Sconf + ∆Srt (9)
∆Sconf : conformational entropy of protein and ligand
∆Srtf : rotational and translational degree of freedoms
Pablo Carbonell (iSSB) Computational Protein Design mSSB: December 2010 26 / 45
27. Predicting Side-chain Dynamics from Structural Descriptors
The Lipari-Szabo model free approach approach allows to quantify motions from
NMR experiments by computing the generalized order parameter S 2
Protein backbone dynamics : 15 NH and 13 Cα H NMR relaxation methods
Protein side chain methyl dynamics : 13 Cα H NMR relaxation methods (side-chain
motions in the picosecond-to-nanosecond time regime)
From the BMRB we compiled S 2 data for 18 proteins, including 10 proteins in 2 or
more different states : calmodulin, barnase, pdz, mup, dfhr, staphylococcal
nuclease, pin1, sh3 domain, MSG
This technique provides only measurements for the Cα of methyl groups in side
chains : ALA, LEU, ILE, MET, THR, VAL
Pablo Carbonell (iSSB) Computational Protein Design mSSB: December 2010 27 / 45
28. Structural Descriptors of Methyl Dynamics
We consider the following parameters influencing side-chain dynamics :
Packing density at the methyl site i and its neighboring residues j within a sphere of
r =5Å
0 1
X X B X
Pi = Cj e−rij = e−rjk A e−rij (10)
C
@
rij <5Å rij <5Å rjk <5Å
Side chain stiffness : number of dihedral angles separating the backbone from the
methyl carbon. weighted by the side-chain packing
Rotameric state : angular distance ∆χ = χ − χ0 to the closest rotameric state χ0 in
the library
Elongation : distance from the methyl site to the Cα
Pairwise contact potential : a knowledge-based potential of frequence of contacts
between residues at several distances computed from the PDB
Solvation effect : DSSP accessibility and residue hydrophobicity
Van der Waals contacts
Hydrogen bonds (in the case of Threonine)
Pablo Carbonell (iSSB) Computational Protein Design mSSB: December 2010 28 / 45
29. Predicting Methyl Side-chain Dynamics
Algorithm : neural network
Cross-validation : r = 0.71 ± 0.029 Example : experimental and predicted
(p-value = 4.6 × 10−87 ) changes in ∆S 2 of barnase after binding
barstar
Protein MD method r (MD) r (nnet)
ubiquitin AMBER99SB 0.81 0.81
TNfn3 CHARMM 22 0.62 0.79 ∆S 2 > 0 ∆S 2 < 0
FNfn10 CHARMM 22 0.51 0.64 rigidification flexibilization
barnase OPLS-AA/L 0.55 0.64
calmodulin FDPB 0.60 0.72
[Carbonell and del Sol, 2009]
Pablo Carbonell (iSSB) Computational Protein Design mSSB: December 2010 29 / 45
30. Outline
1 Introduction
2 Computational Protein Descriptors
3 Sequence-based CPD
4 Structure-based CPD
5 Search Algorithms in CPD
6 De Novo Design
7 Challenges in Sequence and Structure-Based CPD
Pablo Carbonell (iSSB) Computational Protein Design mSSB: December 2010 30 / 45
31. Search Algorithms in CPD
Pablo Carbonell (iSSB) Computational Protein Design mSSB: December 2010 31 / 45
32. Search Algorithms
Objective: finding the best design within the space of all possible amino
acid/rotameric states
A vast search space: 20N or pN
N: number of positions to mutate
p: number of rotameric states
Strategies
Deterministic algorithms
Dead-end elimination (DEE) algorithm: a pruning method.
Some accelerations of the DEE algorithm: upper-bound estimation; the “magic bullet” metric;
conformational splitting; background optimization
Stochastic algorithms
Monte Carlo
Simulated annealing
Genetic algorithms
Pablo Carbonell (iSSB) Computational Protein Design mSSB: December 2010 32 / 45
33. The DEE Algorithm
It assumes that the energy of the protein can be written as
X X
ETOT = Ek (rk ) + Ekl (rk , rl ) (11)
k k =l
N : length of the protein
rk : the rotamer of the kth side chain
Ek (rk ):" the self-energy of a particular rotamer rk
Ekl (rk , rl ): the pair energy of the rotamers rk , rj
Complexity:
Single search scales quadratically with total number of rotamers O((p × N)2 )
Pair search scales cubically O((p × N)3 )
Brute force enumeration : O(pN )
Pablo Carbonell (iSSB) Computational Protein Design mSSB: December 2010 33 / 45
34. The DEE Algorithm
Single rotamers and rotamer pairs are eliminated during the computational cycles
Single elimination : eliminate rotamer if some other rotamer in the side chain gives
better energy
N
X N
X
A
Ek (rk ) + min Ekl (rk , rlX )
A
> B
Ek (rk ) + max Ekl (rk , rlX )
B
(12)
X X
l=1 l=1
Pairs elimination : eliminate pair of rotamers in two positions if there exists another
pair that gives better energy
def
Ukl = Ek (rk ) + El (rlB ) + Ekl (rk , rlB )
AB A A
(13)
N
X “ ”
AB
Ukl + min Eki (rk , riX ) + Elj (rlB , rjX ) >
A
X
i=1
N
X “ ”
CD
Ukl + max Eki (rk , riX ) + Elj (rlD , rjX )
C
(14)
X
i=1
Values are precomputed and stored in energy matrices
Pablo Carbonell (iSSB) Computational Protein Design mSSB: December 2010 34 / 45
35. Stochastic Algorithms
Search in the space of feasible designs by making a series of combinations of
random and directed moves
Monte Carlo Metropolis: a move consists of exchanging one rotamer for another
at a randomly chosen position, a modification is accepted if it lowers the energy
Simulated Annealing allows to explore nearby solutions at the initial cycles of the
search
Genetic Algorithms: a population of models is propagated (evolved) throughout
the course of the run and genetic operators, such as recombination, are used to
create new models from existing parents
They are fast, can be scaled up to problems of large complexity
They are not guaranteed to converge to the optimal solution
Pablo Carbonell (iSSB) Computational Protein Design mSSB: December 2010 35 / 45
36. The SCHEMA Algorithm
Equivalent to an in silico directed evolution
Consists of scoring libraries of hybrid protein
sequences against the parental sequence
Scoring:
Calculate the number of interactions between residues
(contacts within 4.5 Å) that are disrupted in the creation
of hybrid proteins
Hybrids are scored for stability by counting the number of
disruptions
Protein is partitioned into blocks that should not
From [Meyer et al., 2006]
interrupted by crossovers (analog to genetic algorithms)
Pablo Carbonell (iSSB) Computational Protein Design mSSB: December 2010 36 / 45
37. The OPTCOM and IPRO Algorithms for Library Design
The OPTCOM algorithm: The IPRO algorithm:
Balances size and Identify point mutations in the parent sequences
quality of the library using energy-based scoring fuctions
Residue and rotamer choices are driven by a
mixed-integer linear programming formulation
(MILP)
From [Saraf et al., 2006]
Pablo Carbonell (iSSB) Computational Protein Design mSSB: December 2010 37 / 45
38. Some Web Resources
IPRO: Iterative Protein Redesign and Optimization.
http://maranas.che.psu.edu/IPRO.htm
EGAD: A Genetic Algorithm for protein Design.
http://egad.ucsd.edu/software.php
RosettaDesign: A software package.
http://rosettadesign.med.unc.edu/
SCHEMA A pair-wise energy function for scoring protein chimeras made from
homologous proteins. http://www.che.caltech.edu/groups/fha/
schema-tools/schema-overview.html
SHARPEN: Systematic Hierarchical Algorithms for Rotamers and Proteins on
an Extended Network.
http://koko.che.caltech.edu/sharpenabout.html
WHAT IF: Software for protein modelling, design, validation, and
visualisation. http://swift.cmbi.ru.nl/whatif/
FoldX: A force field for energy calculations and protein design.
http://foldx.crg.es/
Pablo Carbonell (iSSB) Computational Protein Design mSSB: December 2010 38 / 45
39. Outline
1 Introduction
2 Computational Protein Descriptors
3 Sequence-based CPD
4 Structure-based CPD
5 Search Algorithms in CPD
6 De Novo Design
7 Challenges in Sequence and Structure-Based CPD
Pablo Carbonell (iSSB) Computational Protein Design mSSB: December 2010 39 / 45
40. De Novo-Designed Proteins
In de novo designs, some assumptions are needed in order to make the search
space tractable
Usually we start from some basic motifs or domains as scaffolds for the design
Examples:
βαβ motif resembling a zinc finger
3 and 4 helix bundles
Helical coiled-coils
Helix bundle motifs can be parametrized using a few global variables that
describe the global structure
Applications:
New metal-binding sites
Nonbiological cofactors for novel biomaterials and electromechanical devices
Novel enzymatic activities
Pablo Carbonell (iSSB) Computational Protein Design mSSB: December 2010 40 / 45
41. Example: De Novo Design of a Metalloprotein
Computational de novo design of a four-helix (108 residues) bundle containing the
non-biological cofactor iron diphenyl porphyrin (DPP-Fe) [Bender et al., 2007]
The initial helix bundle was selected as low-energy structure computed with MCSA
STITCH: a program to select loops connecting helices from PDB Select
CHARMM and PROCHECK for removing overlaps
4 His and the 4 Thr residues to support the 6-point coordination of the Fe(III) cations
SCADS: provides side-dependent amino acid probabilities in each round
Pablo Carbonell (iSSB) Computational Protein Design mSSB: December 2010 41 / 45
42. Outline
1 Introduction
2 Computational Protein Descriptors
3 Sequence-based CPD
4 Structure-based CPD
5 Search Algorithms in CPD
6 De Novo Design
7 Challenges in Sequence and Structure-Based CPD
Pablo Carbonell (iSSB) Computational Protein Design mSSB: December 2010 42 / 45
43. Challenges in Sequence and Structure-Based CPD
Modeling
Greater availability of 3D protein structural information
More accurate energy functions
Improvement of rigid and flexible docking
Design
Improvement in search algorithms
Parametrization for non-natural amino acids
Prediction
Beyond additive models: using machine-learning algorithms
More complete environment descriptors
Pablo Carbonell (iSSB) Computational Protein Design mSSB: December 2010 43 / 45
44. Computational Protein Design
2. Computational Protein Design Techniques
Pablo Carbonell
pablo.carbonell@issb.genopole.fr
iSSB, Institute of Systems and Synthetic Biology
Genopole, University d’Évry-Val d’Essonne, France
mSSB: December 2010
Pablo Carbonell (iSSB) Computational Protein Design mSSB: December 2010 44 / 45
45. Bibliography I
Gretchen M. Bender, Andreas Lehmann, Hongling Zou, Hong Cheng, H. Christopher Fry, Don Engel, Michael J. Therien, J. Kent Blasie, Heinrich Roder,
Jeffrey G. Saven, and William F. DeGrado. De Novo Design of a Single-Chain Diphenylporphyrin Metalloprotein. Journal of the American Chemical
Society, 129(35):10732–10740, September 2007. ISSN 0002-7863. doi: 10.1021/ja071199j. URL http://dx.doi.org/10.1021/ja071199j.
F. Edward Boas and Pehr B. Harbury. Potential energy functions for protein design. Current opinion in structural biology, 17(2):199–204, April 2007. ISSN
0959-440X. doi: 10.1016/j.sbi.2007.03.006. URL http://dx.doi.org/10.1016/j.sbi.2007.03.006.
Pablo Carbonell and Antonio del Sol. Methyl side-chain dynamics prediction based on protein structure. Bioinformatics, pages btp463+, July 2009. doi:
10.1093/bioinformatics/btp463. URL http://dx.doi.org/10.1093/bioinformatics/btp463.
Jean-Loup L. Faulon, Michael J. Collins, and Robert D. Carr. The signature molecular descriptor. 4. Canonizing molecules using extended valence
sequences. Journal of chemical information and computer sciences, 44(2):427–436, 2004. ISSN 0095-2338. doi: 10.1021/ci0341823. URL
http://dx.doi.org/10.1021/ci0341823.
Michelle M. Meyer, Lisa Hochrein, and Frances H. Arnold. Structure-guided SCHEMA recombination of distantly related β-lactamases. Protein Engineering
Design and Selection, 19(12):563–570, December 2006. ISSN 1741-0126. doi: 10.1093/protein/gzl045. URL
http://dx.doi.org/10.1093/protein/gzl045.
Diwahar Narasimhan, Mark R. Nance, Daquan Gao, Mei-Chuan Ko, Joanne Macdonald, Patricia Tamburi, Dan Yoon, Donald M. Landry, James H. Woods,
Chang-Guo Zhan, John J. G. Tesmer, and Roger K. Sunahara. Structural analysis of thermostabilizing mutations of cocaine esterase. Protein
Engineering Design and Selection, 23(7):537–547, July 2010. doi: 10.1093/protein/gzq025. URL http://dx.doi.org/10.1093/protein/gzq025.
Manish C. Saraf, Gregory L. Moore, Nina M. Goodey, Vania Y. Cao, Stephen J. Benkovic, and Costas D. Maranas. IPRO: an iterative computational protein
library redesign and optimization procedure. Biophysical journal, 90(11):4167–4180, June 2006. ISSN 0006-3495. doi: 10.1529/biophysj.105.079277. URL
http://dx.doi.org/10.1529/biophysj.105.079277.
Jiangning Song, Kazuhiro Takemoto, Hongbin Shen, Hao Tan, Michael M. Gromiha, and Tatsuya Akutsu. Prediction of Protein Folding Rates from Structural
Topology and Complex Network Properties. IPSJ Transactions on Bioinformatics, 3:40–53, 2010. doi: 10.2197/ipsjtbio.3.40. URL
http://dx.doi.org/10.2197/ipsjtbio.3.40.
Pablo Carbonell (iSSB) Computational Protein Design mSSB: December 2010 45 / 45