Computational Protein Design. 3. Applications in Systems and Synthetic Biology

Computational Protein Design
3. Applications of Computational Protein Design

Pablo Carbonell
pablo.carbonell@issb.genopole.fr

iSSB, Institute of Systems and Synthetic Biology
Genopole, University d’Évry-Val d’Essonne, France

mSSB: December 2010

Pablo Carbonell (iSSB) Computational Protein Design mSSB: December 2010 1 / 58

Outline

1 Applications in Systems and Synthetic Biology

2 Protein Afﬁnity Enhancement

3 Protein Modular Design

4 Protein Promiscuity Reengineering

5 Conclusions


Outline





5 Conclusions


Applications of CPD in Systems Biology

The challenge : robust and reliable The Structural Interactome
methods of information correlation and
integration of HT -omics networks
Unveiling new relationships that closes
the gap between
molecular characteristics of proteins
and other compounds within the cell
systems characteristics of the cell
as whole
Computational intelligence algorithms
for large-scale discovery studies
Choosing the right set of descriptors
Generating cellular interaction
networks : the structural
interactome


Applications of CPD in Synthetic Biology

Engineering signal transduction: modifying the specificity and specificity of
receptors
Engineering genetic networks
Modifying transcription
Targeting gene repair and modification
Novel biosensors
Minimal cells and synthetic genomes
Metabolic pathway engineering
Feedback loops design and sensitivity analysis
Programmable switches: allosteric, epigenetic, riboswitches
Conditionally delivery of drugs
Modulation of signal transduction pathways
Inhibition of protein function
Adoption of a toxic conformation
Cell-cell communication
Orthogonal genes
Mathematical dynamical models


Outline





5 Conclusions


Antibody-Antigen Interactions

Antibodies are gamma globulin proteins found
in the immune system of vertebrates
Basic structural units:
Two large heavy chains (VH )
Two small light chains (VL )
The Fab region or fragment antigen-binding is
a region of an antibody that binds to antigens
The Fc region or fragment crystallizable region
is the tail region that interact with cell surface
receptors
The FV region : variable domain


The Variable Domain FV

The variable domain is the most important
region for binding to antigens
The FV contains
3 variable loops of β-strands on the light chain
VL
3 variable loops of β-strands on the heavy chain
VH
These loops are referred to as the
complementarity determining regions (CDRs)


In Silico Design of Immunodiagnostics Assays for Anti TNF-α

Tumor necrosis factor-alpha (TNF-α), a cytokine involved in systemic inflammation,
can induce several cell responses depending on the cellular context:
activation of NF-κβ-mediated proliferative programs
programmed cell death.
The early detection of innusual concentrations of TNF-α is a diagnostic
biomarker of inflammation conditions such as metabolic disorders (obesity),
rheumatoid, tuberculosis, and cancer diseases.
Moreover, the use of anti-TNF-α inhibitors have appeared in recent years as a new
therapeutic approach for inflammatory immune-mediated diseases.
The currently used TNF-α inhibitory molecules are antibodies or soluble TNF
receptors which sequester TNF-α.


Computational Protein Afﬁnity Design for Anti TNF-α Antibodies


Building the Model

No crystal structure available of the
TNF-α antibody-antigen complex

Therefore, our ﬁrst step is to build a
model of the complex through
structural homology and docking

TNF-α trimer

Anti-TNF-α model from Swiss-Model

Docking and Scoring

Using zDock (Accelrys Inc.) for the
generation of docked complexes
Fast Fourier Transform based protein
docking program.
The top 2000 ranked predictions are
returned.

Scoring the complexes through the
use of FastContact
Contact binding free energy scoring
tool for protein-protein complex
structures
The estimates are based on rigid
bodies


Hot-spots and Energy Minimization

Predicting hot-spots
By using Foldx , we performed an in silico alanine
scanning in order to predict consensus hot-spots for
the models.
These hot-spots were experimentally veriﬁed in the
laboratory by the experimental group.

3 initial models were selected based on different
criteria:
minimum predicted binding energy in FastContact
highest coverage of known hot-spots in anti-TNF-α.
Energy was then minimized for the complexes by
using Discovery Studio (Accelrys Inc.).


In Silico Combinatorial Library

In silico combinatorial libraries of mutants around the
complementary determining regions (CDR) were built as
follows:
Models for single-mutation variants were computed
through through the use of Biopolymer and Builder
(Accelrys Inc.) for rotamer selection and side chain
positioning
Mutants were then submitted to a cluster machine of
64 × 4-core nodes for local energy minimization of the
CDRs by using gromacs


Virtual Screening

The most beneficial mutations were selected in order to build a combinatorial
library of double and triple mutants.
Variants with the lowest predicted binding affinity were shortlisted and compared
with beneficial mutations observed in the literature
Computation time: 2 weeks in 64 nodes × 4 cores cluster.

The 6 best mutation were transferred to the molecular
biology laboratory to be tested through ELISA
immunoprecipitation assays.
Then, a new round of virtual screening was launched starting from the best
predicted variants.
After three rounds, values close to a 3-fold improvement in binding affinity
(measured as − log10 Kd ) were obtained.


Outline





5 Conclusions


The Modular Organization of Binding Sites


The modular Distribution of Domain-Domain Binding

Why choosing domains?
Domains form independent structural and
functional units
Dataset
Domains are building blocks that can be
Source : iPFAM
rearranged to create proteins with different 330 protein domains
functions 370 domain-domain interactions

Domains are evolutionarily conserved: Multiple alignments

different organisms use the same domains in 5 organisms: E. coli, S. cerevisiae, C. elegans D.
melanogaster, H. sapiens
protein-protein interactions
Objective : large-scale topological analysis of Binding site clustering :
binding domains


Graph Modular Decomposition

K
" #
X ls „ ds «2
Domains can be decomposed further Q= − (1)
L 2L
into connectivity modules by s=1
clustering the domain contact map
ls = number of edges between nodes in module s
G(V , E, C)
ds = sum of node degrees in module s
Girvan-Newman algorithm [PNAS L = total number of edges in the network
(2002)] with maximum modularity stop
rule [Kashtan and Alon, PNAS (2005)]:
1 The betweenness of all existing edges
in the network is calculated ﬁrst.
Edge betweenness : the number of
shortest paths between pairs of nodes
that run along the edge
2 The edge with the highest
betweenness is removed
3 The betweenness of all edges
affected by the removal is recalculated
4 Repeat 2 and 3 until the modularity Q
for the K connected clusters in the
network becomes maximum


Modularity

Modularity Qs is a measure of how tightly members of a module s interact
„ «2
ls ds
Qs = − (2)
L 2L

ls = number of edges between nodes in module s
ds = sum of node degrees in module s
L = total number of edges in the network
ls
L
: fraction of edges in the network that connect vertices in the module s
` ds ´2
2L
: the expected value of the same quantity if edges fall at random

ˆs = ds ps = ds ds /2
l (3)
2 2 L

ps : probability of an edge to connect nodes in module s
ˆ
In a randomly partitioned network, the expected modularity is Qs = 0


Biding Site and Modular Overlaps

Modular composition of binding site j :

mj = (mj1 , mj2 , . . . , mjM ) (4)

Similarity in modular compoisition
between binding sites i and j :
PM
k =1 mik mjk
M(i, j) = (5)
|mi||mj |

Relative interface between i ad j :
» –
1 ni nj
C(i, j) = + (6) Kringle domain (PF00051)
2 Ni Nj
Binding site A (blue)

Binding site B (red)
ni (nj ) : number of residues in i (j) with 1 4 3
!
contacts in j (i) C(A, B) = + (7)
2 10 8
Ni (Nj ): number of residues in binding
site i (j) (2, 8, 0, 0, 0) · (0, 2, 3, 3, 0)T
M(A, B) = √ √ (8)
68 23


The Modular Organization of Domain-Domain Interfaces

Non-overlapping binding sites
are assigned to different
modules

Modules with high modularity
Q contain a signiﬁcant
percentage of binding site
regions

[Del Sol, Carbonell, PLOS Comp. Biology, (2007)]


Using Modularity to Identify Binding Regions

Modularity can be used to
identify binding surfaces
Accuracy and coverage of
modularity and surface
hydrophobic patches are
greater than residue
conservation
Combining modularity with
the other two methods
improves notably the
performance


Intra-Module Cooperativity and Inter-Module Independence

Human IL-4: a cytokine that plays a
regulatory role in the immune system
IL-4 contains 3 energetically
independent clusters of hot-spots
located in 3 modules
These hot-spots can be used to
generate binding afﬁnity and
speciﬁcity



TEM1 β-lactamase confers antibiotic
resistance to E. coli
This enzyme is inhibited by BLIP
A mutagenesis study showed that
there are 2 hot-spot clusters which are
energetically independent
These clusters are located in different
modules



TCR hVβ2.1 (TSST-1 antibody). 2 cooperative distant clusters
hGHbp (human growth hormone). Cooperative hot-spots
of hot-spots around the binding site located in 1 module
distant to the binding site

CI-2 Serine protease Chymotrypsin inhibitor. A cluster of RI (ribonuclease inhibitor). Hot-spots located in different
hot-spot located far away from the binding interface modules are known to be independent


Modularity as a Measure of Residue Cooperativity

Protein domains can be decomposed into a set of modules that contain groups of
specialized residues
Binding sites are usually located in highly cooperative modules
Modularity, combined with sequence conservation and surface patches, can be
used to predict functional regions
This modular architecture confers robustness to protein structures and
contributes to the determination of binding afﬁnity and speciﬁcity


Energetic Determinants of Protein Binding Affinity

The modular decomposition of protein
structures is a structural characterization of
protein interactions
In order to know more about the interplay
between binding affinity and specificity, it is
necessary a thermodynamics
characterization
We focus in this study on one specific
interactome: the yeast interactome (main
source: MIPS)
Structural interactome: for 259 hubs
(>5 partners) participating in 877 different
interactions


Binding Site Clustering
Single and multiple interfaces
Binding sites correspond to residues interacting with the partner at a distance
≤5Å
Binding sites are mapped into the reference sequence of the hub and clustered by
using a version of the algorithm in Teyra et al. [2008]
1 Compute the N × N binary distance matrix D where

1 i ∩j =∅
D(i, j) = δij (9)
0 i ∩j =∅
2 Start with k = N clusters
3 Compute the {k − 1}-means clustering of D
4 Recompute D for the k − 1 clusters
5 Repeat step 3 while all binding sites within clusters overlap
Total interfaces: 539, involved in 1 to 5 interactions


Protein Binding Affinity and Specificity

Binding energies and alanine scanning for each complex estimated using FoldX
[Schymkowitz et al., 2005]
Specific binding sites tend to bind their partners with higher affinity than
promiscuous sites
Interactions between promiscuous binding sites tend to be weaker

Interaction type −∆G [(kcal/mol)/resid]

Specific-specific 0.93
Promiscuous-promiscuous 0.85
Specific-promiscuous 0.50


Hot-Spots and Partner Motifs

A hot-spot : |∆∆Gbind | = |∆GMUT →ALA − ∆GWT | ≥ 2 kcal/mol
In most of the cases, hot-spots are specific to one interaction. Some of them are
promiscuous
Are hot-spots specific?
Binding site motifs of interacting partners are determinants of specificity
As the promiscuity of the hot-spots increases, the number of common motifs in the
partners increase
A common evolutionary origin of divergent partners in promiscuous binding

Number of interac- Average number of common
tions in hot-spots motifs interacting with hot-
spots

1 1.4
2 2.5
3 3.0
4 4.0


Hot-spots Modular Distribution and Speciﬁcity

We have shown already examples of energetic independence of hot-spots in
modules
Furthermore, the relative number of binding site modules containing hot-spots
increases with the number of partners
A small part of hot-spots participate in more than one interaction, probably acting
as binding site anchors

[ Carbonell, Nussinov, Del Sol, Proteomics, 2009]


Modular Distribution of Hot-spots and Specificity

Ubiquitin. A promiscuous protein with weak interactions
Cytochrome b. An example of a specific binding site

Calmoduline-dependent kinase. An example of a specific
cdc42 GTPase. It contains a central module acting as a site
binding site
anchor

The Role of Thermodynamics in Promiscuous Binding

In general, protein-protein interactions involving promiscuous binding sites are
weaker
Proteins generally interact with partners with a similar degree of promiscuity
Hot-spots in promiscuous binding sites tend to be more distributed over different
modules
Knowing the modular distribution of hot-spots involved in different interactions
might allow us to rationally modify binding speciﬁcity and afﬁnity


Large-scale Analysis Workﬂow


Outline





5 Conclusions


Applications in Synthetic Biology: Design of Metabolic Pathways
The Bio-RetroSynth project

ANR Chair d’Excellence, Faulon’s Lab


Tasks in the Bio-RetroSynth project

Bioretrosynthesis. Graphs for heterologous compounds production in E. coli
Computational protein design. Machine learning to mine genomic databases for
predicting protein function
Pathway design. Rank pathways to select the best to engineer
Quantitative Structure-Activity Relationship (QSAR) for enzyme activity and
inhibition based on experimental databases and toxicity assays.
Metabolic engineering. E. coli plasmids in order to construct combinatorial
libraries of highest rank heterologous pathways found to produce a target product
Engineering optimization. Flux Balance Analysis (FBA) and non-linear
optimization methods to maximize target yield


The Signature Reaction Space σ(R)


Examples of Retrosynthesis Graphs in the Reaction Signature Space

RetroPath : an online-tool
for retrosynthesis search of
metabolic pathways

[D. Fichera, P. Carbonell, J.L. Faulon, Predicting

heterologous compound-forming reaction pathways

through retrosynthesis hypergraphs, in preparation]

Penicillin (antibiotic) Galantamine (treatment of Alzeihmer’s disease)


Ranking Pathways

Gene heterogeneity
Heterologous gene expression
Enzyme performance for the specified reaction
Compound toxicity
Estimation of nominal fluxes
Consistency of the predicted phenotype
0 1
X 1 X 1
C(p) = @ + het(gene) + tox(prod)A + (10)
perf (gene) flux
genes(p) prod(gene)

p∗ = arg min C(p) (11)
p


Predicting Compound Toxicity
MIC (IC50) assays in E. coli for commercial chemical compounds, including
antibiotics
Molecular signature-based QSAR model

[A.G. Planson, E. Paillard, F. Vogliolo, P. Carbonell, J.L. Faulon, unpublished]


Enzyme Performance

Putative reactions R ∗ discovered in the signature space h σ(R) by the
retrosynthesis algorithm often lack annotated enzyme sequences in databases
A protein design procedure has to be implemented in order to identify the best
heterologous enzyme sequence candidate to insert

Conceptually, the idea is to deﬁne
a metric in the reaction σ(R) and
sequence σ(S) signature spaces
a convolution operation * between
both spaces that generates the kernel
function k ((R1 , S1 ), (R2 , S2 ))
a machine-learning algorithm

In practical terms, we are searching in the sequence space S for enzymes with a
putative level of promiscuity for the desired reaction R ∗


Taking Advantage of Enzyme Promiscuity in Protein Engineering

Enzymes can potentially process multiple substrates or reactions
We can study enzyme promiscuity to enhance enzyme efﬁciency by protein
engineering techniques
Enzyme promiscuity is an intermediate step in directed evolution

[Tracewell and Arnold, 2009]


A Quantitive Definition of Enzyme Promiscuity

Definitions
Enzyme multispecificity: the ability of enzymes to transform a broad range of closely
related substrates
Promiscuous function: enzyme activities other than the native one

Using reaction signatures to measure promiscuity :
An enzyme is promiscuous if catalyzes at least 2 reactions with different
signatures
Reaction chemical diversity for reactions RA and RB at height h:

h ||h σ(RA ) · h σ(RA )||
d(RA , RB ) = 1 − (12)
||h σ(R A )||2 + ||h σ(RB )||2 − ||h σ(RA ) · h σ(RB )||
Depending on the chosen h range, it is possible to distinguish between catalytic
promiscuity and substrate specificity


Catalytic and Substrate Promiscuity

Given two reactions RA and RB that an enzyme can process :
The enzyme has catalytic promiscuity if
1
σ(RA ) =1 σ(RB ) (13)

(We look at the bonds that are created and/or broken by the chemical transformation)

The enzyme has substrate promiscuity if
0−3
σ(RA ) =0−3 σ(RB ) (14)

(We look at the chemical structures of the substrates)


Molecular Signatures-Based Prediction of Enzyme Promiscuity

Building the dataset


Support Vector Machine Algorithm

Signature space is
highly-dimensional:
2-mers: 202
3-mers: 203
4-mers: 204
...

The SVM algorithm selects the weighted combination of data points (support
vectors) that performs the best separation
We compute from the support vectors the contribution or α-value of each
signature to the prediction of promiscuity


Performance of the SVM Predictor

Accuracy reaches 85% for the whole dataset
Eukaryotes 88%
Prokaryotes 87%

4-mer α-value frequency [%]
ALAA 10.9 13.9%
AVAA 10.4 12.7%
LAAA 11.3 11.4%
ELAA 11.5 10.9%
... ... ...

Distance to catalytic residues (Catalytic Site Atlas)

Distribution of top k -mers provide insights into promiscuous active regions of
the enzyme
Top k -mers are depleted around catalytic sites of non-promiscuous enzymes


Secondary Structure Around Catalytic Sites

Secondary structure distribution
Beta Helix Loop
All residues 15.69% 40.64% 43.67%
Catalytic sites 23.79% 32.15% 44.05%
Non-promiscuous 20.85% 33.65% 45.50%
Promiscuous 30.00% 29.00% 41.00%

Average deviation from random

Helices are in general underrepresented in catalytic residues
Beta strands are signiﬁcantly overrepresented in promiscuous enzymes


Top k -mers in Promiscuity


Application: Reverse Engineering of a Promiscuous Transaminase
Promiscuity induced by directed evolution [Rothman and Kirsch, 2003]:
AATase (EC 2.6.1.1) → TATase (EC 2.6.1.5)

Signatures (k -mers) with highest α-value change
[Carbonell, P., Faulon, J.L., Bioinformatics, 2010]

Outline





5 Conclusions


Conclusions

Computational analysis of biological networks can provide insights into the
mechanisms of protein binding afﬁnity and speciﬁcity

We use molecular graph descriptors in combination with systems-level
characteristics to train machine-learning predictors of protein activity

Applications
Protein optimization
Understanding protein function and evolution
Design of synthetic biological circuits


Acknowledgments

University of Evry / Genopole National Museum of Natural History
iSSB - Faulon’s Lab Promiscuity & Evolution
Metabolic Engineering & Synthetic Biology Guillaume Lecointre
Jean-Loup Faulon Anne-Gaelle Planson
National Cancer Institute (NIH)
Davide Fichera Ioana Popescu
Hot-spots & Speciﬁcity
Julio Peyroncely Elodie Paillard
Florence Vogliolo Chloe Sarnowski Ruth Nussinov
Antoine Decrulle University of North Carolina
Fuijrebio NMR spectroscopy
Structural Bioinformatics Andrew Lee
Antonio del Sol Hirotomo Fujihashi Polytechnic University of Valencia
Dolors Amoros Marcos Arauzo-Bravo Computational Intelligence

Swiss Institute of Bioinformatics Jose Luis Navarro Adolfo Hilario
Peptide identiﬁcation in HPLC/MS Polytechnic Institute of NYU
Ron D. Appel Alexandre Masselot Nonlinear dynamics
Zhong-Ping Jiang Shiwendra Panwar


Computational Protein Design
3. Applications of Computational Protein Design

Pablo Carbonell
pablo.carbonell@issb.genopole.fr

iSSB, Institute of Systems and Synthetic Biology
Genopole, University d’Évry-Val d’Essonne, France

mSSB: December 2010


Bibliography I

S. C. Rothman and J. F. Kirsch. How does an enzyme evolved in vitro compare to naturally occurring homologs possessing the targeted function? Tyrosine
aminotransferase from aspartate aminotransferase. Journal of molecular biology, 327(3):593–608, March 2003. ISSN 0022-2836. URL
http://view.ncbi.nlm.nih.gov/pubmed/12634055.

Joost Schymkowitz, Jesper Borg, Francois Stricher, Robby Nys, Frederic Rousseau, and Luis Serrano. The FoldX web server: an online force field. Nucleic
acids research, 33(Web Server issue), July 2005. ISSN 1362-4962. doi: 10.1093/nar/gki387. URL http://dx.doi.org/10.1093/nar/gki387.

Joan Teyra, Maciej Paszkowski-Rogacz, Gerd Anders, and M. Teresa Pisabarro. SCOWLP classification: structural comparison and analysis of protein
binding regions. BMC bioinformatics, 9:9+, January 2008. ISSN 1471-2105. doi: 10.1186/1471- 2105- 9- 9. URL
http://dx.doi.org/10.1186/1471- 2105- 9- 9.

Cara A. Tracewell and Frances H. Arnold. Directed enzyme evolution: climbing fitness peaks one amino acid at a time. Current opinion in chemical biology,
13(1):3–9, February 2009. ISSN 1879-0402. doi: 10.1016/j.cbpa.2009.01.017. URL http://dx.doi.org/10.1016/j.cbpa.2009.01.017.


Computational Protein Design. 3. Applications in Systems and Synthetic Biology

Recommended

Recommended

More Related Content

What's hot

What's hot (11)

Similar to Computational Protein Design. 3. Applications in Systems and Synthetic Biology

Similar to Computational Protein Design. 3. Applications in Systems and Synthetic Biology (20)

Recently uploaded

Recently uploaded (20)

Computational Protein Design. 3. Applications in Systems and Synthetic Biology