1. 3. Divergence of protein identifiers
2. Methods
7. References
Will the real pharmacologically significant
proteins please stand up?
1. Introduction
Even in their more contemplative moments probably few pharmacologists cogitate on
“so how many human proteins actually exist?” Nevertheless, on a practical level their
engagement with names and identifiers (IDs) for pharmacological protein targets and
disease mechanistic components is intense and includes navigating between
databases and the literature. This work addresses three important aspects of protein
equivocality that pharmacologists may less aware of but that we encounter head-on
during curation of the IUPHAR/BPS Guide to PHARMACOLOGY [1 2]. These are:
1. Variability in canonical counts between 19,198 from the HUGO Gene Nomenclature
Committee (HGNC) up to 21,341 in GeneCards, indicating a surprising annotation
discordance for at least 10% of the human proteome
2. Uncertainty of alternatively spliced (AS) protein existence. While Ensembl predicts
over 100,000 AS mRNAs, the verification of these by proteomics is 30-fold less than
expected, inferring that the majority do not exist in vivo [3]
3. Evidence that some canonical Swiss-Prot (SP) entries are not the major isoform
Using UniProt we ascertained the 4-way intersect between SP protein IDs, HGNC Gene
Symbols, Ensembl genes and NCBI Gene IDs. The four sets were selected using
cross-reference queries from the UniProt interface. We then accessed our internal
protein statistics including the total human UniProt IDs that we had curated into GtoPdb
and those for which we had annotated data-supported and pharmacologically-relevant
ligand interactions. These were compared to the 4-way sequence set. We also counted
proteins for which UniProt had curated splice forms using the query “Alternative splicing
(KW-0025)”. We then and compared these with our ligand interaction set. We also
inspected one splice form that has been annotated in GtoPdb and checked the
information in SP. To address the isoform abundance question we queried the
Annotation of principal and alternative splice isoforms (APPRIS) database to check
targets [4].
1. Harding SD, et al. (2018). Nucl. Acids Res. 46 (Database Issue): D1091-D1106.
2. Southan C, et al. (2018) ACS Omega 3(7), PMID: 30087946
3. Rodriguez JM et al. (2018). Nucl. Acids Res. 46 (Database Issue) D213-D217.
4. Tress ML, et al (2017) Trends Biochem Sci. 42(2):98-110.
5. Southan C (2017) F1000Res. 7;6:448.
5. Protein alternative splicing
Christopher Southan, Simon D. Harding, Elena Faccenda, Adam J. Pawson and Jamie A. Davies.
IUPHAR/BPS Guide to PHARMACOLOGY, Centre for Discovery Brain Sciences, University of Edinburgh, UK
6. Discussion points
• In addition to AS touched on here, additional sources of protein equivocality and
heterogeneity include alternative initiations and post-translational modifications.
• The multiplexing of these from a (still without a consensus) canonical set of ~19,000
proteins is predicted to run into the millions.
• The significance of this for pharmacology, systems biology and drug discovery is
acknowledged to be high but getting solid experimental data is difficult.
• GtoPdb users are welcome to alert us to potentially curatable papers on differential
ligand interactions related to any forms of protein heterogeneity
www.guidetopharmacology.org enquiries@guidetopharmacology.org @GuidetoPHARM
4. Comparing the consensus with GtoPdb
We especially thank all contributors, collaborators and NC-IUPHAR members
In the Venn diagram on the
right the 4-way intersect
shows that these four major
global pipelines concur for
less than 19,000 protein-
coding genes. Most divergent
is the 829 SP-only set.
Inspection established many
of these are categorised as
pseudogenes by HGNC [5].
This surprising result includes
some missing genomic cross-
mappings inside SP. However,
the consensus is close to the
HGNC count of 19,118 (note
Ensembl and NCBI
reciprocally cross-map hence
the empty sections)
Our next step was to compare the 4-
way set from the comparison above
(blue) with a) all the human proteins we
have entered in GtoPdb (yellow) and
b) those proteins that have a curated
interaction (mostly quantitative) against
one or more of the 9405 ligands (green)
The results were generally as expected
in confirming the majority of our proteins
are within the 4-way set (i.e. solidly
supported). However, the analysis was
valuable in detecting minor anomalies
(represented in segments of 5,6 and
23). These are being followed-up but a
major factor is that some of these are
missing GeneID cross-references in
Swiss-Prot (i.e. are blue false –ves)
It is difficult to find papers with solid data showing AS affecting proteins for which we
have curated ligand interactions and may thus exabit differential pharmacology. Many
publications indicate that AS transcription is a) widespread, b) affects the majority of the
mammalian proteome and is c) is likely to be functionally important in various biological
contexts (e.g. tumours and brain tissue) even if the mechanisms are unclear.
Notwithstanding, there are major uncertainties in proving the existence of AS proteins
since they are difficult to verify in vivo. We approached this question by counting our
interaction proteins with AS sequence variants annotated in Swiss-Prot.
The results of this are shown on
the right. The yellow circle
indicates that 52% of human SP
has at least one AS protein
sequence annotated. This rises
slightly to 54% in our interaction
set (blue). Importantly, AS in SP
is target-class specific rising to
70% for kinases but only 14%
for GPCRs (since many are
single--exon genes). Note that
Ensembl predicts considerably
more potential AS sequences
than SP curates
In GtoPdb we only assign quantitative and differentially-specific AS-ligand interactions
if the papers meet our curatorial stringency. We also need evidence that data-
supported differential binding has pharmacological significance. This is challenging for
many reasons that cannot be expanded on here (but we would be pleads to discuss).
Consequently, we have only one AS entry as the interaction between protein target
2903 as claudin18 and antibody ligand 9209 (below, together with the AS first exon).
The specific case of claudin18 and extrapolation to other AS proteins in GtoPdb
raises the question as to which sequence may be quantitatively dominant (i.e. the
principle isoform in vivo). However, there are inherent challenges of quantifying AS-
specific peptides by mass-spec proteomics or estimating surrogate relative
abundancies from transcription data. We thus chose the APPRIS database which
uses a range of computational methods fold coverage scores to select the most
likely principal isoform. In this case the two SP scored equally.