This project developed a process to characterize gene and protein conservation across mammalian species for biological pathways. The process retrieves data from KEGG, BioCarta, Homologene, and UniProt databases to generate matrices showing conservation of genes in pathways across humans, mice, rats, dogs, cows, and chimpanzees. It also identifies known protein variations. The results showed most genes were highly conserved, with some exceptions. Future work includes fully automating the process.
WordPress Websites for Engineers: Elevate Your Brand
Characterization of genes and proteins across species in biological pathways
1. Characterization of genes and proteins of cross-species biological pathways
Jennifer Ivy Dong, Douglas James Joubert–NIH Library, Raina Kumar, & Robert Stephen–ABCC/NCI
Introduction Materials and methods Results
The process has four major modules:
The new era of genomics and proteomics, with the advent of Six pathways, three each from BioCarta and KEGG, were
1. Identify homologous proteins using the Homologene database 3. Find variations using multiple sequence alignments
high throughput technologies such as microarrays and next analyzed using this process and the results for these pathways
generation sequencing, has opened up great opportunities for 2. Identify homologous proteins using similarity search 4. Find all known variations from the UniProt database are presented below. The matrices for one of the pathways are
the life science research community to better understand also shown for illustration.
biological processes. The gene lists obtained from data through
BioCarta Pathways
these experiments are generally analyzed further in the context
of biological pathways as well as with available biological Interferon Gamma (IGP): The IGP pathway has a significant
knowledge sets such as specifically described gene ontologies, role in the body's immune response. It has 6 genes, all well
gene sets and gene enrichments. Efforts are underway to conserved among mammals except for JAK1 and STAT1 in
develop new methods to derive biologically meaningful Pan troglodyte.
information from the gene lists obtained from such technologies. Start with a BioCarta/KEGG pathway name Nerve Growth Factor (NGF): NGF is important for the survival
Although there has been considerable effort extended at the Identify homologous proteins by
similarity search of neurons during embryonic development and has an effect on
level of building, maintaining and distributing these gene sets, a Map Sequence Id to protein Id using Retrieve gene list from CGAP with gene sequence IDs the growth of sensory and sympathetic ganglia. It has 20 genes
BioDbnet
system allowing visualization of their conservation across and most are well-conserved. Across species the exceptions
Identify homologous proteins in
mammalian species has not been developed. We have Perform BlastP for Proteins homologene database
include DPM2 and ELK1, and KLK2. Within species, only Canis
Retrieve homolog group ID for each gene from
developed a process to retrieve information from two pathway Homologene database at NCBI
lupus familiaris had NGF genes that were less conserved.
Populate matrices with best hits using
databases, KEGG and BioCarta, and combine it with information taxonomy report
Protein Kinase C through G-protein coupled receptor
from other biological databases such as Homologene and Report value 1 for species from homolog for each
gene for mammals
(PKC): GPCRs are involved in signal transduction and play a
Uniprot to characterize cross-species conservation of genes and Fetch sequences using protein seq ID for all the
Find variations
homologous genes for each pathway gene for
Perl scripts
role in various cellular functions. There are 9 genes in this
proteins and gain insights into new biological knowledge. from MSA
mammals
Populate matrices (heat map), where genes are at X-
axis and species at Y-axis pathway, and all the genes are extremely well-conserved.
Specifically, we are trying to understand which genes and
proteins are common in given pathways across species among Perform MSA by ClustalW Find known variations KEGG Pathways
mammals such as human (Homo sapiens), mouse (Mus Perl scripts
Identify protein IDs of all the proteins for same
species in NCBI database using Sequence Id or Hedgehog Pathway: The hedgehog signaling pathway is
musculus), rat (Rattus norvegicus), dog (Canis lupus familiaris), Use *.dnd to make cladogram Map sequence id to UniProt Id using BioDbnet
believed to govern the growth of embryonic stem cells as well
cow (Bos taurus), and chimpanzee (Pan troglodytes). We also Search for variations in *.aln files as metamorphosis in general. It has 44 genes, of which 23 are
explore the problem of finding the variations or mutations in For each protein search for UniProt entry from files Perl scripts
Report variations in tab-delimited files
derived from UniProt conserved among all represented mammals. Three genes
these genes and proteins that are well tolerated across these SPA18, DRYK1A, and BTRC are common in all mammals.
species.
Read known variation in flat file and return
annotation in tab delimited file Basal Transcription Factors (BTF): BTF is a major control
point for gene expression in eukaryotes and it contains 34
genes. Most genes in this pathway are well-conserved except
GTF2AIL and STON1.
Dorsal-ventral axis formation (DVF): The DVF pathway is
controlled by GRK and EGFR and is important in limb
development. It has 29 genes and most of the genes are well-
Objectives conserved, the exception being FMN2. Matrices obtained
This project focused on developing methods for deriving the through homologene and similarity method are shown below:
cross-species annotations for genes and protein groups
Cynomolgus monkey
Sumatran Orangutan
Rhesus Macaque
European Rabbit
Western baboon
Domestic Sheep
Syrian Hamster
Gene Symbol
Gene Symbol
White Bear
Opposum
Wild boar
Platypus
Human
Bonobo
Mouse
Chimp
Human
Horse
Mouse
Gorilla
Chimp
identified in candidate pathways. The project had three primary
Cow
Cow
Dog
mice
Rat
Rat
Cat
Dog
goals: Conclusions Future work BRAF
CPEB1
EGFR
1
1
1
1
1
1
1
0
1
1
1
1
1
1
1
1
1
1
BRAF
CPEB1
EGFR
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
1
1
1
1
1
0
0
0
1
0
0
1
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1. Produce a matrix containing genes in a particular biological
ERBB2 1 1 1 0 1 0 ERBB2 1 1 1 1 1 1 1 1 1 1 0 0 1 0 1 1 0 0 0 0 0 0
ERBB4 1 1 1 0 1 1
We developed a process for characterizing cross-species Future work includes:
ERBB4 1 1 1 1 1 1 1 1 1 1 0 0 1 0 1 1 0 0 0 0 0 0
ETS1 1 1 1 1 1 0 ETS1 1 1 1 1 1 1 1 1 1 1 0 0 1 0 0 0 1 1 0 0 0 0
Similarity Search
ETS2 1 1 1 1 1 1
pathway
ETS2 1 1 1 1 1 1 1 1 1 1 0 0 1 0 0 0 0 1 0 0 0 0
ETV6 1 1 1 1 1 1 ETV6 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0
conservation of gene and proteins for mammals, and finding ETV7 1 0 1 0 0 1
ETV7 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0
Homologene
1. Fully automate the process FMN2
GRB2
1
1
0
1
0
1
0
1
0
1
0
1
FMN2
GRB2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
1
1
1
0
0
1
0
0
1
0
1
0
0
0