Annotating nc-RNAs with Rfam

Luca Cozzuto @ Bioinformatics Core
http://rfam.sanger.ac.uk/

Non-coding RNA genes codify for a functional RNA
product rather than for a protein.

Non-coding genes codify for a functional RNA
product rather than for a protein.
Family of functional RNAs:

Biological function RNA family

Involved in protein
tRNA, rRNA, SRP RNA, tmRNA
synthesis
Post-trascriptional
snRNA, snoRNA, SmY, scaRNA, gRNA, RNAse P, RNAse MRP,
modification or DNA
Y RNA, telomerase RNA
replication
aRNA, NAT, crRRNA, long ncRNA, miRNA, piRNA, siRNA,
Regulatory RNAs
tasiRNA, rasiRNA, 7SK
Parasitic RNA Retrotransposon,Viroid, satellite RNA

The majority of functional RNAs fold in stable
structures that are essential for their biological
activity.

Micro-RNA tRNA U2 Part of
spliceosomal Riboswitch
precursor
RNA

Unlike protein-coding genes functional RNAs often
show no significant sequence similarity but preserve a
base-paired secondary structure.

ncRNA_1 AAAAAAGGGGTTTTTT!
ncRNA_2 AAATAAGGGGTTATTT!
Struct ((((((....)))))) !

This makes very difficult to search for those genes
looking only for sequence similarity (i.e. by using
BLAST, FASTA…)

For Rfam database a functional RNA family is
represented by a multiple sequence alignment and a
covariance model.

The model takes into account both sequence and
structure and can be used to scan a genomic
sequence to detect new members of the same family.

The Rfam Seed alignment for the U12 minor spliceosomal RNA family.

Only one sequence,
up to 10 kb

Search methodology

The query sequence is scanned against a library of Rfam sequences using WU-BLAST, with an E-value threshold of 1.0.
Any matches to this are then scanned against the corresponding covariance model using the hand-curated threshold for
that family.

Results
Positive hits are reported together with the score, e-value and alignment to the family CM.

Bit score: how well the sequence matches your model.
The score reflects whether the sequence matches better to the profile model (positive score) or to the null model of
nonhomologous sequences (negative score).

E-value: expected number of false positives with bit scores at least high as your hit.
The value is related to the size of database used for the search.

I Predicted secondary structure
“<> [ ] { }” base pairs “_” hairpin loop “-” interior bulge and loop “,” single stranded multifurcation loop “:” external
single stranded residues “.” insertion to the consensus.

II Consensus of the query model

III Alignment to the model and scoring system
“Capital letter” = max score. “: +” score >=0 for base pairs and single stranded. “ ” negative score

IV Target sequence

Going to the family information
A summary written in wikipedia about the family is shown together with information stored into the database.

Sequences part of that family can be viewed (if they are not so much)

Both seed and full alignments of members can be displayed.

The secondary structure can be viewed.

Also the tree of genomes containing members of that family can be browsed

If a PDB entry is available it is possible to see also the three-dimensional structure.

You can reach some publication on the family.

Problems in searching sequences

- To speed up the searching it is necessary a filtering step based on blast search. This will decrease the sensitivity in
finding true homologues of the functional RNA family.

- The genomes of higher eukaryotes contain many ncRNA-derived pseudogenes and repeats that looks like structured
functional RNAs.

Gardner PP, et al. Bateman A. Rfam: updates to the RNA families database. Nucleic Acids Res. 2009

Batch search
You can upload a file containing several sequences in fasta format. Generally a job takes 48 hours.

Files must have fewer than 100,000 lines and fewer
than 1000 sequences with a size shorter than
200,000 nucleotides

Browsing for genome
Genomes scanned for the presence of a Rfma family are reported in Browse tab.

Browsing for genome
Species, kingdom, number of Rfam families and members found within the specie (Regions) are reported.

Running a complete search for a whole genome.

You may install locally the infernal program available at
http://infernal.janelia.org/.

To speed up the search you may install also the rfam_scan.pl script
available at ftp://ftp.sanger.ac.uk/pub/databases/Rfam/tools/ that relies
on Blast program.

Running a complete search for a whole genome.

Typical usage of infernal.

cmsearch -o output.aln --tabfile output.tab infile.fna Rfam.cm!

Typical usage of rfam_scan.pl

Perl rfam_scan.pl – blastdb Rfam.fasta -outfile.out Rfam.cm
infile.fna !

Annotating nc-RNAs with Rfam

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Annotating nc-RNAs with Rfam

Similar a Annotating nc-RNAs with Rfam (20)

Más de Luca Cozzuto

Más de Luca Cozzuto (6)

Último

Último (20)

Annotating nc-RNAs with Rfam