SlideShare una empresa de Scribd logo
1 de 21
Descargar para leer sin conexión
Pau Corral Montañés

         RUGBCN
        03/05/2012
VIDEO:
“what we used to do in Bioinformatics“


note: due to memory space limitations, the video has not been embedded into
this presentation.

Content of the video:
A search in NCBI (www.ncbi.nlm.nih.gov) for the entry NC_001477 is done
interactively, leading to a single entry in the database related to Dengue Virus,
complete genome. Thanks to the facilities that the site provides, a FASTA file for
this complete genome is downloaded and thereafter opened with a text editor.

The purpose of the video is to show how tedious is accessing the site and
downleading an entry, with no less than 5 mouse clicks.
Unified records with in-house coding:
        The 3 databases share all sequences, but they use different                   A "mirror" of the content of other databases
        accession numbers (IDs) to refer to each entry.
        They update every night.

                                                                              ACNUC
                                                                                           A series of commands and their arguments were
        NCBI – ex.: #000134                                                                defined that allow
                                           EMBL – ex.: #000012
                                                                                           (1) database opening,
                                                                                           (2) query execution,
                                                                                           (3) annotation and sequence display,
                                                                                           (4) annotation, species and keywords browsing, and
                                                                                           (5) sequence extraction




                                              DDBJ – ex.:
                                              #002221




                                                                                  Other Databases
– NCBI - National Centre for Biotechnology Information - (www.ncbi.nlm.nih.gov)
– EMBL - European Molecular Biology Laboratory - (www.ebi.ac.uk/embl)
– DDBJ - DNA Data Bank of Japan - (www.ddbj.nig.ac.jp)
– ACNUC - http://pbil.univ-lyon1.fr/databases/acnuc/acnuc.html
ACNUC retrieval programs:

     i) SeqinR - (http://pbil.univ-lyon1.fr/software/seqinr/)

     ii) C language API - (http://pbil.univ-lyon1.fr/databases/acnuc/raa_acnuc.html)

     iii) Python language API - (http://pbil.univ-lyon1.fr/cgi-bin/raapythonhelp.csh)


     a.I) Query_win – GUI client for remote ACNUC retrieval operations
                   (http://pbil.univ-lyon1.fr/software/query_win.html)
     a.II) raa_query – same functionality as Query_win in command line interface
                  (http://pbil.univ-lyon1.fr/software/query.html)
Why use seqinR:
the Wheel          - available packages
Hotline            - discussion forum
Automation         - same source code for different purposes
Reproducibility    - tutorials in vignette format
Fine tuning        - function arguments

Usage of the R environment!!



#Install the package seqinr_3.0-6 (Apirl 2012)

>install.packages(“seqinr“)
package ‘seqinr’ was built under R version 2.14.2




                                                               A vingette:
                                                               http://seqinr.r-forge.r-project.org/seqinr_2_0-7.pdf
#Choose a mirror
>chooseCRANmirror(mirror_name)
>install.packages("seqinr")

>library(seqinr)

#The command lseqinr() lists all what is defined in the package:
>lseqinr()[1:9]

  [1]   "a"              "aaa"
  [3]   "AAstat"         "acnucclose"
  [5]   "acnucopen"      "al2bp"
  [7]   "alllistranks"   "alr"
  [9]   "amb"

>length(lseqinr())
[1] 209
How many different ways are there to work with biological sequences using SeqinR?

1)Sequences you have locally:

     i) read.fasta() and s2c() and c2s() and GC() and count() and translate()
     ii) write.fasta()
     iii) read.alignment() and consensus()

2) Sequences you download from a Database:

     i) browse Databases
     ii) query() and getSequence()
FASTA files example:
 Example with DNA data: (4 different characters, normally)

                                                                   The FASTA format is very
                                                                   simple and widely used for
                                                                   simple import of biological
                                                                   sequences.
                                                                   It begins with a single-line
                                                                   description starting with a
                                                                   character '>', followed by
                                                                   lines of sequence data of
                                                                   maximum 80 character each.
                                                                   Lines starting with a semi-
                                                                   colon character ';' are
                                                                   comment lines.

  Example with Protein data: (20 different characters, normally)
                                                                   Check Wikipedia for:
                                                                   i)Sequence representation
                                                                   ii)Sequence identifiers
                                                                   iii)File extensions
Read a file with read.fasta()

#Read the sequence from a local directory
> setwd("H:/Documents and Settings/Pau/Mis documentos/R_test")
> dir()
[1] "dengue_whole_sequence.fasta"

#Use the read.fasta (see next slide) function to load the sequence
> read.fasta(file="dengue_whole_sequence.fasta", seqtype="DNA")
$`gi|9626685|ref|NC_001477.1|`
[1] "a" "g" "t" "t" "g" "t" "t" "a" "g" "t" "c" "t" "a" "c" "g" "t" "g" "g"
[19] "a" "c" "c" "g" "a" "c" "a" "a" "g" "a" "a" "c" "a" "g" "t" "t" "t" "c"
[37] "g" "a" "a" "t" "c" "g" "g" "a" "a" "g" "c" "t" "t" "g" "c" "t" "t" "a"
[55] "a" "c" "g" "t" "a" "g" "t" "t" "c" "t" "a" "a" "c" "a" "g" "t" "t" "t"
  [............................................................]
[10711] "t" "g" "g" "t" "g" "c" "t" "g" "t" "t" "g" "a" "a" "t" "c" "a" "a" "c"
[10729] "a" "g" "g" "t" "t" "c" "t"
attr(,"name")
[1] "gi|9626685|ref|NC_001477.1|"
attr(,"Annot")
[1] ">gi|9626685|ref|NC_001477.1| Dengue virus 1, complete genome"
attr(,"class")
[1] "SeqFastadna"



> read.fasta(file="dengue_whole_sequence.fasta", seqtype="DNA", as.string=T, set.attributes=F)
$`gi|9626685|ref|NC_001477.1|`
[1] "agttgttagtctacgtggaccgacaagaacagtttcgaatcggaagcttgcttaacgtag...__truncated__......"

> seq <- read.fasta(file="dengue_whole_sequence.fasta", seqtype="DNA", as.string=T, set.attributes=F)
> str(seq)
List of 1
 $ gi|9626685|ref|NC_001477.1|: chr
"agttgttagtctacgtggaccgacaagaacagtttcgaatcggaagcttgcttaacgtag...__truncated__......"
Usage:
rea d.fa s ta (file, seqtype = c("DNA", "AA"), as.string = FALSE, forceDNAtolower =
TRUE, set.attributes = TRUE, legacy.mode = TRUE, seqonly = FALSE,                   strip.desc
= FALSE, bfa = FALSE, sizeof.longlong = .Machine$sizeof.longlong,
    endian = .Platform$endian, apply.mask = TRUE)

Arguments:
file - path (relative [getwd] is used if absoulte is not given) to FASTA file
seqtype - the nature of the sequence: DNA or AA
as.string - if TRUE sequences are returned as a string instead of a vector characters
forceDNAtolower - lower- or upper-case
set.attributes - whether sequence attributes should be set
legacy.mode - if TRUE lines starting with a semicolon ’;’ are ignored
seqonly - if TRUE, only sequences as returned (execution time is divided approximately by a factor 3)
strip.desc - if TRUE, removes the '>' at the beginning
bfa - if TRUE the fasta file is in MAQ binary format sizeof.longlong
endian - relative to MAQ files
apply.mask - relative to MAQ files

Value:
a list of vector of chars
Basic manipulations:
#Turn seqeunce into characters and count how many are there
> length(s2c(seq[[1]]))
[1] 10735

#Count how many different accurrences are there
> table(s2c(seq[[1]]))
   a    c    g    t
3426 2240 2770 2299

#Count the fraction of G and C bases in the sequence
> GC(s2c(seq[[1]]))
[1] 0.4666977

#Count all possible words in a sequence with a sliding window of size = wordsize
> seq_2 <- "actg"
> count(s2c(seq_2), wordsize=2)
aa ac ag at ca cc cg ct ga gc gg gt ta tc tg tt
 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0

> count(s2c(seq_2), wordsize=2, by=2)
aa ac ag at ca cc cg ct ga gc gg gt ta tc tg tt
 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0

#Translate into aminoacids the genetic sequence.
> c2s(translate(s2c(seq[[1]]), frame=0, sens='F', numcode=1))
[1] "SC*STWTDKNSFESEACLT*F*QFFIREQISDEQPTEKDGSTVFQYAETREKPRVNCFTVGEEILK...-Truncated-...
RIAFRPRTHEIGDGFYSIPKISSHTSNSRNFG*MGLIQEEWSDQGDPSQDTTQQRGPTPGEAVPWW*GLEVRGDPPHNNKQHIDAGRD
QRSCCLYSIIPGTERQKMEWCC*INRF"
#Note:
#    1)frame=0 means that first nuc. is taken as the first position in the codon
#    2)sense='F' means that the sequence is tranlated in the forward sense
#    3)numcode=1 means standard genetic code is used (in SeqinR, up to 23 variations)
Write a FASTA file (and how to save time)
> dir()
[1] "dengue_whole_sequence.fasta" "ER.fasta"
> system.time(seq <- read.fasta(file="ER.fasta", seqtype="DNA", as.string=T, set.attributes=F))
   user system elapsed
   3.28    0.00    3.33

> length(seq)
[1] 9984
> seq[[1]]
[1] "gtggtatcaacgcagagtacgcggggacgtttatatggacgctcctacaaggaaaccctagcctt..._truncated_...ctcata"
> names(seq)[1:5]
[1] "ER0101A01F" "ER0101A03F" "ER0101A05F" "ER0101A07F" "ER0101A09F"

#Clean the sequences keeping only the ones > 25 nucleotides
> char_seq = rapply(seq, s2c, how="list")       # create a list turning strings to characters
> len_seq = rapply(char_seq, length, how="list") # count how many characters are in each list
> bigger_25_list = c()
> for (val in 1:length(len_seq)){
+     if (len_seq[[val]] >= 25){
+         bigger_25_list = c(bigger_25_list, val)
+     }
+ }
> seq_25 = seq[bigger_25_list]            #indexing to get the desired list
> length(seq_25)
[1] 8928
> seq_25[1]
$ER0101A01F
[1] "gtggtatcaacgcagagtacgcggggacgtttatatggacgctcctacaaggaaaccctagccttctcatacct...truncated..."
#Write a FASTA file (see next slide)
> write.fasta(seq_25, names = names(seq_25), file.out="clean_seq.fasta", open="w")
> dir()
[1] "clean_seq.fasta" "dengue_whole_sequence.fasta" "ER.fasta"
Usage:
w rite.fa s ta (sequences, names, file.out, open = "w", nbchar = 80)

Arguments:
sequences - A DNA or protein sequence (in the form of a vector of single characters) or a list of such sequences.
names - The name(s) of the sequences.
file.out - The name of the output file.
open - Open the output file, use "w" to write into a new file, use "a" to append at the end of an already existing file.
nbchar - The number of characters per line (default: 60)


Value:
none in the R space. A FASTA formatted file is created
Write a FASTA file (and how to save time)
# Remember what we did before
> system.time(seq <- read.fasta(file="ER.fasta", seqtype="DNA", as.string=T, set.attributes=F))
   user system elapsed
   3.28    0.00    3.33


# After cleanig seq, we produced a file "clean_seq.fasta" that we read now
> system.time(seq_1 <- read.fasta(file="clean_seq.fasta", seqtype="DNA", as.string=T, set.attributes=F))
   user system elapsed
   1.30    0.01    1.31

#Time can be saved with the save function:
> save(seq_25, file = "ER_CLEAN_seqs.RData")

> system.time(load("ER_CLEAN_seqs.RData"))
   user system elapsed
   0.11    0.02    0.12
Read an alignment (or create an alignment object to be aligned):
#Create an alignment object through reading a FASTA file with two sequences (see next slide)
> fasta <- read.alignment(file = system.file("sequences/Anouk.fasta", package = "seqinr"), format ="fasta")
> fasta
$nb
[1] 2
$nam
[1] "LmjF01.0030" "LinJ01.0030"
$seq
$seq[[1]]
[1] "atgatgtcggccgagccgccgtcgtcgcagccgtacatcagcgacgtgctgcggcggtaccagc...truncated..."
$seq[[2]]
[1] "atgatgtcggccgagccgccgtcgtcgcagccgtacatcagcgacgtgctgcggcggtaccagc...truncated..."
$com
[1] NA
attr(,"class")
[1] "alignment"

# The consensus() function aligns the two sequences, producing a consensus sequences. IUPAC symbology is used.
> fixed_align = consensus(fasta, method="IUPAC")

> table(fixed_align)
fixed_align
  a   c   g   k   m  r    s   t   w    y
411 636 595   3   5 20   13 293   2   20
Usage:
rea d.a lig nm ent(file, format, forceToLower = TRUE)

Arguments:
file - The name of the file which the aligned sequences are to be read from. If it does not contain an absolute or relative
path, the file name is relative to the current working directory, getwd.
format - A character string specifying the format of the file: mas e, clus tal, phylip, fas ta or ms f
forceToLower - A logical defaulting to TRUE stating whether the returned characters in the sequence should be in
lower case



Value:
An object is created of class alignment which is a list with the following components:
nb ->the number of aligned sequences
nam ->a vector of strings containing the names of the aligned sequences
seq ->a vector of strings containing the aligned sequences
com ->a vector of strings containing the commentaries for each sequence or NA if there are no comments
Access a remote server:
> choosebank()
 [1] "genbank"       "embl"          "emblwgs"         "swissprot"   "ensembl"       "hogenom"
 [7] "hogenomdna"    "hovergendna"   "hovergen"        "hogenom5"    "hogenom5dna"   "hogenom4"
[13] "hogenom4dna"   "homolens"      "homolensdna"     "hobacnucl"   "hobacprot"     "phever2"
[19] "phever2dna"    "refseq"        "greviews"        "bacterial"   "protozoan"     "ensbacteria"
[25] "ensprotists"   "ensfungi"      "ensmetazoa"      "ensplants"   "mito"          "polymorphix"
[31] "emglib"        "taxobacgen"    "refseqViruses"

#Access a bank and see complementary information:
> choosebank(bank="genbank", infobank=T)
> ls()
[1] "banknameSocket"
> str(banknameSocket)
List of 9
 $ socket :Classes 'sockconn', 'connection' atomic [1:1] 3
  .. ..- attr(*, "conn_id")=<externalptr>
 $ bankname: chr "genbank"
 $ banktype: chr "GENBANK"
 $ totseqs : num 1.65e+08
 $ totspecs: num 968157
 $ totkeys : num 3.1e+07
 $ release : chr "           GenBank Rel. 189 (15 April 2012) Last Updated: May 1, 2012"
 $ status :Class 'AsIs' chr "on"
 $ details : chr [1:4] "              ****     ACNUC Data Base Content      ****                        " "
      GenBank Rel. 189 (15 April 2012) Last Updated: May 1, 2012" "139,677,722,280 bases; 152,280,170
sequences; 12,313,982 subseqs; 684,079 refers." "Software by M. Gouy, Lab. Biometrie et Biologie Evolutive,
Universite Lyon I "
> banknameSocket$details
[1] "              ****     ACNUC Data Base Content      ****                        "
[2] "           GenBank Rel. 189 (15 April 2012) Last Updated: May 1, 2012"
[3] "139,677,722,280 bases; 152,280,170 sequences; 12,313,982 subseqs; 684,079 refers."
[4] "Software by M. Gouy, Lab. Biometrie et Biologie Evolutive, Universite Lyon I "
Make a query:
# query (see next slide) all sequences that contain the words "virus" and "dengue" in the taxonomy field and
# that are not partial sequences
> system.time(query("All_Dengue_viruses_NOTpartial", ""sp=@virus@" AND "sp=@dengue@" AND
NOT "k=partial""))
   user system elapsed
   0.78    0.00    6.72

> All_Dengue_viruses_NOTpartial[1:4]
$call
query(listname = "All_Dengue_viruses_NOTpartial", query = ""sp=@virus@" AND "sp=@dengue@" AND
NOT "k=partial"")
$name
[1] "All_Dengue_viruses_NOTpartial"
$nelem
[1] 7741
$typelist
[1] "SQ"

> All_Dengue_viruses_NOTpartial[[5]][[1]]
    name   length    frame   ncbicg
"A13666"    "456"      "0"      "1"

> myseq = getSequence(All_Dengue_viruses_NOTpartial[[5]][[1]])
> myseq[1:20]
 [1] "a" "t" "g" "g" "c" "c" "a" "t" "g" "g" "a" "c" "c" "t" "t" "g" "g" "t" "g" "a"

> closebank()
Usage:
query(listname, query, socket = autosocket(), invisible = T, verbose = F, virtual = F)

Arguments:
listname - The name of the list as a quoted string of chars
query - A quoted string of chars containing the request with the syntax given in the details section
socket - An object of class sockconn connecting to a remote ACNUC database (default is a socket to the last opened
database).
invisible - if FALSE, the result is returned visibly.
verbose - if TRUE, verbose mode is on
virtual - if TRUE, no attempt is made to retrieve the information about all the elements of the list. In this case, the req
component of the list is set to NA.

Value:
The result is a list with the following 6 components:

call - the original call
name - the ACNUC list name
nelem - the number of elements (for instance sequences) in the ACNUC list
typelist - the type of the elements of the list. Could be SQ for a list of sequence names, KW for a list of keywords, SP for
a list of species names.
req - a list of sequence names that fit the required criteria or NA when called with parameter virtual is TRUE
socket - the socket connection that was used
Pau Corral Montañés

         RUGBCN
        03/05/2012

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

C programming notes
C programming notesC programming notes
C programming notes
 
Gene prediction methods vijay
Gene prediction methods  vijayGene prediction methods  vijay
Gene prediction methods vijay
 
Strings in c language
Strings in  c languageStrings in  c language
Strings in c language
 
Collection Framework in java
Collection Framework in javaCollection Framework in java
Collection Framework in java
 
RNA-Seq
RNA-SeqRNA-Seq
RNA-Seq
 
STL in C++
STL in C++STL in C++
STL in C++
 
Systems biology: Bioinformatics on complete biological systems
Systems biology: Bioinformatics on complete biological systemsSystems biology: Bioinformatics on complete biological systems
Systems biology: Bioinformatics on complete biological systems
 
07 java collection
07 java collection07 java collection
07 java collection
 
How to design a DNA primer on NCBI.pptx
How to design a DNA primer on NCBI.pptxHow to design a DNA primer on NCBI.pptx
How to design a DNA primer on NCBI.pptx
 
Biological data base
Biological data baseBiological data base
Biological data base
 
Chapter 03 python libraries
Chapter 03 python librariesChapter 03 python libraries
Chapter 03 python libraries
 
BioInformatics MCQ
BioInformatics MCQBioInformatics MCQ
BioInformatics MCQ
 
Arrays in c language
Arrays in c languageArrays in c language
Arrays in c language
 
How To Install Python Pip On Windows | Edureka
How To Install Python Pip On Windows | EdurekaHow To Install Python Pip On Windows | Edureka
How To Install Python Pip On Windows | Edureka
 
Java Collections Framework
Java Collections FrameworkJava Collections Framework
Java Collections Framework
 
Java Strings
Java StringsJava Strings
Java Strings
 
Python programming : Strings
Python programming : StringsPython programming : Strings
Python programming : Strings
 
Introduction to Java Strings, By Kavita Ganesan
Introduction to Java Strings, By Kavita GanesanIntroduction to Java Strings, By Kavita Ganesan
Introduction to Java Strings, By Kavita Ganesan
 
Java Input Output and File Handling
Java Input Output and File HandlingJava Input Output and File Handling
Java Input Output and File Handling
 
Dna fingerprinting
Dna fingerprinting Dna fingerprinting
Dna fingerprinting
 

Destacado

Phylogenetics in R
Phylogenetics in RPhylogenetics in R
Phylogenetics in Rschamber
 
Phylogeny in R - Bianca Santini Sheffield R Users March 2015
Phylogeny in R - Bianca Santini Sheffield R Users March 2015Phylogeny in R - Bianca Santini Sheffield R Users March 2015
Phylogeny in R - Bianca Santini Sheffield R Users March 2015Paul Richards
 
Phylogenetics Analysis in R
Phylogenetics Analysis in RPhylogenetics Analysis in R
Phylogenetics Analysis in RKlaus Schliep
 
Evolution of transposons, genomes, and organisms (Hertweck Fall 2014)
Evolution of transposons, genomes, and organisms (Hertweck Fall 2014)Evolution of transposons, genomes, and organisms (Hertweck Fall 2014)
Evolution of transposons, genomes, and organisms (Hertweck Fall 2014)Kate Hertweck
 
Where can tell me who I am?
Where can tell me who I am?Where can tell me who I am?
Where can tell me who I am?seltzoid
 
Zed-Sales™ - a flagship product of Zed-Axis Technologies Pvt. Ltd.
Zed-Sales™ - a flagship product of Zed-Axis Technologies Pvt. Ltd.Zed-Sales™ - a flagship product of Zed-Axis Technologies Pvt. Ltd.
Zed-Sales™ - a flagship product of Zed-Axis Technologies Pvt. Ltd.Rakesh Kumar
 
Hacking your Kindle (OSCON Lightning Talk)
Hacking your Kindle (OSCON Lightning Talk)Hacking your Kindle (OSCON Lightning Talk)
Hacking your Kindle (OSCON Lightning Talk)Jesse Vincent
 
Exotic Orient
Exotic OrientExotic Orient
Exotic OrientRenny
 
IBM Big Data References
IBM Big Data ReferencesIBM Big Data References
IBM Big Data ReferencesRob Thomas
 
Engage Workshop Berlin09 Part2
Engage Workshop Berlin09 Part2Engage Workshop Berlin09 Part2
Engage Workshop Berlin09 Part2Paul Pivec
 
Blackwell Esteem AFSL
Blackwell Esteem AFSLBlackwell Esteem AFSL
Blackwell Esteem AFSLsamueltay77
 
Iman bysajib hossain akash-01725-340978.
Iman bysajib hossain akash-01725-340978.Iman bysajib hossain akash-01725-340978.
Iman bysajib hossain akash-01725-340978.Sajib Hossain Akash
 
Exposing Opportunities in China A50 using CFD
Exposing Opportunities in China A50 using CFDExposing Opportunities in China A50 using CFD
Exposing Opportunities in China A50 using CFDPhillip CFD
 
360Gate Business Objects portal
360Gate Business Objects portal360Gate Business Objects portal
360Gate Business Objects portalSebastien Goiffon
 
Ushul Firaq wal Adyaan wal Madzaahib Al Fikriyah (Ar)
Ushul Firaq wal Adyaan wal Madzaahib Al Fikriyah (Ar)Ushul Firaq wal Adyaan wal Madzaahib Al Fikriyah (Ar)
Ushul Firaq wal Adyaan wal Madzaahib Al Fikriyah (Ar)Sayogyo Rahman Doko
 
Tonometer Final NSF I-Corps presentation
Tonometer Final NSF I-Corps presentationTonometer Final NSF I-Corps presentation
Tonometer Final NSF I-Corps presentationStanford University
 
ShareThis Auto Study
ShareThis Auto Study ShareThis Auto Study
ShareThis Auto Study ShareThis
 

Destacado (20)

Phylogenetics in R
Phylogenetics in RPhylogenetics in R
Phylogenetics in R
 
Phylogeny in R - Bianca Santini Sheffield R Users March 2015
Phylogeny in R - Bianca Santini Sheffield R Users March 2015Phylogeny in R - Bianca Santini Sheffield R Users March 2015
Phylogeny in R - Bianca Santini Sheffield R Users March 2015
 
Phylogenetics Analysis in R
Phylogenetics Analysis in RPhylogenetics Analysis in R
Phylogenetics Analysis in R
 
Evolution of transposons, genomes, and organisms (Hertweck Fall 2014)
Evolution of transposons, genomes, and organisms (Hertweck Fall 2014)Evolution of transposons, genomes, and organisms (Hertweck Fall 2014)
Evolution of transposons, genomes, and organisms (Hertweck Fall 2014)
 
Where can tell me who I am?
Where can tell me who I am?Where can tell me who I am?
Where can tell me who I am?
 
Zed-Sales™ - a flagship product of Zed-Axis Technologies Pvt. Ltd.
Zed-Sales™ - a flagship product of Zed-Axis Technologies Pvt. Ltd.Zed-Sales™ - a flagship product of Zed-Axis Technologies Pvt. Ltd.
Zed-Sales™ - a flagship product of Zed-Axis Technologies Pvt. Ltd.
 
Hacking your Kindle (OSCON Lightning Talk)
Hacking your Kindle (OSCON Lightning Talk)Hacking your Kindle (OSCON Lightning Talk)
Hacking your Kindle (OSCON Lightning Talk)
 
Exotic Orient
Exotic OrientExotic Orient
Exotic Orient
 
IBM Big Data References
IBM Big Data ReferencesIBM Big Data References
IBM Big Data References
 
Engage Workshop Berlin09 Part2
Engage Workshop Berlin09 Part2Engage Workshop Berlin09 Part2
Engage Workshop Berlin09 Part2
 
Blackwell Esteem AFSL
Blackwell Esteem AFSLBlackwell Esteem AFSL
Blackwell Esteem AFSL
 
Bailey capítulo-6
Bailey capítulo-6Bailey capítulo-6
Bailey capítulo-6
 
Iman bysajib hossain akash-01725-340978.
Iman bysajib hossain akash-01725-340978.Iman bysajib hossain akash-01725-340978.
Iman bysajib hossain akash-01725-340978.
 
Exposing Opportunities in China A50 using CFD
Exposing Opportunities in China A50 using CFDExposing Opportunities in China A50 using CFD
Exposing Opportunities in China A50 using CFD
 
Gscm1
Gscm1Gscm1
Gscm1
 
360Gate Business Objects portal
360Gate Business Objects portal360Gate Business Objects portal
360Gate Business Objects portal
 
Ushul Firaq wal Adyaan wal Madzaahib Al Fikriyah (Ar)
Ushul Firaq wal Adyaan wal Madzaahib Al Fikriyah (Ar)Ushul Firaq wal Adyaan wal Madzaahib Al Fikriyah (Ar)
Ushul Firaq wal Adyaan wal Madzaahib Al Fikriyah (Ar)
 
Geohash
GeohashGeohash
Geohash
 
Tonometer Final NSF I-Corps presentation
Tonometer Final NSF I-Corps presentationTonometer Final NSF I-Corps presentation
Tonometer Final NSF I-Corps presentation
 
ShareThis Auto Study
ShareThis Auto Study ShareThis Auto Study
ShareThis Auto Study
 

Similar a SeqinR - biological data handling

Parsing and Type checking all 2^10000 configurations of the Linux kernel
Parsing and Type checking all 2^10000 configurations of the Linux kernelParsing and Type checking all 2^10000 configurations of the Linux kernel
Parsing and Type checking all 2^10000 configurations of the Linux kernelchk49
 
Next-generation sequencing data format and visualization with ngs.plot 2015
Next-generation sequencing data format and visualization with ngs.plot 2015Next-generation sequencing data format and visualization with ngs.plot 2015
Next-generation sequencing data format and visualization with ngs.plot 2015Li Shen
 
Learning to Spot and Refactor Inconsistent Method Names
Learning to Spot and Refactor Inconsistent Method NamesLearning to Spot and Refactor Inconsistent Method Names
Learning to Spot and Refactor Inconsistent Method NamesDongsun Kim
 
Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Li Shen
 
Profiling distributed Java applications
Profiling distributed Java applicationsProfiling distributed Java applications
Profiling distributed Java applicationsConstantine Slisenka
 
Как разработать DBFW с нуля
Как разработать DBFW с нуляКак разработать DBFW с нуля
Как разработать DBFW с нуляPositive Hack Days
 
Database Firewall from Scratch
Database Firewall from ScratchDatabase Firewall from Scratch
Database Firewall from ScratchDenis Kolegov
 
Terence Barr - jdk7+8 - 24mai2011
Terence Barr - jdk7+8 - 24mai2011Terence Barr - jdk7+8 - 24mai2011
Terence Barr - jdk7+8 - 24mai2011Agora Group
 
2016 bioinformatics i_bio_python_wimvancriekinge
2016 bioinformatics i_bio_python_wimvancriekinge2016 bioinformatics i_bio_python_wimvancriekinge
2016 bioinformatics i_bio_python_wimvancriekingeProf. Wim Van Criekinge
 
Fosdem10
Fosdem10Fosdem10
Fosdem10wremes
 
Language Integrated Query - LINQ
Language Integrated Query - LINQLanguage Integrated Query - LINQ
Language Integrated Query - LINQDoncho Minkov
 
Direct Code Execution - LinuxCon Japan 2014
Direct Code Execution - LinuxCon Japan 2014Direct Code Execution - LinuxCon Japan 2014
Direct Code Execution - LinuxCon Japan 2014Hajime Tazaki
 
Java user group 2015 02-09-java8
Java user group 2015 02-09-java8Java user group 2015 02-09-java8
Java user group 2015 02-09-java8marctritschler
 
Java user group 2015 02-09-java8
Java user group 2015 02-09-java8Java user group 2015 02-09-java8
Java user group 2015 02-09-java8Marc Tritschler
 

Similar a SeqinR - biological data handling (20)

Parsing and Type checking all 2^10000 configurations of the Linux kernel
Parsing and Type checking all 2^10000 configurations of the Linux kernelParsing and Type checking all 2^10000 configurations of the Linux kernel
Parsing and Type checking all 2^10000 configurations of the Linux kernel
 
2015 bioinformatics bio_python
2015 bioinformatics bio_python2015 bioinformatics bio_python
2015 bioinformatics bio_python
 
Next-generation sequencing data format and visualization with ngs.plot 2015
Next-generation sequencing data format and visualization with ngs.plot 2015Next-generation sequencing data format and visualization with ngs.plot 2015
Next-generation sequencing data format and visualization with ngs.plot 2015
 
NGS: Mapping and de novo assembly
NGS: Mapping and de novo assemblyNGS: Mapping and de novo assembly
NGS: Mapping and de novo assembly
 
TiReX: Tiled Regular eXpression matching architecture
TiReX: Tiled Regular eXpression matching architectureTiReX: Tiled Regular eXpression matching architecture
TiReX: Tiled Regular eXpression matching architecture
 
Biopython
BiopythonBiopython
Biopython
 
Learning to Spot and Refactor Inconsistent Method Names
Learning to Spot and Refactor Inconsistent Method NamesLearning to Spot and Refactor Inconsistent Method Names
Learning to Spot and Refactor Inconsistent Method Names
 
Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2
 
Profiling distributed Java applications
Profiling distributed Java applicationsProfiling distributed Java applications
Profiling distributed Java applications
 
Как разработать DBFW с нуля
Как разработать DBFW с нуляКак разработать DBFW с нуля
Как разработать DBFW с нуля
 
Database Firewall from Scratch
Database Firewall from ScratchDatabase Firewall from Scratch
Database Firewall from Scratch
 
Biopython: Overview, State of the Art and Outlook
Biopython: Overview, State of the Art and OutlookBiopython: Overview, State of the Art and Outlook
Biopython: Overview, State of the Art and Outlook
 
Terence Barr - jdk7+8 - 24mai2011
Terence Barr - jdk7+8 - 24mai2011Terence Barr - jdk7+8 - 24mai2011
Terence Barr - jdk7+8 - 24mai2011
 
Java
JavaJava
Java
 
2016 bioinformatics i_bio_python_wimvancriekinge
2016 bioinformatics i_bio_python_wimvancriekinge2016 bioinformatics i_bio_python_wimvancriekinge
2016 bioinformatics i_bio_python_wimvancriekinge
 
Fosdem10
Fosdem10Fosdem10
Fosdem10
 
Language Integrated Query - LINQ
Language Integrated Query - LINQLanguage Integrated Query - LINQ
Language Integrated Query - LINQ
 
Direct Code Execution - LinuxCon Japan 2014
Direct Code Execution - LinuxCon Japan 2014Direct Code Execution - LinuxCon Japan 2014
Direct Code Execution - LinuxCon Japan 2014
 
Java user group 2015 02-09-java8
Java user group 2015 02-09-java8Java user group 2015 02-09-java8
Java user group 2015 02-09-java8
 
Java user group 2015 02-09-java8
Java user group 2015 02-09-java8Java user group 2015 02-09-java8
Java user group 2015 02-09-java8
 

Último

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 

Último (20)

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 

SeqinR - biological data handling

  • 1. Pau Corral Montañés RUGBCN 03/05/2012
  • 2. VIDEO: “what we used to do in Bioinformatics“ note: due to memory space limitations, the video has not been embedded into this presentation. Content of the video: A search in NCBI (www.ncbi.nlm.nih.gov) for the entry NC_001477 is done interactively, leading to a single entry in the database related to Dengue Virus, complete genome. Thanks to the facilities that the site provides, a FASTA file for this complete genome is downloaded and thereafter opened with a text editor. The purpose of the video is to show how tedious is accessing the site and downleading an entry, with no less than 5 mouse clicks.
  • 3. Unified records with in-house coding: The 3 databases share all sequences, but they use different A "mirror" of the content of other databases accession numbers (IDs) to refer to each entry. They update every night. ACNUC A series of commands and their arguments were NCBI – ex.: #000134 defined that allow EMBL – ex.: #000012 (1) database opening, (2) query execution, (3) annotation and sequence display, (4) annotation, species and keywords browsing, and (5) sequence extraction DDBJ – ex.: #002221 Other Databases – NCBI - National Centre for Biotechnology Information - (www.ncbi.nlm.nih.gov) – EMBL - European Molecular Biology Laboratory - (www.ebi.ac.uk/embl) – DDBJ - DNA Data Bank of Japan - (www.ddbj.nig.ac.jp) – ACNUC - http://pbil.univ-lyon1.fr/databases/acnuc/acnuc.html
  • 4.
  • 5. ACNUC retrieval programs: i) SeqinR - (http://pbil.univ-lyon1.fr/software/seqinr/) ii) C language API - (http://pbil.univ-lyon1.fr/databases/acnuc/raa_acnuc.html) iii) Python language API - (http://pbil.univ-lyon1.fr/cgi-bin/raapythonhelp.csh) a.I) Query_win – GUI client for remote ACNUC retrieval operations (http://pbil.univ-lyon1.fr/software/query_win.html) a.II) raa_query – same functionality as Query_win in command line interface (http://pbil.univ-lyon1.fr/software/query.html)
  • 6. Why use seqinR: the Wheel - available packages Hotline - discussion forum Automation - same source code for different purposes Reproducibility - tutorials in vignette format Fine tuning - function arguments Usage of the R environment!! #Install the package seqinr_3.0-6 (Apirl 2012) >install.packages(“seqinr“) package ‘seqinr’ was built under R version 2.14.2 A vingette: http://seqinr.r-forge.r-project.org/seqinr_2_0-7.pdf
  • 7. #Choose a mirror >chooseCRANmirror(mirror_name) >install.packages("seqinr") >library(seqinr) #The command lseqinr() lists all what is defined in the package: >lseqinr()[1:9] [1] "a" "aaa" [3] "AAstat" "acnucclose" [5] "acnucopen" "al2bp" [7] "alllistranks" "alr" [9] "amb" >length(lseqinr()) [1] 209
  • 8. How many different ways are there to work with biological sequences using SeqinR? 1)Sequences you have locally: i) read.fasta() and s2c() and c2s() and GC() and count() and translate() ii) write.fasta() iii) read.alignment() and consensus() 2) Sequences you download from a Database: i) browse Databases ii) query() and getSequence()
  • 9. FASTA files example: Example with DNA data: (4 different characters, normally) The FASTA format is very simple and widely used for simple import of biological sequences. It begins with a single-line description starting with a character '>', followed by lines of sequence data of maximum 80 character each. Lines starting with a semi- colon character ';' are comment lines. Example with Protein data: (20 different characters, normally) Check Wikipedia for: i)Sequence representation ii)Sequence identifiers iii)File extensions
  • 10. Read a file with read.fasta() #Read the sequence from a local directory > setwd("H:/Documents and Settings/Pau/Mis documentos/R_test") > dir() [1] "dengue_whole_sequence.fasta" #Use the read.fasta (see next slide) function to load the sequence > read.fasta(file="dengue_whole_sequence.fasta", seqtype="DNA") $`gi|9626685|ref|NC_001477.1|` [1] "a" "g" "t" "t" "g" "t" "t" "a" "g" "t" "c" "t" "a" "c" "g" "t" "g" "g" [19] "a" "c" "c" "g" "a" "c" "a" "a" "g" "a" "a" "c" "a" "g" "t" "t" "t" "c" [37] "g" "a" "a" "t" "c" "g" "g" "a" "a" "g" "c" "t" "t" "g" "c" "t" "t" "a" [55] "a" "c" "g" "t" "a" "g" "t" "t" "c" "t" "a" "a" "c" "a" "g" "t" "t" "t" [............................................................] [10711] "t" "g" "g" "t" "g" "c" "t" "g" "t" "t" "g" "a" "a" "t" "c" "a" "a" "c" [10729] "a" "g" "g" "t" "t" "c" "t" attr(,"name") [1] "gi|9626685|ref|NC_001477.1|" attr(,"Annot") [1] ">gi|9626685|ref|NC_001477.1| Dengue virus 1, complete genome" attr(,"class") [1] "SeqFastadna" > read.fasta(file="dengue_whole_sequence.fasta", seqtype="DNA", as.string=T, set.attributes=F) $`gi|9626685|ref|NC_001477.1|` [1] "agttgttagtctacgtggaccgacaagaacagtttcgaatcggaagcttgcttaacgtag...__truncated__......" > seq <- read.fasta(file="dengue_whole_sequence.fasta", seqtype="DNA", as.string=T, set.attributes=F) > str(seq) List of 1 $ gi|9626685|ref|NC_001477.1|: chr "agttgttagtctacgtggaccgacaagaacagtttcgaatcggaagcttgcttaacgtag...__truncated__......"
  • 11. Usage: rea d.fa s ta (file, seqtype = c("DNA", "AA"), as.string = FALSE, forceDNAtolower = TRUE, set.attributes = TRUE, legacy.mode = TRUE, seqonly = FALSE, strip.desc = FALSE, bfa = FALSE, sizeof.longlong = .Machine$sizeof.longlong, endian = .Platform$endian, apply.mask = TRUE) Arguments: file - path (relative [getwd] is used if absoulte is not given) to FASTA file seqtype - the nature of the sequence: DNA or AA as.string - if TRUE sequences are returned as a string instead of a vector characters forceDNAtolower - lower- or upper-case set.attributes - whether sequence attributes should be set legacy.mode - if TRUE lines starting with a semicolon ’;’ are ignored seqonly - if TRUE, only sequences as returned (execution time is divided approximately by a factor 3) strip.desc - if TRUE, removes the '>' at the beginning bfa - if TRUE the fasta file is in MAQ binary format sizeof.longlong endian - relative to MAQ files apply.mask - relative to MAQ files Value: a list of vector of chars
  • 12. Basic manipulations: #Turn seqeunce into characters and count how many are there > length(s2c(seq[[1]])) [1] 10735 #Count how many different accurrences are there > table(s2c(seq[[1]])) a c g t 3426 2240 2770 2299 #Count the fraction of G and C bases in the sequence > GC(s2c(seq[[1]])) [1] 0.4666977 #Count all possible words in a sequence with a sliding window of size = wordsize > seq_2 <- "actg" > count(s2c(seq_2), wordsize=2) aa ac ag at ca cc cg ct ga gc gg gt ta tc tg tt 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 > count(s2c(seq_2), wordsize=2, by=2) aa ac ag at ca cc cg ct ga gc gg gt ta tc tg tt 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 #Translate into aminoacids the genetic sequence. > c2s(translate(s2c(seq[[1]]), frame=0, sens='F', numcode=1)) [1] "SC*STWTDKNSFESEACLT*F*QFFIREQISDEQPTEKDGSTVFQYAETREKPRVNCFTVGEEILK...-Truncated-... RIAFRPRTHEIGDGFYSIPKISSHTSNSRNFG*MGLIQEEWSDQGDPSQDTTQQRGPTPGEAVPWW*GLEVRGDPPHNNKQHIDAGRD QRSCCLYSIIPGTERQKMEWCC*INRF" #Note: # 1)frame=0 means that first nuc. is taken as the first position in the codon # 2)sense='F' means that the sequence is tranlated in the forward sense # 3)numcode=1 means standard genetic code is used (in SeqinR, up to 23 variations)
  • 13. Write a FASTA file (and how to save time) > dir() [1] "dengue_whole_sequence.fasta" "ER.fasta" > system.time(seq <- read.fasta(file="ER.fasta", seqtype="DNA", as.string=T, set.attributes=F)) user system elapsed 3.28 0.00 3.33 > length(seq) [1] 9984 > seq[[1]] [1] "gtggtatcaacgcagagtacgcggggacgtttatatggacgctcctacaaggaaaccctagcctt..._truncated_...ctcata" > names(seq)[1:5] [1] "ER0101A01F" "ER0101A03F" "ER0101A05F" "ER0101A07F" "ER0101A09F" #Clean the sequences keeping only the ones > 25 nucleotides > char_seq = rapply(seq, s2c, how="list") # create a list turning strings to characters > len_seq = rapply(char_seq, length, how="list") # count how many characters are in each list > bigger_25_list = c() > for (val in 1:length(len_seq)){ + if (len_seq[[val]] >= 25){ + bigger_25_list = c(bigger_25_list, val) + } + } > seq_25 = seq[bigger_25_list] #indexing to get the desired list > length(seq_25) [1] 8928 > seq_25[1] $ER0101A01F [1] "gtggtatcaacgcagagtacgcggggacgtttatatggacgctcctacaaggaaaccctagccttctcatacct...truncated..." #Write a FASTA file (see next slide) > write.fasta(seq_25, names = names(seq_25), file.out="clean_seq.fasta", open="w") > dir() [1] "clean_seq.fasta" "dengue_whole_sequence.fasta" "ER.fasta"
  • 14. Usage: w rite.fa s ta (sequences, names, file.out, open = "w", nbchar = 80) Arguments: sequences - A DNA or protein sequence (in the form of a vector of single characters) or a list of such sequences. names - The name(s) of the sequences. file.out - The name of the output file. open - Open the output file, use "w" to write into a new file, use "a" to append at the end of an already existing file. nbchar - The number of characters per line (default: 60) Value: none in the R space. A FASTA formatted file is created
  • 15. Write a FASTA file (and how to save time) # Remember what we did before > system.time(seq <- read.fasta(file="ER.fasta", seqtype="DNA", as.string=T, set.attributes=F)) user system elapsed 3.28 0.00 3.33 # After cleanig seq, we produced a file "clean_seq.fasta" that we read now > system.time(seq_1 <- read.fasta(file="clean_seq.fasta", seqtype="DNA", as.string=T, set.attributes=F)) user system elapsed 1.30 0.01 1.31 #Time can be saved with the save function: > save(seq_25, file = "ER_CLEAN_seqs.RData") > system.time(load("ER_CLEAN_seqs.RData")) user system elapsed 0.11 0.02 0.12
  • 16. Read an alignment (or create an alignment object to be aligned): #Create an alignment object through reading a FASTA file with two sequences (see next slide) > fasta <- read.alignment(file = system.file("sequences/Anouk.fasta", package = "seqinr"), format ="fasta") > fasta $nb [1] 2 $nam [1] "LmjF01.0030" "LinJ01.0030" $seq $seq[[1]] [1] "atgatgtcggccgagccgccgtcgtcgcagccgtacatcagcgacgtgctgcggcggtaccagc...truncated..." $seq[[2]] [1] "atgatgtcggccgagccgccgtcgtcgcagccgtacatcagcgacgtgctgcggcggtaccagc...truncated..." $com [1] NA attr(,"class") [1] "alignment" # The consensus() function aligns the two sequences, producing a consensus sequences. IUPAC symbology is used. > fixed_align = consensus(fasta, method="IUPAC") > table(fixed_align) fixed_align a c g k m r s t w y 411 636 595 3 5 20 13 293 2 20
  • 17. Usage: rea d.a lig nm ent(file, format, forceToLower = TRUE) Arguments: file - The name of the file which the aligned sequences are to be read from. If it does not contain an absolute or relative path, the file name is relative to the current working directory, getwd. format - A character string specifying the format of the file: mas e, clus tal, phylip, fas ta or ms f forceToLower - A logical defaulting to TRUE stating whether the returned characters in the sequence should be in lower case Value: An object is created of class alignment which is a list with the following components: nb ->the number of aligned sequences nam ->a vector of strings containing the names of the aligned sequences seq ->a vector of strings containing the aligned sequences com ->a vector of strings containing the commentaries for each sequence or NA if there are no comments
  • 18. Access a remote server: > choosebank() [1] "genbank" "embl" "emblwgs" "swissprot" "ensembl" "hogenom" [7] "hogenomdna" "hovergendna" "hovergen" "hogenom5" "hogenom5dna" "hogenom4" [13] "hogenom4dna" "homolens" "homolensdna" "hobacnucl" "hobacprot" "phever2" [19] "phever2dna" "refseq" "greviews" "bacterial" "protozoan" "ensbacteria" [25] "ensprotists" "ensfungi" "ensmetazoa" "ensplants" "mito" "polymorphix" [31] "emglib" "taxobacgen" "refseqViruses" #Access a bank and see complementary information: > choosebank(bank="genbank", infobank=T) > ls() [1] "banknameSocket" > str(banknameSocket) List of 9 $ socket :Classes 'sockconn', 'connection' atomic [1:1] 3 .. ..- attr(*, "conn_id")=<externalptr> $ bankname: chr "genbank" $ banktype: chr "GENBANK" $ totseqs : num 1.65e+08 $ totspecs: num 968157 $ totkeys : num 3.1e+07 $ release : chr " GenBank Rel. 189 (15 April 2012) Last Updated: May 1, 2012" $ status :Class 'AsIs' chr "on" $ details : chr [1:4] " **** ACNUC Data Base Content **** " " GenBank Rel. 189 (15 April 2012) Last Updated: May 1, 2012" "139,677,722,280 bases; 152,280,170 sequences; 12,313,982 subseqs; 684,079 refers." "Software by M. Gouy, Lab. Biometrie et Biologie Evolutive, Universite Lyon I " > banknameSocket$details [1] " **** ACNUC Data Base Content **** " [2] " GenBank Rel. 189 (15 April 2012) Last Updated: May 1, 2012" [3] "139,677,722,280 bases; 152,280,170 sequences; 12,313,982 subseqs; 684,079 refers." [4] "Software by M. Gouy, Lab. Biometrie et Biologie Evolutive, Universite Lyon I "
  • 19. Make a query: # query (see next slide) all sequences that contain the words "virus" and "dengue" in the taxonomy field and # that are not partial sequences > system.time(query("All_Dengue_viruses_NOTpartial", ""sp=@virus@" AND "sp=@dengue@" AND NOT "k=partial"")) user system elapsed 0.78 0.00 6.72 > All_Dengue_viruses_NOTpartial[1:4] $call query(listname = "All_Dengue_viruses_NOTpartial", query = ""sp=@virus@" AND "sp=@dengue@" AND NOT "k=partial"") $name [1] "All_Dengue_viruses_NOTpartial" $nelem [1] 7741 $typelist [1] "SQ" > All_Dengue_viruses_NOTpartial[[5]][[1]] name length frame ncbicg "A13666" "456" "0" "1" > myseq = getSequence(All_Dengue_viruses_NOTpartial[[5]][[1]]) > myseq[1:20] [1] "a" "t" "g" "g" "c" "c" "a" "t" "g" "g" "a" "c" "c" "t" "t" "g" "g" "t" "g" "a" > closebank()
  • 20. Usage: query(listname, query, socket = autosocket(), invisible = T, verbose = F, virtual = F) Arguments: listname - The name of the list as a quoted string of chars query - A quoted string of chars containing the request with the syntax given in the details section socket - An object of class sockconn connecting to a remote ACNUC database (default is a socket to the last opened database). invisible - if FALSE, the result is returned visibly. verbose - if TRUE, verbose mode is on virtual - if TRUE, no attempt is made to retrieve the information about all the elements of the list. In this case, the req component of the list is set to NA. Value: The result is a list with the following 6 components: call - the original call name - the ACNUC list name nelem - the number of elements (for instance sequences) in the ACNUC list typelist - the type of the elements of the list. Could be SQ for a list of sequence names, KW for a list of keywords, SP for a list of species names. req - a list of sequence names that fit the required criteria or NA when called with parameter virtual is TRUE socket - the socket connection that was used
  • 21. Pau Corral Montañés RUGBCN 03/05/2012