SlideShare una empresa de Scribd logo
1 de 29
Biopython
     Karin Lagesen

karin.lagesen@bio.uio.no
ConcatFasta.py
Create a script that has the following:
  function get_fastafiles(dirname)
     gets all the files in the directory, checks if they are fasta
       files (end in .fsa), returns list of fasta files
     hint: you need os.path to create full relative file names
  function concat_fastafiles(filelist, outfile)
     takes a list of fasta files, opens and reads each of them,
       writes them to outfile
  if __name__ == “__main__”:
     do what needs to be done to run script
Remember imports!
Object oriented programming
Biopython is object-oriented
Some knowledge helps understand how
 biopython works
OOP is a way of organizing data and
 methods that work on them in a coherent
 package
OOP helps structure and organize the code
Classes and objects
A class:
  is a user defined type
  is a mold for creating objects
  specifies how an object can contain and
    process data
  represents an abstraction or a template for how
    an object of that class will behave
An object is an instance of a class
All objects have a type – shows which class
  they were made from
Attributes and methods
Classes specify two things:
  attributes – data holders
  methods – functions for this class
Attributes are variables that will contain the
 data that each object will have
Methods are functions that an object of that
 class will be able to perform
Class and object example
Class: MySeq
MySeq has:
   attribute length
   method translate
An object of the class MySeq is created like this:
   myseq = MySeq(“ATGGCCG”)
Get sequence length:
   myseq.length
Get translation:
   myseq.translate()
Summary
An object has to be instantiated, i.e.
 created, to exist
Every object has a certain type, i.e. is of a
 certain class
The class decides which attributes and
 methods an object has
Attributes and methods are accessed
 using . after the object variable name
Biopython
Package that assists with processing
 biological data
Consists of several modules – some with
 common operations, some more
 specialized
Website: biopython.org
Working with sequences
Biopython has many ways of working with
  sequence data
Components for today:
  Alphabet
  Seq
  SeqRecord
  SeqIO
Other useful classes for working with alignments,
 blast searches and results etc are also available,
 not covered today
Class Alphabet
Every sequence needs an alphabet
CCTTGGCC – DNA or protein?
Biopython contains several alphabets
  DNA
  RNA
  Protein
  the three above with IUPAC codes
  ...and others
Can all be found in Bio.Alphabet package
Alphabet example
Go to freebee
Do module load python (necessary to find biopython
  modules) – start python
  >>> import Bio.Alphabet
                                           NOTE: have to import
  >>> Bio.Alphabet.ThreeLetterProtein.letters
                                           Alphabets to use them
  ['Ala', 'Asx', 'Cys', 'Asp', 'Glu', 'Phe', 'Gly', 'His', 'Ile', 
  'Lys', 'Leu', 'Met', 'Asn', 'Pro', 'Gln', 'Arg', 'Ser', 'Thr', 
  'Sec', 'Val', 'Trp', 'Xaa', 'Tyr', 'Glx']
  >>> from Bio.Alphabet import IUPAC
  >>> IUPAC.IUPACProtein.letters
  'ACDEFGHIKLMNPQRSTVWY'
  >>> IUPAC.unambiguous_dna.letters
  'GATC'
  >>> 
Packages, modules and
              classes
What happens here?
>>> from Bio.Alphabet import IUPAC
   >>> IUPAC.IUPACProtein.letters


Bio and Alphabet are packages
    packages contain modules
IUPAC is a module
    a module is a file with python code
IUPAC module contains class IUPACProtein and
  other classes specifying alphabets
IUPACProtein has attribute letters
Seq
Represents one sequence with its alphabet
Methods:
  translate()
  transcribe()
  complement()
  reverse_complement()
  ...
Using Seq

>>> from Bio.Seq import Seq
>>> import Bio.Alphabet       Create object
>>> seq = Seq("CCGGGTT", Bio.Alphabet.IUPAC.unambiguous_dna)
>>> seq
Seq('CCGGGTT', IUPACUnambiguousDNA())
>>> seq.transcribe()
Seq('CCGGGUU', IUPACUnambiguousRNA()) Use methods
>>> seq.translate()
Seq('PG', IUPACProtein())
>>> seq = Seq("CCGGGUU", Bio.Alphabet.IUPAC.unambiguous_rna)
>>> seq.transcribe()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>        New object, different alphabet
  File "/site/VERSIONS/python­2.6.2/lib/python2.6/site­packages/Bio/Seq.py",
 line 830, in transcribe
    raise ValueError("RNA cannot be transcribed!")
ValueError: RNA cannot be transcribed!
>>> seq.translate()
Seq('PG', IUPACProtein())
>>> 
                                            Alphabet dictates which
                                            methods make sense
Seq as a string
Most string methods work on Seqs
If string is needed, do str(seq)
>>> seq = Seq('CCGGGTTAACGTA',Bio.Alphabet.IUPAC.unambiguous_dna)
>>> seq[:5]
Seq('CCGGG', IUPACUnambiguousDNA())
>>> len(seq)
13
>>> seq.lower()
Seq('ccgggttaacgta', DNAAlphabet())
>>> print seq
CCGGGTTAACGTA
>>> list(seq)
['C', 'C', 'G', 'G', 'G', 'T', 'T', 'A', 'A', 'C', 'G', 'T', 'A']
>>> mystring = str(seq)
>>> print mystring
CCGGGTTAACGTA
>>> type(seq)
<class 'Bio.Seq.Seq'>       How to check what class
>>> type(mystring)          or type an object is from
<type 'str'>
>>> 
MutableSeq
Seqs are immutable as strings are
If mutable string is needed, convert to MutableSeq
Allows in-place changes
  >>> seq[0] = 'T'
  Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
  TypeError: 'Seq' object does not support item assignment
  >>> mut_seq = seq.tomutable()
  >>> seq
  Seq('CCGGGTTAACGTA', IUPACUnambiguousDNA())
  >>> seq[0] = 'T'
  Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
  TypeError: 'Seq' object does not support item assignment
  >>> mut_seq = seq.tomutable()
  >>> mut_seq[0] = 'T'
  >>> mut_seq
  MutableSeq('TCGGGTTAACGTA', IUPACUnambiguousDNA())
  >>> mut_seq.complement()
  >>> mut_seq
  MutableSeq('AGCCCAATTGCAT', IUPACUnambiguousDNA())
  >>>                                Notice: object is changed!
SeqRecord
Seq contains the sequence and alphabet
But sequences often come with a lot more
SeqRecord = Seq + metadata
Main attributes:
   id – name or identifier
   seq – seq object containing the sequence
>>> seq   Existing sequence
Seq('CCGGGTTAACGTA', IUPACUnambiguousDNA()) SeqRecord is a class
>>> from Bio.SeqRecord import SeqRecord     found inside the
>>> seqRecord = SeqRecord(seq, id='001')
>>> seqRecord
                                            Bio.SeqRecord module
SeqRecord(seq=Seq('CCGGGTTAACGTA', IUPACUnambiguousDNA()), 
id='001', name='<unknown name>', description='<unknown description>', 
dbxrefs=[])
>>> 
SeqRecord attributes
From the biopython webpages:
Main attributes:

id - Identifier such as a locus tag (string)
seq - The sequence itself (Seq object or similar)

Additional attributes:

name - Sequence name, e.g. gene name (string)
description - Additional text (string)
dbxrefs - List of database cross references (list of strings)
features - Any (sub)features defined (list of SeqFeature objects)
annotations - Further information about the whole sequence (dictionary)
      Most entries are strings, or lists of strings.
letter_annotations - Per letter/symbol annotation (restricted dictionary). This holds
      Python sequences (lists, strings or tuples) whose length matches that of the
      sequence. A typical use would be to hold a list of integers representing
      sequencing quality scores, or a string representing the secondary structure.
SeqRecords in practice...
>>> from Bio.SeqRecord import SeqRecord
>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import DNAAlphabet
>>> seqRecord = SeqRecord(Seq('GCAGCCTCAAACCCCAGCTG', 
… DNAAlphabet), id = 'NM_005368.2', name = 'NM_005368', 
… description = 'Myoglobin var 1',
… dbxrefs = ['GeneID:4151', 'HGNC:6915'])
>>> seqRecord.annotations['note'] = 'Information goes here'
>>> seqRecord
SeqRecord(seq=Seq('GCAGCCTCAAACCCCAGCTG',
 <class 'Bio.Alphabet.DNAAlphabet'>), id='NM_005368.2', 
name='NM_005368', description='Myoglobin var 1', 
dbxrefs=['GeneID:4151', 'HGNC:6915'])
>>> seqRecord.annotations
{'note': 'Information goes here'}
>>> 
SeqIO
How to get sequences in and out of files
Retrieves sequences as SeqRecords, can
 write SeqRecords to files
Reading:
  parse(filehandle, format)
  returns a generator that gives SeqRecords
Writing:
  write(SeqRecord(s), filehandle, format)

  NOTE: examples in this section from http://biopython.org/wiki/SeqIO
SeqIO formats
List: http://biopython.org/wiki/SeqIO
Some examples:
  fasta
  genbank
  several fastq-formats
  ace
Note: a format might be readable but not
 writable depending on biopython version
Reading a file
        from Bio import SeqIO
        handle = open("example.fasta", "r")
        for record in SeqIO.parse(handle,"fasta") :
            print record.id
        handle.close()




SeqIO.parse returns a SeqRecord iterator
An iterator will give you the next element the
 next time it is called – compare to
 readline()
Useful because if a file contains many
 records, we avoid putting all into memory
 all at once
Exercise
Use mb.gbk, found in Karins folder
Use the SeqIO methods to
   read in the file
   print the id of each of the records
   print the first 10 nucleotide of each record
 >>> from Bio import SeqIO
 >>> fh = open("mb.gbk", "r")
 >>> for record in SeqIO.parse(fh, "genbank"):
 ...     print record.id
 ...     print record.seq[:10]
 ... 
 NM_005368.2
 GCAGCCTCAA
 XM_001081975.2
 CCTCTCCCCA
 NM_001164047.1
 TAGCTGCCCA
 >>> 
SeqRecords lists and
           dictionaries
To get everything as a list:
  handle = open("example.fasta", "r")
     records = list(SeqIO.parse(handle, "fasta"))
     handle.close()


To get everything as a dictionary:
  handle = open("example.fasta", "r")
     record_dict = SeqIO.to_dict(SeqIO.parse(handle, "fasta"))
     handle.close()


But: avoid if at all possible
Writing files
                                      sequences are here a
           from Bio import SeqIO      list of SeqRecords
           sequences = ... # add code here
           output_handle = open("example.fasta", "w")
           SeqIO.write(sequences, output_handle, "fasta")
           output_handle.close()




Note: sequences is here a list
Can write any iterable containing
 SeqRecords to a file
Can also write a single sequence
seq_length.py
Write script that reads a file containing genbank
 sequences and writes out name and sequence
 length
Should have
  Function sequence_length(inputfile)
      Open file
      Per seqRecord in input:
           Print name, length of sequence
      Close file
  If __name__ == “__main__”:
      Get input from command line:
           inputfile
Modifications
Figure out how to:
  print the description of each genbank entry
  which annotations each entry has
  print the taxonomy for each entry
Description:
  seqRecord.description
Annotations:
  seqRecord.annotations.keys()
Taxonomy:
  seqRecord.annotations['taxonomy']
tag_fasta.py
Create script that takes a file containing fasta
  sequences, adds a tag at the front of the name and
  writes it out to a new file
Should have
  Function change_name(seqRecord, tag)
     Change name, return seqRecord
  Function read_write_fasta(tag, input, output)
     Per seqRecord in input:
         Change name
         Write to output
  If __name__ == “__main__”:
     Get input from command line:
         Tag, input file, output file
Optional homework
           convert.py
Create a script that:
  takes input filename, input file type, output
    filename and output file type
  converts input file to output file type and writes it
    to output file

Más contenido relacionado

La actualidad más candente

Tools of bioinforformatics by kk
Tools of bioinforformatics by kkTools of bioinforformatics by kk
Tools of bioinforformatics by kkKAUSHAL SAHU
 
Comparative genomics and proteomics
Comparative genomics and proteomicsComparative genomics and proteomics
Comparative genomics and proteomicsNikhil Aggarwal
 
Comparative genomics
Comparative genomicsComparative genomics
Comparative genomicsprateek kumar
 
Protein Structure, Databases and Structural Alignment
Protein Structure, Databases and Structural AlignmentProtein Structure, Databases and Structural Alignment
Protein Structure, Databases and Structural AlignmentSaramita De Chakravarti
 
Distance based method
Distance based method Distance based method
Distance based method Adhena Lulli
 
Sequence Submission Tools
Sequence Submission ToolsSequence Submission Tools
Sequence Submission ToolsRishikaMaji
 
Applications of bioinformatics
Applications of bioinformaticsApplications of bioinformatics
Applications of bioinformaticsSudha Rameshwari
 
Sequence alignment global vs. local
Sequence alignment  global vs. localSequence alignment  global vs. local
Sequence alignment global vs. localbenazeer fathima
 
Sequence alig Sequence Alignment Pairwise alignment:-
Sequence alig Sequence Alignment Pairwise alignment:-Sequence alig Sequence Alignment Pairwise alignment:-
Sequence alig Sequence Alignment Pairwise alignment:-naveed ul mushtaq
 
Basics of bioinformatics
Basics of bioinformaticsBasics of bioinformatics
Basics of bioinformaticsAbhishek Vatsa
 

La actualidad más candente (20)

Motif andpatterndatabase
Motif andpatterndatabaseMotif andpatterndatabase
Motif andpatterndatabase
 
Genome annotation 2013
Genome annotation 2013Genome annotation 2013
Genome annotation 2013
 
Mega
MegaMega
Mega
 
Tools of bioinforformatics by kk
Tools of bioinforformatics by kkTools of bioinforformatics by kk
Tools of bioinforformatics by kk
 
Comparative genomics and proteomics
Comparative genomics and proteomicsComparative genomics and proteomics
Comparative genomics and proteomics
 
SWISS-PROT
SWISS-PROTSWISS-PROT
SWISS-PROT
 
Comparative genomics
Comparative genomicsComparative genomics
Comparative genomics
 
Protein Structure, Databases and Structural Alignment
Protein Structure, Databases and Structural AlignmentProtein Structure, Databases and Structural Alignment
Protein Structure, Databases and Structural Alignment
 
Distance based method
Distance based method Distance based method
Distance based method
 
Scop database
Scop databaseScop database
Scop database
 
Composite and Specialized databases
Composite and Specialized databasesComposite and Specialized databases
Composite and Specialized databases
 
Sequence Submission Tools
Sequence Submission ToolsSequence Submission Tools
Sequence Submission Tools
 
Applications of bioinformatics
Applications of bioinformaticsApplications of bioinformatics
Applications of bioinformatics
 
Overview of Next Gen Sequencing Data Analysis
Overview of Next Gen Sequencing Data AnalysisOverview of Next Gen Sequencing Data Analysis
Overview of Next Gen Sequencing Data Analysis
 
Bioinformatica t3-scoring matrices
Bioinformatica t3-scoring matricesBioinformatica t3-scoring matrices
Bioinformatica t3-scoring matrices
 
Sequence alignment global vs. local
Sequence alignment  global vs. localSequence alignment  global vs. local
Sequence alignment global vs. local
 
ProCheck
ProCheckProCheck
ProCheck
 
Entrez databases
Entrez databasesEntrez databases
Entrez databases
 
Sequence alig Sequence Alignment Pairwise alignment:-
Sequence alig Sequence Alignment Pairwise alignment:-Sequence alig Sequence Alignment Pairwise alignment:-
Sequence alig Sequence Alignment Pairwise alignment:-
 
Basics of bioinformatics
Basics of bioinformaticsBasics of bioinformatics
Basics of bioinformatics
 

Similar a Biopython

Similar a Biopython (20)

Java 7 Features and Enhancements
Java 7 Features and EnhancementsJava 7 Features and Enhancements
Java 7 Features and Enhancements
 
Biopython: Overview, State of the Art and Outlook
Biopython: Overview, State of the Art and OutlookBiopython: Overview, State of the Art and Outlook
Biopython: Overview, State of the Art and Outlook
 
Java7 New Features and Code Examples
Java7 New Features and Code ExamplesJava7 New Features and Code Examples
Java7 New Features and Code Examples
 
Implementing jsp tag extensions
Implementing jsp tag extensionsImplementing jsp tag extensions
Implementing jsp tag extensions
 
File Handling in Java.pdf
File Handling in Java.pdfFile Handling in Java.pdf
File Handling in Java.pdf
 
Functions and modules in python
Functions and modules in pythonFunctions and modules in python
Functions and modules in python
 
15. text files
15. text files15. text files
15. text files
 
Advance Java
Advance JavaAdvance Java
Advance Java
 
Learning Java 1 – Introduction
Learning Java 1 – IntroductionLearning Java 1 – Introduction
Learning Java 1 – Introduction
 
Struts2 - 101
Struts2 - 101Struts2 - 101
Struts2 - 101
 
Python and You Series
Python and You SeriesPython and You Series
Python and You Series
 
2016 bioinformatics i_bio_python_wimvancriekinge
2016 bioinformatics i_bio_python_wimvancriekinge2016 bioinformatics i_bio_python_wimvancriekinge
2016 bioinformatics i_bio_python_wimvancriekinge
 
Dynamic Python
Dynamic PythonDynamic Python
Dynamic Python
 
Java I/O
Java I/OJava I/O
Java I/O
 
Jug java7
Jug java7Jug java7
Jug java7
 
What`s new in Java 7
What`s new in Java 7What`s new in Java 7
What`s new in Java 7
 
JDK1.7 features
JDK1.7 featuresJDK1.7 features
JDK1.7 features
 
BioMake BOSC 2004
BioMake BOSC 2004BioMake BOSC 2004
BioMake BOSC 2004
 
WhatsNewNIO2.pdf
WhatsNewNIO2.pdfWhatsNewNIO2.pdf
WhatsNewNIO2.pdf
 
Taming Core Data by Arek Holko, Macoscope
Taming Core Data by Arek Holko, MacoscopeTaming Core Data by Arek Holko, Macoscope
Taming Core Data by Arek Holko, Macoscope
 

Último

Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 

Último (20)

Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 

Biopython

  • 1. Biopython Karin Lagesen karin.lagesen@bio.uio.no
  • 2. ConcatFasta.py Create a script that has the following: function get_fastafiles(dirname) gets all the files in the directory, checks if they are fasta files (end in .fsa), returns list of fasta files hint: you need os.path to create full relative file names function concat_fastafiles(filelist, outfile) takes a list of fasta files, opens and reads each of them, writes them to outfile if __name__ == “__main__”: do what needs to be done to run script Remember imports!
  • 3. Object oriented programming Biopython is object-oriented Some knowledge helps understand how biopython works OOP is a way of organizing data and methods that work on them in a coherent package OOP helps structure and organize the code
  • 4. Classes and objects A class: is a user defined type is a mold for creating objects specifies how an object can contain and process data represents an abstraction or a template for how an object of that class will behave An object is an instance of a class All objects have a type – shows which class they were made from
  • 5. Attributes and methods Classes specify two things: attributes – data holders methods – functions for this class Attributes are variables that will contain the data that each object will have Methods are functions that an object of that class will be able to perform
  • 6. Class and object example Class: MySeq MySeq has: attribute length method translate An object of the class MySeq is created like this: myseq = MySeq(“ATGGCCG”) Get sequence length: myseq.length Get translation: myseq.translate()
  • 7. Summary An object has to be instantiated, i.e. created, to exist Every object has a certain type, i.e. is of a certain class The class decides which attributes and methods an object has Attributes and methods are accessed using . after the object variable name
  • 8. Biopython Package that assists with processing biological data Consists of several modules – some with common operations, some more specialized Website: biopython.org
  • 9. Working with sequences Biopython has many ways of working with sequence data Components for today: Alphabet Seq SeqRecord SeqIO Other useful classes for working with alignments, blast searches and results etc are also available, not covered today
  • 10. Class Alphabet Every sequence needs an alphabet CCTTGGCC – DNA or protein? Biopython contains several alphabets DNA RNA Protein the three above with IUPAC codes ...and others Can all be found in Bio.Alphabet package
  • 11. Alphabet example Go to freebee Do module load python (necessary to find biopython modules) – start python >>> import Bio.Alphabet NOTE: have to import >>> Bio.Alphabet.ThreeLetterProtein.letters Alphabets to use them ['Ala', 'Asx', 'Cys', 'Asp', 'Glu', 'Phe', 'Gly', 'His', 'Ile',  'Lys', 'Leu', 'Met', 'Asn', 'Pro', 'Gln', 'Arg', 'Ser', 'Thr',  'Sec', 'Val', 'Trp', 'Xaa', 'Tyr', 'Glx'] >>> from Bio.Alphabet import IUPAC >>> IUPAC.IUPACProtein.letters 'ACDEFGHIKLMNPQRSTVWY' >>> IUPAC.unambiguous_dna.letters 'GATC' >>> 
  • 12. Packages, modules and classes What happens here? >>> from Bio.Alphabet import IUPAC >>> IUPAC.IUPACProtein.letters Bio and Alphabet are packages packages contain modules IUPAC is a module a module is a file with python code IUPAC module contains class IUPACProtein and other classes specifying alphabets IUPACProtein has attribute letters
  • 13. Seq Represents one sequence with its alphabet Methods: translate() transcribe() complement() reverse_complement() ...
  • 14. Using Seq >>> from Bio.Seq import Seq >>> import Bio.Alphabet Create object >>> seq = Seq("CCGGGTT", Bio.Alphabet.IUPAC.unambiguous_dna) >>> seq Seq('CCGGGTT', IUPACUnambiguousDNA()) >>> seq.transcribe() Seq('CCGGGUU', IUPACUnambiguousRNA()) Use methods >>> seq.translate() Seq('PG', IUPACProtein()) >>> seq = Seq("CCGGGUU", Bio.Alphabet.IUPAC.unambiguous_rna) >>> seq.transcribe() Traceback (most recent call last):   File "<stdin>", line 1, in <module> New object, different alphabet   File "/site/VERSIONS/python­2.6.2/lib/python2.6/site­packages/Bio/Seq.py",  line 830, in transcribe     raise ValueError("RNA cannot be transcribed!") ValueError: RNA cannot be transcribed! >>> seq.translate() Seq('PG', IUPACProtein()) >>>  Alphabet dictates which methods make sense
  • 15. Seq as a string Most string methods work on Seqs If string is needed, do str(seq) >>> seq = Seq('CCGGGTTAACGTA',Bio.Alphabet.IUPAC.unambiguous_dna) >>> seq[:5] Seq('CCGGG', IUPACUnambiguousDNA()) >>> len(seq) 13 >>> seq.lower() Seq('ccgggttaacgta', DNAAlphabet()) >>> print seq CCGGGTTAACGTA >>> list(seq) ['C', 'C', 'G', 'G', 'G', 'T', 'T', 'A', 'A', 'C', 'G', 'T', 'A'] >>> mystring = str(seq) >>> print mystring CCGGGTTAACGTA >>> type(seq) <class 'Bio.Seq.Seq'> How to check what class >>> type(mystring) or type an object is from <type 'str'> >>> 
  • 16. MutableSeq Seqs are immutable as strings are If mutable string is needed, convert to MutableSeq Allows in-place changes >>> seq[0] = 'T' Traceback (most recent call last):   File "<stdin>", line 1, in <module> TypeError: 'Seq' object does not support item assignment >>> mut_seq = seq.tomutable() >>> seq Seq('CCGGGTTAACGTA', IUPACUnambiguousDNA()) >>> seq[0] = 'T' Traceback (most recent call last):   File "<stdin>", line 1, in <module> TypeError: 'Seq' object does not support item assignment >>> mut_seq = seq.tomutable() >>> mut_seq[0] = 'T' >>> mut_seq MutableSeq('TCGGGTTAACGTA', IUPACUnambiguousDNA()) >>> mut_seq.complement() >>> mut_seq MutableSeq('AGCCCAATTGCAT', IUPACUnambiguousDNA()) >>>  Notice: object is changed!
  • 17. SeqRecord Seq contains the sequence and alphabet But sequences often come with a lot more SeqRecord = Seq + metadata Main attributes: id – name or identifier seq – seq object containing the sequence >>> seq Existing sequence Seq('CCGGGTTAACGTA', IUPACUnambiguousDNA()) SeqRecord is a class >>> from Bio.SeqRecord import SeqRecord found inside the >>> seqRecord = SeqRecord(seq, id='001') >>> seqRecord Bio.SeqRecord module SeqRecord(seq=Seq('CCGGGTTAACGTA', IUPACUnambiguousDNA()),  id='001', name='<unknown name>', description='<unknown description>',  dbxrefs=[]) >>> 
  • 18. SeqRecord attributes From the biopython webpages: Main attributes: id - Identifier such as a locus tag (string) seq - The sequence itself (Seq object or similar) Additional attributes: name - Sequence name, e.g. gene name (string) description - Additional text (string) dbxrefs - List of database cross references (list of strings) features - Any (sub)features defined (list of SeqFeature objects) annotations - Further information about the whole sequence (dictionary) Most entries are strings, or lists of strings. letter_annotations - Per letter/symbol annotation (restricted dictionary). This holds Python sequences (lists, strings or tuples) whose length matches that of the sequence. A typical use would be to hold a list of integers representing sequencing quality scores, or a string representing the secondary structure.
  • 20. SeqIO How to get sequences in and out of files Retrieves sequences as SeqRecords, can write SeqRecords to files Reading: parse(filehandle, format) returns a generator that gives SeqRecords Writing: write(SeqRecord(s), filehandle, format) NOTE: examples in this section from http://biopython.org/wiki/SeqIO
  • 21. SeqIO formats List: http://biopython.org/wiki/SeqIO Some examples: fasta genbank several fastq-formats ace Note: a format might be readable but not writable depending on biopython version
  • 22. Reading a file from Bio import SeqIO handle = open("example.fasta", "r") for record in SeqIO.parse(handle,"fasta") :     print record.id handle.close() SeqIO.parse returns a SeqRecord iterator An iterator will give you the next element the next time it is called – compare to readline() Useful because if a file contains many records, we avoid putting all into memory all at once
  • 23. Exercise Use mb.gbk, found in Karins folder Use the SeqIO methods to read in the file print the id of each of the records print the first 10 nucleotide of each record >>> from Bio import SeqIO >>> fh = open("mb.gbk", "r") >>> for record in SeqIO.parse(fh, "genbank"): ...     print record.id ...     print record.seq[:10] ...  NM_005368.2 GCAGCCTCAA XM_001081975.2 CCTCTCCCCA NM_001164047.1 TAGCTGCCCA >>> 
  • 24. SeqRecords lists and dictionaries To get everything as a list: handle = open("example.fasta", "r") records = list(SeqIO.parse(handle, "fasta")) handle.close() To get everything as a dictionary: handle = open("example.fasta", "r") record_dict = SeqIO.to_dict(SeqIO.parse(handle, "fasta")) handle.close() But: avoid if at all possible
  • 25. Writing files sequences are here a from Bio import SeqIO list of SeqRecords sequences = ... # add code here output_handle = open("example.fasta", "w") SeqIO.write(sequences, output_handle, "fasta") output_handle.close() Note: sequences is here a list Can write any iterable containing SeqRecords to a file Can also write a single sequence
  • 26. seq_length.py Write script that reads a file containing genbank sequences and writes out name and sequence length Should have Function sequence_length(inputfile) Open file Per seqRecord in input: Print name, length of sequence Close file If __name__ == “__main__”: Get input from command line: inputfile
  • 27. Modifications Figure out how to: print the description of each genbank entry which annotations each entry has print the taxonomy for each entry Description: seqRecord.description Annotations: seqRecord.annotations.keys() Taxonomy: seqRecord.annotations['taxonomy']
  • 28. tag_fasta.py Create script that takes a file containing fasta sequences, adds a tag at the front of the name and writes it out to a new file Should have Function change_name(seqRecord, tag) Change name, return seqRecord Function read_write_fasta(tag, input, output) Per seqRecord in input: Change name Write to output If __name__ == “__main__”: Get input from command line: Tag, input file, output file
  • 29. Optional homework convert.py Create a script that: takes input filename, input file type, output filename and output file type converts input file to output file type and writes it to output file

Notas del editor

  1. Show webpage for alphabets. Each alphabet is a separate class