Se ha denunciado esta presentación.
Se está descargando tu SlideShare. ×

Genome_annotation@BioDec: Python all over the place

Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Próximo SlideShare
BioDec Srl Company Profile
BioDec Srl Company Profile
Cargando en…3
×

Eche un vistazo a continuación

1 de 29 Anuncio

Más Contenido Relacionado

Similares a Genome_annotation@BioDec: Python all over the place (20)

Más reciente (20)

Anuncio

Genome_annotation@BioDec: Python all over the place

  1. 1. Genome_annotation@BioDec: Python all over the place. Ivan Rossi ivan@biodec.com @rouge2507
  2. 2. Hello ● BioDec does bioinformatics since 2002 ● Bioinformatics software development ● Bioinformation management system, BioDecoders ● Bioinformatics Consulting ● Development, engineering and integration of custom solutions ● Annotated databases of biosequences (e.g. genomes) ● Our Forte ● Protein-sequence analysis ● Trans-membrane proteins ● Machine-learning ● Python is everywhere
  3. 3. The Challenge: from Sequence to Function >BGAL_SULSO BETA-GALACTOSIDASE Sulfolobus solfataricus. MYSFPNSFRFGWSQAGFQSEMGTPGSEDPNTDWYKWVHDPENMAAGLVSG DLPENGPGYWGNYKTFHDNAQKMGLKIARLNVEWSRIFPNPLPRPQNFDE SKQDVTEVEINENELKRLDEYANKDALNHYREIFKDLKSRGLYFILNMYH WPLPLWLHDPIRVRRGDFTGPSGWLSTRTVYEFARFSAYIAWKFDDLVDE YSTMNEPNVVGGLGYVGVKSGFPPGYLSFELSRRHMYNIIQAHARAYDGI KSVSKKPVGIIYANSSFQPLTDKDMEAVEMAENDNRWWFFDAIIRGEITR GNEKIVRDDLKGRLDWIGVNYYTRTVVKRTEKGYVSLGGYGHGCERNSVS LAGLPTSDFGWEFFPEGLYDVLTKYWNRYHLYMYVTENGIADDADYQRPY YLVSHVYQVHRAINSGADVRGYLHWSLADNYEWASGFSMRFGLLKVDYNT KRLYWRPSALVYREIATNGAITDEIEHLNSVPPVKPLRH Protein Function Gene Sequence Protein Sequence (~10^7) Protein Structure (10^5)
  4. 4. Problems in Sequence Analysis Information Overflow: very large sets of data available High Throughput: New data must be processed at high speed (volume of data, time constraints) Open Problems: difficult to provide a simple first-principle or a model-based solution
  5. 5. Alignments OmpA APKDNTWYTGAKLGWS QYHDTGLINNNGPTHEN KLGAGAFGGYQV NPYVGFEMGYDWLGR OEP21 IDTNTFFQVRGGLD TKT---------------GQPS SGSALIRHF YPNFSATLGVGVRYD OmpA MPYKGSVENGA YKAQGVQLTAKLGYP ITDDLDIYTRLGGMVWRADT YSNVYGKN HDTGVS OEP21 KQDSVGVRYAKND KLRYTVLAKKT FPVTNDGLVNFKIK GGCDVDQD-------FKE WKSR OmpA PVFAGGVEYA I-TPEIATRLEYQW TNNIGDAHTIGTRPDNG MLSLGVSYRF G----- OEP21 GGAEFSWNVF NFQKDQDVRLRIGYE AFEQV-PYLQIRE NNWTFNADYKGRWNVRYD L Alignments of some kind are the main tool for sequence comparison and database search OmpA: PDB 1BXW, SwissProt OMPA_ECOLI OEP21: Transmembrane Domain (24-177)
  6. 6. Tools from machine learning Prediction Known sequences (DB subsets) TTCCPSIVARSNFNVCRLPGTPEAICATYTGCIIIPGATCPGDYAN ANN, HMM, SVM ANN, HMM, SVM Known mapping General Rules Known structures Artificial Neural Networks (ANNs) Hidden Markov Models (HMMs) Support Vector Machines (SVMs) New sequence
  7. 7. A 0 0 0 0 0 0 0 0 0 0 0 10 0 0 0 0 C 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 D 0 0 70 0 0 0 0 60 0 0 0 0 20 0 0 0 E 0 0 0 0 0 0 0 0 0 0 0 0 70 0 0 0 F 0 0 0 10 0 33 0 0 0 0 0 0 0 0 0 0 G 10 0 30 0 30 0 100 0 0 0 0 50 0 0 0 0 H 0 0 0 0 10 0 0 10 30 0 0 0 0 0 0 0 K 0 40 0 0 0 0 0 0 10 100 70 0 0 0 0 100 I 0 0 0 0 0 0 0 0 30 0 0 0 0 0 0 0 L 0 0 0 0 0 0 0 30 0 0 0 0 0 0 0 0 M 0 0 0 0 0 0 0 0 0 0 0 0 0 60 0 0 N 0 0 0 0 10 0 0 0 0 0 30 10 0 0 0 0 P 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Q 0 0 0 0 40 0 0 0 30 0 0 0 0 0 0 0 R 0 50 0 0 0 0 0 0 0 0 0 0 0 0 0 0 S 0 0 0 0 0 33 0 0 0 0 0 0 10 10 0 0 T 20 0 0 0 0 33 0 0 0 0 0 30 0 30 100 0 V 0 0 0 0 10 0 0 0 0 0 0 0 0 0 0 0 W 0 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Y 70 0 0 90 0 0 0 0 0 0 0 0 0 0 0 0 Evolutionary Information 1 Y K D Y H S - D K K K G E L - - 2 Y R D Y Q T - D Q K K G D L - - 3 Y R D Y Q S - D H K K G E L - - 4 Y R D Y V S - D H K K G E L - - 5 Y R D Y Q F - D Q K K G S L - - 6 Y K D Y N T - H Q K K N E S - - 7 Y R D Y Q T - D H K K A D L - - 8 G Y G F G - - L I K N T E T T K 9 T K G Y G F G L I K N T E T T K 10 T K G Y G F G L I K N T E T T K Sequence position MSA Seq. Profile Sequence profile Given a Multiple Sequence Alignment (MSA) of similar sequences, associate to each position a 20-valued vector containing the relative aminoacidic composition of the aligned sequences.
  8. 8. Why Python? (2.1.x, in 2002) ● Common ground, easy to pick up ● Expressive: productive, fast prototyping ● Mantainable: readable after months ● Useful tools and libs (e.g. BioPython) ● Retrospective: We were f...ing RIGHT!
  9. 9. Hidden Markov Models Very powerful tools when: ● The system can be modeled in probabilistic terms. ● There is a ‘grammar of the problem’ ● There is a “limited sequential dependency” that can model the problem (at least to a rough approx) N T 0.01 0.01 0.99 0.99
  10. 10. 99HMMers End Start Signal Peptide TM1 TM2 TM3 TM4 TM5 TM6 TM7 Insertion loop Inside loop Outside loop Profile-HMM, based on: http://www.biocomp.unibo.it/piero/PHMM
  11. 11. BioPython BioPython (http://biopython.org) is a community- developed (O|B|F) set of Python libraries and tools for bioinformatics. ● The Parsers for formats and application (vital) ● The Sequence objects ● Bio.SeqIO, Bio.AlignIO, Bio.PDB ● Specialized External-application wrappers ● BioSQL interface
  12. 12. BioSQL BioSQL (http://www.biosql.org) is a generic relational model (a schema) covering sequences, features, sequence and feature annotation, a reference taxonomy, and ontologies. ● Works with all O|B|F Bio* projects ● We extended it to suit our special need
  13. 13. Ruffus Ruffus (http://www.ruffus.org.uk/) is a Computation Pipeline library for Python, designed to allow easy analysis automation. ● Acts like a pythonic Make on steroids ● Write your Python functions and decorate them – @originate, @transform, @merge an more ● Pipeline handling – Run pipelines make-style (run_pipeline) – Schedule pipelines on SGE compute clusters (run_job)
  14. 14. Angler pipeline Proteome Generate profiles Predictions: Signal peptides Betabarrels Alpha-helical TMP Fold recognition Coiled coils Disordered regions Sub-cellular localization Classify Proteome Atlas (a DB) Angler annotates and classifies Protein sequences
  15. 15. ZenDock Analyzes protein solvent- exposed surface for putative “interactor” residues, returning a “fuzzy” (probabilistic) answer. Interactors are correlated and grouped into patches Results are mapped on the protein 3D structure and made available through a web interface Contact-shell profile Int non-Int
  16. 16. If you can't outrun them... The Problem ● Full Profile building is the slow step – It takes 30” to 5' for a 3-passes PsiBlast run (uniref90) – Repeat for ~10^5 … CPU weeks for genome. ● Major genomes updated every 3 months ● Micro-SME: limited resources
  17. 17. … try to outsmart them. ● Sequence space is redundant – Both intra-genome and inter-genome ● Profiles are built incrementally – PsiBlast is an iterative algorithm ● PsiBlast is deterministic – Given the same sequence, database, and number of iterations you get the same profile
  18. 18. Our accelerator: the PyBlastCache 1) Hash the sequence 2) version the reference protein database 3) store computed profiles in a key-value store 1) Key as a combination of seq. hash and DB version 4) Compute ● If full_key_match: skip_and_copy() ● If seq_key_match: update_profile( seq, itn=1) ● If no_key: create_profile(seq, itn=3)
  19. 19. The (Python) front-ends ● Plone: a CMS – https://plone.org ● Web2py: a MVC framework – http://www.web2py.com ● Galaxy: web interface + workflow engine – Focus on reproducible research – https://wiki.galaxyproject.org/ – Saas: https://usegalaxy.org
  20. 20. ● A BiOSQL browser, based on Plone, to search and display data and metadata (annotations) from biosequence databases. Could integrate predictors; ● We publicly released the base version open-source software at http://plone4bio.org; ● Used to be the la base for some commercial software we sold to clients. Plone4Bio
  21. 21. Plone4Bio screenshots
  22. 22. Bologna, 21/1/2010 LIMS features
  23. 23. Galaxy Galaxy is an open, web-based platform for accessible, reproducible, and transparent computational biomedical research. – Users without programming experience can easily specify parameters and run tools and workflows. – Galaxy captures information in order to allow complete repeats of a computational analysis. – Users share and publish analyses via the web and create Pages, interactive, web-based documents that describe a complete analysis. ● Accepted as material by peer reviewed journals
  24. 24. Galaxy highlights Galaxy is useful to both end user and bioinformatic devs. ● Get data directly from online DBs (USCS, Biomart,...) ● Handling of data from lab instrumentetion (e.g NGS seqs) ● Map calculated data on online viewers (e.g. genome viewer) ● Easily extensible: wrapping a foreign tools is as simple as by writing an XML file. ● Data sharing (workflows, libraries, tools...) ● The community!
  25. 25. Snapshots From https://usegalaxy.org
  26. 26. Visual programming
  27. 27. Thou Shalt Care For The DATA ● So much junk in the literature!! – Both for features and data sets ● Use training, testing and validation sets ● The sets should always be disjoint – Below 25% seq ID ● Redundancy is THE ENEMY ● Avoid feature bloat, use feature selection ● Always compare results with a nearest-neighbor method – Good ones are really hard to beat
  28. 28. No Free Lunch ● There is no killer method – Choose method that better models your domain (e.g. sequences → HMMs) – Data curation is always more important ● Be Humble, be Honest! Meditation hint: http://www.no-free-lunch.org/
  29. 29. The community is your friend. Give back to the community.

×