SlideShare una empresa de Scribd logo

Genome_annotation@BioDec: Python all over the place

BioDec
BioDec
BioDecICT Systems Engineer and Software Developer en BioDec

A retrospective on the use of Python for bioinformatics at BioDec. Presentation given at Pycon Sei, Firenze 2015-04-18

Genome_annotation@BioDec: Python all over the place

1 de 29
Descargar para leer sin conexión
Genome_annotation@BioDec:
Python all over the place.
Ivan Rossi
ivan@biodec.com
@rouge2507
Hello
● BioDec does bioinformatics since 2002
● Bioinformatics software development
● Bioinformation management system, BioDecoders
● Bioinformatics Consulting
● Development, engineering and integration of custom solutions
● Annotated databases of biosequences (e.g. genomes)
● Our Forte
● Protein-sequence analysis
● Trans-membrane proteins
● Machine-learning
● Python is everywhere
The Challenge:
from Sequence to Function
>BGAL_SULSO BETA-GALACTOSIDASE Sulfolobus solfataricus.
MYSFPNSFRFGWSQAGFQSEMGTPGSEDPNTDWYKWVHDPENMAAGLVSG
DLPENGPGYWGNYKTFHDNAQKMGLKIARLNVEWSRIFPNPLPRPQNFDE
SKQDVTEVEINENELKRLDEYANKDALNHYREIFKDLKSRGLYFILNMYH
WPLPLWLHDPIRVRRGDFTGPSGWLSTRTVYEFARFSAYIAWKFDDLVDE
YSTMNEPNVVGGLGYVGVKSGFPPGYLSFELSRRHMYNIIQAHARAYDGI
KSVSKKPVGIIYANSSFQPLTDKDMEAVEMAENDNRWWFFDAIIRGEITR
GNEKIVRDDLKGRLDWIGVNYYTRTVVKRTEKGYVSLGGYGHGCERNSVS
LAGLPTSDFGWEFFPEGLYDVLTKYWNRYHLYMYVTENGIADDADYQRPY
YLVSHVYQVHRAINSGADVRGYLHWSLADNYEWASGFSMRFGLLKVDYNT
KRLYWRPSALVYREIATNGAITDEIEHLNSVPPVKPLRH
Protein Function
Gene Sequence
Protein Sequence (~10^7)
Protein Structure (10^5)
Problems in Sequence Analysis
Information Overflow:
very large sets of data available
High Throughput:
New data must be processed at high speed
(volume of data, time constraints)
Open Problems:
difficult to provide a simple first-principle or a
model-based solution
Alignments
OmpA APKDNTWYTGAKLGWS QYHDTGLINNNGPTHEN KLGAGAFGGYQV NPYVGFEMGYDWLGR
OEP21 IDTNTFFQVRGGLD TKT---------------GQPS SGSALIRHF YPNFSATLGVGVRYD
OmpA MPYKGSVENGA YKAQGVQLTAKLGYP ITDDLDIYTRLGGMVWRADT YSNVYGKN HDTGVS
OEP21 KQDSVGVRYAKND KLRYTVLAKKT FPVTNDGLVNFKIK GGCDVDQD-------FKE WKSR
OmpA PVFAGGVEYA I-TPEIATRLEYQW TNNIGDAHTIGTRPDNG MLSLGVSYRF G-----
OEP21 GGAEFSWNVF NFQKDQDVRLRIGYE AFEQV-PYLQIRE NNWTFNADYKGRWNVRYD L
Alignments of some kind are the main tool for
sequence comparison and database search
OmpA: PDB 1BXW, SwissProt OMPA_ECOLI
OEP21: Transmembrane Domain (24-177)
Tools from machine learning
Prediction
Known sequences (DB subsets)
TTCCPSIVARSNFNVCRLPGTPEAICATYTGCIIIPGATCPGDYAN
ANN,
HMM,
SVM
ANN,
HMM,
SVM
Known mapping
General
Rules
Known
structures
Artificial Neural Networks (ANNs)
Hidden Markov Models (HMMs)
Support Vector Machines (SVMs)
New sequence
Publicidad

Recomendados

BioDec Srl Company Profile
BioDec Srl Company ProfileBioDec Srl Company Profile
BioDec Srl Company ProfileBioDec
 
The RNA workbench - Galaxy User Conference 2018
The RNA workbench - Galaxy User Conference 2018The RNA workbench - Galaxy User Conference 2018
The RNA workbench - Galaxy User Conference 2018Florian Eggenhofer
 
Slideshare Powerpoint presentation
Slideshare Powerpoint presentationSlideshare Powerpoint presentation
Slideshare Powerpoint presentationelliehood
 
Reproducible Workflow with Cytoscape and Jupyter Notebook
Reproducible Workflow with Cytoscape and Jupyter NotebookReproducible Workflow with Cytoscape and Jupyter Notebook
Reproducible Workflow with Cytoscape and Jupyter NotebookKeiichiro Ono
 
Cool Informatics Tools and Services for Biomedical Research
Cool Informatics Tools and Services for Biomedical ResearchCool Informatics Tools and Services for Biomedical Research
Cool Informatics Tools and Services for Biomedical ResearchDavid Ruau
 
Portable and reproducible bioinformatic analysis. Neoantigen discovery.
Portable and reproducible bioinformatic analysis. Neoantigen discovery.Portable and reproducible bioinformatic analysis. Neoantigen discovery.
Portable and reproducible bioinformatic analysis. Neoantigen discovery.Vladimir Kovacevic
 
Spring Boot to Quarkus: A real app migration experience | DevNation Tech Talk
Spring Boot to Quarkus: A real app migration experience | DevNation Tech TalkSpring Boot to Quarkus: A real app migration experience | DevNation Tech Talk
Spring Boot to Quarkus: A real app migration experience | DevNation Tech TalkRed Hat Developers
 

Más contenido relacionado

Similar a Genome_annotation@BioDec: Python all over the place

WSO2 Machine Learner - Product Overview
WSO2 Machine Learner - Product OverviewWSO2 Machine Learner - Product Overview
WSO2 Machine Learner - Product OverviewWSO2
 
ICWE2017 BigDataEurope
ICWE2017 BigDataEuropeICWE2017 BigDataEurope
ICWE2017 BigDataEuropeBigData_Europe
 
ICAR 2015 Workshop - Nick Provart
ICAR 2015 Workshop - Nick ProvartICAR 2015 Workshop - Nick Provart
ICAR 2015 Workshop - Nick ProvartAraport
 
Python for Chemistry
Python for ChemistryPython for Chemistry
Python for Chemistryguest5929fa7
 
Python for Chemistry
Python for ChemistryPython for Chemistry
Python for Chemistrybaoilleach
 
#Fstoco - Monitoring and Instrumentation, why Tracing is Key
#Fstoco  - Monitoring and Instrumentation, why Tracing is Key#Fstoco  - Monitoring and Instrumentation, why Tracing is Key
#Fstoco - Monitoring and Instrumentation, why Tracing is KeyJonah Kowall
 
Tech Talk: ONOS- A Distributed SDN Network Operating System
Tech Talk: ONOS- A Distributed SDN Network Operating SystemTech Talk: ONOS- A Distributed SDN Network Operating System
Tech Talk: ONOS- A Distributed SDN Network Operating Systemnvirters
 
ExSchema - ICSM'13
ExSchema - ICSM'13ExSchema - ICSM'13
ExSchema - ICSM'13jccastrejon
 
Ontology-based data access: why it is so cool!
Ontology-based data access: why it is so cool!Ontology-based data access: why it is so cool!
Ontology-based data access: why it is so cool!Josef Hardi
 
Open-source from/in the enterprise: the RDKit
Open-source from/in the enterprise: the RDKitOpen-source from/in the enterprise: the RDKit
Open-source from/in the enterprise: the RDKitGreg Landrum
 
IPK - Reproducible research - To infinity
IPK - Reproducible research - To infinityIPK - Reproducible research - To infinity
IPK - Reproducible research - To infinityPeterMorrell4
 
tranSMART Community Meeting 5-7 Nov 13 - Session 1: Chilly-Mazarin Meeting Ob...
tranSMART Community Meeting 5-7 Nov 13 - Session 1: Chilly-Mazarin Meeting Ob...tranSMART Community Meeting 5-7 Nov 13 - Session 1: Chilly-Mazarin Meeting Ob...
tranSMART Community Meeting 5-7 Nov 13 - Session 1: Chilly-Mazarin Meeting Ob...David Peyruc
 
Ontologies and semantic web
Ontologies and semantic webOntologies and semantic web
Ontologies and semantic webStanley Wang
 
EVOLUTION OF ONTOLOGY-BASED MAPPINGS
EVOLUTION OF ONTOLOGY-BASED MAPPINGSEVOLUTION OF ONTOLOGY-BASED MAPPINGS
EVOLUTION OF ONTOLOGY-BASED MAPPINGSAksw Group
 
Using PostgreSQL with Bibliographic Data
Using PostgreSQL with Bibliographic DataUsing PostgreSQL with Bibliographic Data
Using PostgreSQL with Bibliographic DataJimmy Angelakos
 
Antao Biopython Bosc2008
Antao Biopython Bosc2008Antao Biopython Bosc2008
Antao Biopython Bosc2008bosc_2008
 
Ai dev world utilizing apache pulsar, apache ni fi and minifi for edgeai io...
Ai dev world   utilizing apache pulsar, apache ni fi and minifi for edgeai io...Ai dev world   utilizing apache pulsar, apache ni fi and minifi for edgeai io...
Ai dev world utilizing apache pulsar, apache ni fi and minifi for edgeai io...Timothy Spann
 

Similar a Genome_annotation@BioDec: Python all over the place (20)

WSO2 Machine Learner - Product Overview
WSO2 Machine Learner - Product OverviewWSO2 Machine Learner - Product Overview
WSO2 Machine Learner - Product Overview
 
ICWE2017 BigDataEurope
ICWE2017 BigDataEuropeICWE2017 BigDataEurope
ICWE2017 BigDataEurope
 
ICAR 2015 Workshop - Nick Provart
ICAR 2015 Workshop - Nick ProvartICAR 2015 Workshop - Nick Provart
ICAR 2015 Workshop - Nick Provart
 
Python for Chemistry
Python for ChemistryPython for Chemistry
Python for Chemistry
 
Python for Chemistry
Python for ChemistryPython for Chemistry
Python for Chemistry
 
#Fstoco - Monitoring and Instrumentation, why Tracing is Key
#Fstoco  - Monitoring and Instrumentation, why Tracing is Key#Fstoco  - Monitoring and Instrumentation, why Tracing is Key
#Fstoco - Monitoring and Instrumentation, why Tracing is Key
 
Tech Talk: ONOS- A Distributed SDN Network Operating System
Tech Talk: ONOS- A Distributed SDN Network Operating SystemTech Talk: ONOS- A Distributed SDN Network Operating System
Tech Talk: ONOS- A Distributed SDN Network Operating System
 
Bio2RDF@BH2010
Bio2RDF@BH2010Bio2RDF@BH2010
Bio2RDF@BH2010
 
ExSchema - ICSM'13
ExSchema - ICSM'13ExSchema - ICSM'13
ExSchema - ICSM'13
 
Ontology-based data access: why it is so cool!
Ontology-based data access: why it is so cool!Ontology-based data access: why it is so cool!
Ontology-based data access: why it is so cool!
 
Open-source from/in the enterprise: the RDKit
Open-source from/in the enterprise: the RDKitOpen-source from/in the enterprise: the RDKit
Open-source from/in the enterprise: the RDKit
 
IPK - Reproducible research - To infinity
IPK - Reproducible research - To infinityIPK - Reproducible research - To infinity
IPK - Reproducible research - To infinity
 
Berlin OpenStack Summit'18
Berlin OpenStack Summit'18Berlin OpenStack Summit'18
Berlin OpenStack Summit'18
 
tranSMART Community Meeting 5-7 Nov 13 - Session 1: Chilly-Mazarin Meeting Ob...
tranSMART Community Meeting 5-7 Nov 13 - Session 1: Chilly-Mazarin Meeting Ob...tranSMART Community Meeting 5-7 Nov 13 - Session 1: Chilly-Mazarin Meeting Ob...
tranSMART Community Meeting 5-7 Nov 13 - Session 1: Chilly-Mazarin Meeting Ob...
 
Ontologies and semantic web
Ontologies and semantic webOntologies and semantic web
Ontologies and semantic web
 
EVOLUTION OF ONTOLOGY-BASED MAPPINGS
EVOLUTION OF ONTOLOGY-BASED MAPPINGSEVOLUTION OF ONTOLOGY-BASED MAPPINGS
EVOLUTION OF ONTOLOGY-BASED MAPPINGS
 
Using PostgreSQL with Bibliographic Data
Using PostgreSQL with Bibliographic DataUsing PostgreSQL with Bibliographic Data
Using PostgreSQL with Bibliographic Data
 
Antao Biopython Bosc2008
Antao Biopython Bosc2008Antao Biopython Bosc2008
Antao Biopython Bosc2008
 
Ai dev world utilizing apache pulsar, apache ni fi and minifi for edgeai io...
Ai dev world   utilizing apache pulsar, apache ni fi and minifi for edgeai io...Ai dev world   utilizing apache pulsar, apache ni fi and minifi for edgeai io...
Ai dev world utilizing apache pulsar, apache ni fi and minifi for edgeai io...
 
Resume_Srivatsa
Resume_SrivatsaResume_Srivatsa
Resume_Srivatsa
 

Último

Zi-Stick UBS Dongle ZIgbee from Aeotec manual
Zi-Stick UBS Dongle ZIgbee from  Aeotec manualZi-Stick UBS Dongle ZIgbee from  Aeotec manual
Zi-Stick UBS Dongle ZIgbee from Aeotec manualDomotica daVinci
 
Z-Wave Fan coil Thermostat Heltun_HE-HT01_User_Manual.pdf
Z-Wave Fan coil Thermostat Heltun_HE-HT01_User_Manual.pdfZ-Wave Fan coil Thermostat Heltun_HE-HT01_User_Manual.pdf
Z-Wave Fan coil Thermostat Heltun_HE-HT01_User_Manual.pdfDomotica daVinci
 
5 Things You Shouldn’t Do at Salesforce World Tour Sydney 2024!
5 Things You Shouldn’t Do at Salesforce World Tour Sydney 2024!5 Things You Shouldn’t Do at Salesforce World Tour Sydney 2024!
5 Things You Shouldn’t Do at Salesforce World Tour Sydney 2024!XfilesPro
 
Breaking Barriers & Leveraging the Latest Developments in AI Technology
Breaking Barriers & Leveraging the Latest Developments in AI TechnologyBreaking Barriers & Leveraging the Latest Developments in AI Technology
Breaking Barriers & Leveraging the Latest Developments in AI TechnologySafe Software
 
Early Tech Adoption: Foolish or Pragmatic? - 17th ISACA South Florida WOW Con...
Early Tech Adoption: Foolish or Pragmatic? - 17th ISACA South Florida WOW Con...Early Tech Adoption: Foolish or Pragmatic? - 17th ISACA South Florida WOW Con...
Early Tech Adoption: Foolish or Pragmatic? - 17th ISACA South Florida WOW Con...Adrian Sanabria
 
Manual Eurotronic Thermostatic Valve Comry Z-Wave
Manual Eurotronic Thermostatic Valve Comry Z-WaveManual Eurotronic Thermostatic Valve Comry Z-Wave
Manual Eurotronic Thermostatic Valve Comry Z-WaveDomotica daVinci
 
"The Transformative Power of AI and Open Challenges" by Dr. Manish Gupta, Google
"The Transformative Power of AI and Open Challenges" by Dr. Manish Gupta, Google"The Transformative Power of AI and Open Challenges" by Dr. Manish Gupta, Google
"The Transformative Power of AI and Open Challenges" by Dr. Manish Gupta, GoogleISPMAIndia
 
AWS reInvent 2023 recaps from Chicago AWS user group
AWS reInvent 2023 recaps from Chicago AWS user groupAWS reInvent 2023 recaps from Chicago AWS user group
AWS reInvent 2023 recaps from Chicago AWS user groupAWS Chicago
 
Enhancing Productivity and Insight A Tour of JDK Tools Progress Beyond Java 17
Enhancing Productivity and Insight  A Tour of JDK Tools Progress Beyond Java 17Enhancing Productivity and Insight  A Tour of JDK Tools Progress Beyond Java 17
Enhancing Productivity and Insight A Tour of JDK Tools Progress Beyond Java 17Ana-Maria Mihalceanu
 
Q1 Memory Fabric Forum: Intel Enabling Compute Express Link (CXL)
Q1 Memory Fabric Forum: Intel Enabling Compute Express Link (CXL)Q1 Memory Fabric Forum: Intel Enabling Compute Express Link (CXL)
Q1 Memory Fabric Forum: Intel Enabling Compute Express Link (CXL)Memory Fabric Forum
 
My sample product research idea for you!
My sample product research idea for you!My sample product research idea for you!
My sample product research idea for you!KivenRaySarsaba
 
Q1 Memory Fabric Forum: SMART CXL Product Lineup
Q1 Memory Fabric Forum: SMART CXL Product LineupQ1 Memory Fabric Forum: SMART CXL Product Lineup
Q1 Memory Fabric Forum: SMART CXL Product LineupMemory Fabric Forum
 
From eSIMs to iSIMs: It’s Inside the Manufacturing
From eSIMs to iSIMs: It’s Inside the ManufacturingFrom eSIMs to iSIMs: It’s Inside the Manufacturing
From eSIMs to iSIMs: It’s Inside the ManufacturingSoracom Global, Inc.
 
zigbee motion sensor user manual NAS-PD07B2.pdf
zigbee motion sensor user manual NAS-PD07B2.pdfzigbee motion sensor user manual NAS-PD07B2.pdf
zigbee motion sensor user manual NAS-PD07B2.pdfDomotica daVinci
 
Dynamical systems simulation in Python for science and engineering
Dynamical systems simulation in Python for science and engineeringDynamical systems simulation in Python for science and engineering
Dynamical systems simulation in Python for science and engineeringMassimo Talia
 
Microsoft Azure News - Feb 2024
Microsoft Azure News - Feb 2024Microsoft Azure News - Feb 2024
Microsoft Azure News - Feb 2024Daniel Toomey
 
2024 February Patch Tuesday
2024 February Patch Tuesday2024 February Patch Tuesday
2024 February Patch TuesdayIvanti
 
Introduction to Serverless with AWS Lambda in C#.pptx
Introduction to Serverless with AWS Lambda in C#.pptxIntroduction to Serverless with AWS Lambda in C#.pptx
Introduction to Serverless with AWS Lambda in C#.pptxBrandon Minnick, MBA
 
Bringing nullability into existing code - dammit is not the answer.pptx
Bringing nullability into existing code - dammit is not the answer.pptxBringing nullability into existing code - dammit is not the answer.pptx
Bringing nullability into existing code - dammit is not the answer.pptxMaarten Balliauw
 
Evolution of Chatbots: From Custom AI Chatbots and AI Chatbots for Websites.pptx
Evolution of Chatbots: From Custom AI Chatbots and AI Chatbots for Websites.pptxEvolution of Chatbots: From Custom AI Chatbots and AI Chatbots for Websites.pptx
Evolution of Chatbots: From Custom AI Chatbots and AI Chatbots for Websites.pptxKyle Willson
 

Último (20)

Zi-Stick UBS Dongle ZIgbee from Aeotec manual
Zi-Stick UBS Dongle ZIgbee from  Aeotec manualZi-Stick UBS Dongle ZIgbee from  Aeotec manual
Zi-Stick UBS Dongle ZIgbee from Aeotec manual
 
Z-Wave Fan coil Thermostat Heltun_HE-HT01_User_Manual.pdf
Z-Wave Fan coil Thermostat Heltun_HE-HT01_User_Manual.pdfZ-Wave Fan coil Thermostat Heltun_HE-HT01_User_Manual.pdf
Z-Wave Fan coil Thermostat Heltun_HE-HT01_User_Manual.pdf
 
5 Things You Shouldn’t Do at Salesforce World Tour Sydney 2024!
5 Things You Shouldn’t Do at Salesforce World Tour Sydney 2024!5 Things You Shouldn’t Do at Salesforce World Tour Sydney 2024!
5 Things You Shouldn’t Do at Salesforce World Tour Sydney 2024!
 
Breaking Barriers & Leveraging the Latest Developments in AI Technology
Breaking Barriers & Leveraging the Latest Developments in AI TechnologyBreaking Barriers & Leveraging the Latest Developments in AI Technology
Breaking Barriers & Leveraging the Latest Developments in AI Technology
 
Early Tech Adoption: Foolish or Pragmatic? - 17th ISACA South Florida WOW Con...
Early Tech Adoption: Foolish or Pragmatic? - 17th ISACA South Florida WOW Con...Early Tech Adoption: Foolish or Pragmatic? - 17th ISACA South Florida WOW Con...
Early Tech Adoption: Foolish or Pragmatic? - 17th ISACA South Florida WOW Con...
 
Manual Eurotronic Thermostatic Valve Comry Z-Wave
Manual Eurotronic Thermostatic Valve Comry Z-WaveManual Eurotronic Thermostatic Valve Comry Z-Wave
Manual Eurotronic Thermostatic Valve Comry Z-Wave
 
"The Transformative Power of AI and Open Challenges" by Dr. Manish Gupta, Google
"The Transformative Power of AI and Open Challenges" by Dr. Manish Gupta, Google"The Transformative Power of AI and Open Challenges" by Dr. Manish Gupta, Google
"The Transformative Power of AI and Open Challenges" by Dr. Manish Gupta, Google
 
AWS reInvent 2023 recaps from Chicago AWS user group
AWS reInvent 2023 recaps from Chicago AWS user groupAWS reInvent 2023 recaps from Chicago AWS user group
AWS reInvent 2023 recaps from Chicago AWS user group
 
Enhancing Productivity and Insight A Tour of JDK Tools Progress Beyond Java 17
Enhancing Productivity and Insight  A Tour of JDK Tools Progress Beyond Java 17Enhancing Productivity and Insight  A Tour of JDK Tools Progress Beyond Java 17
Enhancing Productivity and Insight A Tour of JDK Tools Progress Beyond Java 17
 
Q1 Memory Fabric Forum: Intel Enabling Compute Express Link (CXL)
Q1 Memory Fabric Forum: Intel Enabling Compute Express Link (CXL)Q1 Memory Fabric Forum: Intel Enabling Compute Express Link (CXL)
Q1 Memory Fabric Forum: Intel Enabling Compute Express Link (CXL)
 
My sample product research idea for you!
My sample product research idea for you!My sample product research idea for you!
My sample product research idea for you!
 
Q1 Memory Fabric Forum: SMART CXL Product Lineup
Q1 Memory Fabric Forum: SMART CXL Product LineupQ1 Memory Fabric Forum: SMART CXL Product Lineup
Q1 Memory Fabric Forum: SMART CXL Product Lineup
 
From eSIMs to iSIMs: It’s Inside the Manufacturing
From eSIMs to iSIMs: It’s Inside the ManufacturingFrom eSIMs to iSIMs: It’s Inside the Manufacturing
From eSIMs to iSIMs: It’s Inside the Manufacturing
 
zigbee motion sensor user manual NAS-PD07B2.pdf
zigbee motion sensor user manual NAS-PD07B2.pdfzigbee motion sensor user manual NAS-PD07B2.pdf
zigbee motion sensor user manual NAS-PD07B2.pdf
 
Dynamical systems simulation in Python for science and engineering
Dynamical systems simulation in Python for science and engineeringDynamical systems simulation in Python for science and engineering
Dynamical systems simulation in Python for science and engineering
 
Microsoft Azure News - Feb 2024
Microsoft Azure News - Feb 2024Microsoft Azure News - Feb 2024
Microsoft Azure News - Feb 2024
 
2024 February Patch Tuesday
2024 February Patch Tuesday2024 February Patch Tuesday
2024 February Patch Tuesday
 
Introduction to Serverless with AWS Lambda in C#.pptx
Introduction to Serverless with AWS Lambda in C#.pptxIntroduction to Serverless with AWS Lambda in C#.pptx
Introduction to Serverless with AWS Lambda in C#.pptx
 
Bringing nullability into existing code - dammit is not the answer.pptx
Bringing nullability into existing code - dammit is not the answer.pptxBringing nullability into existing code - dammit is not the answer.pptx
Bringing nullability into existing code - dammit is not the answer.pptx
 
Evolution of Chatbots: From Custom AI Chatbots and AI Chatbots for Websites.pptx
Evolution of Chatbots: From Custom AI Chatbots and AI Chatbots for Websites.pptxEvolution of Chatbots: From Custom AI Chatbots and AI Chatbots for Websites.pptx
Evolution of Chatbots: From Custom AI Chatbots and AI Chatbots for Websites.pptx
 

Genome_annotation@BioDec: Python all over the place

  • 1. Genome_annotation@BioDec: Python all over the place. Ivan Rossi ivan@biodec.com @rouge2507
  • 2. Hello ● BioDec does bioinformatics since 2002 ● Bioinformatics software development ● Bioinformation management system, BioDecoders ● Bioinformatics Consulting ● Development, engineering and integration of custom solutions ● Annotated databases of biosequences (e.g. genomes) ● Our Forte ● Protein-sequence analysis ● Trans-membrane proteins ● Machine-learning ● Python is everywhere
  • 3. The Challenge: from Sequence to Function >BGAL_SULSO BETA-GALACTOSIDASE Sulfolobus solfataricus. MYSFPNSFRFGWSQAGFQSEMGTPGSEDPNTDWYKWVHDPENMAAGLVSG DLPENGPGYWGNYKTFHDNAQKMGLKIARLNVEWSRIFPNPLPRPQNFDE SKQDVTEVEINENELKRLDEYANKDALNHYREIFKDLKSRGLYFILNMYH WPLPLWLHDPIRVRRGDFTGPSGWLSTRTVYEFARFSAYIAWKFDDLVDE YSTMNEPNVVGGLGYVGVKSGFPPGYLSFELSRRHMYNIIQAHARAYDGI KSVSKKPVGIIYANSSFQPLTDKDMEAVEMAENDNRWWFFDAIIRGEITR GNEKIVRDDLKGRLDWIGVNYYTRTVVKRTEKGYVSLGGYGHGCERNSVS LAGLPTSDFGWEFFPEGLYDVLTKYWNRYHLYMYVTENGIADDADYQRPY YLVSHVYQVHRAINSGADVRGYLHWSLADNYEWASGFSMRFGLLKVDYNT KRLYWRPSALVYREIATNGAITDEIEHLNSVPPVKPLRH Protein Function Gene Sequence Protein Sequence (~10^7) Protein Structure (10^5)
  • 4. Problems in Sequence Analysis Information Overflow: very large sets of data available High Throughput: New data must be processed at high speed (volume of data, time constraints) Open Problems: difficult to provide a simple first-principle or a model-based solution
  • 5. Alignments OmpA APKDNTWYTGAKLGWS QYHDTGLINNNGPTHEN KLGAGAFGGYQV NPYVGFEMGYDWLGR OEP21 IDTNTFFQVRGGLD TKT---------------GQPS SGSALIRHF YPNFSATLGVGVRYD OmpA MPYKGSVENGA YKAQGVQLTAKLGYP ITDDLDIYTRLGGMVWRADT YSNVYGKN HDTGVS OEP21 KQDSVGVRYAKND KLRYTVLAKKT FPVTNDGLVNFKIK GGCDVDQD-------FKE WKSR OmpA PVFAGGVEYA I-TPEIATRLEYQW TNNIGDAHTIGTRPDNG MLSLGVSYRF G----- OEP21 GGAEFSWNVF NFQKDQDVRLRIGYE AFEQV-PYLQIRE NNWTFNADYKGRWNVRYD L Alignments of some kind are the main tool for sequence comparison and database search OmpA: PDB 1BXW, SwissProt OMPA_ECOLI OEP21: Transmembrane Domain (24-177)
  • 6. Tools from machine learning Prediction Known sequences (DB subsets) TTCCPSIVARSNFNVCRLPGTPEAICATYTGCIIIPGATCPGDYAN ANN, HMM, SVM ANN, HMM, SVM Known mapping General Rules Known structures Artificial Neural Networks (ANNs) Hidden Markov Models (HMMs) Support Vector Machines (SVMs) New sequence
  • 7. A 0 0 0 0 0 0 0 0 0 0 0 10 0 0 0 0 C 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 D 0 0 70 0 0 0 0 60 0 0 0 0 20 0 0 0 E 0 0 0 0 0 0 0 0 0 0 0 0 70 0 0 0 F 0 0 0 10 0 33 0 0 0 0 0 0 0 0 0 0 G 10 0 30 0 30 0 100 0 0 0 0 50 0 0 0 0 H 0 0 0 0 10 0 0 10 30 0 0 0 0 0 0 0 K 0 40 0 0 0 0 0 0 10 100 70 0 0 0 0 100 I 0 0 0 0 0 0 0 0 30 0 0 0 0 0 0 0 L 0 0 0 0 0 0 0 30 0 0 0 0 0 0 0 0 M 0 0 0 0 0 0 0 0 0 0 0 0 0 60 0 0 N 0 0 0 0 10 0 0 0 0 0 30 10 0 0 0 0 P 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Q 0 0 0 0 40 0 0 0 30 0 0 0 0 0 0 0 R 0 50 0 0 0 0 0 0 0 0 0 0 0 0 0 0 S 0 0 0 0 0 33 0 0 0 0 0 0 10 10 0 0 T 20 0 0 0 0 33 0 0 0 0 0 30 0 30 100 0 V 0 0 0 0 10 0 0 0 0 0 0 0 0 0 0 0 W 0 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Y 70 0 0 90 0 0 0 0 0 0 0 0 0 0 0 0 Evolutionary Information 1 Y K D Y H S - D K K K G E L - - 2 Y R D Y Q T - D Q K K G D L - - 3 Y R D Y Q S - D H K K G E L - - 4 Y R D Y V S - D H K K G E L - - 5 Y R D Y Q F - D Q K K G S L - - 6 Y K D Y N T - H Q K K N E S - - 7 Y R D Y Q T - D H K K A D L - - 8 G Y G F G - - L I K N T E T T K 9 T K G Y G F G L I K N T E T T K 10 T K G Y G F G L I K N T E T T K Sequence position MSA Seq. Profile Sequence profile Given a Multiple Sequence Alignment (MSA) of similar sequences, associate to each position a 20-valued vector containing the relative aminoacidic composition of the aligned sequences.
  • 8. Why Python? (2.1.x, in 2002) ● Common ground, easy to pick up ● Expressive: productive, fast prototyping ● Mantainable: readable after months ● Useful tools and libs (e.g. BioPython) ● Retrospective: We were f...ing RIGHT!
  • 9. Hidden Markov Models Very powerful tools when: ● The system can be modeled in probabilistic terms. ● There is a ‘grammar of the problem’ ● There is a “limited sequential dependency” that can model the problem (at least to a rough approx) N T 0.01 0.01 0.99 0.99
  • 10. 99HMMers End Start Signal Peptide TM1 TM2 TM3 TM4 TM5 TM6 TM7 Insertion loop Inside loop Outside loop Profile-HMM, based on: http://www.biocomp.unibo.it/piero/PHMM
  • 11. BioPython BioPython (http://biopython.org) is a community- developed (O|B|F) set of Python libraries and tools for bioinformatics. ● The Parsers for formats and application (vital) ● The Sequence objects ● Bio.SeqIO, Bio.AlignIO, Bio.PDB ● Specialized External-application wrappers ● BioSQL interface
  • 12. BioSQL BioSQL (http://www.biosql.org) is a generic relational model (a schema) covering sequences, features, sequence and feature annotation, a reference taxonomy, and ontologies. ● Works with all O|B|F Bio* projects ● We extended it to suit our special need
  • 13. Ruffus Ruffus (http://www.ruffus.org.uk/) is a Computation Pipeline library for Python, designed to allow easy analysis automation. ● Acts like a pythonic Make on steroids ● Write your Python functions and decorate them – @originate, @transform, @merge an more ● Pipeline handling – Run pipelines make-style (run_pipeline) – Schedule pipelines on SGE compute clusters (run_job)
  • 14. Angler pipeline Proteome Generate profiles Predictions: Signal peptides Betabarrels Alpha-helical TMP Fold recognition Coiled coils Disordered regions Sub-cellular localization Classify Proteome Atlas (a DB) Angler annotates and classifies Protein sequences
  • 15. ZenDock Analyzes protein solvent- exposed surface for putative “interactor” residues, returning a “fuzzy” (probabilistic) answer. Interactors are correlated and grouped into patches Results are mapped on the protein 3D structure and made available through a web interface Contact-shell profile Int non-Int
  • 16. If you can't outrun them... The Problem ● Full Profile building is the slow step – It takes 30” to 5' for a 3-passes PsiBlast run (uniref90) – Repeat for ~10^5 … CPU weeks for genome. ● Major genomes updated every 3 months ● Micro-SME: limited resources
  • 17. … try to outsmart them. ● Sequence space is redundant – Both intra-genome and inter-genome ● Profiles are built incrementally – PsiBlast is an iterative algorithm ● PsiBlast is deterministic – Given the same sequence, database, and number of iterations you get the same profile
  • 18. Our accelerator: the PyBlastCache 1) Hash the sequence 2) version the reference protein database 3) store computed profiles in a key-value store 1) Key as a combination of seq. hash and DB version 4) Compute ● If full_key_match: skip_and_copy() ● If seq_key_match: update_profile( seq, itn=1) ● If no_key: create_profile(seq, itn=3)
  • 19. The (Python) front-ends ● Plone: a CMS – https://plone.org ● Web2py: a MVC framework – http://www.web2py.com ● Galaxy: web interface + workflow engine – Focus on reproducible research – https://wiki.galaxyproject.org/ – Saas: https://usegalaxy.org
  • 20. ● A BiOSQL browser, based on Plone, to search and display data and metadata (annotations) from biosequence databases. Could integrate predictors; ● We publicly released the base version open-source software at http://plone4bio.org; ● Used to be the la base for some commercial software we sold to clients. Plone4Bio
  • 23. Galaxy Galaxy is an open, web-based platform for accessible, reproducible, and transparent computational biomedical research. – Users without programming experience can easily specify parameters and run tools and workflows. – Galaxy captures information in order to allow complete repeats of a computational analysis. – Users share and publish analyses via the web and create Pages, interactive, web-based documents that describe a complete analysis. ● Accepted as material by peer reviewed journals
  • 24. Galaxy highlights Galaxy is useful to both end user and bioinformatic devs. ● Get data directly from online DBs (USCS, Biomart,...) ● Handling of data from lab instrumentetion (e.g NGS seqs) ● Map calculated data on online viewers (e.g. genome viewer) ● Easily extensible: wrapping a foreign tools is as simple as by writing an XML file. ● Data sharing (workflows, libraries, tools...) ● The community!
  • 27. Thou Shalt Care For The DATA ● So much junk in the literature!! – Both for features and data sets ● Use training, testing and validation sets ● The sets should always be disjoint – Below 25% seq ID ● Redundancy is THE ENEMY ● Avoid feature bloat, use feature selection ● Always compare results with a nearest-neighbor method – Good ones are really hard to beat
  • 28. No Free Lunch ● There is no killer method – Choose method that better models your domain (e.g. sequences → HMMs) – Data curation is always more important ● Be Humble, be Honest! Meditation hint: http://www.no-free-lunch.org/
  • 29. The community is your friend. Give back to the community.