SlideShare una empresa de Scribd logo
1 de 47
Descargar para leer sin conexión
Bioperl:
Developing Open
    Source
   Software
     Jason Stajich
    Duke University



                      BOSC 2005
Why Talk About Bioperl?


• Successful open-source project
• Bioinformatics is a difficult field to
  straddle
• Toolkit still has (many) flaws
• An evolving project


                                         BOSC 2005
How Did We Get Here?


• A Brief History
 • Development Stats
 • People
• Bioperl in Action



                        BOSC 2005
Bioperl Development
         Goals

• Provide a useable Perl toolkit for life
  science data
• Solve problems that developers need
• Avoid the 1-off “everyone write a
  BLAST|GenBank|FASTA parser”
• Continuity of solutions

                                            BOSC 2005
A Brief History
                                                                             2002
            1997-1998                         2000                        Hackathons
                                                                          AZ to ZA
          Poster at ISMB                                                                                          2004
                                     Bio::SeqIO, Bio::DB
        OMG Bio-Objects
                                                                       “Core” Founded
       bio.perl.org website                                                                          Bioperl Bootcamp at UMontreal
                                     EnsEMBL launched                 Bioperl 1.0 release
            Bio::PreSeq                 bioperl-ext                      Bioperl paper
           Bio::UnivAln                                                                                        BOSC2004
                                                                        Gbrowse paper
            Bio::Struct                 Bioperl 0.07
         BLAST modules
                                                                      HOWTOs started
                                        BOSC2000                        bioperl-run
        Bioperl 0.01-0.04
                                                                        BOSC2002


                              1999                                                  2003
     1996                                                  2001                                                   2005
                                                                            Singapore Hackathon            Bioperl 1.5 released
                     Ewan Birney becomes
   VSNS-BCD                                          bptutorial written
                   project leader after Steve                               Bioperl 1.2.3 released
    Course                                              Bioperl-db                                        100+ Paper references
                            Chervitz                                       BioPipe paper published
Steven Brenner                                                                                           bioperl-db to have official
                                                                             Bioperl 1.4 released
Steve Chervitz *                                        BioCORBA                                                  release
                     Bio::SeqIO, Bio::Index
Chris Dagdigian                                         BOSC2001                                                BOSC2005
                                                                              bioperl-microarray
Richard Resnick
                      Releases 0.05-0.06                                           started
  Lew Gramer
                             BioP                                                BOSC2003
                      BioJava, BioPython
                           launched
                       Bioperl-99 mtg

                                                                                                                         BOSC 2005
Defining Points
• Vision of module(s) set by code owner
• She/He who writes it, “wins” argument
• Project leader/core developers
 • can remove code before a release to
   insure a working release
 • Enforce code and module continuity
• Preserve API as much as possible
                                         BOSC 2005
Project Mechanics


• Code is on a shared server (O|B|F)
• Publicly accessible
• Write-access to repository granted by
  main developers
• Discussion of code, ideas on-list


                                          BOSC 2005
Bioperl Mailing List
                              Traffic
1400
                                                                                                        ’full_dates.dat’ using 1:2
                    bioperl-guts 266 subscribers (20-June-2005)                                  ’full_guts_dates.dat’ using 1:2
                    bioperl-l    1500 subscribers (20-June-2005)
1200



1000



800



600



400



200



  0
       01/01/1996



                          01/01/1997



                                       01/01/1998



                                                    01/01/1999



                                                                 01/01/2000



                                                                              01/01/2001



                                                                                           01/01/2002



                                                                                                          01/01/2003



                                                                                                                       01/01/2004



                                                                                                                                    01/01/2005



                                                                                                                                                 01/01/2006
                                                                                                                                                          BOSC 2005
Code Growth
800                                                                350000
                                      ’plot.dat’ using 1:2                                                   ’plot.dat’ using 1:3

700
                                                                   300000


600
                                                                   250000

500

                                                                   200000

400

                                                                   150000
300


                                                                   100000
200


                                                                    50000
100



 0                                                                     0
      0   0.2   0.4   0.6   0.8   1    1.2           1.4     1.6            0   0.2    0.4   0.6   0.8   1     1.2           1.4    1.6


          Modules per release                                                         Lines per release



                                                                                                             BOSC 2005
CVS Commits
300
                                                           ’ChangeLog.sum.dat’ using 1:2



250




200




150




100




 50




   0
01/01/1999   01/01/2000   01/01/2001   01/01/2002   01/01/2003   01/01/2004   01/01/2005   01/01/2006
                                                                                                  BOSC 2005
Bioperl in Action
# Convert sequence formats
use Bio::SeqIO;
my $in = Bio::SeqIO->new(-format => ‘genbank’, -file=> ‘file.gbk’);
my $out = Bio::SeqIO->new(-format =>’fasta’,-file =>’>file.fas’);

while( my $seq = $in->next_seq ) {
  $out->write_seq($seq);
}

----------------------------------

use Bio::SearchIO;
my $in = Bio::SearchIO->new(-format => ‘blast’,
                            -file=> ‘result.blastp’);
while( my $r = $in->next_result ) {
  while( my $h = $r->next_hit ) {
    while( my $hsp = $h->next_hsp ) {
         print $hsp->query->start, “..”,$hsp->query->end,”
”;
    }
  }
                                                           BOSC 2005
}
Bioperl in Action
# Convert sequence formats
use Bio::SeqIO;
my $in = Bio::SeqIO->new(-format => ‘embl’, -file=> ‘file.embl’);
my $out = Bio::SeqIO->new(-format =>’gcg’,-file =>’>file.gcg’);

while( my $seq = $in->next_seq ) {
  $out->write_seq($seq);
}

----------------------------------

use Bio::SearchIO;
my $in = Bio::SearchIO->new(-format => ‘fasta’,
                            -file   => ‘result.fastp’);
while( my $r = $in->next_result ) {
  while( my $h = $r->next_hit ) {
    while( my $hsp = $h->next_hsp ) {
         print $hsp->query->start, “..”,$hsp->query->end,”
”;
    }
  }
                                                            BOSC 2005
}
Measuring sucess
• 2002 Paper has 100+ Citations
• Bits and pieces used in many
  informatics tools
 • Bio::SearchIO (Rfam, In-Paranoid)
 • Generic Genome Browser (Stein et
   al, 2002)
 • Comparative Genomics Library
   (Yandell, Mungall, et al)
 • [EnsEMBL]
                                       BOSC 2005
Bioperl Taught At
Tutorials and Courses




                   CSHL Bioinformatics
                    Software Courses

                 Various mini courses
Bootcamp 2004
                MIT, Duke, EBI, Pasteur
  (Montreal)
                                 BOSC 2005
Knowledge of Bioperl is
                      A Marketable Skill
                                                                              http://bioag.byu.edu/botany/homepage/botweb/jobs_Ph.D.htm
                                                                              
     Academic Facilities Coordinator II (Facilities)
http://www.ce.com/education/Bioinformatics-Certificate-
                                                                              
     Bioinformatics Position at the UCR Genomics Institute
Program-10082109.htm
                                                                              
     UC, Riverside
Track 1: Suggested electives for computer scientists and IT professionals
   * Advanced Sequence Analysis in Bioinformatics (2 units)
                                                                              The Center for Plant Cell Biology (CEPCEB) in the Genomics Institute of the University of California, Riverside,
   * Gene Expression and Pathways (2 units)
                                                                              invites applicants for an Academic Facilities Coordinator II position, an academic-track 11-month appointment.
   * Protein Structure Analysis in Bioinformatics (2 units)
                                                                              Salary for the position is commensurate with education and experience.
   * Design and Implementation of Bioinformatics Infrastructures (3 units)
   * DNA Microarrays - Principles, Applications and Data Analysis (2 Units)
                                                                              The successful applicant will be expected to organize a small bioinformatics team to provide support to The Center
   * Parallel and Distributed Computing for Bioinformatics (2 units)
                                                                              for Plant Cell Biology. This team will implement currently available bioinformatics tools including relational database
Track 2: Suggested electives for molecular biologists and other scientists
                                                                              support and will develop user-specific data-mining tools. The appointee will be expected to develop research
   * Introduction to Programming for Bioinformatics II*** (3 units)
                                                                              collaborations with the faculty and teach or organize short courses that will inform that will inform the local
   * Introduction to Programming for Bioinformatics III (3 units)
                                                                              community about the available bioinformatics resources. Applicants must have a Ph.D. in the Biological Sciences
      BioPerl for Bioinformatics (2 units)
  *                                                                           (Plant Biology is preferred). Applicants with experience in leading a bioinformatics group will be given preference.
  * Design and Implementation of Bioinformatics Infrastructures (3 units)     Additionally, the applicant must be proficient in one or more programming languages (PERL, PYTHON, JAVA, C++)
  * Gene Expression and Pathways (2 units)                                    and have a good understanding of database design and implementation. In addition, applicants should have
  * Protein Structure Analysis in Bioinformatics (2 units)                    experience using one of the open source bioinformatics frameworks such as BIOJAVA or
  * DNA Microarrays - Principles, Applications and Data Analysis (2 units)
                                                                              BIOPERL. The applicant should have experience with software collaboration tools such as CVS. The applicant
                                                                              will be expected to oversee the purchase, installation, and management of the necessary computer hardware and
 http://www.foothill.fhda.edu/bio/programs/bioinfo/curric.shtml               software required to provide bioinformatics support to several users. A good understanding of the UNIX operating
                                                                              system and systems administration is also an important qualification.
 CAREER CERTIFICATE REQUIREMENTS (49 units)*
                                                                    http://careers.psgs.com/CareerOppLocation2.asp?location=BETHESDA
 Biotechnology Core Courses (14 units)                                 CF-046: Bioinformatics Specialist
 BTEC 51A Cell Biology for Biotechnology (3 units)
 BTEC 52A Molecular Biology for Biotechnology (3 units)             The NIAID Office of Technology and Information Systems (OTIS)/Bioinformatics and Scientific IT Program (BSIP) is seek-ing
 BTEC 65 DNA Electrophoretic Systems (1 unit)                       a bioinformatics specialist. The position includes managing our bioinformatics services and applications, training NIH
 BTEC 68 Polymerase Chain Reaction (1 unit)                         scientists, creating and modifying simple bioinformatics software, and collaborating with NIH scientists on specific projects.
 BTEC 71 DNA Sequencing & Bioinformatics (1 unit)                   [snip]
 BTEC 76 Introduction to Microarray Data Analysis (2 unit)
 BTEC 64 Protein Electrophoretic Systems (1 unit)                   The qualified candidate must hold a Master's Degree (or equivalent) and three years of experience or a Ph. D in life science
 BTEC 66 HPLC (2 units)                                             or computer science. The candidate must have strong interpersonal, written and oral communication skills and be a lateral
                                                                    thinker. Must be able to communicate current bioinformatics technology in a clear and precise manner and to discuss
 Computer Science Core Courses (30 units)                           projects with scientists and advise what relevant tools may be used or implemented, have the ability to locate relevant data/
 CIS 52A Introduction to Data Management Systems (5 units)          information and put this into the context of projects they work on. Candidate must have expert knowledge of UNIX (Mac OS
 CIS 52B2 Introduction to Oracle SQL (5 units)                      X Darwin a plus), with the ability to install and configure command-line-based applications and services. These include: UNIX
 CIS 68A Introduction to UNIX (5 units)
                                                                    windowing systems (X11, FreeX86), UNIX-based open source scientific applications, Perl and BioPerl scripts,
 CIS 68E Introduction to PERL (5 units)
                                                                    HTML, XML. Must have experience with one or more of the following: · Web services (WSDL, SOAP) · Genomics and
 CIS 68H Introduction to BioPerl (5 units)                          proteomics · Protein structure prediction/visualization. · Sequence analysis, alignment and database searches. · Regular
                                                                                                                                                                        BOSC 2005
 COIN81 Bioinformatics Tools & Databases (5 units)                  expressions and relational database queries. · Data mining, text mining · Biostatistics
Why has the Project
       Worked?


• A critical mass of individuals who
  wanted to solve a common problem
• Infrastructure that works
• Sense of community
• Sharing information, tutorials


                                       BOSC 2005
Some Key Developers
Not Pictured
      Richard Adams
         Scott Cain
        Sean Davis
       Stefan Kirov
      Nathan Haigh
       Marc Logghe
                                                            Chad




Heikki                Aaron     Steve      Jason   Shawn     Allen




 Ewan                 Chris M   Hilmar   Lincoln   Brian    Chris D
                                                           BOSC 2005
Infrastructure

• Easily accessible CVS, web service
• Reliable
• Machines that we own
 • Additional services can be added
• Moving to faster machines this summer


                                       BOSC 2005
Collaborative Nature
    of the Project


• Folks working on similar problems
• Pooled development resources
• Friendly & interactive way to do
  science
• Shared user support load...


                                      BOSC 2005
Open Source is Good!

• Fix things faster
  • (If you really care about the fix)
• Everyone can see solution, audit code
• Contribution can be modular
• “Give a little, get a lot”
• More developers for less resources

                                          BOSC 2005
Open Source Doesn’t
   Solve Everything
• Best laid plans
 • Decision by consensus ...
 • ... Free-for-all committing
• Things don’t “just happen”
 • Need strong leadership, vision, and
   commitment
• Documentation tends to lag
• Interesting, needed code trumps boring
                                         BOSC 2005
Great Power ==
  Great Responsibility

• Making releases
 • Point releases maintain API
 • Attempt to maintain backwards
   compatibility between stable releases
• Many people depend on the code
• Additional modules should maintain
  continuity
                                       BOSC 2005
We Know how to have
       fun!




                      BOSC 2005
Bioperl: Present Day
• Parsers for
 • Sequence & alignment parsing
 • Gene Prediction output
 • Phylogenetic trees
 • Weight Matricies (TFBS)
 • Ontologies
 • Structure data
 • Various CompBio, MolEvol apps

                                   BOSC 2005
Bioperl: Present Day

• Tools for
 • Sequence manipulation (I/O, manip)
 • Graphical sequence rendering
 • Sequence & alignment stats, popgen
   tests, molevo tests
 • Access to Local & Remote Sequence
   Databases
                                       BOSC 2005
Present Day Stats

• 780+ modules
• 336,000 lines of code
• ~9,000 tests in unit test ‘t’ dir
• 57 utility scripts, 60 example scripts
• 11 HOWTO docs
• 44 Q&A in FAQ

                                           BOSC 2005
Bioperl Magic


• Using interfaces to hide specifics
 • Bio::DB::GenBank & Bio::Index::GenBank
• Objects can masquerade like other objects
 • Bio::SeqFeature::Generic, Bio::Location,
   DB::GFF::RelSegment, DB::GFF::Feature


                                              BOSC 2005
Bioperl Magic
use Bio::DB::GenBank;
use Bio::DB::Fasta;

my $db = Bio::DB::GenBank->new;
my $idx = Bio::DB::Fasta->new($fafile_dir);

my $seq1 = $db->get_Seq_by_acc($acc);
my $seq2 = $idx->get_Seq_by_acc($acc);


use Bio::DB::FileCache;
my $cachedb = Bio::DB::FileCache->new($db);
my $acc = $cachedb->get_Seq_by_acc($acc);

                                         BOSC 2005
...The Problems

• Interfaces system is cluttered
 • Learning curve is steep
 • Nonsensical to some people
• OO Code is too slow
• Developer community needs growth


                                     BOSC 2005
Why is Bioperl Slow
   (in some places)?


1. Object-Oriented Perl is a Hack!
 1. $class->SUPER::new(@_);
 2. $class->SUPER::_rearrange(@_);
 3. Object overhead can be high.



                                     BOSC 2005
Research with Bioperl


• Genome annotation pipeline
• Genome Browsing
• Comparative genomics
• Phylogenomics



                               BOSC 2005
S.cerevisiae
     Existing Genome                                                  C.glabrata
       Annotation                                                K.lactis
         New                                                   K.waltii
       Annotation                                             A.gossypii
                                                              D.hansenii
                                                                C.guilliermondii
                                                             C.lusitaniae
Hemiascomycota
                                                       Y.lipolytica
                                                            P.anserina
                                                            C.globsum
                                                          N.crassa
                                                          M.grisea
  Euascomycota                                          F.graminearum
                                                        A.fumigatus
                                                        A.nidulans
                                                         H.capsulatum
                                                         C.immitis
Archiascomycota                                       S.pombe
                                                      C.cinereus
                                                      P.chrysosporium
  Basidiomycota                                       C.neoformans
                                                U.maydis
  Zygomycota                               R.oryzae
                                         H.sapiens

                                         Protein ML tree based on 100 random orthologs
                            A.thaliana
                       10
Genome Annotation
                      Pipeline
               Rfam
             tRNAscan

                            ZFF to GFF3                     GFF to AA
              SNAP                                                                      FASTA
                                                                          predicted
                                                                                                         Proteins
                                                                           proteins
             Twinscan                                                                  all-vs-all
                            GFF2 to GFF3




                                             Bio::DB::GFF
                            Tools::Glimmer
             Glimmer                                                                         SearchIO
                                                             protein to
             Genscan
Genome                                                        genome
                            Tools::Genscan
                                                            coordinates
                                                                                       homologs
                                                                           HMMER                        MCL
                              SearchIO                       SearchIO                     and
             BLASTZ
                                                                                       orthologs
             BLASTN
                                                             Tools::GFF     GLEAN
                            GFF2 to GFF3
            exonerate                                                     (combiner)
Proteins                                                                                                 Gene
           protein2genome
                                                                                                        families



            exonerate
 ESTs
            est2genome


                                                                                                    BOSC 2005
Genome Browsing




                  BOSC 2005
Comparative Genomics

• Find orthologs, align, build trees
 • Assess gene structure among
   orthologs
• Find gene families
 • Assess gene structure
 • Copy number among genomes

                                       BOSC 2005
Comparative Genomics:
          Gene Structure

                                                                                        Species
                                                                                       phylogeny


                                                                                              Bio::Tree/TreeIO
             Bio::DB
                                    Bio::AignIO                  Bio::SimpleAlign
            Bio::SeqIO
                         MUSCLE                   Alignments                          Score shared
orthologs
                                  Bio::Coordinate with Introns                      intorn positions
                                                                                        on Tree




                                                                                                 BOSC 2005
Phylogenomics

                                                                    Species
                                                                   phylogeny




                                                          Bio::TreeIO
   gene
                                                MrBayes
 families    Bio::DB
                                                 PhyML             Classify tree
            Bio::SeqIO            Bio::AignIO
                         MUSCLE                                    differences
                                                ProtML
                                                PUZZLE             Bin genes by
orthologs                                                            topology




                                                                                   BOSC 2005
Bioperl - The Future




                       BOSC 2005
Where Should Bioperl
    be Headed?


• Concise set of functionality that just
  works?
• Supporting leading edge of research?
• Beginner level tools?
• Advanced tools?


                                           BOSC 2005
Challenges

• Logistics
 • Building and maintaining large
   projects takes a lot of time
 • Project leaders with vision (and time)
 • Face-to-face time invaluable for
   coordination
 • Need more dedicated developers
                                       BOSC 2005
Starting Over?

• Years of organic development don’t
  make a wholly consistent toolkit
• Keeping POD up to date vs API
  maintenance
 • Autodoc-ing?
• Perl6 rewrite?
• Do we start a Bioperl 2, from scratch?
                                           BOSC 2005
Supporting the Toolkit

• Consider applying for funding to
  support a developer?
• Finding companies to provide more
  dedicated user support?
• Gatherings (hackathons)
 • DocFests to increase documentation
• Better incentives for bug fixes
                                        BOSC 2005
My Priority Queue


• Simplified API, hidden interfaces
• Overview docs point functionality to
  modules
• Speed
• Extensibility


                                         BOSC 2005
A Plan
• Continue Bioperl 1.x
 • Try and fix interface overload
 • Work out speed issues (maybe)
• Encourage dabbling into a 2.0 codebase
 • Outline some key principals in new
   codebase.
 • Consider prototypes & Perl6
                                        BOSC 2005
We’ve Grown Up


• Over the course of the project we have
 • Gotten married
 • Had kids
 • Changed continents
 • Changed jobs



                                      BOSC 2005
BOSC 2005
Thanks

• Developers - past and present
• Ewan Birney
• Lincoln Stein
• Chris Dagdigian
• Brian Osborne


                                  BOSC 2005

Más contenido relacionado

Más de Jason Stajich

Fungi DB at Keystone meeting
Fungi DB at Keystone meetingFungi DB at Keystone meeting
Fungi DB at Keystone meetingJason Stajich
 
MOMY FungiDB tutorial
MOMY FungiDB tutorialMOMY FungiDB tutorial
MOMY FungiDB tutorialJason Stajich
 
Stajich phyloseminar 2011 06-29
Stajich phyloseminar 2011 06-29Stajich phyloseminar 2011 06-29
Stajich phyloseminar 2011 06-29Jason Stajich
 
2011 05-02 fungi db-iccc
2011 05-02 fungi db-iccc2011 05-02 fungi db-iccc
2011 05-02 fungi db-icccJason Stajich
 
2009 11 09 UCLA Bioinformatics Talk
2009 11 09 UCLA Bioinformatics Talk2009 11 09 UCLA Bioinformatics Talk
2009 11 09 UCLA Bioinformatics TalkJason Stajich
 
2009 11 16 UCR Comp Sci
2009 11 16 UCR Comp Sci2009 11 16 UCR Comp Sci
2009 11 16 UCR Comp SciJason Stajich
 
Evolution and exploration of the transcriptional landscape in two filamentous...
Evolution and exploration of the transcriptional landscape in two filamentous...Evolution and exploration of the transcriptional landscape in two filamentous...
Evolution and exploration of the transcriptional landscape in two filamentous...Jason Stajich
 
Connecting ultrastructure to molecules
Connecting ultrastructure to moleculesConnecting ultrastructure to molecules
Connecting ultrastructure to moleculesJason Stajich
 
Connecting ultrastructure to molecules: Studying evolution of molecular comp...
Connecting ultrastructure to molecules: Studying evolution  of molecular comp...Connecting ultrastructure to molecules: Studying evolution  of molecular comp...
Connecting ultrastructure to molecules: Studying evolution of molecular comp...Jason Stajich
 
Comparative genomics of the fungal kingdom
Comparative genomics of the fungal kingdomComparative genomics of the fungal kingdom
Comparative genomics of the fungal kingdomJason Stajich
 
Comparative genomics of fung: studying adaptation through gene family evolution
Comparative genomics of fung: studying adaptation through gene family evolutionComparative genomics of fung: studying adaptation through gene family evolution
Comparative genomics of fung: studying adaptation through gene family evolutionJason Stajich
 
Evolution of gene family size change in fungi
Evolution of gene family size change in fungiEvolution of gene family size change in fungi
Evolution of gene family size change in fungiJason Stajich
 
Comparative Genomics with GMOD and BioPerl
Comparative Genomics with GMOD and BioPerlComparative Genomics with GMOD and BioPerl
Comparative Genomics with GMOD and BioPerlJason Stajich
 

Más de Jason Stajich (13)

Fungi DB at Keystone meeting
Fungi DB at Keystone meetingFungi DB at Keystone meeting
Fungi DB at Keystone meeting
 
MOMY FungiDB tutorial
MOMY FungiDB tutorialMOMY FungiDB tutorial
MOMY FungiDB tutorial
 
Stajich phyloseminar 2011 06-29
Stajich phyloseminar 2011 06-29Stajich phyloseminar 2011 06-29
Stajich phyloseminar 2011 06-29
 
2011 05-02 fungi db-iccc
2011 05-02 fungi db-iccc2011 05-02 fungi db-iccc
2011 05-02 fungi db-iccc
 
2009 11 09 UCLA Bioinformatics Talk
2009 11 09 UCLA Bioinformatics Talk2009 11 09 UCLA Bioinformatics Talk
2009 11 09 UCLA Bioinformatics Talk
 
2009 11 16 UCR Comp Sci
2009 11 16 UCR Comp Sci2009 11 16 UCR Comp Sci
2009 11 16 UCR Comp Sci
 
Evolution and exploration of the transcriptional landscape in two filamentous...
Evolution and exploration of the transcriptional landscape in two filamentous...Evolution and exploration of the transcriptional landscape in two filamentous...
Evolution and exploration of the transcriptional landscape in two filamentous...
 
Connecting ultrastructure to molecules
Connecting ultrastructure to moleculesConnecting ultrastructure to molecules
Connecting ultrastructure to molecules
 
Connecting ultrastructure to molecules: Studying evolution of molecular comp...
Connecting ultrastructure to molecules: Studying evolution  of molecular comp...Connecting ultrastructure to molecules: Studying evolution  of molecular comp...
Connecting ultrastructure to molecules: Studying evolution of molecular comp...
 
Comparative genomics of the fungal kingdom
Comparative genomics of the fungal kingdomComparative genomics of the fungal kingdom
Comparative genomics of the fungal kingdom
 
Comparative genomics of fung: studying adaptation through gene family evolution
Comparative genomics of fung: studying adaptation through gene family evolutionComparative genomics of fung: studying adaptation through gene family evolution
Comparative genomics of fung: studying adaptation through gene family evolution
 
Evolution of gene family size change in fungi
Evolution of gene family size change in fungiEvolution of gene family size change in fungi
Evolution of gene family size change in fungi
 
Comparative Genomics with GMOD and BioPerl
Comparative Genomics with GMOD and BioPerlComparative Genomics with GMOD and BioPerl
Comparative Genomics with GMOD and BioPerl
 

Último

From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 

Último (20)

From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 

BioPerl: Developming Open Source Software

  • 1. Bioperl: Developing Open Source Software Jason Stajich Duke University BOSC 2005
  • 2. Why Talk About Bioperl? • Successful open-source project • Bioinformatics is a difficult field to straddle • Toolkit still has (many) flaws • An evolving project BOSC 2005
  • 3. How Did We Get Here? • A Brief History • Development Stats • People • Bioperl in Action BOSC 2005
  • 4. Bioperl Development Goals • Provide a useable Perl toolkit for life science data • Solve problems that developers need • Avoid the 1-off “everyone write a BLAST|GenBank|FASTA parser” • Continuity of solutions BOSC 2005
  • 5. A Brief History 2002 1997-1998 2000 Hackathons AZ to ZA Poster at ISMB 2004 Bio::SeqIO, Bio::DB OMG Bio-Objects “Core” Founded bio.perl.org website Bioperl Bootcamp at UMontreal EnsEMBL launched Bioperl 1.0 release Bio::PreSeq bioperl-ext Bioperl paper Bio::UnivAln BOSC2004 Gbrowse paper Bio::Struct Bioperl 0.07 BLAST modules HOWTOs started BOSC2000 bioperl-run Bioperl 0.01-0.04 BOSC2002 1999 2003 1996 2001 2005 Singapore Hackathon Bioperl 1.5 released Ewan Birney becomes VSNS-BCD bptutorial written project leader after Steve Bioperl 1.2.3 released Course Bioperl-db 100+ Paper references Chervitz BioPipe paper published Steven Brenner bioperl-db to have official Bioperl 1.4 released Steve Chervitz * BioCORBA release Bio::SeqIO, Bio::Index Chris Dagdigian BOSC2001 BOSC2005 bioperl-microarray Richard Resnick Releases 0.05-0.06 started Lew Gramer BioP BOSC2003 BioJava, BioPython launched Bioperl-99 mtg BOSC 2005
  • 6. Defining Points • Vision of module(s) set by code owner • She/He who writes it, “wins” argument • Project leader/core developers • can remove code before a release to insure a working release • Enforce code and module continuity • Preserve API as much as possible BOSC 2005
  • 7. Project Mechanics • Code is on a shared server (O|B|F) • Publicly accessible • Write-access to repository granted by main developers • Discussion of code, ideas on-list BOSC 2005
  • 8. Bioperl Mailing List Traffic 1400 ’full_dates.dat’ using 1:2 bioperl-guts 266 subscribers (20-June-2005) ’full_guts_dates.dat’ using 1:2 bioperl-l 1500 subscribers (20-June-2005) 1200 1000 800 600 400 200 0 01/01/1996 01/01/1997 01/01/1998 01/01/1999 01/01/2000 01/01/2001 01/01/2002 01/01/2003 01/01/2004 01/01/2005 01/01/2006 BOSC 2005
  • 9. Code Growth 800 350000 ’plot.dat’ using 1:2 ’plot.dat’ using 1:3 700 300000 600 250000 500 200000 400 150000 300 100000 200 50000 100 0 0 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 Modules per release Lines per release BOSC 2005
  • 10. CVS Commits 300 ’ChangeLog.sum.dat’ using 1:2 250 200 150 100 50 0 01/01/1999 01/01/2000 01/01/2001 01/01/2002 01/01/2003 01/01/2004 01/01/2005 01/01/2006 BOSC 2005
  • 11. Bioperl in Action # Convert sequence formats use Bio::SeqIO; my $in = Bio::SeqIO->new(-format => ‘genbank’, -file=> ‘file.gbk’); my $out = Bio::SeqIO->new(-format =>’fasta’,-file =>’>file.fas’); while( my $seq = $in->next_seq ) { $out->write_seq($seq); } ---------------------------------- use Bio::SearchIO; my $in = Bio::SearchIO->new(-format => ‘blast’, -file=> ‘result.blastp’); while( my $r = $in->next_result ) { while( my $h = $r->next_hit ) { while( my $hsp = $h->next_hsp ) { print $hsp->query->start, “..”,$hsp->query->end,” ”; } } BOSC 2005 }
  • 12. Bioperl in Action # Convert sequence formats use Bio::SeqIO; my $in = Bio::SeqIO->new(-format => ‘embl’, -file=> ‘file.embl’); my $out = Bio::SeqIO->new(-format =>’gcg’,-file =>’>file.gcg’); while( my $seq = $in->next_seq ) { $out->write_seq($seq); } ---------------------------------- use Bio::SearchIO; my $in = Bio::SearchIO->new(-format => ‘fasta’, -file => ‘result.fastp’); while( my $r = $in->next_result ) { while( my $h = $r->next_hit ) { while( my $hsp = $h->next_hsp ) { print $hsp->query->start, “..”,$hsp->query->end,” ”; } } BOSC 2005 }
  • 13. Measuring sucess • 2002 Paper has 100+ Citations • Bits and pieces used in many informatics tools • Bio::SearchIO (Rfam, In-Paranoid) • Generic Genome Browser (Stein et al, 2002) • Comparative Genomics Library (Yandell, Mungall, et al) • [EnsEMBL] BOSC 2005
  • 14. Bioperl Taught At Tutorials and Courses CSHL Bioinformatics Software Courses Various mini courses Bootcamp 2004 MIT, Duke, EBI, Pasteur (Montreal) BOSC 2005
  • 15. Knowledge of Bioperl is A Marketable Skill http://bioag.byu.edu/botany/homepage/botweb/jobs_Ph.D.htm Academic Facilities Coordinator II (Facilities) http://www.ce.com/education/Bioinformatics-Certificate- Bioinformatics Position at the UCR Genomics Institute Program-10082109.htm UC, Riverside Track 1: Suggested electives for computer scientists and IT professionals * Advanced Sequence Analysis in Bioinformatics (2 units) The Center for Plant Cell Biology (CEPCEB) in the Genomics Institute of the University of California, Riverside, * Gene Expression and Pathways (2 units) invites applicants for an Academic Facilities Coordinator II position, an academic-track 11-month appointment. * Protein Structure Analysis in Bioinformatics (2 units) Salary for the position is commensurate with education and experience. * Design and Implementation of Bioinformatics Infrastructures (3 units) * DNA Microarrays - Principles, Applications and Data Analysis (2 Units) The successful applicant will be expected to organize a small bioinformatics team to provide support to The Center * Parallel and Distributed Computing for Bioinformatics (2 units) for Plant Cell Biology. This team will implement currently available bioinformatics tools including relational database Track 2: Suggested electives for molecular biologists and other scientists support and will develop user-specific data-mining tools. The appointee will be expected to develop research * Introduction to Programming for Bioinformatics II*** (3 units) collaborations with the faculty and teach or organize short courses that will inform that will inform the local * Introduction to Programming for Bioinformatics III (3 units) community about the available bioinformatics resources. Applicants must have a Ph.D. in the Biological Sciences BioPerl for Bioinformatics (2 units) * (Plant Biology is preferred). Applicants with experience in leading a bioinformatics group will be given preference. * Design and Implementation of Bioinformatics Infrastructures (3 units) Additionally, the applicant must be proficient in one or more programming languages (PERL, PYTHON, JAVA, C++) * Gene Expression and Pathways (2 units) and have a good understanding of database design and implementation. In addition, applicants should have * Protein Structure Analysis in Bioinformatics (2 units) experience using one of the open source bioinformatics frameworks such as BIOJAVA or * DNA Microarrays - Principles, Applications and Data Analysis (2 units) BIOPERL. The applicant should have experience with software collaboration tools such as CVS. The applicant will be expected to oversee the purchase, installation, and management of the necessary computer hardware and http://www.foothill.fhda.edu/bio/programs/bioinfo/curric.shtml software required to provide bioinformatics support to several users. A good understanding of the UNIX operating system and systems administration is also an important qualification. CAREER CERTIFICATE REQUIREMENTS (49 units)* http://careers.psgs.com/CareerOppLocation2.asp?location=BETHESDA Biotechnology Core Courses (14 units) CF-046: Bioinformatics Specialist BTEC 51A Cell Biology for Biotechnology (3 units) BTEC 52A Molecular Biology for Biotechnology (3 units) The NIAID Office of Technology and Information Systems (OTIS)/Bioinformatics and Scientific IT Program (BSIP) is seek-ing BTEC 65 DNA Electrophoretic Systems (1 unit) a bioinformatics specialist. The position includes managing our bioinformatics services and applications, training NIH BTEC 68 Polymerase Chain Reaction (1 unit) scientists, creating and modifying simple bioinformatics software, and collaborating with NIH scientists on specific projects. BTEC 71 DNA Sequencing & Bioinformatics (1 unit) [snip] BTEC 76 Introduction to Microarray Data Analysis (2 unit) BTEC 64 Protein Electrophoretic Systems (1 unit) The qualified candidate must hold a Master's Degree (or equivalent) and three years of experience or a Ph. D in life science BTEC 66 HPLC (2 units) or computer science. The candidate must have strong interpersonal, written and oral communication skills and be a lateral thinker. Must be able to communicate current bioinformatics technology in a clear and precise manner and to discuss Computer Science Core Courses (30 units) projects with scientists and advise what relevant tools may be used or implemented, have the ability to locate relevant data/ CIS 52A Introduction to Data Management Systems (5 units) information and put this into the context of projects they work on. Candidate must have expert knowledge of UNIX (Mac OS CIS 52B2 Introduction to Oracle SQL (5 units) X Darwin a plus), with the ability to install and configure command-line-based applications and services. These include: UNIX CIS 68A Introduction to UNIX (5 units) windowing systems (X11, FreeX86), UNIX-based open source scientific applications, Perl and BioPerl scripts, CIS 68E Introduction to PERL (5 units) HTML, XML. Must have experience with one or more of the following: · Web services (WSDL, SOAP) · Genomics and CIS 68H Introduction to BioPerl (5 units) proteomics · Protein structure prediction/visualization. · Sequence analysis, alignment and database searches. · Regular BOSC 2005 COIN81 Bioinformatics Tools & Databases (5 units) expressions and relational database queries. · Data mining, text mining · Biostatistics
  • 16. Why has the Project Worked? • A critical mass of individuals who wanted to solve a common problem • Infrastructure that works • Sense of community • Sharing information, tutorials BOSC 2005
  • 17. Some Key Developers Not Pictured Richard Adams Scott Cain Sean Davis Stefan Kirov Nathan Haigh Marc Logghe Chad Heikki Aaron Steve Jason Shawn Allen Ewan Chris M Hilmar Lincoln Brian Chris D BOSC 2005
  • 18. Infrastructure • Easily accessible CVS, web service • Reliable • Machines that we own • Additional services can be added • Moving to faster machines this summer BOSC 2005
  • 19. Collaborative Nature of the Project • Folks working on similar problems • Pooled development resources • Friendly & interactive way to do science • Shared user support load... BOSC 2005
  • 20. Open Source is Good! • Fix things faster • (If you really care about the fix) • Everyone can see solution, audit code • Contribution can be modular • “Give a little, get a lot” • More developers for less resources BOSC 2005
  • 21. Open Source Doesn’t Solve Everything • Best laid plans • Decision by consensus ... • ... Free-for-all committing • Things don’t “just happen” • Need strong leadership, vision, and commitment • Documentation tends to lag • Interesting, needed code trumps boring BOSC 2005
  • 22. Great Power == Great Responsibility • Making releases • Point releases maintain API • Attempt to maintain backwards compatibility between stable releases • Many people depend on the code • Additional modules should maintain continuity BOSC 2005
  • 23. We Know how to have fun! BOSC 2005
  • 24. Bioperl: Present Day • Parsers for • Sequence & alignment parsing • Gene Prediction output • Phylogenetic trees • Weight Matricies (TFBS) • Ontologies • Structure data • Various CompBio, MolEvol apps BOSC 2005
  • 25. Bioperl: Present Day • Tools for • Sequence manipulation (I/O, manip) • Graphical sequence rendering • Sequence & alignment stats, popgen tests, molevo tests • Access to Local & Remote Sequence Databases BOSC 2005
  • 26. Present Day Stats • 780+ modules • 336,000 lines of code • ~9,000 tests in unit test ‘t’ dir • 57 utility scripts, 60 example scripts • 11 HOWTO docs • 44 Q&A in FAQ BOSC 2005
  • 27. Bioperl Magic • Using interfaces to hide specifics • Bio::DB::GenBank & Bio::Index::GenBank • Objects can masquerade like other objects • Bio::SeqFeature::Generic, Bio::Location, DB::GFF::RelSegment, DB::GFF::Feature BOSC 2005
  • 28. Bioperl Magic use Bio::DB::GenBank; use Bio::DB::Fasta; my $db = Bio::DB::GenBank->new; my $idx = Bio::DB::Fasta->new($fafile_dir); my $seq1 = $db->get_Seq_by_acc($acc); my $seq2 = $idx->get_Seq_by_acc($acc); use Bio::DB::FileCache; my $cachedb = Bio::DB::FileCache->new($db); my $acc = $cachedb->get_Seq_by_acc($acc); BOSC 2005
  • 29. ...The Problems • Interfaces system is cluttered • Learning curve is steep • Nonsensical to some people • OO Code is too slow • Developer community needs growth BOSC 2005
  • 30. Why is Bioperl Slow (in some places)? 1. Object-Oriented Perl is a Hack! 1. $class->SUPER::new(@_); 2. $class->SUPER::_rearrange(@_); 3. Object overhead can be high. BOSC 2005
  • 31. Research with Bioperl • Genome annotation pipeline • Genome Browsing • Comparative genomics • Phylogenomics BOSC 2005
  • 32. S.cerevisiae Existing Genome C.glabrata Annotation K.lactis New K.waltii Annotation A.gossypii D.hansenii C.guilliermondii C.lusitaniae Hemiascomycota Y.lipolytica P.anserina C.globsum N.crassa M.grisea Euascomycota F.graminearum A.fumigatus A.nidulans H.capsulatum C.immitis Archiascomycota S.pombe C.cinereus P.chrysosporium Basidiomycota C.neoformans U.maydis Zygomycota R.oryzae H.sapiens Protein ML tree based on 100 random orthologs A.thaliana 10
  • 33. Genome Annotation Pipeline Rfam tRNAscan ZFF to GFF3 GFF to AA SNAP FASTA predicted Proteins proteins Twinscan all-vs-all GFF2 to GFF3 Bio::DB::GFF Tools::Glimmer Glimmer SearchIO protein to Genscan Genome genome Tools::Genscan coordinates homologs HMMER MCL SearchIO SearchIO and BLASTZ orthologs BLASTN Tools::GFF GLEAN GFF2 to GFF3 exonerate (combiner) Proteins Gene protein2genome families exonerate ESTs est2genome BOSC 2005
  • 34. Genome Browsing BOSC 2005
  • 35. Comparative Genomics • Find orthologs, align, build trees • Assess gene structure among orthologs • Find gene families • Assess gene structure • Copy number among genomes BOSC 2005
  • 36. Comparative Genomics: Gene Structure Species phylogeny Bio::Tree/TreeIO Bio::DB Bio::AignIO Bio::SimpleAlign Bio::SeqIO MUSCLE Alignments Score shared orthologs Bio::Coordinate with Introns intorn positions on Tree BOSC 2005
  • 37. Phylogenomics Species phylogeny Bio::TreeIO gene MrBayes families Bio::DB PhyML Classify tree Bio::SeqIO Bio::AignIO MUSCLE differences ProtML PUZZLE Bin genes by orthologs topology BOSC 2005
  • 38. Bioperl - The Future BOSC 2005
  • 39. Where Should Bioperl be Headed? • Concise set of functionality that just works? • Supporting leading edge of research? • Beginner level tools? • Advanced tools? BOSC 2005
  • 40. Challenges • Logistics • Building and maintaining large projects takes a lot of time • Project leaders with vision (and time) • Face-to-face time invaluable for coordination • Need more dedicated developers BOSC 2005
  • 41. Starting Over? • Years of organic development don’t make a wholly consistent toolkit • Keeping POD up to date vs API maintenance • Autodoc-ing? • Perl6 rewrite? • Do we start a Bioperl 2, from scratch? BOSC 2005
  • 42. Supporting the Toolkit • Consider applying for funding to support a developer? • Finding companies to provide more dedicated user support? • Gatherings (hackathons) • DocFests to increase documentation • Better incentives for bug fixes BOSC 2005
  • 43. My Priority Queue • Simplified API, hidden interfaces • Overview docs point functionality to modules • Speed • Extensibility BOSC 2005
  • 44. A Plan • Continue Bioperl 1.x • Try and fix interface overload • Work out speed issues (maybe) • Encourage dabbling into a 2.0 codebase • Outline some key principals in new codebase. • Consider prototypes & Perl6 BOSC 2005
  • 45. We’ve Grown Up • Over the course of the project we have • Gotten married • Had kids • Changed continents • Changed jobs BOSC 2005
  • 47. Thanks • Developers - past and present • Ewan Birney • Lincoln Stein • Chris Dagdigian • Brian Osborne BOSC 2005