SlideShare una empresa de Scribd logo
1 de 1
BioPerl at 15: New Features, New Directions Christopher J. Fields, University of Illinois, cjfields@uiuc.edu *Mark A. Jensen, Fortinbras Research and SRA International,  mark_jensen@sra.com  Jason  E. Stajich, University of California at Riverside, jason.stajich@ucr.edu The BioPerl Project, an open-source Perl toolkit for bioinformatics, was initiated in 1995 and became instrumental in the automated organization and analysis of original Human Genome Project data. Since then, BioPerl has become a complete object-oriented Perl environment for bioinformatics development, with modules to perform a wide range of bioinformatics functions, including multi-format parsing and translation, object-relational model databasing, EMBL and NCBI web service access, and external program execution. The BioPerl developer community is actively responding to the far-reaching changes in the field that have taken place over the last several years. Major goals are: (1) to provide new functionality useful to researchers at the cutting edge of bioinformatics, (2) to reorganize BioPerl into smaller application-oriented packages, (3) to deprecate older modules whose utility has declined substantially, and (4) to continue to expand and improve documentation, so that BioPerl remains useful and relevant in the years ahead.  Google Summer of Code BioPerl has provided mentorship for GSoC projects  for the past three years. These have resulted in material additions to the codebase, and have been focused on expanding BioPerl's capabilities in format parsing and large file processing. The BioPerl wiki (http://bioperl.org) The wiki is now the central location for all BioPerl documentation: installation, module POD, HOWTO articles, code snippets, and personnel descriptions. It has played an important role as the new face of BioPerl and as a landing for the developer discussions that are taking BioPerl forward. BioPerl on gitHub (http://github.com/bioperl) BioPerl recently migrated all active repositories to gitHub from OBF-hosted Subversion. With the move to  git  comes decentralization and more fluid, independent development. We expect this to improve the BioPerl response time both to bugs and to new developments in the field, as well as increase new developer recruitment and community participation. Community participation and development New features New directions Next-gen sequencing support Bringing BioPerl up to speed for next-gen sequence data handling has led to efforts along three lines: file format standardization, common command-line tool wrapping, and BioPerl object system I/O integration tailored to next-gen data. Formats BioPerl and other Bio* projects recently published a collaborative effort to standardize FASTQ formats, including variants for Illumina and Solexa platforms. These formats are now in use across BioPerl and the Bio* projects. Support for important binary formats (BAM, BigWIG) is provided by wrappers for command line tools, and the integration of fast XS-based Perl modules such as Lincoln Stein's  Bio-SamTools  and  Bio-BigFile  CPAN packages. Wrappers Enhancements to the  Bio::Tools::Run::WrapperBase  system has made it easier to add BioPerl wrapper modules for external programs, and to integrate these into other modules that implement pipelines using BioPerl sequence and alignment objects as I/O. Tracking NCBI developments In the past year, NCBI has released a fully updated BLAST toolkit,  blast+ † , and has been encouraging a move from their EUtilities RESTful interface to a newer SOAP interface ‡ .  BioPerl has responded with  Bio::Tools::Run::StandAloneBlastPlus  and  Bio::DB::SoapEUtilities . These were designed not only to update the API interface, but also to add I/O layers that accept and parse messages into familiar BioPerl objects, and to build in straightforward methods for creating pipelines of  blast+  program analyses or EUtilities fetches. † ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST ‡ http://eutils.ncbi.nlm.nih.gov/entrez/eutils/soap/v2.0/DOC/esoap_help.html bedtools bowtie bwa minimo newbler samtools BioPerl object support :  Bio::Assembly The  Bio::Assembly  system has been extensively updated, to include reading and/or writing assemblies in MAQ, BAM, SAM, BWA, and other formats. Assembly object support is integrated into run wrappers for  bwa ,  bedtools ,  maq , and  samtools . Future work will incorporate new sequence objects that are optimized for large files (through the work of GSoC student Jun Yin). use Bio::Tools::Run::Maq; my $maq = Bio::Tools::Run::Maq->new(); $assy_obj = $maq->run('read1.fastq', 'refseq.fas', 'read2.fastq'); Timeline BioPerl has grown in its user and developer base since those early days. New developers and collaborations have contributed not only key modules, but also important design methodologies and refactoring over the years that have helped BioPerl to maintain its usefulness and relevance. Discontinuities followed by increases in lines of code over time reflect a high level of community flexibility and dedication in pursuit of DTWT. General wrapper facility A set of modules ( Bio::Tools::WrapperMaker ) is under development that will increase the responsiveness of BioPerl development by providing an XML-based way for users themselves to specify the interface for their favorite commandl ine programs, at the same time creating a common, consistent API for executing those programs and accessing output. Intermediate layers for large file handling and generic parsing BioPerl parsers generally take raw data to Perl objects with no intermediate layer. This induces prohibitive overhead when parsing large files, and also can limit user flexibility: parsing may be desired, but not the BioPerl objects. The first problem is being tackled by attaching backend handlers onto container class constructors that are able persist records of large files efficiently, creating BioPerl objects only as needed or desired. The second problem has led to experiments in generic parsing: data file records are parsed into a simple stream of hashes, which then can be directed where the user desires; into the creation of BioPerl objects as usual, or elsewhere. Biome and BioPerl 6 BioPerl has been object-oriented from the beginning, but suffers the weaknesses of Perl 5 objects: very high overhead, loose encapsulation, limited object introspection, and the lack of built-in interfaces and roles, among other things.  These issues are being addressed in two ways: in Perl 5 through the Moose classes and dependencies, and in the creation of Perl 6. BioPerl is exploring both paths to true objects with the experimental  Biome  (BioPerl with Metaobject Extensions) and BioPerl 6 projects. Biome role as interface Shattering the Monolith  BioPerl continues to be distributed as just a handful of packages. The core package in particular has grown to 341 files, comprising 874 classes with 23,146 tests. Maintenance and installation issues are barriers to developers and users alike. We are in the process of splitting the core into reasonable, application-related chunks. This plus the  git  migration should significantly improve BioPerl management. The  BioPerl Core Development Team  is Sendu Bala, Rob Buels, Christopher Fields, Mark Jensen, Hilmar Lapp, Heikki Lehväslaiho, Aaron Mackey, Dave Messina, Brian Osborne, Jason Stajich, and Lincoln Stein. Key support is provided by Chris Dagdigian and Mauricio Herrera Cuadra. Florent Angly and Dan Kortschak are lead developers of projects discussed here. Year Sponsoring Institution Student Project Example Module 2008 NESCent Mira Han PhyloXML parsing  Bio::TreeIO::phyloxml 2009 NESCent Chase Miller NeXML parsing Bio::Nexml 2010 OBF Jun Yin Alignment subsystem refactoring  in progress source: http://www.ohloh.net/p/bioperl Convert plain text sequence Map reads to reference seq Assemble map into consensus Extract info from consensus fasta2bfa fastq2bfq map mapmerge assemble mapview cns2fq maq  assembly pipeline class consumes role Class Role must instantiate reqd abstract method consuming class possesses role members instance possesses concrete role methods main::

Más contenido relacionado

Similar a BioPerl (Poster T02, ISMB 2010)

Cross-Community User Requirements and the Biodiversity Heritage Library
Cross-Community User Requirements and the Biodiversity Heritage LibraryCross-Community User Requirements and the Biodiversity Heritage Library
Cross-Community User Requirements and the Biodiversity Heritage LibraryChris Freeland
 
Biopython Project Update (BOSC 2012)
Biopython Project Update (BOSC 2012)Biopython Project Update (BOSC 2012)
Biopython Project Update (BOSC 2012)Eric Talevich
 
E Talevich - Biopython project-update
E Talevich - Biopython project-updateE Talevich - Biopython project-update
E Talevich - Biopython project-updateJan Aerts
 
Developing an open source community for cloud bioinformatics
Developing an open source community for cloud bioinformaticsDeveloping an open source community for cloud bioinformatics
Developing an open source community for cloud bioinformaticsBrad Chapman
 
BHL / EOL technology sit down
BHL / EOL technology sit downBHL / EOL technology sit down
BHL / EOL technology sit downChris Freeland
 
why google stores billions of lines of code in a single repository
why google stores billions of lines of code in a single repositorywhy google stores billions of lines of code in a single repository
why google stores billions of lines of code in a single repositorymustafa sarac
 
Why Google Stores Billions of Lines of Code in a Single Repository
Why Google Stores Billions of Lines of Code in a Single RepositoryWhy Google Stores Billions of Lines of Code in a Single Repository
Why Google Stores Billions of Lines of Code in a Single RepositoryKapil Mohan
 
Essential Requirements for Community Annotation Tools
Essential Requirements for Community Annotation ToolsEssential Requirements for Community Annotation Tools
Essential Requirements for Community Annotation ToolsMonica Munoz-Torres
 
Introduction to Ontologies for Environmental Biology
Introduction to Ontologies for Environmental BiologyIntroduction to Ontologies for Environmental Biology
Introduction to Ontologies for Environmental BiologyBarry Smith
 
Reproducible bioinformatics pipelines with Docker and Anduril
Reproducible bioinformatics pipelines with Docker and AndurilReproducible bioinformatics pipelines with Docker and Anduril
Reproducible bioinformatics pipelines with Docker and AndurilChristian Frech
 
Wageningen phenotype meeting
Wageningen phenotype meetingWageningen phenotype meeting
Wageningen phenotype meetingthehyve
 
Biodiversity Heritiage Library: progress and process
Biodiversity Heritiage Library: progress and processBiodiversity Heritiage Library: progress and process
Biodiversity Heritiage Library: progress and processPhil Cryer
 
SP Intervets BioInformatics Portal - A customized global Pipeline Pilot Webpo...
SP Intervets BioInformatics Portal - A customized global Pipeline Pilot Webpo...SP Intervets BioInformatics Portal - A customized global Pipeline Pilot Webpo...
SP Intervets BioInformatics Portal - A customized global Pipeline Pilot Webpo...Frank Oellien
 
State of the Union eBPF - Linux Kernel Programming
State of the Union eBPF - Linux Kernel ProgrammingState of the Union eBPF - Linux Kernel Programming
State of the Union eBPF - Linux Kernel Programmingcvasilak1
 
Prins Bio Lib Bosc 2009
Prins Bio Lib Bosc 2009Prins Bio Lib Bosc 2009
Prins Bio Lib Bosc 2009bosc
 
A semantic framework for biomedical image discovery
A semantic framework for biomedical image discoveryA semantic framework for biomedical image discovery
A semantic framework for biomedical image discoverySyed Ahmad Chan Bukhari, PhD
 
FAIR data and model management for systems biology (and SOPs too!)
FAIR data and model management for systems biology (and SOPs too!)FAIR data and model management for systems biology (and SOPs too!)
FAIR data and model management for systems biology (and SOPs too!)FAIRDOM
 

Similar a BioPerl (Poster T02, ISMB 2010) (20)

openBIO
openBIOopenBIO
openBIO
 
Cross-Community User Requirements and the Biodiversity Heritage Library
Cross-Community User Requirements and the Biodiversity Heritage LibraryCross-Community User Requirements and the Biodiversity Heritage Library
Cross-Community User Requirements and the Biodiversity Heritage Library
 
BioNLPSADI
BioNLPSADIBioNLPSADI
BioNLPSADI
 
Biopython Project Update (BOSC 2012)
Biopython Project Update (BOSC 2012)Biopython Project Update (BOSC 2012)
Biopython Project Update (BOSC 2012)
 
E Talevich - Biopython project-update
E Talevich - Biopython project-updateE Talevich - Biopython project-update
E Talevich - Biopython project-update
 
Developing an open source community for cloud bioinformatics
Developing an open source community for cloud bioinformaticsDeveloping an open source community for cloud bioinformatics
Developing an open source community for cloud bioinformatics
 
BHL / EOL technology sit down
BHL / EOL technology sit downBHL / EOL technology sit down
BHL / EOL technology sit down
 
why google stores billions of lines of code in a single repository
why google stores billions of lines of code in a single repositorywhy google stores billions of lines of code in a single repository
why google stores billions of lines of code in a single repository
 
Why Google Stores Billions of Lines of Code in a Single Repository
Why Google Stores Billions of Lines of Code in a Single RepositoryWhy Google Stores Billions of Lines of Code in a Single Repository
Why Google Stores Billions of Lines of Code in a Single Repository
 
Essential Requirements for Community Annotation Tools
Essential Requirements for Community Annotation ToolsEssential Requirements for Community Annotation Tools
Essential Requirements for Community Annotation Tools
 
Introduction to Ontologies for Environmental Biology
Introduction to Ontologies for Environmental BiologyIntroduction to Ontologies for Environmental Biology
Introduction to Ontologies for Environmental Biology
 
Reproducible bioinformatics pipelines with Docker and Anduril
Reproducible bioinformatics pipelines with Docker and AndurilReproducible bioinformatics pipelines with Docker and Anduril
Reproducible bioinformatics pipelines with Docker and Anduril
 
Wageningen phenotype meeting
Wageningen phenotype meetingWageningen phenotype meeting
Wageningen phenotype meeting
 
Biodiversity Heritiage Library: progress and process
Biodiversity Heritiage Library: progress and processBiodiversity Heritiage Library: progress and process
Biodiversity Heritiage Library: progress and process
 
SP Intervets BioInformatics Portal - A customized global Pipeline Pilot Webpo...
SP Intervets BioInformatics Portal - A customized global Pipeline Pilot Webpo...SP Intervets BioInformatics Portal - A customized global Pipeline Pilot Webpo...
SP Intervets BioInformatics Portal - A customized global Pipeline Pilot Webpo...
 
State of the Union eBPF - Linux Kernel Programming
State of the Union eBPF - Linux Kernel ProgrammingState of the Union eBPF - Linux Kernel Programming
State of the Union eBPF - Linux Kernel Programming
 
FYP report
FYP reportFYP report
FYP report
 
Prins Bio Lib Bosc 2009
Prins Bio Lib Bosc 2009Prins Bio Lib Bosc 2009
Prins Bio Lib Bosc 2009
 
A semantic framework for biomedical image discovery
A semantic framework for biomedical image discoveryA semantic framework for biomedical image discovery
A semantic framework for biomedical image discovery
 
FAIR data and model management for systems biology (and SOPs too!)
FAIR data and model management for systems biology (and SOPs too!)FAIR data and model management for systems biology (and SOPs too!)
FAIR data and model management for systems biology (and SOPs too!)
 

Último

Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 

Último (20)

Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 

BioPerl (Poster T02, ISMB 2010)

  • 1. BioPerl at 15: New Features, New Directions Christopher J. Fields, University of Illinois, cjfields@uiuc.edu *Mark A. Jensen, Fortinbras Research and SRA International, mark_jensen@sra.com Jason E. Stajich, University of California at Riverside, jason.stajich@ucr.edu The BioPerl Project, an open-source Perl toolkit for bioinformatics, was initiated in 1995 and became instrumental in the automated organization and analysis of original Human Genome Project data. Since then, BioPerl has become a complete object-oriented Perl environment for bioinformatics development, with modules to perform a wide range of bioinformatics functions, including multi-format parsing and translation, object-relational model databasing, EMBL and NCBI web service access, and external program execution. The BioPerl developer community is actively responding to the far-reaching changes in the field that have taken place over the last several years. Major goals are: (1) to provide new functionality useful to researchers at the cutting edge of bioinformatics, (2) to reorganize BioPerl into smaller application-oriented packages, (3) to deprecate older modules whose utility has declined substantially, and (4) to continue to expand and improve documentation, so that BioPerl remains useful and relevant in the years ahead. Google Summer of Code BioPerl has provided mentorship for GSoC projects for the past three years. These have resulted in material additions to the codebase, and have been focused on expanding BioPerl's capabilities in format parsing and large file processing. The BioPerl wiki (http://bioperl.org) The wiki is now the central location for all BioPerl documentation: installation, module POD, HOWTO articles, code snippets, and personnel descriptions. It has played an important role as the new face of BioPerl and as a landing for the developer discussions that are taking BioPerl forward. BioPerl on gitHub (http://github.com/bioperl) BioPerl recently migrated all active repositories to gitHub from OBF-hosted Subversion. With the move to git comes decentralization and more fluid, independent development. We expect this to improve the BioPerl response time both to bugs and to new developments in the field, as well as increase new developer recruitment and community participation. Community participation and development New features New directions Next-gen sequencing support Bringing BioPerl up to speed for next-gen sequence data handling has led to efforts along three lines: file format standardization, common command-line tool wrapping, and BioPerl object system I/O integration tailored to next-gen data. Formats BioPerl and other Bio* projects recently published a collaborative effort to standardize FASTQ formats, including variants for Illumina and Solexa platforms. These formats are now in use across BioPerl and the Bio* projects. Support for important binary formats (BAM, BigWIG) is provided by wrappers for command line tools, and the integration of fast XS-based Perl modules such as Lincoln Stein's Bio-SamTools and Bio-BigFile CPAN packages. Wrappers Enhancements to the Bio::Tools::Run::WrapperBase system has made it easier to add BioPerl wrapper modules for external programs, and to integrate these into other modules that implement pipelines using BioPerl sequence and alignment objects as I/O. Tracking NCBI developments In the past year, NCBI has released a fully updated BLAST toolkit, blast+ † , and has been encouraging a move from their EUtilities RESTful interface to a newer SOAP interface ‡ . BioPerl has responded with Bio::Tools::Run::StandAloneBlastPlus and Bio::DB::SoapEUtilities . These were designed not only to update the API interface, but also to add I/O layers that accept and parse messages into familiar BioPerl objects, and to build in straightforward methods for creating pipelines of blast+ program analyses or EUtilities fetches. † ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST ‡ http://eutils.ncbi.nlm.nih.gov/entrez/eutils/soap/v2.0/DOC/esoap_help.html bedtools bowtie bwa minimo newbler samtools BioPerl object support : Bio::Assembly The Bio::Assembly system has been extensively updated, to include reading and/or writing assemblies in MAQ, BAM, SAM, BWA, and other formats. Assembly object support is integrated into run wrappers for bwa , bedtools , maq , and samtools . Future work will incorporate new sequence objects that are optimized for large files (through the work of GSoC student Jun Yin). use Bio::Tools::Run::Maq; my $maq = Bio::Tools::Run::Maq->new(); $assy_obj = $maq->run('read1.fastq', 'refseq.fas', 'read2.fastq'); Timeline BioPerl has grown in its user and developer base since those early days. New developers and collaborations have contributed not only key modules, but also important design methodologies and refactoring over the years that have helped BioPerl to maintain its usefulness and relevance. Discontinuities followed by increases in lines of code over time reflect a high level of community flexibility and dedication in pursuit of DTWT. General wrapper facility A set of modules ( Bio::Tools::WrapperMaker ) is under development that will increase the responsiveness of BioPerl development by providing an XML-based way for users themselves to specify the interface for their favorite commandl ine programs, at the same time creating a common, consistent API for executing those programs and accessing output. Intermediate layers for large file handling and generic parsing BioPerl parsers generally take raw data to Perl objects with no intermediate layer. This induces prohibitive overhead when parsing large files, and also can limit user flexibility: parsing may be desired, but not the BioPerl objects. The first problem is being tackled by attaching backend handlers onto container class constructors that are able persist records of large files efficiently, creating BioPerl objects only as needed or desired. The second problem has led to experiments in generic parsing: data file records are parsed into a simple stream of hashes, which then can be directed where the user desires; into the creation of BioPerl objects as usual, or elsewhere. Biome and BioPerl 6 BioPerl has been object-oriented from the beginning, but suffers the weaknesses of Perl 5 objects: very high overhead, loose encapsulation, limited object introspection, and the lack of built-in interfaces and roles, among other things. These issues are being addressed in two ways: in Perl 5 through the Moose classes and dependencies, and in the creation of Perl 6. BioPerl is exploring both paths to true objects with the experimental Biome (BioPerl with Metaobject Extensions) and BioPerl 6 projects. Biome role as interface Shattering the Monolith BioPerl continues to be distributed as just a handful of packages. The core package in particular has grown to 341 files, comprising 874 classes with 23,146 tests. Maintenance and installation issues are barriers to developers and users alike. We are in the process of splitting the core into reasonable, application-related chunks. This plus the git migration should significantly improve BioPerl management. The BioPerl Core Development Team is Sendu Bala, Rob Buels, Christopher Fields, Mark Jensen, Hilmar Lapp, Heikki Lehväslaiho, Aaron Mackey, Dave Messina, Brian Osborne, Jason Stajich, and Lincoln Stein. Key support is provided by Chris Dagdigian and Mauricio Herrera Cuadra. Florent Angly and Dan Kortschak are lead developers of projects discussed here. Year Sponsoring Institution Student Project Example Module 2008 NESCent Mira Han PhyloXML parsing Bio::TreeIO::phyloxml 2009 NESCent Chase Miller NeXML parsing Bio::Nexml 2010 OBF Jun Yin Alignment subsystem refactoring in progress source: http://www.ohloh.net/p/bioperl Convert plain text sequence Map reads to reference seq Assemble map into consensus Extract info from consensus fasta2bfa fastq2bfq map mapmerge assemble mapview cns2fq maq assembly pipeline class consumes role Class Role must instantiate reqd abstract method consuming class possesses role members instance possesses concrete role methods main::