This document discusses an open platform approach to cyberinfrastructure for scientific research. It argues that rather than a single unified pipeline, science needs flexible reusable components that can be mixed and matched. Open source development that involves domain experts can help ensure software stays useful. Too often software is not reusable even in theory. Academic software development is difficult, requiring consideration of science, computing resources, user interfaces, legal issues, and integration. An "ecology" of competing pipeline components is needed.
1. An open platform approach
to cyberinfrastructure
C. Titus Brown
ctb@msu.edu
Asst Professor, Michigan State University
(Microbiology, Computer Science, and BEACON)
2. khmer software
An efficient, sensitive, and specific pipeline component for extremely
scalable shotgun sequencing analysis
github.com/ged-lab/khmer
4. Academic software development
is really, really hard!
Considerations of “remixing” are in addition to:
• Interesting science
• Sufficient compute
• User interface
• Liability and other legal issues
• Integration
5. Towards an “ecology” of components
• We don’t need “one true pipeline.”
• We need flexible, reusable, and competing pipeline
components.
• This is not a concern:
• It’s how science works!
http://xkcd.com/927/
6. • Want flexible, sustainable CI? Build open platforms, openly,
with open source approaches.
– The OSS community has lots of experience in doing this, & working
within incentive structures.
– Note, traditional academic incentives don’t align well.
• Agile methodologies (iterative, use-case driven, organic)
ensure that software doesn’t go too far astray; must directly
involve (& be driven by) domain research groups.
• Too much of software that is produced is not even reusable in
theory, much less in practice. This needs to change!!!
Blog post will be at: http://ivory.idyll.org/blog/2013-gbmf-mmi.html
7. Other things I’m doing
• Scalable/sensitive/specific algorithms for shotgunomics.
• Benchmarking shotgun metagenome assembly.
• CI education (NIH/ngs; NSF/data + compute;
Sloan/Software Carpentry; BEACON/intro computing for
grad)
• Hobbies/windmills:
– Open science and open data.
– Replication and reproducible research.
– Changing publication and peer review culture in biology.
Mention kbase. Want to make sequence easy again . Develop in close contact with specific biology projects.
Want to be able to do this without talking to anyone!!
I do not buy into the idea that we can project data analysis and software needs very far into the future.
We are also attacking many other things, including education and training, reproducibility, etc. Also, please stop “developing software for researchers”. We need a more bottom up approach. Maybe mention CaBIG.