Software is increasingly playing a big part in scientific research, but in most cases the growth is organic. The life time of research software is often as short as the duration of a postdoctoral contract: Once the researcher moves on, custom-written niche code is frequently not well documented, components are not reusable, and the overall development effort is likely lost.
This is a case study in looking at the evolution of software for research in the field of genomics within my research group at the Department of Genetics at Cambridge University. While our research questions changed over the past decade, we moved from Perl code and regular expressions to R and statistical analysis, and from there to agent-based simulations in Java. Not only will I discuss the languages and tools used as well as the processes and how they have evolved over the years. It also covers the factors that influence the nature of the growth, such as funding, but also how 'open source' as a default has changed our development work. We also take a look into the future to see how we predict the software usage will grow.
Also, in presenting the problems and discussing possible solution, this talk will look at the role institutions play in helping address these issues. In particular the Software Sustainability Institute (SSI, http://software.ac.uk/) works in the UK to promote the development, maintenance and (re)use of research software.
The Eclipse Foundation, with the Science Working Group, works to facilitate software sharing and reuse. How can organisations like the SSI and Eclipse align their strategies and activities for maximum effect?
2. brief bio & experience
since 2015
Fellow of the SSI
since 2013
IoT entrepreneur
2008-2016
Royal Society research group
leader at University of Cambridge
2011-2015
Scientific advisor to FlyBase
2012-2015
MPhil Director for Computational
Biology
3. ‣ a UK government-funded “virtual institute” for
building better, sustainable software
‣ primarily focussed on academic software but very
inclusive to industry partners
‣ distributed team with a few members at universities
in Southampton, Oxford, Manchester and Edinburgh
plus a vast network of independent fellows “in the
field”
Software
Sustainability
Institute
http://www.software.ac.uk
@SoftwareSaved
4. software
‣ good, reusable code
‣ well documented
people
‣ recognition and reward
‣ career paths
values
‣ reproducibility
‣ openness
policy
‣ raise awareness
‣ establish facts
Software
Sustainability
Institute
http://www.software.ac.uk
@SoftwareSaved
6. ‣ Software reaches boundaries that prevent
improvement, growth and adoption
‣ Providing the expertise and services needed to
negotiate to the next stage:
✓ software reviews and refactoring
✓ collaborations between stakeholders (Hi, Eclipse!)
✓ guidance and best practice on software development
✓ training (e.g. Software Carpentry)
✓ project management
✓ community building
✓ publicity etc…
Software
Sustainability
Institute
http://www.software.ac.uk
@SoftwareSaved
10. ‣unsupervised undergraduate
project
‣inspired by the need of a PhD
student
‣no software manual or help
‣requests for code: 0
‣URL is long dead, no idea about
the whereabouts of code
very generous
for the time!
11. ‣addressed my own needs as
biologist (“got the job done”)
‣horribly mix of object oriented
and spaghetti code
‣required complex manipulations
in the source to update quickly
outdated information
‣requests for code: many; but too
embarrassed to put on
sourceforge
“If you would like to adapt GO-Cluster to your personal
needs and want the source code (only fairly commented),
please contact my group leader Dr. Reinhard Schuh.”
13. BAD SCIENCE“All other data analyses were performed
using custom-written Perl scripts or
publicly available websites.“
“All downstream analyses were performed
with custom-made Perl scripts.”
“All data analysis was performed using custom-written
Perl scripts and statistical tests were performed with R.”
Embarrassingly unscientific quotes from a few of
my data analytical papers between 2005-2008
i.e.: “f$@k you, I can’t be asked telling you what I did!” in
combination with
mostly uncommented write-only and execute-once type scripts
14. OPEN DATA, OPEN SOURCE, OPEN ACCESS, OPEN SCIENCE
since early 2010s: increased pressure in the
community not only to release data, but also tools
‣sometimes requested by journals
‣often required to appease reviewers
‣frequent naming and shaming on Twitter
15. simple Perl CGI script with
MySQL backend
‣easy to update content :-)
‣no analytical capability :-(
using InterMine framework,
based on Java, ASP, Ajax and
PostgreSQL
‣fancy features and looks :-)
‣requires a specialists to do any
update :-(
FlyTF is a gold standard,
but has never been funded!
Technical upgrade (feature-rrhea)
was motivated because content-
only updates are hard to publish.
16. ‣Java
‣hardware- and OS-independent
‣GUI and config files
‣extensive documentation for
end-users and programmers
‣code refactored regularly to
ease readability for novices
‣all source on Github
17. Issues with (academic) software development
‣ typically little or no dedicated budget for software
development on scientific grants
‣ even if funded, resources are often too little to
adhere to best practices (e.g. lack of a planning
phase)
‣ often very ad-hoc with a focus on getting ‘one job
done’, not with reuse and sustainability in hindsight
‣ there’s no credit for writing good software
‣ code generated by ‘amateurs’ with a high turnover
of people with skills
‣ academic salaries are poor compared to industry
salaries - it’s hard to get professional software
developers
18. Software
Sustainability
Institute
Work better. Together.
This presentation is on Slideshare:
http://www.slideshare.net/BorisAdryan
For the community. Driven by individuals. Us.