A recent issue of Nature contained a series of articles discussing issues surrounding computer programs, or “code”, used as part of the scientific workflow.
In a nutshell:
- Nick Barnes encouraged scientists to share their code, making it a part of the scientific record, available to others to use, scrutinize, and (often) improve
- Zeeya Merali struck a more ominous tone, discussing the (apparently) frequent shortcomings of academic scientists who have willingly or unwillingly become computer-programmers
Although I disagree with the overwhelmingly negative tone of the second article, the subject of fundamental importance to both works is the increasing reliance of scientists on “custom” computer code and the establishment of best practices for archiving, documenting, testing, maintaining, receiving credit for, and making available these custom programs or snippets of computer code.
Establishing best-practices is, unfortunately, a terribly complex endeavor – with commercial, legal, and ethical implications, and best-practices are often not one-size-fits-all: one person’s Shangri-La is often another’s Waterloo. Part of this complexity arises from the fact that computer code is frequently and often should be mutable – programs are living-documents that change as new features are added, bugs are fixed, documentation improves, etc. Some would argue that access to these changes be as open as possible, while others may argue that only vetted “snapshots” of computer code at particular times are appropriate for end-users to view and/or use. Still others would claim that computer code is only valid for end-use when written by “professionals” – although “bugs” of all sorts are present in commercial and non-commercial (e.g., scientific) computer programs.
Because computer code is (often? occasionally?) mutable, it is also not as easy to archive as, say, data – which, in theory, are immutable once collected. Nor are online data archives (GenBank, Dryad, TreeBASE), in their current form, adequate or entirely appropriate for providing access to computer code.
Version control systems, used as part of the software development process, solve part of this problem by allowing us to track changes to and make snapshots of computer programs. But, formal means of explicitly attaching a snapshot or version of code applied to a particular set of data do not exist. In essence, we have started to provide and explicitly track data (via DOI, GenBank accession number, etc.), results (via manuscript), and laboratory methods (via manuscript), but we do NOT explicitly track the software used to process it all. With respect to creating repeatable scientific work, this is a fundamental oversight.
Again, though, we return to the fact that the problem is complex – do we mandate that everyone release their code prior to publication? Do we mandate that everyone place their code in a version control system? Do we only accept code with unit-tests and documentation? What do we do about the contentious issue of licensing? How do we encourage and reward well-documented and well-tested code? Can we convince data-sharing sites to accept and associate a “snapshot” of computer code (e.g., a hash associated with a particular commit) with the data analyzed with that particular bit(s) of code?
On the flip-side, one could also argue that this “problem” is, to some degree, a manufactured one… There are any number of scientific programs that are open-source, well-documented, well-tested, and in common use. There are also any number of individuals or research groups making their code available to the general public, and my opinion is that this trend is on the upward swing – influenced largely by the concomitant focus on open data. Changing attitudes are also fueled by the availability of several code sharing sites. Yet, we still lack the means to connect code with data, and few would argue that there is no room for improvement.
As part of the Molecular Ecology/Molecular Ecology Resources web-presence, we are planning to create an additional site for the archiving and discussion of computer code, particularly the smaller code fragments that all of us use and love (?). Essentially, the site is meant to address some of the problems discussed above – primary among them keeping these code fragments alive and available to everyone. We’ve had a bit of a hard time deciding how best to do this – for many of the reasons listed above (and a few extras):
- we want to archive a changeable object
- we would like to provide a persistent link to the object
- we want to maintain some form of version control for submitted objects
- we want to allow and encourage discussion about posted objects
- we want to distribute the burden of applying these metadata to a community of users
- we want access to be easy and pain free
- we want to avoid spam to the degree possible
When we have a workable solution, we hope to address a majority of these points, and we hope to provide a useful resource to the community. Yet, there remains significant room for discussion and improvement: one issue, of several that remain, is how best to establish the link between data and the computer code used to analyze these data – an issue that deserves discussion within the scientific community, and one the existing data archives may help to address.
*The title of this post is a reference to the Rongorongo inscriptions - a famous set of undecoded writings - which essentially represent a problem that is complicated to solve