Tag Archives: open access

Why scientists should code in the open

All too often, I encounter published papers in which the code is “available upon request”, or “available in the supplementary materials” (as a zip file). This is not just poor form. It also hurts your software’s future. (And, in my opinion, when results depend on software, it is inexcusable.)

Given the numerous options for posting code online, there’s just no excuse to give code in a less-than-convenient format, upon publication. When you publish, put your code on Github or Bitbucket.

In this piece, I’ll go even further: put your code there from the beginning. Put your code there as soon as you finish reading this article. Here’s why:

No, you won’t get scooped

Reading code is hard. Ask any experienced programmer: most have trouble reading code they themselves wrote a few months ago, let alone someone else’s code. It’s extremely unlikely that someone will browse your code looking for a scoop. That time is better spent doing research.

It’s never going to be ready

Another thing I hear is that they want to post their code, but they want to clean it up first, and remove all the “embarrassing” bits. Unfortunately, science doesn’t reward time spent “cleaning up” your code, at least not yet. So the sad reality is that you probably will never actually get to the point where you are happy to post your code online.

But here’s the secret: everybody is in that boat with you. That’s why this document exists. I recommend you read it in full, but this segment is particularly important:

When it comes time to empirically evaluate new research with respect to someone else’s prior work, even a rickety implementation of their work can save grad-student-months, or even grad-student-years, of time.

Matt Might himself is as thorough and high-profile as you get in computer science, and yet, he has this to say about code clean-up:

I kept telling myself that I’d clean it all up and release it some day.

I have to be honest with myself: this clean-up is never going to happen.

Your code might not meet your standards, but, believe it or not, your code will help others, and the sooner it’s out there, the sooner they can be helped.

You will gain collaborators and citations

If anyone is going to be rifling through your code, they will probably end up asking for your help. This happens with even the best projects: have a look at the activity on the mailing lists for scikit-learn or NumPy, two of the best-maintained open-source projects out there.

When you have to go back and explain how a piece of code worked, that’s when you will actually take the time and clean it up. In the process, the person you help will be more likely to contribute to your project, either in code or in bug reports, improvement suggestions, or even citations.

In the case of my own gala project, I guess that about half of the citations it received happened because of its open-source code and open mailing list.

Your coding ability will automagically improve

I first heard this one from Albert Cardona. They say sunlight is the best disinfectant, and this is certainly true of code. Just the very idea that anyone can easily read their code will make most people more careful when programming. Over time, this care will become second nature, and you will develop a taste for nice, easy-to-read code.

In short, the alleged downsides of code-sharing are, at best, longshots, while there are many tangible upsides. Put your code out there. (And use a liberal open-source license!)

Why PLOS ONE is no longer my default journal

Time-to-publication at the world’s biggest scientific journal has grown dramatically, but the nail in the coffin was its poor production policies.

When PLOS ONE was announced in 2006, its charter immediately resonated with me. This would be the first journal where only scientific accuracy mattered. Judgments of “impact” and “interest” would be left to posterity, which is the right strategy when publishing is cheap and searching and filtering are easy. The whole endeavour would be a huge boon to “in-between” scientists straddling established fields — such as bioinformaticians.

My first first-author paper, Joint Genome-Wide Profiling of miRNA and mRNA Expression in Alzheimer’s Disease Cortex Reveals Altered miRNA Regulation, went through a fairly standard journal loop. We first submitted it to Genome Biology, which (editorially) deemed it uninteresting to a sufficiently broad readership; then to RNA, which (editorially) decided that our sample size was too small; and finally to PLOS ONE, where it went out to review. After a single revision loop, it was accepted for publication. It’s been cited more than 15 times a year, which is modest but above the Journal Impact Factor for Genome Biology — which means that the editors made a bad call rejecting it outright. (I’m not bitter!)

Overall, it was a very positive first experience at PLOS. Time to acceptance was under 3 months, time to publication under 4. The reviewers were no less harsh than in my previous experiences, so I felt (and still feel) that the reputation of PLOS ONE as a “junk” journal was (is) highly undeserved. (Update: There’s been a big hullabaloo about a recent sting targeting open access journals with a fake paper. PLOS ONE came away unscathed. See also the take of Mike Eisen, co-founder of PLOS.) And the number of citations certainly vindicated PLOS ONE’s approach of ignoring apparent impact.

So, when looking for a home for my equally-awkward postdoc paper (not quite computer vision, not quite neuroscience), PLOS ONE was a natural first choice.

The first thing to go wrong was the time to publication, about 6 months. Still better than many top-tier journals, but no longer a crushing advantage. And it’s not just me: there’s been plenty of discussion about time-to-publication steadily increasing at PLOS ONE. But I was not too worried about the publication time, since I’d put my paper up on the arXiv (and revised it at each round of peer-review, so you can see the revision history there — but not on PLOS ONE).

But, after multiple rounds of review, the time came for production, at which point they messed up two things: they did not include my present address; and they messed up Figure 1, which is supposed to be a small, single-column, illustrative figure, and which they made page-width. The effect is almost comical, and my first impression seeing page 2 would be to think that the authors are trying to mask their incompetence with giant pictures. (We’re not, I swear!)

Figure 1 of my paper on arXiv (left) and PLOS ONE (right)
Figure 1 of our paper on arXiv (left) and PLOS ONE (right)

Both of these mistakes could have been avoided if PLOS ONE did not have a policy of not letting you see the camera-ready pdf before it is published, and of not allowing corrections to papers unless they are technical or scientific, regardless of fault. Not to mention they could have, you know, actually looked at the dimensions embedded in the submitted TIFFs. With a $1,300 publication fee, PLOS could afford to take a little bit of extra care with production. Both of the above policies are utterly unnecessary — the added cost of sending authors a production proof is close to nil, and keeping track of revisions on online publications is also trivial (see the 22 year old arXiv for an example).

We scientists live and die by our papers. We don’t want the culmination of years of work to be marred by a silly, easily-fixed formatting error, ossified by an unwieldy bureaucracy. I’ve been an avid promoter of PLOS (and PLOS ONE in particular) over the past few years, but I’m sad to say that’s not where my next paper will end up.

Ultimately, PLOS ONE’s model, groundbreaking though it was, is already being supplanted by newcomers. PeerJ offers everything PLOS ONE does at a fraction of the cost, and further includes a preprint service and open peer-review. Ditto for F1000 Research, which in addition offers unlimited revisions (a topic close to my heart ;). And both use the excellent MathJAX to render mathematical formulas, unlike PLOS’s archaic use of embedded images. They get my vote for the journals of the future.

[Note: the views expressed herein are mine alone — no co-authors were harmed consulted in the writing of this blog post.]

References

Nunez-Iglesias J, Liu CC, Morgan TE, Finch CE, & Zhou XJ (2010). Joint genome-wide profiling of miRNA and mRNA expression in Alzheimer’s disease cortex reveals altered miRNA regulation. PloS one, 5 (2) PMID: 20126538

Kravitz DJ, & Baker CI (2011). Toward a new model of scientific publishing: discussion and a proposal. Frontiers in computational neuroscience, 5 PMID: 22164143

Juan Nunez-Iglesias, Ryan Kennedy, Toufiq Parag, Jianbo Shi, & Dmitri B. Chklovskii (2013). Machine learning of hierarchical clustering to segment 2D and 3D images arXiv arXiv: 1303.6163v3

Nunez-Iglesias J, Kennedy R, Parag T, Shi J, & Chklovskii DB (2013). Machine Learning of Hierarchical Clustering to Segment 2D and 3D Images. PloS one, 8 (8) PMID: 23977123