Skip to main content

Why scientists should code in the open

All too often, I encounter published papers in which the code is "available upon request", or "available in the supplementary materials" (as a zip file). This is not just poor form. It also hurts your software's future. (And, in my opinion, when results depend on software, it is inexcusable.)

Given the numerous options for posting code online, there's just no excuse to give code in a less-than-convenient format, upon publication. When you publish, put your code on Github or Bitbucket.

In this piece, I'll go even further: put your code there from the beginning. Put your code there as soon as you finish reading this article. Here's why:

No, you won't get scooped

Reading code is hard. Ask any experienced programmer: most have trouble reading code they themselves wrote a few months ago, let alone someone else's code. It's extremely unlikely that someone will browse your code looking for a scoop. That time is better spent doing research.

It's never going to be ready

Another thing I hear is that they want to post their code, but they want to clean it up first, and remove all the "embarrassing" bits. Unfortunately, science doesn't reward time spent "cleaning up" your code, at least not yet. So the sad reality is that you probably will never actually get to the point where you are happy to post your code online.

But here's the secret: everybody is in that boat with you. That's why this document exists. I recommend you read it in full, but this segment is particularly important:

When it comes time to empirically evaluate new research with respect to someone else's prior work, even a rickety implementation of their work can save grad-student-months, or even grad-student-years, of time.

Matt Might himself is as thorough and high-profile as you get in computer science, and yet, he has this to say about code clean-up:

I kept telling myself that I'd clean it all up and release it some day. I have to be honest with myself: this clean-up is never going to happen.

Your code might not meet your standards, but, believe it or not, your code will help others, and the sooner it's out there, the sooner they can be helped.

You will gain collaborators and citations

If anyone is going to be rifling through your code, they will probably end up asking for your help. This happens with even the best projects: have a look at the activity on the mailing lists for scikit-learn or NumPy, two of the best-maintained open-source projects out there.

When you have to go back and explain how a piece of code worked, that's when you will actually take the time and clean it up. In the process, the person you help will be more likely to contribute to your project, either in code or in bug reports, improvement suggestions, or even citations.

In the case of my own gala project, I guess that about half of the citations it received happened because of its open-source code and open mailing list.

Your coding ability will automagically improve

I first heard this one from Albert Cardona. They say sunlight is the best disinfectant, and this is certainly true of code. Just the very idea that anyone can easily read their code will make most people more careful when programming. Over time, this care will become second nature, and you will develop a taste for nice, easy-to-read code.

In short, the alleged downsides of code-sharing are, at best, longshots, while there are many tangible upsides. Put your code out there. (And use a liberal open-source license!)

Comments

Comments powered by Disqus