Tag Archives: science

What do scientists know about open source?

A friend recently pointed out this great talk by Matt Bernius, What students know and don’t know about open source. If you have even a minor interest in open source it’s worth a watch, but the gist is: in the US alone, there are about 200,000 students enrolled in a computer science major. Open source communities are a great space to learn real-world programming, so why don’t these numbers translate into massive contributions to open source?

At the core of the issue, Matt identifies two main problems: (1) colleges and universities simply don’t teach open source, or even collaborative coding; and (2), many open source communities make newcomers feel unwelcome in a variety of ways.

I want to comment about this in the context of programming in science. That is, programming where the code is not the main product, but rather a useful tool to obtain a scientific result, for example in biology or physics. Here, we still see relatively little contribution to open source, for related but different cultural issues.

I’ve sent my scientists should code in the open post to a few people and the response from most remains sceptical. I hope this post will address some of their concerns.

Scientific culture is ridiculously secretive

The most common objection is to my assertion that people won’t scoop you by looking at your code. I remember a tweet (that I sadly can’t find now) that really got to the gist of the problem. It went something like this:

Someone in science having a new idea: “Ooh, I hope I don’t get scooped!”
Someone in open source having a new idea: “Ooh, I hope someone has implemented this already!”

Update: I found the source! It’s this tweet by Elizabeth Seiver.

This is a huge gap in culture that won’t soon go away, but there are encouraging steps towards narrowing it. For example, PLOS Biology, a leading journal, recently announced that they would consider “scooped” studies for publication within six months of the “scooping”. That goes some way towards re-aligning incentives towards collaborative and open science.
olga botvinic
I’ve come across many collaborations that have started because of open source. I have not heard of someone getting scooped because of open source, but of course that sort of information would be hard to trace and come by. Several people did write to me that they were concerned about very specific groups rifling through their code expressly for the purpose of scooping them. For me it’s hard to imagine someone even having that attitude, and my advice is that if you do face such a toxic community, it might be wise to change your chosen field of study.

Nevertheless, I want to emphasise here that open source programming can take many forms, with the zip file attached to the paper being the lowest, coding in the open being the highest, and several other models in between. Any steps you can take towards the higher models will ultimately help you. My preferred mode for code that really does have to be private is to use a private GitHub repository, and just make that repo public once the paper is accepted.

A lot of people prefer the “code dump with no revision history” model of post-publication sharing, but this tosses out a lot of valuable information for people coming after you: what have you tried that didn’t work? What issues did you have with the code? Have you considered coding in one style or another? The code dump model also makes you less likely to use GitHub in the first place, depriving you of an opportunity to learn some valuable real-world skills.

For coding, scientists have even more severe impostor syndrome

As I mentioned in my original post, and this I find completely uncontestable, publishing shitty code is not a bad thing. Everybody writes bad code, and nearly everybody knows it. Here’s Hadley Wickham, creator of dplyr, tidy data, ggplot2, among other things; in other words, someone who knows a thing or two about elegant code and about as close as one gets to coding royalty in science:

The only way to write great code is to write lots of shitty code first.

Publishing your raw code is a good thing and will absolutely not be a black mark on your career. Indeed, in open source circles, it is often a bare GitHub contribution history that is a black mark. (And this is another problem, but in my opinion a better one.)

Scientists don’t know about open source

If knowledge of open source is lacking in computer science, what chance does it have in other fields? The truth is that outreach and education need to become a massive part of open source culture, especially in science.

I credit Stéfan van der Walt for my life in open source. After I gave a talk at SciPy 2012, he invited me to join the scikit-image sprint at the end of the conference. If it hadn’t been for that, I probably would have just wandered around the hall, too shy to join any sprint (see “impostor syndrome”, above), and my life would be very different right now.

Anyway, at that point I’d made my code “open source”, which meant it was on GitHub. I had only added a license to submit to the conference. As a reminder, unlicensed code doesn’t count as open source. But I had never really collaborated in open source. My idea of collaboration was my workflow with my colleague: a single branch (master), from which we both pulled and to which we both pushed. When I sat down with Stéfan and Tony Yu, and I figured what I wanted to work on, I asked: “So, should I just push to master, or what?” I still remember, with some embarrassment, the dubious look Stéfan and Tony exchanged, as they silently figured out which of them would introduce this newbie to pull requests.

But that’s the thing: I shouldn’t feel embarrassment. Scientists for the most part don’t get introduced to coding in their education, much less to open source.

What can scientists in open source do?

A lesson from my continued contributions to the SciPy ecosystem, I hope, is that some light mentorship can yield enormous dividends later on. Stéfan and Tony took the time to walk me through the open source contribution process, when they could have dismissively sent me a link to some page explaining it. I’m a big fan of writing good documents for newcomers, but nothing beats a good hand-holding. It’s very easy for me to imagine an alternate reality where I had not felt welcome or rewarded by the scikit-image project and my life had not taken this productive turn.

Continuing on imaginary themes, it is only slightly less plausible that the open source scientific world should be awash with new contributors at every level of science. How do we turn this dream into a reality?

If you are a scientist and this post is among your first encounters with the term “open source”, and you think you might be interested in learning more, here are a few things I recommend, in order of easiest to hardest:

  • Read the preface and epilogue of my book with Stéfan and Harriet Dashnow. (Free online!) I feel a bit icky recommending my own book, but why repeat myself? In those chapters I tried to distill my thoughts on joining the SciPy community, which is a fantastic, rewarding space in which to do open source programming as a scientist. I expect many things we wrote generalise well to e.g. the tidyverse.
  • Look for upcoming software carpentry workshops near you. These are free two-day programming boot camps to introduce you to computational thinking, and, crucially, to version control with git.
  • Go to a SciPy conference. I know of SciPy, EuroSciPy, and SciPy India, but I have a vague memory of offshoots in Africa and South America.

If you are in a boat similar to mine (intermediate/advanced open source contributor in science), and you feel like you would like your work to feel a bit more crowded, I can tell you what I’m going to be doing in response to this talk:

  • Sign up to deliver (more) software carpentry training (or similar). Getting the word out is the number one thing.
  • In software carpentry, emphasise the role of git in collaboration. (I think the official program does not go far enough in this direction, and focuses instead on the initial linear history.)
  • If you are located in a university, talk to your CS department to see whether they have any courses in open source development. If not, see whether you can guest lecture in a suitable course to make students aware of the open source opportunities out there.
  • Similarly, follow up software carpentry with more advanced sessions on open source collaboration. I gained an enormous fraction of my programming skills from collaborating on open source. I really think there is no better tool for long-term learning in this space. An idea that I’d like to try out is to curate a bunch of open issues on prominent repos and get SWC students to sprint on them for a day1. I know about the “good first issue” tag on GitHub. Unfortunately, my experience with it is mixed. I think many repos are overly optimistic with theirs (this includes scikit-image), and, furthermore, a large proportion of these tagged issues get “claimed” quickly — and often half-heartedly!
  • Write, write, write! Did you get a cool PR merged? Write a blog post about it! Or at least tweet! We need to get the message out that writing PRs is for everyone. =)

If you have any further ideas, I’d love to hear them.


  1. Actually I drafted this post a while back, and tried this yesterday, with mixed success. I’ll write about that experience soon. ;) 

1st ASPP Asia Pacific evaluation survey

In January of 2018, we had the first ASPP summer school outside of Europe. (This was a parallel workshop to the European one, which will be held in Italy in September 2018.) In general, it was a great success, with some caveats that we will elaborate on below.

First we want to note that this school was a bit different than the European ones, in that we only had attendees from Australian institutions, where the European school has broad international representation, including some from out of Europe. This was in some ways inevitable, as it is more expensive to travel to Australia from almost anywhere than to travel within Europe. On the other hand, we advertised relatively late, and we were unable to secure travel grants during the advertising period, so there is hope that a future edition would be able to attract a more international crowd from the Asia Pacific region.

Given all this, there was a question as to whether we would be able to capture the atmosphere of the school, which normally sees the students living together and socialising for basically the whole week. In this case, most students just went home after classes were finished. But although some of that atmosphere was missing, by the end of the week we did manage to get some close links between all the students and the faculty. The evaluations below show that most of the value of the school was preserved.

We note that 100% of the respondents (29/30 of the students) would recommend the course to their peers. So, although some lectures were better received than others, and although the programming project was not universally loved, we managed to provide value for everyone. All of this is in line with the evaluations at previous schools (available at https://python.g-node.org/wiki/archives.html).

The project, which consists of programming a videogame bot, is controversial every year, but, consistently, more people like it than don’t, and people get to practice git, pair programming, and programming as a team, which is the single most difficult skill to practice when programming for science. Indeed when we walk around during the project programming sessions, we see people extremely engaged in what they are coding. It’s difficult to imagine a scientific problem engaging such diverse people as the school’s attendees (which come from very disparate scientific fields).

Of all the feedback, two particular statements, we hope from people in the same project group, broke our hearts. We decided not to include them in this report, because they might be easy to de-anonymise by group members, but they boil down to the following: a group member, by being combative and rude to others in their team, and deciding to essentially complete the project by themselves, ruined the programming project for all of their team members, with some even feeling that they were not good enough to contribute. This is tragic, because we want everyone in the school to feel empowered to do anything at all in Python.

Absolutely every student has something to offer in this project. Here, as in life, teams are comprised of members of varying skills. But we know from our selection that everyone has the skills to contribute (and this is confirmed by the fact that most attendees, for most lectures, felt that the difficulty level was “just right”). So if a student felt inadequate, it can only be because of the toxic team member.

Ned Batchelder recently wrote an excellent blog post about what he calls “Toxic experts” and what Tiziano Zito calls, somewhat more bluntly, “Arrogant assholes”. (In discussions about this post, Tiziano and others noted that one does not have to be an expert to be toxic, or arrogant, or an asshole. No matter: the points below apply equally to anyone meeting any of the above characteristics regardless of expertise.)

The feedback we received should serve as a warning to selection committees and hiring managers everywhere about how damaging it is to allow such a person into your ranks. Due to the anonymous nature of the survey, we can’t tell whether there was one or two toxic experts in our midst, but if it’s one, they soured the school for five other people. If it’s two, then that’s ten people, a third of the school, that might have had a terrible experience. The problem with toxic experts is that they can so quickly cause damage to so many others. Thus, even if they are a mythical “10x engineer”, they are not worth it.

Literally nothing that the above-described team member could have done, coding-wise, could make up for the damage they caused. Despite their strong opinions, they missed the entire point of the programming project, which is not to win a medal, but to learn about working in a team.

We try to avoid toxic experts in our selection process for the school, but they slip through every so often. In response to this feedback, we will aim to be even more vigilant in our selection, and also make the aims of the project as a learning exercise more explicit during its introduction. We will also make sure to be more aware of group interactions during the actual school; we apologise to the students involved that we did not catch this behaviour this time. We are truly sorry.

If you are in the position of being an expert during a school or workshop, don’t go it alone. That is a waste of your time, because you can do a programming project on your own whenever you damn well please. Slow down, and think instead about practicing your teaching and mentoring skills. They are also important in life, and, in many contexts, they are your responsibility.

You can access the full survey results here.

— Juan, and the Organisers.

SciPy’s new LowLevelCallable is a game-changer

… and combines rather well with that other game-changing library I like, Numba.

I’ve lamented before that function calls are expensive in Python, and that this severely hampers many functions that should be insanely useful, such as SciPy’s ndimage.generic_filter.

To illustrate this, let’s look at image erosion, which is the replacement of each pixel in an image by the minimum of its neighbourhood. ndimage has a fast C implementation, which serves as a perfect benchmark against the generic version, using a generic filter with min as the operator. Let’s start with a 2048 x 2048 random image:

>>> import numpy as np
>>> image = np.random.random((2048, 2048))

and a neighbourhood “footprint” that picks out the pixels to the left and right, and above and below, the centre pixel:

>>> footprint = np.array([[0, 1, 0],
...                       [1, 1, 1],
...                       [0, 1, 0]], dtype=bool)

Now, we measure the speed of grey_erosion and generic_filter. Spoiler alert: it’s not pretty.

>>> from scipy import ndimage as ndi
>>> %timeit ndi.grey_erosion(image, footprint=footprint)
10 loops, best of 3: 118 ms per loop
>>> %timeit ndi.generic_filter(image, np.min, footprint=footprint)
1 loop, best of 3: 27 s per loop

As you can see, with Python functions, generic_filter is unusable for anything but the tiniest of images.

A few months ago, I was trying to get around this by using Numba-compiled functions, but the way to feed C functions to SciPy was different depending on which part of the library you were using. scipy.integrate used ctypes, while scipy.ndimage used PyCObjects or PyCapsules, depending on your Python version, and Numba only supported the former method at the time. (Plus, this topic starts to stretch my understanding of low-level Python, so I felt there wasn’t much I could do about it.)

Enter this pull request to SciPy from Pauli Virtanen, which is live in the most recent SciPy version, 0.19. It unifies all C-function interfaces within SciPy, and Numba already supports this format. It takes a bit of gymnastics, but it works! It really works!

(By the way, the release is full of little gold nuggets. If you use SciPy at all, the release notes are well worth a read.)

First, we need to define a C function of the appropriate signature. Now, you might think this is the same as the Python signature, taking in an array of values and returning a single value, but that would be too easy! Instead, we have to go back to some C-style programming with pointers and array sizes. From the generic_filter documentation:

This function also accepts low-level callback functions with one of the following signatures and wrapped in scipy.LowLevelCallable:

int callback(double *buffer, npy_intp filter_size, 
             double *return_value, void *user_data)
int callback(double *buffer, intptr_t filter_size, 
             double *return_value, void *user_data)

The calling function iterates over the elements of the input and output arrays, calling the callback function at each element. The elements within the footprint of the filter at the current element are passed through the buffer parameter, and the number of elements within the footprint through filter_size. The calculated value is returned in return_value. user_data is the data pointer provided to scipy.LowLevelCallable as-is.

The callback function must return an integer error status that is zero if something went wrong and one otherwise.

(Let’s leave aside that crazy reversal of Unix convention of the past 50 years in the last paragraph, except to note that our function must return 1 or it will be killed.)

So, we need a Numba cfunc that takes in:

  • a double pointer pointing to the values within the footprint,
  • a pointer-sized integer that specifies the number of values in the footprint,
  • a double pointer for the result, and
  • a void pointer, which could point to additional parameters, but which we can ignore for now.

The Numba type names are listed in this page. Unfortunately, at the time of writing, there’s no mention of how to make pointers there, but finding such a reference was not too hard. (Incidentally, it would make a good contribution to Numba’s documentation to add CPointer to the Numba types page.)

So, armed with all that documentation, and after much trial and error, I was finally ready to write that C callable:

>>> from numba import cfunc, carray
>>> from numba.types import intc, intp, float64, voidptr
>>> from numba.types import CPointer
>>> 
>>> 
>>> @cfunc(intc(CPointer(float64), intp,
...             CPointer(float64), voidptr))
... def nbmin(values_ptr, len_values, result, data):
...     values = carray(values_ptr, (len_values,), dtype=float64)
...     result[0] = np.inf
...     for v in values:
...         if v < result[0]:
...             result[0] = v
...     return 1

The only other tricky bits I had to watch out for while writing that function were as follows:

  • remembering that there’s two ways to de-reference a pointer in C: *ptr, which is not valid Python and thus not valid Numba, and ptr[0]. So, to place the result at the given double pointer, we use the latter syntax. (If you prefer to use Cython, the same rule applies.)
  • Creating an array out of the values_ptr and len_values variables, as shown here. That’s what enables the for v in values Python-style access to the array.

Ok, so now what you’ve been waiting for. How did we do? First, to recap, the original benchmarks:

>>> %timeit ndi.grey_erosion(image, footprint=footprint)
10 loops, best of 3: 118 ms per loop
>>> %timeit ndi.generic_filter(image, np.min, footprint=footprint)
1 loop, best of 3: 27 s per loop

And now, with our new Numba cfunc:

>>> %timeit ndi.generic_filter(image, LowLevelCallable(nbmin.ctypes), footprint=footprint)
10 loops, best of 3: 113 ms per loop

That’s right: it’s even marginally faster than the pure C version! I almost cried when I ran that.


Higher-order functions, ie functions that take other functions as input, enable powerful, concise, elegant expressions of various algorithms. Unfortunately, these have been hampered in Python for large-scale data processing because of Python’s function call overhead. SciPy’s latest update goes a long way towards redressing this.

Brian Greene on the Colbert Report

I promise sometime soon I’ll write something not about someone else’s videos! But for now, enjoy theoretical physicist Brian Greene on the Colbert Report. Stephen drives an excellent interview, as usual, and proves yet again that he either knows a good deal of science, or he does his homework before talking about it. As a result, science coverage on the Colbert Report is invariably excellent.