Skip to main content

Why you should cite open source tools

Every now and then, a moment or a sentence in a conversation sticks out at you, and lodges itself in the back of your brain for months or even years. In this case, the sentence is a tweet, and I fear that the only way to dislodge it is to talk about it publicly.

Last year, I complained on Twitter that a very prominent paper that was getting lots of attention used scikit-image, but failed to cite our paper. (Or the papers corresponding to many other open source packages.) I continued that scientists developing open source software depend on these citations to continue their work. (More on this in another post...) One response was that surely the developers of the open source scientific Python stack were not scientists per se, and that citations were not a priority for them.

I still sigh internally when I think of it.

That tweet manifests a pervasive perception that open source scientific software is written by God-like figures. These massively experienced software developers have easy access to funds for their work, and are at the service of all the other scientists, who are their users. I used to share this perception, but it is utterly false.

I certainly hadn't thought about the funding question, but I did think of packages like NumPy and SciPy as being written by "pros" (whatever that means), whose main job was to produce these amazing libraries.

My ideas started to change only after I attended the SciPy 2012 conference, and I was invited to the scikit-image sprint by its lead author, Stéfan van der Walt. I was totally starstruck, and even after meeting him I just assumed his job was "open source guru", or somesuch. It is only later that I learned that he was a postdoctoral researcher and lecturer in applied mathematics at Stellenbosch University, in South Africa. As with other academics, his main job was to produce research and to teach. Open source was something he produced on the side.

(Years later, as we were scrambling to finish the final chapter for Elegant SciPy, Stéfan revamped the build infrastructure for our book — the scripts and configuration files that converted the Markdown text we were writing to executed code and html. "You have poor prioritisation skills, you know?", I taunted. He responded: "Yeah. But a lot of the SciPy documentation toolchain exists because of that." And yes: the revamped scripts ended up being extremely useful during the editing phase with O'Reilly.)

After the conference I continued to contribute to scikit-image, eventually joining the core development team, but throughout the process I continued to feel like an impostor, someone who had somehow managed to gain entrance to this hallowed and otherworldly community despite his inferior skills and knowledge.

Only after years of interacting with this community did I internalise the fact that nearly all of this software stack has been produced by practising scientists who took the extra care and effort to ensure that their code was robust, well-tested, and easily accessible to all. Despite a recent influx of interest and contributions from industry, as far as I can tell, most contributions to the SciPy stack still come from practising scientists in academia.

I know this now, but until recently I was suffering from another fallacy: that if I knew it, clueless as I was, then surely everybody knew it. That tweet disabused me of that notion, and this is why you're reading this. I hope it is instructive, and that you'll find it worth sharing widely.

This idea, that the SciPy stack is made by active scientists, is important because it affects how this work can be supported. Sadly, neither university hierarchies nor national funding bodies recognise code as valuable output. (There are some exceptions, but this remains the norm.) By and large, the only things that count are papers, grants, and, to a lesser extent, teaching evaluation scores.

So, yes, many of the original contributors to the SciPy libraries are now in industry. But they were academics at the time that they contributed and were driven out because academia did not value their contributions. Academics should not have to sacrifice their careers to contribute to open source. Citations to software papers are an imperfect solution to this problem (again, more soon), but they sure as hell are better than nothing.

So, if you are a user of open source tools in the Scientific Python stack, I have two requests for you:

  1. When you publish your work, cite every library that you import. Most scientific software has a notice on their homepage or README file pointing to a paper you can cite. By definition, if you've imported a library, you've found it useful, and if you've found it useful, then you probably care about supporting its authors. This is a small way you can contribute to their success.
  2. You are good enough to contribute. If you have an issue with an open source package you are using, look at the source code. Submit an issue to the project's bug tracker (usually GitHub). And try your hand at fixing it. The software's authors will usually offer guidance on how to do this, and you will improve your own skills as a result. Good software development practices is one of the most transferrable skills you can gain.

Of course, citations alone will not solve the wider problem that open source software is chronically undervalued. I have many thoughts about how open source should be supported, especially in science, but I'll expand on that in an upcoming post.

Update: I'm adding a link to a related post: What do scientists know about open source?

Summer school announcement: 2nd Advanced Scientific Programming in Python (ASPP) Asia Pacific!

The Advanced Scientific Programming in Python (ASPP) summer school has had 10 successful iterations in Europe and one iteration here in Melbourne earlier this year. Another European iteration is starting next week in Camerino, Italy.

Now, thanks to the generous sponsorship of CSIRO, and the efforts of Benjamin Schwessinger and Genevieve Buckley, two alumni from the Melbourne school, and Kerensa McElroy, Agriculture Data School Coordinator at CSIRO, the Asia Pacific fork of ASPP gets its second iteration in Canberra, Jan 20-27, 2019.

Key details

  • The workshop runs January 20-27, 2019 at the Australian National University in Canberra, Australia.
  • topics include git, contributing to open source software with github, testing, debugging, profiling, advanced NumPy, Cython, and data visualisation.
  • hands-on learning using pair programming
  • free to attend (but students are responsible for travel, accommodation, and meals)
  • 30 student places, to be selected competitively
  • application deadline is Oct 7, 2018, 23:59 Anywhere On Earth
  • website:
  • FAQ:
  • apply:


Three years ago, I had the privilege of teaching the 2015 ASPP school in Munich. It turned out to be a fantastic teaching experience (I have taught in 2 more since), and more importantly, it was a fantastic experience for the students. Students are selected for the school to fit a certain profile, neither too novice nor too advanced. As such, participants selected for the school are almost guaranteed to learn a great deal.

Indeed, almost every iteration of the school has been co-organised by former students. Sure enough, with the help of two students from the Melbourne instance, we will be able to have a new iteration in Canberra this January.

Course description

Scientists spend increasingly more time writing, maintaining, and debugging software. While techniques for doing this efficiently have evolved, only few scientists have been trained to use them. As a result, instead of doing their research, they spend far too much time writing deficient code and reinventing the wheel. In this course we will present a selection of advanced programming techniques and best practices that are standard in industry, but especially tailored to the needs of a programming scientist. Lectures are devised to be interactive and to give the students enough time to acquire direct hands-on experience with the materials. Students will work in pairs throughout the school and will team up to practice the newly learned skills in a real programming project — an entertaining computer game.

We use the Python programming language for the entire course. Python works as a simple programming language for beginners, but more importantly, it also works great in scientific simulations and data analysis. We show how clean language design, ease of extensibility, and the great wealth of open source libraries for scientific computing and data visualization are driving Python to becoming a standard tool for scientists.

Who is eligible?

This school is targeted at Master/PhD students, postdocs, and academic staff and technicians from all areas of science. Competence in Python or in another language such as Java, C/C++, MATLAB, or Mathematica is absolutely required. Basic knowledge of Python and of a version control system such as git, subversion, mercurial, or bazaar is assumed. Participants without any prior experience with Python and/or git should work through the proposed introductory material before the course.

We have strived to get a pool of students that is international and gender-balanced, and have succeeded, with gender parity in the last five schools.

More questions

If you have any questions, contact [email protected].

Please circulate this announcement widely! And follow @scipyschool for further developments.


The road to scikit-image 1.0

This is the first in a series of posts about the joint scikit-image, scikit-learn, and dask sprint that took place at the Berkeley Insitute of Data Science, May 28-Jun 1, 2018.

In addition to the dask and scikit-learn teams, the sprint brought together three core developers of scikit-image (Emmanuelle Gouillart, Stéfan van der Walt, and myself), and two newer contributors, Kira Evans and Mark Harfouche. Since we are rarely in the same timezone, let alone in the same room, we took the opportunity to discuss some high level goals using a framework suggested by Tracy Teal (via Chris Holdgraf): Vision, Mission, Values. I'll try do Chris's explanation of these ideas justice:

  • Vision: what are we trying to achieve? What is the future that we are trying to bring about?
  • Mission: what are we going to do about it? This is the plan needed to make the vision a reality.
  • Values: what are we willing to do, and not willing to do, to complete our mission?

So, on the basis of this framework, I'd like to review where scikit-image is now, where I think it needs to go, and the ideas that Emma, Stéfan, and I came up with during the sprint to get scikit-image there.

I will point out, from the beginning, that one of our values is that we are community-driven, and this is not a wishy-washy concept. (More below.) Therefore this blog post constitutes only a preliminary document, designed to kick-start an official roadmap for scikit-image 1.0 with more than a blank canvas. The roadmap will be debated on GitHub and the mailing list, open to discussion by anyone, and when completed will appear on our webpage. This post is not the roadmap.

Part one: where we are

scikit-image is a tremendously successful project that I feel very proud to have been a part of until now. I still cherish the email I got from Stéfan inviting me to join the core team. (Five years ago now!)

Like many open source projects, though, we are threatened by our own success, with feature requests and bug reports piling on faster than we can get through them. And, because we grew organically, with no governance model, it is often difficult to resolve thorny questions about API design, what gets included in the library, and how to deprecate old functionality. Discussion usually stalls before any decision is taken, resulting in a process heavily biased towards inaction. Many issues and PRs languish for years, resulting in a double loss for the project: a smaller loss from losing the PR, and a bigger one from losing a potential contributor that understandably has lost interest.

Possibly the most impactful decision that we took at the BIDS sprint is that at least three core developers will video once a month to discuss stalled issues and PRs. (The logistics are still being worked out.) We hope that this sustained commitment will move many PRs and issues forward much faster than they have until now.

Part two: where we're going

Onto the framework. What are the vision, mission, and values of scikit-image? How will these help guide the decisions that we make daily and in our dev meetings?

Our vision

We want scikit-image to be the reference image processing and analysis library for science in Python. In one sense I think that we are already there, but there are more than enough remaining warts that they might cause the motivated user to go looking elsewhere. The vision, then, is to increase our customer satisfaction fraction in this space to something approaching 1.0.

Our mission

How do we get there? Here is our mission:

  • Our library must be easily re-usable. This means that we will be careful in adding new dependencies, and possibly cull some existing ones, or make them optional. We also want to remove some of the bigger test datasets from our package, which at 24MB is getting rather unwieldy! (By comparison, Python 3.7 is 16MB.) (Props to Josh Warner for noticing this.)
  • It also means providing a consistent API. This means that conceptually identical function arguments, such as images, label images, and arguments defining whether an input image is grayscale, should have the same name across various the library. We've made great strides in this goal thanks to Egor Panfilov and Adrian Sieber, but we still have some way to go.
  • We want to ensure accuracy of our algorithms. This means comprehensive testing, even against external libraries, and engaging experts in relevant fields to audit our code. (Though this of course is a challenge!)
  • Show utmost care with users' data. Not that we haven't cared until now, but there are places in scikit-image where too much responsibility (in my view) rests with the user, with insufficient transparency from our functions for new users to predict what will happen to their data. For example, we are quite liberal with how we deal with input data: it gets rescaled whenever we need to change the type, for example from unsigned 8-bit integers (uint8) to floating point. Although we have good technical reasons for doing this, and rather extensive documentation about it, these conversions are the source of much user confusion. We are aiming to improve this in issue 3009. Likewise, we don't handle image metadata at all. What is the physical extent of the input image? What is the range and units of the data points in the image? What do the different channels represent? These are all important questions in scientific images, but until now we have completely abdicated responsibility in them and simply ignore any metadata. I don't think this is tenable for a scientific imaging library. We don't have a good answer for how we will do it, but I consider this a must-solve before we can call ourselves 1.0.

Our values

Finally, how do we solve the thorny questions of API design, whether to include algorithms, etc? Here are our values:

  • We used the word "reference" in our vision. This phrasing is significant. It means that we value elegant implementations, that are easy to understand for newcomers, over obtaining every last ounce of speed. This value is a useful guide in reviewing pull requests. We will prefer a 20% slowdown when it reduces the lines of code two-fold.
  • We also used the word science in our vision. This means our aim is to serve scientific applications, and not, for example, image editing in the vein of Photoshop or GIMP. Having said this, we value being part of diverse scientific fields. (One of the first citations of the scikit-image paper was a remote sensing paper, to our delight: none of the core developers work in that field!)
  • We are inclusive. From my first contributions to the project, I have received patient mentorship from Stéfan, Emmanuelle, Johannes Schönberger, Andy Mueller, and others. (Indeed, I am still learning from fellow contributors, as seen here, to show just one example.) We will continue to welcome and mentor newcomers to the Scientific Python ecosystem who are making their first contribution.
  • Both of the above points have a corrolary: we require excellent documentation, in the form of usage examples, docstrings documenting the API of each function, and comments explaining tricky parts of the code. This requirement has stalled a few PRs in the past, but this is something that our monthly meetings will specifically address.
  • We don't do magic. We use NumPy arrays instead of fancy façade objects that mask their complexity. We prefer to educate our users over making decisions on their behalf (through quality documentation, particularly in docstrings).
  • We are community-driven, which means that decisions about the API and features will be driven by our users' requirements, and not the whims of the core team. (For example, I would happily curry all of our functions, but that would be confusing to most users, so I suffer in silence. =P)

I hope that the above values are uncontroversial in the scikit-image core team. (I myself used to fall heavily on the pro-magic side, but hard experience with this library has shown me the error of my ways.) I also hope, but more hesitantly, that our much wider community of users will also see these values as, well, valuable.

As I mentioned above, I hope this blog post will spawn a discussion involving both the core team and the wider community, and that this discussion can be distilled into a public roadmap for scikit-image.

Part three: scikit-image 1.0

I have deliberately left out new features off the mission, except for metadata handling. The library will never be "feature complete". But we can develop a stable and consistent enough API that adding new features will almost never require breaking it.

For completeness, I'll compile my personal pet list of things I will attempt to work on or be particularly excited about other people working on. This is not part of the roadmap, it's part of my roadmap.

  • Near-complete support for n-dimensional data. I want 2D-only functions to become the exception in the library, maybe so much so that we are forced to add a _2d suffix to the function name.
  • Typing support. I never want to move from simple arrays as our base data type, but I want a way to systematically distinguish between plain images, label images, coordinate lists, and other types, in a way that is accessible to automatic tools.
  • Basic image registration functionality.
  • Evaluation algorithms for all parts of the library (such as segmentation, or keypoint matching).

The human side

Along with articulating the way we see the project, another key part of getting to 1.0 is supporting existing maintainers, and onboarding new ones. It is clear that the project is currently straining under the weight of its popularity. While we solve one issue, three more are opened, and two pull requests.

In the past, we have been too hesitant to invite new members to the core team, because it is difficult to tell whether a new contributor shares your vision. Our roadmap document is an important step towards rectifying this, because it clarifies where the library is going, and therefore the decision making process when it comes to accepting new contributions, for example.

In a followup to this post, I aim to propose a maintainer onboarding document, in a similar vein, to make sure that new maintainers all share the same process when evaluating new PRs and communicating with contributors. A governance model is also in the works, by which I mean that Stéfan has been wanting to establish one for years and now Emmanuelle and I are onboard with this plan, and I hope others will be too, and now we just need to decide on the damn thing.

I hope that all of these changes will allows us to reach the scikit-image 1.0 milestone sooner rather than later, and that everyone reading this is as excited about it as I was while we hashed this plan together.

As a reminder, this is not our final roadmap, nor our final vision/mission statement. Please comment on the corresponding GitHub issue for this post if you have thoughts and suggestions! (You can also use the mailing list, and we will soon provide a way to submit anonymous comments, too.) As a community, we will come together to create the library we all want to use and contribute to.

As a reminder, everything in this blog is CC0+BY, so feel free to reuse any or all of it in your own projects! And I want to thank BIDS, and specifically Nelle Varoquaux at BIDS, for making this discussion possible, among many other things that will be written up in upcoming posts.

Update: Anonymous comments are now open at To summarise, to comment on this proposal you can:

  • comment on the GitHub issue
  • submit a comment below
  • submit an anonymous comment at

What do scientists know about open source?

A friend recently pointed out this great talk by Matt Bernius, What students know and don't know about open source. If you have even a minor interest in open source it's worth a watch, but the gist is: in the US alone, there are about 200,000 students enrolled in a computer science major. Open source communities are a great space to learn real-world programming, so why don't these numbers translate into massive contributions to open source?

At the core of the issue, Matt identifies two main problems: (1) colleges and universities simply don't teach open source, or even collaborative coding; and (2), many open source communities make newcomers feel unwelcome in a variety of ways.

I want to comment about this in the context of programming in science. That is, programming where the code is not the main product, but rather a useful tool to obtain a scientific result, for example in biology or physics. Here, we still see relatively little contribution to open source, for related but different cultural issues.

I've sent my scientists should code in the open post to a few people and the response from most remains sceptical. I hope this post will address some of their concerns.

Scientific culture is ridiculously secretive

The most common objection is to my assertion that people won't scoop you by looking at your code. I remember a tweet (that I sadly can't find now) that really got to the gist of the problem. It went something like this:

Someone in science having a new idea: "Ooh, I hope I don't get scooped!"
Someone in open source having a new idea: "Ooh, I hope someone has implemented this already!"

Update: I found the source! It's this tweet by Elizabeth Seiver.

This is a huge gap in culture that won't soon go away, but there are encouraging steps towards narrowing it. For example, PLOS Biology, a leading journal, recently announced that they would consider "scooped" studies for publication within six months of the "scooping". That goes some way towards re-aligning incentives towards collaborative and open science.

I've come across many collaborations that have started because of open source. I have not heard of someone getting scooped because of open source, but of course that sort of information would be hard to trace and come by. Several people did write to me that they were concerned about very specific groups rifling through their code expressly for the purpose of scooping them. For me it's hard to imagine someone even having that attitude, and my advice is that if you do face such a toxic community, it might be wise to change your chosen field of study.

Nevertheless, I want to emphasise here that open source programming can take many forms, with the zip file attached to the paper being the lowest, coding in the open being the highest, and several other models in between. Any steps you can take towards the higher models will ultimately help you. My preferred mode for code that really does have to be private is to use a private GitHub repository, and just make that repo public once the paper is accepted.

A lot of people prefer the "code dump with no revision history" model of post-publication sharing, but this tosses out a lot of valuable information for people coming after you: what have you tried that didn't work? What issues did you have with the code? Have you considered coding in one style or another? The code dump model also makes you less likely to use GitHub in the first place, depriving you of an opportunity to learn some valuable real-world skills.

For coding, scientists have even more severe impostor syndrome

As I mentioned in my original post, and this I find completely uncontestable, publishing shitty code is not a bad thing. Everybody writes bad code, and nearly everybody knows it. Here's Hadley Wickham, creator of dplyr, tidy data, ggplot2, among other things; in other words, someone who knows a thing or two about elegant code and about as close as one gets to coding royalty in science:

The only way to write great code is to write lots of shitty code first.

Publishing your raw code is a good thing and will absolutely not be a black mark on your career. Indeed, in open source circles, it is often a bare GitHub contribution history that is a black mark. (And this is another problem, but in my opinion a better one.)

Scientists don't know about open source

If knowledge of open source is lacking in computer science, what chance does it have in other fields? The truth is that outreach and education need to become a massive part of open source culture, especially in science.

I credit Stéfan van der Walt for my life in open source. After I gave a talk at SciPy 2012, he invited me to join the scikit-image sprint at the end of the conference. If it hadn't been for that, I probably would have just wandered around the hall, too shy to join any sprint (see "impostor syndrome", above), and my life would be very different right now.

Anyway, at that point I'd made my code "open source", which meant it was on GitHub. I had only added a license to submit to the conference. As a reminder, unlicensed code doesn't count as open source. But I had never really collaborated in open source. My idea of collaboration was my workflow with my colleague: a single branch (master), from which we both pulled and to which we both pushed. When I sat down with Stéfan and Tony Yu, and I figured what I wanted to work on, I asked: "So, should I just push to master, or what?" I still remember, with some embarrassment, the dubious look Stéfan and Tony exchanged, as they silently figured out which of them would introduce this newbie to pull requests.

But that's the thing: I shouldn't feel embarrassment. Scientists for the most part don't get introduced to coding in their education, much less to open source.

What can scientists in open source do?

A lesson from my continued contributions to the SciPy ecosystem, I hope, is that some light mentorship can yield enormous dividends later on. Stéfan and Tony took the time to walk me through the open source contribution process, when they could have dismissively sent me a link to some page explaining it. I'm a big fan of writing good documents for newcomers, but nothing beats a good hand-holding. It's very easy for me to imagine an alternate reality where I had not felt welcome or rewarded by the scikit-image project and my life had not taken this productive turn.

Continuing on imaginary themes, it is only slightly less plausible that the open source scientific world should be awash with new contributors at every level of science. How do we turn this dream into a reality?

If you are a scientist and this post is among your first encounters with the term "open source", and you think you might be interested in learning more, here are a few things I recommend, in order of easiest to hardest:

  • Read the preface and epilogue of my book with Stéfan and Harriet Dashnow. (Free online!) I feel a bit icky recommending my own book, but why repeat myself? In those chapters I tried to distill my thoughts on joining the SciPy community, which is a fantastic, rewarding space in which to do open source programming as a scientist. I expect many things we wrote generalise well to e.g. the tidyverse.
  • Look for upcoming software carpentry workshops near you. These are free two-day programming boot camps to introduce you to computational thinking, and, crucially, to version control with git.
  • Go to a SciPy conference. I know of SciPy, EuroSciPy, and SciPy India, but I have a vague memory of offshoots in Africa and South America.

If you are in a boat similar to mine (intermediate/advanced open source contributor in science), and you feel like you would like your work to feel a bit more crowded, I can tell you what I'm going to be doing in response to this talk:

  • Sign up to deliver (more) software carpentry training (or similar). Getting the word out is the number one thing.
  • In software carpentry, emphasise the role of git in collaboration. (I think the official program does not go far enough in this direction, and focuses instead on the initial linear history.)
  • If you are located in a university, talk to your CS department to see whether they have any courses in open source development. If not, see whether you can guest lecture in a suitable course to make students aware of the open source opportunities out there.
  • Similarly, follow up software carpentry with more advanced sessions on open source collaboration. I gained an enormous fraction of my programming skills from collaborating on open source. I really think there is no better tool for long-term learning in this space. An idea that I'd like to try out is to curate a bunch of open issues on prominent repos and get SWC students to sprint on them for a day1. I know about the "good first issue" tag on GitHub. Unfortunately, my experience with it is mixed. I think many repos are overly optimistic with theirs (this includes scikit-image), and, furthermore, a large proportion of these tagged issues get "claimed" quickly — and often half-heartedly!
  • Write, write, write! Did you get a cool PR merged? Write a blog post about it! Or at least tweet! We need to get the message out that writing PRs is for everyone. =)

If you have any further ideas, I'd love to hear them.

  1. Actually I drafted this post a while back, and tried this yesterday, with mixed success. I'll write about that experience soon. ;) 

1st ASPP Asia Pacific evaluation survey

In January of 2018, we had the first ASPP summer school outside of Europe. (This was a parallel workshop to the European one, which will be held in Italy in September 2018.) In general, it was a great success, with some caveats that we will elaborate on below.

First we want to note that this school was a bit different than the European ones, in that we only had attendees from Australian institutions, where the European school has broad international representation, including some from out of Europe. This was in some ways inevitable, as it is more expensive to travel to Australia from almost anywhere than to travel within Europe. On the other hand, we advertised relatively late, and we were unable to secure travel grants during the advertising period, so there is hope that a future edition would be able to attract a more international crowd from the Asia Pacific region.

Given all this, there was a question as to whether we would be able to capture the atmosphere of the school, which normally sees the students living together and socialising for basically the whole week. In this case, most students just went home after classes were finished. But although some of that atmosphere was missing, by the end of the week we did manage to get some close links between all the students and the faculty. The evaluations below show that most of the value of the school was preserved.

We note that 100% of the respondents (29/30 of the students) would recommend the course to their peers. So, although some lectures were better received than others, and although the programming project was not universally loved, we managed to provide value for everyone. All of this is in line with the evaluations at previous schools (available at

The project, which consists of programming a videogame bot, is controversial every year, but, consistently, more people like it than don't, and people get to practice git, pair programming, and programming as a team, which is the single most difficult skill to practice when programming for science. Indeed when we walk around during the project programming sessions, we see people extremely engaged in what they are coding. It's difficult to imagine a scientific problem engaging such diverse people as the school's attendees (which come from very disparate scientific fields).

Of all the feedback, two particular statements, we hope from people in the same project group, broke our hearts. We decided not to include them in this report, because they might be easy to de-anonymise by group members, but they boil down to the following: a group member, by being combative and rude to others in their team, and deciding to essentially complete the project by themselves, ruined the programming project for all of their team members, with some even feeling that they were not good enough to contribute. This is tragic, because we want everyone in the school to feel empowered to do anything at all in Python.

Absolutely every student has something to offer in this project. Here, as in life, teams are comprised of members of varying skills. But we know from our selection that everyone has the skills to contribute (and this is confirmed by the fact that most attendees, for most lectures, felt that the difficulty level was "just right"). So if a student felt inadequate, it can only be because of the toxic team member.

Ned Batchelder recently wrote an excellent blog post about what he calls "Toxic experts" and what Tiziano Zito calls, somewhat more bluntly, "Arrogant assholes". (In discussions about this post, Tiziano and others noted that one does not have to be an expert to be toxic, or arrogant, or an asshole. No matter: the points below apply equally to anyone meeting any of the above characteristics regardless of expertise.)

The feedback we received should serve as a warning to selection committees and hiring managers everywhere about how damaging it is to allow such a person into your ranks. Due to the anonymous nature of the survey, we can't tell whether there was one or two toxic experts in our midst, but if it's one, they soured the school for five other people. If it's two, then that's ten people, a third of the school, that might have had a terrible experience. The problem with toxic experts is that they can so quickly cause damage to so many others. Thus, even if they are a mythical "10x engineer", they are not worth it.

Literally nothing that the above-described team member could have done, coding-wise, could make up for the damage they caused. Despite their strong opinions, they missed the entire point of the programming project, which is not to win a medal, but to learn about working in a team.

We try to avoid toxic experts in our selection process for the school, but they slip through every so often. In response to this feedback, we will aim to be even more vigilant in our selection, and also make the aims of the project as a learning exercise more explicit during its introduction. We will also make sure to be more aware of group interactions during the actual school; we apologise to the students involved that we did not catch this behaviour this time. We are truly sorry.

If you are in the position of being an expert during a school or workshop, don't go it alone. That is a waste of your time, because you can do a programming project on your own whenever you damn well please. Slow down, and think instead about practicing your teaching and mentoring skills. They are also important in life, and, in many contexts, they are your responsibility.

You can access the full survey results here.

-- Juan, and the Organisers.

Summer School Announcement: ASPP Asia-Pacific 2018

The Advanced Scientific Programming in Python (ASPP) summer school has had 10 extremely successful iterations in Europe. (You can find past materials, schedules, and student evaluations at Now, thanks to the INCF, we will be holding its first iteration in Australia, to cater to the Asia Pacific region. (Note: the original ASPP will still take place in Europe next Northern summer; this is a fork of that school.)

Key details

  • The workshop runs January 14-21 at the Melbourne Brain Centre, University of Melbourne, Australia
  • topics include: git, contributing to open source software with github, testing, debugging, profiling, advanced NumPy, Cython, data visualisation.
  • hands-on learning using pair programming
  • free to attend (but students are responsible for travel, accommodation, and meals)
  • 30 student places, to be selected competitively
  • application deadline is Oct 31, 2017, 23:59 UTC.
  • website:
  • apply: (make sure you read the FAQ on that page)


Two-and-a-bit years ago, Tiziano Zito asked me if I could join the faculty at the 2015 ASPP school in Munich (then in its 8th iteration). It turned out to be a fantastic teaching experience, and, more importantly, it was a fantastic experience for the students. Students selected for the school fit a certain profile, neither novice nor advanced. As such, you can be sure that if you participate in the school, you will learn a great deal. We teach tools that will immediately improve your scientific practice. I decided that I wanted to replicate the school in Australia. Now it is finally here!

Course outline

Scientists spend increasingly more time writing, maintaining, and debugging software. While techniques for doing this efficiently have evolved, only few scientists have been trained to use them. As a result, instead of doing their research, they spend far too much time writing deficient code and reinventing the wheel. In this course we will present a selection of advanced programming techniques and best practices that are standard in industry, but especially tailored to the needs of a programming scientist. Lectures are devised to be interactive and to give the students enough time to acquire direct hands-on experience with the materials. Students will work in pairs throughout the school and will team up to practice the newly learned skills in a real programming project — an entertaining computer game.

We use the Python programming language for the entire course. Python works as a simple programming language for beginners, but more importantly, it also works great in scientific simulations and data analysis. We show how clean language design, ease of extensibility, and the great wealth of open source libraries for scientific computing and data visualization are driving Python to becoming a standard tool for scientists.

Who is eligible?

This school is targeted at Master/PhD students and postdocs from all areas of science. Competence in Python or in another language such as Java, C/C++, MATLAB, or Mathematica is absolutely required. Basic knowledge of Python and of a version control system such as git, subversion, mercurial, or bazaar is assumed. Participants without any prior experience with Python and/or git should work through the proposed introductory material before the course.

We have strived to get a pool of students that is international and gender-balanced, and have succeeded, with gender parity in the last four schools.

More questions

If you have any questions, contact [email protected].

Please circulate this announcement widely!



Prettier LowLevelCallables with Numba JIT and decorators

In my recent post, I extolled the virtues of SciPy 0.19's LowLevelCallable. I did lament, however, that for generic_filter, the LowLevelCallable interface is a good deal uglier than the standard function interface. In the latter, you merely need to provide a function that takes the values within a pixel neighbourhood, and outputs a single value — an arbitrary function of the input values. That is a Wholesome and Good filter function, the way God intended.

In contrast, a LowLevelCallable takes the following signature:

>>> from llc import jit_filter_function

The source code is on GitHub. Currently it only covers ndi.generic_filter's signature, and only with Numba, but I hope to gradually expand it to cover all the functions that take LowLevelCallables in SciPy, as well as support Cython. Pull requests are welcome!

SciPy's new LowLevelCallable is a game-changer

... and combines rather well with that other game-changing library I like, Numba.

I've lamented before that function calls are expensive in Python, and that this severely hampers many functions that should be insanely useful, such as SciPy's ndimage.generic_filter.

To illustrate this, let's look at image erosion, which is the replacement of each pixel in an image by the minimum of its neighbourhood. ndimage has a fast C implementation, which serves as a perfect benchmark against the generic version, using a generic filter with min as the operator. Let's start with a 2048 x 2048 random image:


>>> %timeit ndi.generic_filter(image, LowLevelCallable(nbmin.ctypes), footprint=footprint) 10 loops, best of 3: 113 ms per loop


That's right: it's even marginally faster than the pure C version! I almost cried when I ran that.

Higher-order functions, ie functions that take other functions as input, enable powerful, concise, elegant expressions of various algorithms. Unfortunately, these have been hampered in Python for large-scale data processing because of Python's function call overhead. SciPy's latest update goes a long way towards redressing this.

Numba in the real world

Numba is a just-in-time compiler (JIT) for Python code focused on NumPy arrays and scientific Python. I've seen various tutorials around the web and in conferences, but I have yet to see someone use Numba "in the wild". In the past few months, I've been using Numba in my own code, and I recently released my first real package using Numba, skan. The short version is that Numba is amazing and you should strongly consider it to speed up your scientific Python bottlenecks. Read on for the longer version.

Part 1: some toy examples

Let me illustrate what Numba is good for with the most basic example: adding two arrays together. You've probably seen similar examples around the web. We start by defining a pure Python function for iterating over a pair of arrays and adding them:
In [1]:
import numpy as np

def addarr(x, y):
    result = np.zeros_like(x)
    for i in range(x.size):
        result[i] = x[i] + y[i]
    return result
How long does this take in pure Python?
In [2]:
n = int(1e6)
a = np.random.rand(n)
b = np.random.rand(n)
In [3]:
%timeit -r 1 -n 1 addarr(a, b)
1 loop, best of 1: 721 ms per loop
About half a second on my machine. Let's try with Numba using its JIT decorator:
In [4]:
import numba

addarr_nb = numba.jit(addarr)
In [5]:
%timeit -r 1 -n 1 addarr_nb(a, b)
1 loop, best of 1: 283 ms per loop
The first time it runs, it's only a tiny bit faster. That's because of the nature of JITs: they only compile code as it is being run, in order to use object type information of the objects passed into the function. (Note that, in Python, the arguments a and b to addarr could be anything: an array, as expected, but also a list, a tuple, even a Banana, if you've defined such a class, and the meaning of the function body is different for each of those types.) Let's see what happens the next time we run it:
In [6]:
%timeit -r 1 -n 1 addarr_nb(a, b)
1 loop, best of 1: 6.36 ms per loop
Whoa! Now the code takes 5ms, about 100 times faster than the pure Python version. And the NumPy equivalent?
In [7]:
%timeit -r 1 -n 1 a + b
1 loop, best of 1: 5.62 ms per loop
Only marginally faster than Numba, even though NumPy addition is implemented in highly optimised C code. And, for some data types, Numba even beats NumPy:
In [8]:
r = np.random.randint(0, 128, size=n).astype(np.uint8)
s = np.random.randint(0, 128, size=n).astype(np.uint8)
In [9]:
%timeit -r 1 -n 1 r + s
1 loop, best of 1: 2.92 ms per loop
In [10]:
%timeit -r 1 -n 1 addarr_nb(r, s)
1 loop, best of 1: 238 ms per loop
In [11]:
%timeit -r 1 -n 1 addarr_nb(r, s)
1 loop, best of 1: 234 µs per loop
WOW! For smaller data types, Numba beats NumPy by over 10x! I'm only speculating, but since my clock speed is about 1GHz (I'm writing this on a base Macbook with a 1.1GHz Core-m processor), I suspect that Numba is taking advantage of some SIMD capabilities of the processor, whereas NumPy is treating each array element as an individual arithmetic operation. (If any Numba or NumPy devs are reading this and have more concrete implementation details that explain this, please share them in the comments!)
So hopefully I've got your attention now. For years, NumPy has been the go-to library for performance Python in scientific computing. But, if you wanted to do something a little out of the ordinary, you were stuck. Now, Numba generally matches that for arbitrary code and sometimes beats it handily! In this context, I decided to use Numba to do something a little less trivial, as part of my research.

Part 2: Real Numba

I'll present below a slightly simplified version of the code present in my library, skan, which is currently available on PyPI and conda-forge. The task is to build an graph out of the pixels of a skleton image, like this one:
In [12]:
%matplotlib inline
In [13]:
import matplotlib.pyplot as plt
plt.rcParams['image.cmap'] = 'gray'
plt.rcParams['image.interpolation'] = 'nearest'
In [14]:
skeleton = np.array([[0, 1, 0, 0, 0, 1, 1],
                     [0, 0, 1, 1, 1, 0, 0],
                     [0, 1, 0, 0, 0, 1, 0],
                     [0, 0, 1, 0, 1, 0, 0],
                     [1, 1, 0, 1, 0, 0, 0]], dtype=bool)
skeleton = np.pad(skeleton, pad_width=1, mode='constant')
In [15]:
fig, ax = plt.subplots(figsize=(5, 5))

Every white pixel in the image will be a node in our graph, and we place edges between nodes if the pixels are next to each other (counting diagonals). A natural way to represent a graph in the SciPy world is as a sparse matrix A: we number the nonzero pixels from 1 onwards — these are the rows of the matrix — and then place a 1 at entry A(i, j) when pixel i is adjacent to pixel j. SciPy's sparse.coo_matrix format make it very easy to construct such a matrix: we just need an array with the row coordinates and another with the column coordinates. Because NumPy arrays are not dynamically resizable like Python lists, it helps to know ahead of time how many edges we are going to need to put in our row and column arrays. Thankfully, a well-known theorem of graph theory states that the number of edges of a graph is half the sum of the degrees. In our case, because we want to add the edges twice (once from i to j and once from j to i, we just need the sum of the degrees exactly. We can find this out with a convolution using scipy.ndimage:
In [16]:
from scipy import ndimage as ndi

neighbors = np.array([[1, 1, 1],
                      [1, 0, 1],
                      [1, 1, 1]])

degrees = ndi.convolve(skeleton.astype(int), neighbors) * skeleton
In [17]:
fig, ax = plt.subplots(figsize=(5, 5))
result = ax.imshow(degrees, cmap='magma')
ax.set_title('Skeleton, colored by node degree')
cbar = fig.colorbar(result, ax=ax, shrink=0.7)
cbar.set_ticks([0, 1, 2, 3])
There you can see "tips" of the skeleton, with only 1 neighbouring pixel, as purple, "paths", with 2 neighbours, as red, and "junctions", with 3 neighbors, as yellow. Now, consider the pixel at position (1, 6). It has two neighbours (as indicated by its colour): (2, 5) and (1, 7). If we number the nonzero pixels as 1, 2, ..., n from left to right and top to bottom, then this pixel has label 2, and its neighbours have labels 6 and 3. We therefore need to add edges (2, 3) and (2, 6) to the graph. Similarly, when we consider pixel 6, we will add edges (6, 5), (6, 3), and (6, 8).
In [18]:
fig, ax = plt.subplots(figsize=(5, 5))
result = ax.imshow(degrees, cmap='magma')
cbar = fig.colorbar(result, ax=ax, shrink=0.7)
cbar.set_ticks([0, 1, 2, 3])

nnz = len(np.flatnonzero(degrees))
pixel_labels = np.arange(nnz) + 1
for lab, y, x in zip(pixel_labels, *np.nonzero(degrees)):
    ax.text(x, y, lab, horizontalalignment='center',

ax.set_title('Skeleton, with pixel IDs')
Scanning over the whole image, we see that we need row and col arrays of length exactly np.sum(degrees).
In [19]:
n_edges = np.sum(degrees)
row = np.empty(n_edges, dtype=np.int32)  # type expected by scipy.sparse
col = np.empty(n_edges, dtype=np.int32)
The final piece of the puzzle is finding neighbours. For this, we need to know a little about how NumPy stores arrays. Even though our array is 2-dimensional (rows and columns), these are all arrayed in a giant line, each row placed one after the other. (This is called "C-order".) If we index into this linearised array ("raveled", in NumPy's language), we can make sure that our code works for 2D, 3D, and even higher-dimensional images. Using this indexing, neighbouring pixels to the left and right are accessed by subtracting or adding 1 to the current index. Neighbouring pixels above and below are accessed by subtracting or adding the length of a whole row. Finally, diagonal neighbours are found by combining these two. For simplicity, we only show the 2D version below:
In [20]:
def neighbour_steps(shape):
    step_sizes = np.cumprod((1,) + shape[-1:0:-1])
    axis_steps = np.array([[-1, -1],
                           [-1,  1],
                           [ 1, -1],
                           [ 1,  1]])
    diag = axis_steps @ step_sizes
    steps = np.concatenate((step_sizes, -step_sizes, diag))
    return steps
In [21]:
steps = neighbour_steps(degrees.shape)
[  1   9  -1  -9 -10   8  -8  10]
Of course, if we use these steps near the right edge of the image, we'll wrap around, and mistakenly think that the first element of the next row is a neighbouring pixel! Our solution is to only process nonzero pixels, and make sure that we have a 1-pixel-wide "pad" of zero pixels — which we do, in the image above! Now, we iterate over image pixels, look at neighbors, and populate the row and column vectors.
In [22]:
def build_graph(labeled_pixels, steps_to_neighbours, row, col):
    start = np.max(steps_to_neighbours)
    end = len(labeled_pixels) - start
    elem = 0  # row/col index
    for k in range(start, end):
        i = labeled_pixels[k]
        if i != 0:
            for s in steps:
                neighbour = k + s
                j = labeled_pixels[neighbour]
                if j != 0:
                    row[elem] = i
                    col[elem] = j
                    elem += 1
In [23]:
skeleton_int = np.ravel(skeleton.astype(np.int32))
skeleton_int[np.nonzero(skeleton_int)] = 1 + np.arange(nnz)
In [24]:
%timeit -r 1 -n 1 build_graph(skeleton_int, steps, row, col)
1 loop, best of 1: 917 µs per loop
Now we try the Numba version:
In [25]:
build_graph_nb = numba.jit(build_graph)
In [26]:
%timeit -r 1 -n 1 build_graph_nb(skeleton_int, steps, row, col)
1 loop, best of 1: 346 ms per loop
In [27]:
%timeit -r 1 -n 1 build_graph_nb(skeleton_int, steps, row, col)
1 loop, best of 1: 14.3 µs per loop
Nice! We get more than a 50-fold speedup using Numba, and this operation would have been difficult if not impossible to convert to a NumPy vectorized operation! We can now build our graph:
In [28]:
from scipy import sparse
G = sparse.coo_matrix((np.ones_like(row), (row, col))).tocsr()
As to what to do with said graph, I'll leave that for another post. (You can also peruse the skan source code.) In the meantime, though, you can visualize it with NetworkX:
In [29]:
import networkx as nx

Gnx = nx.from_scipy_sparse_matrix(G)

nx.draw_spectral(Gnx, with_labels=True)
There's our pixel graph! Obviously, the speedup and n-d support are important for bigger, 3D volumes, not for this tiny graph. But they are important, and, thanks to Numba, easy to obtain.


I hope I've piqued your interest in Numba and encouraged you to use it in your own projects. I think the future of success of Python in science heavily depends on JITs, and Numba is a strong contender to be the default JIT in this field. Note:This post was written using Jupyter Notebook. You can find the source notebook here.

Trump's win

Like many of you, I watched in horror two days ago as the night unfolded, and the unthinkable slowly came to pass. After a Netflix binge to try to numb the fear, I dived into a clickhole of social media posts and news articles to try to make sense of what had happened. I hope that writing a synthesis of that will let me get on with my life in this brave new world.

I am deeply, depressingly pessimistic about the future of the planet under Trump. Let's take the very best, most ludicrously optimistic scenario: that Trump swings to the center1 and doesn't make good on his many horrid promises. Even then, his election, and the Republicans' victory in the House and Senate, represent game over in the fight to avoid climate change2.

Like many of us, I'd buried my head in the sand about this, even after I read Michael Moore's again-famous essay predicting Trump's victory3, which is well worth a read. After that, here are some choice quotes from essays I recommend about how this nightmare came to be. From Glenn Greenwald's The Ongoing, Dangerous Refusal to Learn the Lesson of Brexit4:

When a political party is demolished, the principle responsibility belongs to one entity: the party that got crushed. It’s the job of the party and the candidate, and nobody else, to persuade the citizenry to support them and find ways to do that. Last night, the Democrats failed, resoundingly, to do that, and any autopsy or liberal think piece or pro-Clinton pundit commentary that does not start and finish with their own behavior is one that is inherently worthless.

Democrats got complacent and forgot a huge bloc of voters, instead catering to us upper-middle-class city dwellers. Linked from that article, Vincent Bevins's post-Brexit Facebook post5 expresses the core of their failure well:

Both Brexit and Trumpism are the very, very, wrong answers to legitimate questions that urban elites have refused to ask for thirty years. Questions such as - Who are the losers of globalization, and how can we spread the benefits to them and ease the transition? Is it fair that the rich can capture almost all the gains of open borders and trade, or should the process be more equitable?

I also liked this paragraph from David Wong's Don't Panic6:

The truth is, most of Trump's voters voted for him despite the fact that he said/believes awful things, not because of it. That in no way excuses it, but I have to admit I've spent eight years quietly tuning out news stories about drone strikes blowing up weddings in Afghanistan. [...] [Trump supporters] look out their front door and see painkiller addicts and closed factories. They believe that nobody in Washington gives a shit about them, mainly because that's 100-percent correct.

Now for the really depressing part: First, I don't think Trump will fix these people's problems. Despite the anti-establishment rhetoric, what he'll deliver is more tax cuts for the rich. And second, I don't see a plausible way out for the US. Its electoral system is completely ridiculous, and it's going to get worse, much worse, before it gets better. As in 2000, the popular vote went to Hillary Clinton, but the Electoral College vote went easily to Trump. The lack of preferential voting also meant that Hillary could have won in various battleground states where her loss margin was much smaller than the number of votes for third-party candidates. (Again, as in 2000.) Finally, gerrymandering has handed the House of Representatives to Republicans for the foreseeable future, despite more people voting Democrat than Republican in most elections.

But, what incentive do the people in power have to change the electoral rules?

Paul Krugman's dark closing for that night, Our Unknown Country7, summed it up best:

Is America a failed state and society? It looks truly possible.

Finally, many of you reading this will be in other countries thinking, it could not happen here. But it will, without action. Without massive change from the left-leaning parties of the world. Michael Moore's warning applies almost everywhere. Here in Australia, the Labor party has egregiously followed the right-wing Coalition in their repugnant asylum-seeker policy (our very own "build that wall" is "stop the boats"), and in their mass surveillance policies. Much like the Democrats, they have only themselves to blame for their loss in the last election, earlier this year.

The other day, the news reported on all the people who are losing their jobs here in my home town of Geelong as the Ford assembly line closes. These are the people we cannot forget.

I don't yet know what I can do about all this. But it's clear that writing snarky Facebook posts either being outraged or mocking the government will solve nothing. We have to do better, and we have to really be out there, not in here.

  1. Trump on Hillary Clinton in 2008.

  2. Scientists say that the next decade will require dramatic emissions reductions, worldwide, if we are to keep warming to a reasonable level. The United States accounts for an enormous proportion of these emissions, and will now spend 4-8 years, at an absolute minimum, doing absolutely nothing — indeed, probably helping the oil industry. One day after his election, Trump has already selected a climate sceptic to head the EPA:

  3. Michael Moore: Trump will win.

  4. Glenn Greenwald: Democrats, Trump, and the ongoing, dangerous refusal to learn the lesson of Brexit.

  5. Vincent Bevins on Brexit.

  6. David Wong: Don't Panic

  7. Paul Krugman: Our Unknown Country.