Badges

Continuous integration in Python, 6: Show off your work

We’re finally ready to wrap up this topic. By now you can:

But, much as exercise is wasted if your bathroom scale doesn’t automatically tweet about it, all this effort is for naught if visitors to your GitHub page can’t see it!

Most high-profile open-source projects these days advertise their CI efforts. Above, I cheekily called this showing off, but it’s truly important: anyone who lands on your GitHub page is a potential user or contributor, and if they see evidence that your codebase is stable and well-tested, they are more likely to stick around.

Badging your README is easy. (You do have a README, don’t you?) In Travis, go to your latest build. Near the top right, click on the “build passing” badge:

Travis-CI badge

You’ll get an overlay, with a pull-down menu for all the different options for getting the badge. You can grab the image URL, or you can directly grab the Markdown to put into your markdown-formatted README, or a bunch of other options, including RST:

Travis-CI badge URLs

Just copy and paste the appropriate code and add it to your README file wherever you please.

Meanwhile, on your repository’s Coveralls page, on the right-hand side, you will find another pull-down menu with the appropriate URLs:

Coveralls badge URLs

Again, just grab whichever URL is appropriate for your needs (I prefer Markdown-formatted READMEs), and add it to your README, usually next to the Travis badge.

And you’re done! You’ve got automated tests, tons of test coverage, you’re running everything correctly thanks to configuration files, and all this is getting run on demand thanks to Travis and Coveralls. And thanks to badging, the whole world knows about it:

Readme badges

Screen Shot 2014-10-16 at 3.49.30 pm

Continuous integration in Python, 5: report test coverage using Coveralls

In this series of posts, we’ve covered:

Today I will show you how to continuously check your test coverage using Coveralls.

Travis runs whatever commands you tell it to run in your .travis.yml file. Normally, that’s just installing your program and its requirements, and running your tests. If you wanted instead to launch some nuclear missiles, you could do that. (Assuming you were happy to put the launch keys in a public git repository… =P)

The Coveralls service, once again free for open-source repositories, takes advantage of this: you just need to install an extra piece of software from PyPI, and run it after your tests have passed. Do so by adding the line pip install coveralls to your before_install section, and just the coveralls command to a new after_success section:

language: python
python:
    - "2.7"
    - "3.4"
before_install:
    - pip install pytest pytest-cov
    - pip install coveralls
script:
    - py.test
after_success:
    - coveralls

This will push your coverage report to Coveralls every time Travis is run, i.e., every time you push to your GitHub repository. You will need to turn on the service by:

  • going to coveralls.io and signing in with your GitHub credentials
  • clicking on “Repositories”, and “Add Repo” (and perhaps on “Sync GitHub Repos”, top right)
  • switching your repository’s switch to “On”

Adding a repo to Coveralls

That’s it! With just two small changes to your .travis.yml and the flick of a switch, you will get continuous monitoring of your test coverage with each push, and also with each pull request. By default, Coveralls will even comment on your pull requests, so that you instantly know whether someone is not testing their submitted code!

Plus, you get a swanky web dashboard to monitor your coverage!

Coveralls dashboard

Tune in tomorrow for the final chapter in continuous integration in Python: showing off! No, it’s not a blog post about blogging about it. =P

(Addendum: the above .travis.yml will run Coveralls twice, once per Python version, so it will also comment twice for your PRs, notify you twice, etc. If you want to test more Python versions, the problem multiplies. Therefore, you can choose to call Coveralls only for a specific Python version, like so:

after_success:
    - if [[ $ENV == python=3.4* ]]; then
          coveralls;
      fi

This is the approach taken by the scikit-image project.)

Screen Shot 2014-10-15 at 9.55.05 pm

Continuous integration in Python, 4: set up Travis-CI

This is the fourth post in my Continuous integration (CI) in Python series, and the one that puts the “continuous” in CI! For the introductory three posts, see:

Introduction to Travis-CI

Once you’ve set up your tests locally, it does you no good if you don’t remember to run them! Travis-CI makes this seamless, because it will check out your code and run you tests for each and every commit you push to GitHub! (This is even more important when you are receiving pull requests on GitHub: the tests will be run online, without you having to individually check out each PR and run the tests on your machine!)

This is what continuous integration is all about. Once upon a time, the common practice was to pile on new features on a codebase. Then, come release time, there would be a feature freeze, and some time would be spent cleaning up code and removing bugs. In continuous integration, instead, no new feature is allowed into the codebase until it is bug free, as demonstrated by the test suite.

What to do

You need to first add a .travis.yml file to the root of your project. This tells Travis how to install your program’s dependencies, install your program, and run the tests. Here’s an example file to run tests and coverage on our maths.py sample project:

language: python
python:
    - "2.7"
    - "3.4"
before_install:
    - pip install pytest pytest-cov
script:
    - py.test

Pretty simple: tell Travis the language of your project, the Python version (you can specify multiple versions, one per line, to test on multiple Python versions!), how to install test dependencies. Finally, the command to run your tests. (This can be anything, not just pytest or another testing framework, as long as a shell exit status of 0 means “success” and anything else means “failure”.)

You can read more about the syntax and options for your .travis.yml in the Travis documentation. There are other sections you can add, such as “virtualenv” to set up Python virtual environments, “install” to add compilation and other installation steps for your library, before testing, and “after_success” to enable e.g. custom notifications. (We will use this last one in the next post.)

Once you have created .travis.yml for your project, you need to turn it on for your repo. This is, currently, a fairly painful process, and I can’t wait for the day that it’s enabled by default on GitHub. In the meantime though:

Voilà! Every push and pull-request to your repository will trigger a job on Travis’s servers, building your dependencies and your software, running your tests, and emailing you if anything went wrong! Amazingly, this is completely free for open source projects, so you really have no excuse for not using it!

Travis build page

Follow this blog to learn how to continuously check test coverage using Coveralls, coming in the next post!

Continuous integration in Python 3: set up your test configuration files

This is the third post in my series on continuous integration (CI) in Python. For the first two posts, see 1: automated tests with pytest, and 2: measuring test coverage.

By now, you’ve got a bunch of tests and doctests set up in your project, which you run with the command:

$ py.test --doctest-modules --cov . --cov-report term-missing

If you happen to have a few script and setup files lying around that you don’t want to test, this might expand to

$ py.test --doctest-modules --cov . --cov-report term-missing --ignore script.py

As you can see, this is getting cumbersome. You can fix this by adding a setup.cfg file to the root of your project, containing configuration options for your testing. This follows standard INI file syntax.

[pytest]
addopts = –ignore script.py –doctest-modules –cov-report term-missing –cov .

Note that the “ignore” flag only tells pytest not to run that file during testing. You also have to tell the coverage module that you don’t want those lines counted for coverage. Do this in a .coveragerc file, with similar syntax:

[run]
omit = *script.py

Once you’ve done this, you can invoke your tests with a simple, undecorated call to just py.test.

To find out more about your configuration options, see the pytest basic test configuration page, and Ned Batchelder’s excellent .coveragerc syntax guide.

That’s it for this entry of my CI series. Follow this blog for the next two entries, setting up Travis CI and Coveralls.

Continuous integration in Python, Volume 2: measuring test coverage

(Edit: I initially thought it would be cute to number from 0. But it turns out it becomes rather obnoxious to relate English (first, second, …) to 0-indexing. So this was formerly volume 1. But everything else remains the same.)

This is the second post in a series about setting up continuous integration for a Python project from scratch. For the first post, see Automated tests with pytest.

After you’ve written some test cases for a tiny project, it’s easy to check what code you have automatically tested. For even moderately big projects, you will need tools that automatically check what parts of your code are actually tested. The proportion of lines of code that are run at least once during your tests is called your test coverage.

For the same reasons that testing is important, measuring coverage is important. Pytest can measure coverage for you with the coverage plugin, found in the pytest-cov package. Once you’ve installed the extension, a test coverage measurement is just a command-line option away:

 ~/projects/maths $ py.test --doctest-modules --cov .
============================= test session starts ==============================
platform darwin -- Python 2.7.8 -- py-1.4.25 -- pytest-2.6.3
plugins: cov
collected 2 items 

maths.py .
test_maths.py .
--------------- coverage: platform darwin, python 2.7.8-final-0 ----------------
Name         Stmts   Miss  Cover
--------------------------------
maths            2      0   100%
test_maths       4      0   100%
--------------------------------
TOTAL            6      0   100%

=========================== 2 passed in 0.07 seconds ===========================

(The --cov takes a directory as input, which I find obnoxious, given that py.test so naturally defaults to the current directory. But it is what it is.)

Now, if I add a function without a test, I’ll see my coverage drop:

def sqrt(x):
    """Return the square root of `x`."""
    return x * 0.5

(The typo is intentional.)

--------------- coverage: platform darwin, python 2.7.8-final-0 ----------------
Name         Stmts   Miss  Cover
--------------------------------
maths            4      1    75%
test_maths       4      0   100%
--------------------------------
TOTAL            8      1    88%

With one more option, --cov-report term-missing, I can see which lines I haven’t covered, so I can try to design tests specifically for that code:

--------------- coverage: platform darwin, python 2.7.8-final-0 ----------------
Name         Stmts   Miss  Cover   Missing
------------------------------------------
maths            4      1    75%   24
test_maths       4      0   100%   
------------------------------------------
TOTAL            8      1    88%   

Do note that 100% coverage does not ensure correctness. For example, suppose I test my sqrt function like so:

def sqrt(x):
    """Return the square root of `x`.

    Examples
    --------
    >>> sqrt(4.0)
    2.0
    """
    return x * 0.5

Even though my test is correct, and I now have 100% test coverage, I haven’t detected my mistake. Oops!

But, keeping that caveat in mind, full test coverage is a wonderful thing, and if you don’t test something, you’re guaranteed not to catch errors. Further, my example above is quite contrived, and in most situations full test coverage will spot most errors.

That’s it for part 1. Tune in next time to learn how to turn on Travis continuous integration for your GitHub projects!

Continuous integration in Python, Volume 1: automated tests with pytest

(Edit: I initially thought it would be cute to number from 0. But it turns out it becomes rather obnoxious to relate English (first, second, …) to 0-indexing. So this was formerly volume 0. But everything else remains the same.)

I just finished the process of setting up continuous integration from scratch for one of my projects, cellom2tif, a simple image file converter/liberator. I thought I would write a blog post about that process, but it has slowly mutated into a hefty document that I thought would work better as a series. I’ll cover automated testing, test coverage, and how to get these to run automatically for your project with Travis-CI and Coveralls.

Without further ado, here goes the first post: how to set up automated testing for your Python project using pytest.

Automated tests, and why you need them

Software engineering is hard, and it’s incredibly easy to mess things up, so you should write tests for all your functions, which ensure that nothing obviously stupid is going wrong. Tests can take a lot of different forms, but here’s a really basic example. Suppose this is a function in your file, maths.py:

def square(x):
    return x ** 2

Then, elsewhere in your package, you should have a file test_maths.py with the following function definition:

from maths import square
def test_square():
    x = 4
    assert square(x) == 16

This way, if someone (such as your future self) comes along and messes with the code in square(), test_square will tell you whether they broke it.

Testing in Python with pytest

A whole slew of testing frameworks, such as nose, will then traverse your project, look for files and functions whose names begin with test_, run them, and report any errors or assertion failures.

I’ve chosen to use pytest as my framework because:

  • it is a strict superset of both nose and Python’s built-in unittest, so that if you run it on projects set up with those, it’ll work out of the box;
  • its output is more readable than nose’s; and
  • its fixtures provide a very nice syntax for test setup and cleanup.

For a quick intro to pytest, you can watch Andreas Pelme’s talk at this year’s EuroPython conference:

But the basics are very simple: sprinkle files named test_something.py throughout your project, each containing one or more test_function() definition; then type py.test on the command line (at your project root directory), and voila! Pytest will traverse all your subdirectories, gather up all the test files and functions, and run your tests.

Here’s the output for the minimal maths project described above:

~/projects/maths $ py.test
============================= test session starts ==============================
platform darwin -- Python 2.7.8 -- py-1.4.20 -- pytest-2.5.2
collected 1 items

test_maths.py .

=========================== 1 passed in 0.03 seconds ===========================

In addition to the test functions described above, there is a Python standard called doctest in which properly formatted usage examples in your documentation are automatically run as tests. Here's an example:

def square(x):
    """Return the square of a number `x`.

    [...]

    Examples
    --------
    >>> square(5)
    25
    """
    return x ** 2

(See my post on NumPy docstring conventions for what should go in the ellipsis above.)

Depending the complexity of your code, doctests, test functions, or both will be the appropriate route. Pytest supports doctests with the --doctest-modules flag. (This runs both your standard tests and doctests.)

~/projects/maths $ py.test --doctest-modules
============================= test session starts ==============================
platform darwin -- Python 2.7.8 -- py-1.4.20 -- pytest-2.5.2
collected 2 items

maths.py .
test_maths.py .

=========================== 2 passed in 0.06 seconds ===========================

Test-driven development

That was easy! And yet most people, my past self included, neglect tests, thinking they’ll do them eventually, when the software is ready. This is backwards. You’ve probably heard the phrase “Test-driven development (TDD)”; this is what they’re talking about: writing your tests before you’ve written the functionality to pass them. It might initially seem like wasted effort, like you’re not making progress in what you actually want to do, which is write your software. But it’s not:

By spending a bit of extra effort to prevent bugs down the road, you will get to where you want to go faster.

That’s it for volume 0! Watch out for the next post: ensuring your tests are thorough by measuring test coverage.

Use pmset to restore your Mac’s instant-wake after upgrading to Yosemite

I just upgraded to Mac OS X Yosemite Beta 3, and sure enough, Apple has once again broken my time-to-deep-sleep setting. After I upgraded, I noticed that my computer took a few seconds to wake from sleep, rather than the usual instant-on that defined OS X for so long. I checked pmset, Mac’s power management command-line utility:

$ pmset -g
Active Profiles:
Battery Power       -1*
AC Power        -1
Currently in use:
 standbydelay         4200
 standby              1
 halfdim              1
 hibernatefile        /var/vm/sleepimage
 gpuswitch            2
 darkwakes            0
 disksleep            10
 sleep                0
 autopoweroffdelay    14400
 hibernatemode        3
 autopoweroff         1
 ttyskeepawake        1
 displaysleep         10
 acwake               0
 lidwake              1

Sure enough, the standbydelay setting, the time until the computer switches from “sleep” to “deep sleep”, is back to 4,200 seconds (1 hour 10 minutes), not the 24 hours that I had set it to last year.

So, as before, if you want to speed up your wake up time in everyday use of your laptop, run the following command in the Terminal:

sudo pmset -a standbydelay 86400

Check in for a new post come next upgrade! (Mammoth?)

Read microscopy images to numpy arrays with python-bioformats

The python-bioformats library lets you seamlessly read microscopy images into numpy arrays from pretty much any file format.

I recently explained how to use Fiji’s Jython interpreter to open BioFormats images, do some processing, and save the result to a standard format such as TIFF. Since then, the
CellProfiler team has released an incredible tool, the python-bioformats library. It uses the Java Native Interface (JNI) to access the Java BioFormats library and give you back a numpy array within Python. In other words, it’s magic.

Some of the stuff I was doing in the Jython interpreter was not going to fly for a 400GB image file produced by Leica (namely, setting the flag setOpenAllSeries(True)). This file contained 3D images of multiple zebrafish embryos, obtained every 30 minutes for three days. I needed to process each image sequentially.

The first problem was that even reading the metadata from the file resulted in a Java out-of-memory error! With the help of Lee Kamentsky, one of the creators of python-bioformats, I figured out that Java allocates a maximum memory footprint of just 256MB. With the raw metadata string occupying 27MB, this was not enough to contain the full structure of the parsed metadata tree. The solution was simply to set a much larger maximum memory allocation to the JVM:

import javabridge as jv, bioformats as bf
jv.start_vm(class_path=bf.JARS, max_heap_size='8G')

Once that was done, it was straightforward enough to get an XML string of metadata to get information about all the images contained in the file:

md = bf.get_omexml_metadata('Long time Gfap 260314.lif')

This was my first experience with XML, and I found the official schema extremely intimidating. Thankfully, as usual, Python makes common things easy, and parsing XML is a very common thing. Using the xml package from the Python standard library, it is easy enough to get the information I want from the XML string:

from xml.etree import ElementTree as ETree
mdroot = ETree.fromstring(md)

You might have gathered that the ElementTree object contains a rooted tree with all the information about our file. Each node has a tag, attributes, and children. The root contains information about each series:

>>> print(mdroot[300].tag, mdroot[300].attrib)
('{http://www.openmicroscopy.org/Schemas/OME/2011-06}Image', {'ID': 'Image:121', 'Name': '41h to 47.5 hpSCI/Pos021_S001'})

And each series contains information about its acquisition, physical measurements, and pixel measurements:

>>> for a in mdroot[300]:
...     print((a.tag, a.attrib))
...
('{http://www.openmicroscopy.org/Schemas/OME/2011-06}AcquiredDate', {}),
('{http://www.openmicroscopy.org/Schemas/OME/2011-06}InstrumentRef', {'ID': 'Instrument:121'}),
('{http://www.openmicroscopy.org/Schemas/OME/2011-06}ObjectiveSettings', {'RefractiveIndex': '1.33', 'ID': 'Objective:121:0'}),
('{http://www.openmicroscopy.org/Schemas/OME/2011-06}Pixels', {'SizeT': '14', 'DimensionOrder': 'XYCZT', 'PhysicalSizeY': '0.445197265625', 'PhysicalSizeX': '0.445197265625', 'PhysicalSizeZ': '1.9912714979001302', 'SizeX': '1024', 'SizeY': '1024', 'SizeZ': '108', 'SizeC': '2', 'Type': 'uint8', 'ID': 'Pixels:121'})

I only need a fraction of this metadata, so I wrote a function, parse_xml_metadata, to parse out the image names, their size in pixels, and their physical resolution.

Armed with this knowledge, it is then straightforward to preallocate a numpy array for each image and read the image from disk:

from matplotlib import pyplot as plt, cm
from lesion.lifio import parse_xml_metadata
import numpy as np
from __future__ import division

filename = 'Long time Gfap 260314.lif'
rdr = bf.ImageReader(filename, perform_init=True)
names, sizes, resolutions = parse_xml_metadata(md)
idx = 50 # arbitrary series for demonstration
size = sizes[idx]
nt, nz = size[:2]
image5d = np.empty(size, np.uint8)
for t in range(nt):
    for z in range(nz):
        image5d[t, z] = rdr.read(z=z, t=t, series=idx, rescale=False)
plt.imshow(image5d[nt//2, nz//2, :, :, 0], cmap=cm.gray)
plt.show()

2D slice from 5D volume

Boom. Using Python BioFormats, I’ve read in a small(ish) part of a quasi-terabyte image file into a numpy array, ready for further processing.

Note: the dimension order here is time, z, y, x, channel, or TZYXC, which I think is the most efficient way to read these files in. My wrapper allows arbitrary dimension order, so it’ll be good to use it to figure out the fastest way to iterate through the volume.

In my case, I’m looking to extract statistics using scikit-image’s profile_line function, and plot their evolution over time. Here’s the min/max intensity profile along the embryo for a sample stack:

min/max vs time

I still need to clean up the code, in particular to detect bad images (no prizes for guessing which timepoint was bad here), but my point for now is that, thanks to Python BioFormats, doing your entire bioimage analysis in Python just got a heck of a lot easier.

SciPy 2014: an extremely useful conference with a diversity problem

I just got back home from the SciPy 2014 conference in Austin, TX. Here are my thoughts after this year’s conference.

About SciPy in general

Since my first SciPy in 2012, I’ve decided to prioritise programming conferences over scientific ones, because the value I get for my time is just that much higher. At a scientific conference, I certainly become aware of way-cool stuff going on in other research labs in my area. Once I get home, however, I go back to whatever I was doing. In contrast, at programming conferences, I become aware of new tools and practices that change the way I do science. In his keynote this year, Greg Wilson said of Software Carpentry, “We save researchers a day a week for the rest of their careers.” I feel the same way about SciPy in general.

In the 2012 sprints, I learned about GitHub Pull Requests and code review, the lingua franca of open source development today. I can’t express how useful that’s been. I also started my ongoing collaboration with the scikit-image project, which has enabled my research to reach far more users than I ever could have achieved on my own.

No scientific conference I’ve been to has had such an impact on my science output, nor can I imagine one doing so.

This year’s highlights

This year was no different. Without further ado, here are my top hits from this year’s conference:

  • Michael Droettboom talked about his continuous benchmarking project, Airspeed Velocity. It is hilariously named and incredibly useful. asv checks out code from your Git repo at regular intervals and runs benchmarks (which you define), and plots your code’s performance over time. It’s an incredible guard against feature creep slowing down your code base.
  • IPython recently unveiled their modal version 2.0 interface, sending vimmers worldwide rejoicing. But a few key bindings are just wrong from a vim perspective. Most egregiously, i, which should enter edit mode, interrupts the kernel when pressed twice! That’s just evil. Paul Ivanov goes all in with vim keybindings with his hilarious and life-changing IPython vimception talk. The title is more appropriate than I realised. Must-watch.
  • Damián Avila revealed (heh) his live IPython presentations with Reveal.js, forever changing how Python programmers present their work. I was actually aware of this before the conference and used it in my own talk, but you should definitely watch his talk and get the extension from his repo.
  • Min RK gave an excellent tutorial on IPython parallel (repo, videos 1, 2, 3). It’s remarkable what the IPython team have achieved thanks to their decoupling of the interactive shell and the computation “kernel”. (I still hate that word.)
  • Brian Granger and Jon Frederic gave an excellent tutorial on IPython interactive widgets (notebook here). They provide a simple and fast way to interactively explore your data. I’ve already started using these on my own problems.
  • Aaron Meurer gave the best introduction to the Python packaging problem that I’ve ever seen, and how Continuum’s conda project solves it. I think we still need an in-depth tutorial on how package distributors should use conda, but for users, conda is just awesome, and this talk explains why.
  • Matt Rocklin has a gift for crystal clear speaking, despite his incredible speed, and it was on full display in his (and Mark Wiebe’s) talk on Blaze, Continuum’s new array abstraction library. I’m not sure I’ll be using Blaze immediately but it’s certainly on my radar now!
  • Lightning talks are always fun: days 1, 2, 3. Watch out for Fernando Pérez’s announcement of Project Jupyter, the evolution of the IPython notebook, and for Damon McDougall’s riveting history of waffles. (You think I’m joking.)

Apologies if I’ve missed anyone: with three tracks, an added one with the World Cup matches ;) , and my own talk preparations, “overwhelming” does not begin to describe the conference! I will second Aaron Meurer’s assertion that there were no bad talks. Which brings us to…

On my to-watch

Jake Vanderplas recently wrote a series of blog posts (last one here, with links to earlier posts) comparing frequentist and Bayesian approaches to statistical analysis, in which he makes a compelling argument that we should all relegate frequentism to the dustbin of history. As such, I intend to go over Chris Fonnesbeck’s tutorial (2, 3) and talk about Bayesian analysis in Python using PyMC.

David Sanders also did a Julia tutorial (part 2) that was scheduled at the same time as my own scikit-image tutorial, but I’m interested to see how the Julia ecosystem is progressing, so that should be a good place to start. (Although I’m still bitter that they went with 1-based indexing!)

The reproducible science tutorial (part 2) generated quite a bit of buzz so I would be interested to go over that one as well.

For those interested in computing education or in geoscience, the conference had dedicated tracks for each of those, so you are bound to find something you like (not least, Lorena Barba’s and Greg Wilson’s keynotes). Have a look at the full listing of videos here. These might be easier to navigate by looking at the conference schedule.

The SciPy experience

I want to close this post with a few reflections on the conference itself.

SciPy is broken up into three “stages”: two days of tutorials, three days of talks, and two days of sprints. Above, I covered the tutorials and talks, but to me, the sprints are what truly distinguish SciPy. The spirit of collaboration is unparalleled, and an astonishing amount of value is generated in those two days, either in the shape of code, or in introducing newcomers to new projects and new ways to collaborate in programming.

My biggest regret of the conference was not giving a lightning talk urging people to come to the sprints. I repeatedly asked people whether they were coming to the sprints, and almost invariably the answer was that they didn’t feel they were good enough to contribute. To reiterate my previous statements: (1) when I attended my first sprint in 2012, I had never done a pull request; (2) sprints are an excellent way to introduce newcomers to projects and to the pull request development model. All the buzz around the sprints was how welcoming all of the teams were, but I think there is a massive number of missed opportunities because this is not made obvious to attendees before the sprints.

Lastly, a few notes on diversity. During the conference, April Wright, a student in evolutionary biology at UT Austin, wrote a heartbreaking blog post about how excluded she felt from a conference where only 15% of attendees were women. That particular incident was joyfully resolved, with plenty of SciPyers reaching out to April and inviting her along to sprints and other events. But it highlighted just how poorly we are doing in terms of diversity. Andy Terrel, one of the conference organisers, pointed out that 15% is much better than 2012’s three (women, not percent!), but (a) that is still extremely low, and (b) I was horrified to read this because I was there in 2012… And I did not notice that anything was wrong. How can it be, in 2012, that it can seem normal to be at a professional conference and have effectively zero women around? It doesn’t matter what one says about the background percentage of women in our industry and so on… Maybe SciPy is doing all it can about diversity. (Though I doubt it.) The point is that a scene like that should feel like one of those deserted cityscapes in post-apocalyptic movies. As long as it doesn’t, as long as SciPy feels normal, we will continue to have diversity problems. I hope my fellow SciPyers look at these numbers, feel appalled as I have, and try to improve.

… And on cue, while I was writing this post, Andy Terrel wrote a great post of his own about this very topic:

http://andy.terrel.us/blog/2014/07/17/

I still consider SciPy a fantastic conference. Jonathan Eisen (@phylogenomics), whom I admire, would undoubtedly boycott it because of the problems outlined above, but I am heartened that the organising committee is taking this as a serious problem and trying hard fix it. I hope next time it is even better.

A clever use of SciPy’s ndimage.generic_filter for n-dimensional image processing

This year I am privileged to be a mentor in the Google Summer of Code for the scikit-image project, as part of the Python Software Foundation organisation. Our student, Vighnesh Birodkar, recently came up with a clever use of SciPy’s ndimage.generic_filter that is certainly worth sharing widely.

Vighnesh is tasked with implementing region adjacency graphs and graph based methods for image segmentation. He initially wrote specific functions for 2D and 3D images, and I suggested that he should merge them: either with n-dimensional code, or, at the very least, by making 2D a special case of 3D. He chose the former, and produced extremely elegant code. Three nested for loops and a large number of neighbour computations were replaced by a function call and a simple loop. Read on to find out how.

Iterating over an array of unknown dimension is not trivial a priori, but thankfully, someone else has already solved that problem: NumPy’s nditer and ndindex functions allow one to efficiently iterate through every point of an n-dimensional array. However, that still leaves the problem of finding neighbors, to determine which regions are adjacent to each other. Again, this is not trivial to do in nD.

scipy.ndimage provides a suitable function, generic_filter. Typically, a filter is used to iterate a “selector” (called a structuring element) over an array, compute some function of all the values covered by the structuring element, and replace the central value by the output of the function. For example, using the structuring element:


fp = np.array([[0, 1, 0],
               [1, 1, 1],
               [0, 1, 0]], np.uint8)

and the function np.median on a 2D image produces a median filter over a pixel’s immediate neighbors. That is,


import functools
median_filter = functools.partial(generic_filter,
                                  function=np.median,
                                  footprint=fp)

Here, we don’t want to create an output array, but an output graph. What to do? As it turns out, Python’s pass-by-reference allowed Vighnesh to do this quite easily using the “extra_arguments” keyword to generic_filter: we can write a filter function that receives the graph and updates it when two distinct values are adjacent! generic_filter passes all values covered by a structuring element as a flat array, in the array order of the structuring element. So Vighnesh wrote the following function:


def _add_edge_filter(values, g):
    """Add an edge between first element in `values` and
    all other elements of `values` in the graph `g`.
    `values[0]` is expected to be the central value of
    the footprint used.

    Parameters
    ----------
    values : array
        The array to process.
    g : RAG
        The graph to add edges in.

    Returns
    -------
    0.0 : float
        Always returns 0.

    """
    values = values.astype(int)
    current = values[0]
    for value in values[1:]:
        g.add_edge(current, value)
    return 0.0

Then, using the footprint:


fp = np.array([[0, 0, 0],
               [0, 1, 1],
               [0, 1, 0]], np.uint8)

(or its n-dimensional analog), this filter is called as follows on labels, the image containing the region labels:


filters.generic_filter(labels,
                       function=_add_edge_filter,
                       footprint=fp,
                       mode='nearest',
                       extra_arguments=(g,))

This is a rather unconventional use of generic_filter, which is normally used for its output array. Note how the return value of the filter function, _add_edge_filter, is just 0! In our case, the output array contains all 0s, but we use the filter exclusively for its side-effect: adding an edge to the graph g when there is more than one unique value in the footprint. That’s cool.

Continuing, in this first RAG implementation, Vighnesh wanted to segment according to average color, so he further needed to iterate over each pixel/voxel/hypervoxel and keep a running total of the color and the pixel count. He used elements in the graph node dictionary for this and updated them using ndindex:


for index in np.ndindex(labels.shape):
    current = labels[index]
    g.node[current]['pixel count'] += 1
    g.node[current]['total color'] += image[index]

Thus, together, numpy’s nditer, ndindex, and scipy.ndimage’s generic_filter provide a powerful way to perform a large variety of operations on n-dimensional arrays… Much larger than I’d realised!

You can see Vighnesh’s complete pull request here and follow his blog here.

If you use NumPy arrays and their massive bag of tricks, please cite the paper below!

Stefan Van Der Walt, S. Chris Colbert, & Gaël Varoquaux (2011). The NumPy array: a structure for efficient numerical computation Computing in Science and Engineering 13, 2 (2011) 22-30 arXiv: 1102.1523v1