Tag Archives: Python

Summer School Announcement: ASPP Asia-Pacific 2018

The Advanced Scientific Programming in Python (ASPP) summer school has had 10 extremely successful iterations in Europe. (You can find past materials, schedules, and student evaluations at https://python.g-node.org/archives.) Now, thanks to the INCF, we will be holding its first iteration in Australia, to cater to the Asia Pacific region. (Note: the original ASPP will still take place in Europe next Northern summer; this is a fork of that school.)

Key details

  • The workshop runs January 14-21 at the Melbourne Brain Centre, University of Melbourne, Australia
  • topics include: git, contributing to open source software with github, testing, debugging, profiling, advanced NumPy, Cython, data visualisation.
  • hands-on learning using pair programming
  • free to attend (but students are responsible for travel, accommodation, and meals)
  • 30 student places, to be selected competitively
  • application deadline is Oct 31, 2017, 23:59 UTC.
  • website: https://melbournebioinformatics.org.au/aspp-asia-pacific
  • apply: https://melbournebioinformatics.org.au/aspp-asia-pacific/applications (make sure you read the FAQ on that page)

Background

Two-and-a-bit years ago, Tiziano Zito asked me if I could join the faculty at the 2015 ASPP school in Munich (then in its 8th iteration). It turned out to be a fantastic teaching experience, and, more importantly, it was a fantastic experience for the students. Students selected for the school fit a certain profile, neither novice nor advanced. As such, you can be sure that if you participate in the school, you will learn a great deal. We teach tools that will immediately improve your scientific practice. I decided that I wanted to replicate the school in Australia. Now it is finally here!

Course outline

Scientists spend increasingly more time writing, maintaining, and debugging software. While techniques for doing this efficiently have evolved, only few scientists have been trained to use them. As a result, instead of doing their research, they spend far too much time writing deficient code and reinventing the wheel. In this course we will present a selection of advanced programming techniques and best practices that are standard in industry, but especially tailored to the needs of a programming scientist. Lectures are devised to be interactive and to give the students enough time to acquire direct hands-on experience with the materials. Students will work in pairs throughout the school and will team up to practice the newly learned skills in a real programming project — an entertaining computer game.

We use the Python programming language for the entire course. Python works as a simple programming language for beginners, but more importantly, it also works great in scientific simulations and data analysis. We show how clean language design, ease of extensibility, and the great wealth of open source libraries for scientific computing and data visualization are driving Python to becoming a standard tool for scientists.

Who is eligible?

This school is targeted at Master/PhD students and postdocs from all areas of science. Competence in Python or in another language such as Java, C/C++, MATLAB, or Mathematica is absolutely required. Basic knowledge of Python and of a version control system such as git, subversion, mercurial, or bazaar is assumed. Participants without any prior experience with Python and/or git should work through the proposed introductory material before the course.

We have strived to get a pool of students that is international and gender-balanced, and have succeeded, with gender parity in the last four schools.

More questions

If you have any questions, contact aspp@melbournebioinformatics.org.au.

Please circulate this announcement widely!

Thanks,

Juan.

Prettier LowLevelCallables with Numba JIT and decorators

In my recent post, I extolled the virtues of SciPy 0.19’s LowLevelCallable. I did lament, however, that for generic_filter, the LowLevelCallable interface is a good deal uglier than the standard function interface. In the latter, you merely need to provide a function that takes the values within a pixel neighbourhood, and outputs a single value — an arbitrary function of the input values. That is a Wholesome and Good filter function, the way God intended.

In contrast, a LowLevelCallable takes the following signature:

int callback(double *buffer, intptr_t filter_size, 
             double *return_value, void *user_data)

That’s not very Pythonic at all. In fact, it’s positively Conic (TM). For those that don’t know, pointers are evil, so let’s aim to avoid their use.

“But Juan!”, you are no doubt exclaiming. “Juan! Didn’t you just tell us how to use pointers in Numba cfuncs, and tell us how great it was because it was so fast?”

Indeed I did. But it left a bad taste in my mouth. Although I felt that the tradeoff was worth it for such a phenomenal speed boost (300x!), I was unsatisfied. So I started immediately to look for a tidier solution. One that would let me write proper filter functions while still taking advantage of LowLevelCallables.

It turns out Numba cfuncs can call Numba jitted functions, so, with a little bit of decorator magic, it’s now ludicrously easy to write performant callables for SciPy using just pure Python. (If you don’t know what Numba JIT is, read my earlier post.) As in the last post, let’s look at grey_erosion as a baseline benchmark:

>>> import numpy as np
>>> footprint = np.array([[0, 1, 0],
...                       [1, 1, 1],
...                       [0, 1, 0]], dtype=bool)
>>> from scipy import ndimage as ndi
>>>
>>> %timeit ndi.grey_erosion(image, footprint=fp)
1 loop, best of 3: 160 ms per loop

Now, we write a decorator that uses Numba jit and Numba cfunc to make a LowLevelCallable suitable for passing directly into generic_filter:

>>> import numba
>>> from numba import cfunc, carray
>>> from numba.types import intc, CPointer, float64, intp, voidptr
>>> from scipy import LowLevelCallable
>>>
>>> def jit_filter_function(filter_function):
...     jitted_function = numba.jit(filter_function, nopython=True)
...     @cfunc(intc(CPointer(float64), intp, CPointer(float64), voidptr))
...     def wrapped(values_ptr, len_values, result, data):
...         values = carray(values_ptr, (len_values,), dtype=float64)
...         result[0] = jitted_function(values)
...         return 1
...     return LowLevelCallable(wrapped.ctypes)

If you haven’t seen decorators before, read this primer from Real Python. To summarise, we’ve written a function that takes as input a Python function, and outputs a LowLevelCallable. Here’s how to use it:

>>> @jit_filter_function
... def fmin(values):
...     result = np.inf
...     for v in values:
...         if v < result:
...             result = v
...     return result

As you can see, the fmin function definition looks just like a normal Python function. All the magic happens when we attach our @jit_filter_function decorator to the top of the function. Let’s see it in action:

>>> %timeit ndi.generic_filter(image, fmin, footprint=fp)
10 loops, best of 3: 92.9 ms per loop

Wow! numba.jit is actually over 70% faster than grey_erosion or the plain cfunc approach!

In case you want to use this, I’ve made a package available on PyPI, so you can actually pip install it right now with pip install llc (for low-level callable), and then:

>>> from llc import jit_filter_function

The source code is on GitHub. Currently it only covers ndi.generic_filter‘s signature, and only with Numba, but I hope to gradually expand it to cover all the functions that take LowLevelCallables in SciPy, as well as support Cython. Pull requests are welcome!

SciPy’s new LowLevelCallable is a game-changer

… and combines rather well with that other game-changing library I like, Numba.

I’ve lamented before that function calls are expensive in Python, and that this severely hampers many functions that should be insanely useful, such as SciPy’s ndimage.generic_filter.

To illustrate this, let’s look at image erosion, which is the replacement of each pixel in an image by the minimum of its neighbourhood. ndimage has a fast C implementation, which serves as a perfect benchmark against the generic version, using a generic filter with min as the operator. Let’s start with a 2048 x 2048 random image:

>>> import numpy as np
>>> image = np.random.random((2048, 2048))

and a neighbourhood “footprint” that picks out the pixels to the left and right, and above and below, the centre pixel:

>>> footprint = np.array([[0, 1, 0],
...                       [1, 1, 1],
...                       [0, 1, 0]], dtype=bool)

Now, we measure the speed of grey_erosion and generic_filter. Spoiler alert: it’s not pretty.

>>> from scipy import ndimage as ndi
>>> %timeit ndi.grey_erosion(image, footprint=footprint)
10 loops, best of 3: 118 ms per loop
>>> %timeit ndi.generic_filter(image, np.min, footprint=footprint)
1 loop, best of 3: 27 s per loop

As you can see, with Python functions, generic_filter is unusable for anything but the tiniest of images.

A few months ago, I was trying to get around this by using Numba-compiled functions, but the way to feed C functions to SciPy was different depending on which part of the library you were using. scipy.integrate used ctypes, while scipy.ndimage used PyCObjects or PyCapsules, depending on your Python version, and Numba only supported the former method at the time. (Plus, this topic starts to stretch my understanding of low-level Python, so I felt there wasn’t much I could do about it.)

Enter this pull request to SciPy from Pauli Virtanen, which is live in the most recent SciPy version, 0.19. It unifies all C-function interfaces within SciPy, and Numba already supports this format. It takes a bit of gymnastics, but it works! It really works!

(By the way, the release is full of little gold nuggets. If you use SciPy at all, the release notes are well worth a read.)

First, we need to define a C function of the appropriate signature. Now, you might think this is the same as the Python signature, taking in an array of values and returning a single value, but that would be too easy! Instead, we have to go back to some C-style programming with pointers and array sizes. From the generic_filter documentation:

This function also accepts low-level callback functions with one of the following signatures and wrapped in scipy.LowLevelCallable:

int callback(double *buffer, npy_intp filter_size, 
             double *return_value, void *user_data)
int callback(double *buffer, intptr_t filter_size, 
             double *return_value, void *user_data)

The calling function iterates over the elements of the input and output arrays, calling the callback function at each element. The elements within the footprint of the filter at the current element are passed through the buffer parameter, and the number of elements within the footprint through filter_size. The calculated value is returned in return_value. user_data is the data pointer provided to scipy.LowLevelCallable as-is.

The callback function must return an integer error status that is zero if something went wrong and one otherwise.

(Let’s leave aside that crazy reversal of Unix convention of the past 50 years in the last paragraph, except to note that our function must return 1 or it will be killed.)

So, we need a Numba cfunc that takes in:

  • a double pointer pointing to the values within the footprint,
  • a pointer-sized integer that specifies the number of values in the footprint,
  • a double pointer for the result, and
  • a void pointer, which could point to additional parameters, but which we can ignore for now.

The Numba type names are listed in this page. Unfortunately, at the time of writing, there’s no mention of how to make pointers there, but finding such a reference was not too hard. (Incidentally, it would make a good contribution to Numba’s documentation to add CPointer to the Numba types page.)

So, armed with all that documentation, and after much trial and error, I was finally ready to write that C callable:

>>> from numba import cfunc, carray
>>> from numba.types import intc, intp, float64, voidptr
>>> from numba.types import CPointer
>>> 
>>> 
>>> @cfunc(intc(CPointer(float64), intp,
...             CPointer(float64), voidptr))
... def nbmin(values_ptr, len_values, result, data):
...     values = carray(values_ptr, (len_values,), dtype=float64)
...     result[0] = np.inf
...     for v in values:
...         if v < result[0]:
...             result[0] = v
...     return 1

The only other tricky bits I had to watch out for while writing that function were as follows:

  • remembering that there’s two ways to de-reference a pointer in C: *ptr, which is not valid Python and thus not valid Numba, and ptr[0]. So, to place the result at the given double pointer, we use the latter syntax. (If you prefer to use Cython, the same rule applies.)
  • Creating an array out of the values_ptr and len_values variables, as shown here. That’s what enables the for v in values Python-style access to the array.

Ok, so now what you’ve been waiting for. How did we do? First, to recap, the original benchmarks:

>>> %timeit ndi.grey_erosion(image, footprint=footprint)
10 loops, best of 3: 118 ms per loop
>>> %timeit ndi.generic_filter(image, np.min, footprint=footprint)
1 loop, best of 3: 27 s per loop

And now, with our new Numba cfunc:

>>> %timeit ndi.generic_filter(image, LowLevelCallable(nbmin.ctypes), footprint=footprint)
10 loops, best of 3: 113 ms per loop

That’s right: it’s even marginally faster than the pure C version! I almost cried when I ran that.


Higher-order functions, ie functions that take other functions as input, enable powerful, concise, elegant expressions of various algorithms. Unfortunately, these have been hampered in Python for large-scale data processing because of Python’s function call overhead. SciPy’s latest update goes a long way towards redressing this.

Numba in the real world

Numba is a just-in-time compiler (JIT) for Python code focused on NumPy arrays and scientific Python. I’ve seen various tutorials around the web and in conferences, but I have yet to see someone use Numba “in the wild”. In the past few months, I’ve been using Numba in my own code, and I recently released my first real package using Numba, skan. The short version is that Numba is amazing and you should strongly consider it to speed up your scientific Python bottlenecks. Read on for the longer version.

Part 1: some toy examples

Let me illustrate what Numba is good for with the most basic example: adding two arrays together. You’ve probably seen similar examples around the web.

We start by defining a pure Python function for iterating over a pair of arrays and adding them:

In [1]:
import numpy as np


def addarr(x, y):
    result = np.zeros_like(x)
    for i in range(x.size):
        result[i] = x[i] + y[i]
    return result
How long does this take in pure Python?

In [2]:
n = int(1e6)
a = np.random.rand(n)
b = np.random.rand(n)
In [3]:
%timeit -r 1 -n 1 addarr(a, b)
1 loop, best of 1: 721 ms per loop
About half a second on my machine. Let’s try with Numba using its JIT decorator:

In [4]:
import numba

addarr_nb = numba.jit(addarr)
In [5]:
%timeit -r 1 -n 1 addarr_nb(a, b)
1 loop, best of 1: 283 ms per loop
The first time it runs, it’s only a tiny bit faster. That’s because of the nature of JITs: they only compile code as it is being run, in order to use object type information of the objects passed into the function. (Note that, in Python, the arguments a and b to addarr could be anything: an array, as expected, but also a list, a tuple, even a Banana, if you’ve defined such a class, and the meaning of the function body is different for each of those types.)

Let’s see what happens the next time we run it:

In [6]:
%timeit -r 1 -n 1 addarr_nb(a, b)
1 loop, best of 1: 6.36 ms per loop
Whoa! Now the code takes 5ms, about 100 times faster than the pure Python version. And the NumPy equivalent?

In [7]:
%timeit -r 1 -n 1 a + b
1 loop, best of 1: 5.62 ms per loop
Only marginally faster than Numba, even though NumPy addition is implemented in highly optimised C code. And, for some data types, Numba even beats NumPy:

In [8]:
r = np.random.randint(0, 128, size=n).astype(np.uint8)
s = np.random.randint(0, 128, size=n).astype(np.uint8)
In [9]:
%timeit -r 1 -n 1 r + s
1 loop, best of 1: 2.92 ms per loop
In [10]:
%timeit -r 1 -n 1 addarr_nb(r, s)
1 loop, best of 1: 238 ms per loop
In [11]:
%timeit -r 1 -n 1 addarr_nb(r, s)
1 loop, best of 1: 234 µs per loop
WOW! For smaller data types, Numba beats NumPy by over 10x!

I’m only speculating, but since my clock speed is about 1GHz (I’m writing this on a base Macbook with a 1.1GHz Core-m processor), I suspect that Numba is taking advantage of some SIMD capabilities of the processor, whereas NumPy is treating each array element as an individual arithmetic operation. (If any Numba or NumPy devs are reading this and have more concrete implementation details that explain this, please share them in the comments!)

So hopefully I’ve got your attention now. For years, NumPy has been the go-to library for performance Python in scientific computing. But, if you wanted to do something a little out of the ordinary, you were stuck. Now, Numba generally matches that for arbitrary code and sometimes beats it handily!

In this context, I decided to use Numba to do something a little less trivial, as part of my research.

Part 2: Real Numba

I’ll present below a slightly simplified version of the code present in my library, skan, which is currently available on PyPI and conda-forge. The task is to build an graph out of the pixels of a skleton image, like this one:

In [12]:
%matplotlib inline
In [13]:
import matplotlib.pyplot as plt
plt.rcParams['image.cmap'] = 'gray'
plt.rcParams['image.interpolation'] = 'nearest'
In [14]:
skeleton = np.array([[0, 1, 0, 0, 0, 1, 1],
                     [0, 0, 1, 1, 1, 0, 0],
                     [0, 1, 0, 0, 0, 1, 0],
                     [0, 0, 1, 0, 1, 0, 0],
                     [1, 1, 0, 1, 0, 0, 0]], dtype=bool)
skeleton = np.pad(skeleton, pad_width=1, mode='constant')
In [15]:
fig, ax = plt.subplots(figsize=(5, 5))
ax.imshow(skeleton)
ax.set_title('Skeleton')
ax.set_xlabel('col')
ax.set_ylabel('row')
Out[15]:

Every white pixel in the image will be a node in our graph, and we place edges between nodes if the pixels are next to each other (counting diagonals). A natural way to represent a graph in the SciPy world is as a sparse matrix A: we number the nonzero pixels from 1 onwards — these are the rows of the matrix — and then place a 1 at entry A(i, j) when pixel i is adjacent to pixel j. SciPy’s sparse.coo_matrix format make it very easy to construct such a matrix: we just need an array with the row coordinates and another with the column coordinates.

Because NumPy arrays are not dynamically resizable like Python lists, it helps to know ahead of time how many edges we are going to need to put in our row and column arrays. Thankfully, a well-known theorem of graph theory states that the number of edges of a graph is half the sum of the degrees. In our case, because we want to add the edges twice (once from i to j and once from j to i, we just need the sum of the degrees exactly. We can find this out with a convolution using scipy.ndimage:

In [16]:
from scipy import ndimage as ndi

neighbors = np.array([[1, 1, 1],
                      [1, 0, 1],
                      [1, 1, 1]])

degrees = ndi.convolve(skeleton.astype(int), neighbors) * skeleton
In [17]:
fig, ax = plt.subplots(figsize=(5, 5))
result = ax.imshow(degrees, cmap='magma')
ax.set_title('Skeleton, colored by node degree')
ax.set_xlabel('col')
ax.set_ylabel('row')
cbar = fig.colorbar(result, ax=ax, shrink=0.7)
cbar.set_ticks([0, 1, 2, 3])
cbar.set_label('degree')
There you can see “tips” of the skeleton, with only 1 neighbouring pixel, as purple, “paths”, with 2 neighbours, as red, and “junctions”, with 3 neighbors, as yellow.

Now, consider the pixel at position (1, 6). It has two neighbours (as indicated by its colour): (2, 5) and (1, 7). If we number the nonzero pixels as 1, 2, …, n from left to right and top to bottom, then this pixel has label 2, and its neighbours have labels 6 and 3. We therefore need to add edges (2, 3) and (2, 6) to the graph. Similarly, when we consider pixel 6, we will add edges (6, 5), (6, 3), and (6, 8).

In [18]:
fig, ax = plt.subplots(figsize=(5, 5))
result = ax.imshow(degrees, cmap='magma')
cbar = fig.colorbar(result, ax=ax, shrink=0.7)
cbar.set_ticks([0, 1, 2, 3])

nnz = len(np.flatnonzero(degrees))
pixel_labels = np.arange(nnz) + 1
for lab, y, x in zip(pixel_labels, *np.nonzero(degrees)):
    ax.text(x, y, lab, horizontalalignment='center',
            verticalalignment='center')

ax.set_xlabel('col')
ax.set_ylabel('row')
ax.set_title('Skeleton, with pixel IDs')
cbar.set_label('degree')
Scanning over the whole image, we see that we need row and col arrays of length exactly np.sum(degrees).

In [19]:
n_edges = np.sum(degrees)
row = np.empty(n_edges, dtype=np.int32)  # type expected by scipy.sparse
col = np.empty(n_edges, dtype=np.int32)
The final piece of the puzzle is finding neighbours. For this, we need to know a little about how NumPy stores arrays. Even though our array is 2-dimensional (rows and columns), these are all arrayed in a giant line, each row placed one after the other. (This is called “C-order”.) If we index into this linearised array (“raveled”, in NumPy’s language), we can make sure that our code works for 2D, 3D, and even higher-dimensional images. Using this indexing, neighbouring pixels to the left and right are accessed by subtracting or adding 1 to the current index. Neighbouring pixels above and below are accessed by subtracting or adding the length of a whole row. Finally, diagonal neighbours are found by combining these two. For simplicity, we only show the 2D version below:

In [20]:
def neighbour_steps(shape):
    step_sizes = np.cumprod((1,) + shape[-1:0:-1])
    axis_steps = np.array([[-1, -1],
                           [-1,  1],
                           [ 1, -1],
                           [ 1,  1]])
    diag = axis_steps @ step_sizes
    steps = np.concatenate((step_sizes, -step_sizes, diag))
    return steps
In [21]:
steps = neighbour_steps(degrees.shape)
print(steps)
[  1   9  -1  -9 -10   8  -8  10]
Of course, if we use these steps near the right edge of the image, we’ll wrap around, and mistakenly think that the first element of the next row is a neighbouring pixel! Our solution is to only process nonzero pixels, and make sure that we have a 1-pixel-wide “pad” of zero pixels — which we do, in the image above!

Now, we iterate over image pixels, look at neighbors, and populate the row and column vectors.

In [22]:
def build_graph(labeled_pixels, steps_to_neighbours, row, col):
    start = np.max(steps_to_neighbours)
    end = len(labeled_pixels) - start
    elem = 0  # row/col index
    for k in range(start, end):
        i = labeled_pixels[k]
        if i != 0:
            for s in steps:
                neighbour = k + s
                j = labeled_pixels[neighbour]
                if j != 0:
                    row[elem] = i
                    col[elem] = j
                    elem += 1
In [23]:
skeleton_int = np.ravel(skeleton.astype(np.int32))
skeleton_int[np.nonzero(skeleton_int)] = 1 + np.arange(nnz)
In [24]:
%timeit -r 1 -n 1 build_graph(skeleton_int, steps, row, col)
1 loop, best of 1: 917 µs per loop
Now we try the Numba version:

In [25]:
build_graph_nb = numba.jit(build_graph)
In [26]:
%timeit -r 1 -n 1 build_graph_nb(skeleton_int, steps, row, col)
1 loop, best of 1: 346 ms per loop
In [27]:
%timeit -r 1 -n 1 build_graph_nb(skeleton_int, steps, row, col)
1 loop, best of 1: 14.3 µs per loop
Nice! We get more than a 50-fold speedup using Numba, and this operation would have been difficult if not impossible to convert to a NumPy vectorized operation! We can now build our graph:

In [28]:
from scipy import sparse
G = sparse.coo_matrix((np.ones_like(row), (row, col))).tocsr()
As to what to do with said graph, I’ll leave that for another post. (You can also peruse the skan source code.) In the meantime, though, you can visualize it with NetworkX:

In [29]:
import networkx as nx

Gnx = nx.from_scipy_sparse_matrix(G)
Gnx.remove_node(0)

nx.draw_spectral(Gnx, with_labels=True)
There’s our pixel graph! Obviously, the speedup and n-d support are important for bigger, 3D volumes, not for this tiny graph. But they are important, and, thanks to Numba, easy to obtain.

Conclusion

I hope I’ve piqued your interest in Numba and encouraged you to use it in your own projects. I think the future of success of Python in science heavily depends on JITs, and Numba is a strong contender to be the default JIT in this field.

Note:This post was written using Jupyter Notebook. You can find the source notebook here.

The cost of a Python function call

I’ve read in various places that the Python function call overhead is very high. As I was parroting this “fact” to Ed Schofield recently, he asked me what the cost of a function actually was. I had no idea. This prompted us to do a few quick benchmarks.

The short version is that it takes about 150ns to call a function in Python (on my laptop). This doesn’t sound like a lot, but it means that you can make at most 6.7 million calls per second, two to three orders of magnitude slower than your processor’s clock speed.

If you want your function to do something, such as, oh, I don’t know, receive an input argument, this goes up to 350ns, throttling you at 2.8 million calls per second.

Benchmarking function calls

I cleaned up Ed’s and my initial experiments to make a small module and timer to measure all these values. You can clone the repo and run python function-calls/timer.py to check the numbers on your machine.

The benchmarks are variations of comparing the execution time of:

for i in range(n):
    pass

and:

def f():
    pass

for i in range(n):
    f()

for some suitably large n. As I mentioned above, that comes out to an absolute minimum of 150ns per function call.

What this means

I’ve been making a fuss over the past year about the excellent Toolz and the way it enables elegant streaming data processing. (See my demo repo and my EuroSciPy talk.) You can read data from a modern SSD at speeds approaching 500MB/s. If you want to stream each byte through Python functions, you’ll instantly lose two orders of magnitude of speed. And, the more functions you use, the slower you’ll go, which discourages functional programming and modularity — the very things I was trying to promote!

In the DNA sequence processing I demo in the talk, I get a throughput of about 0.5MB/s. On one hand, this is kind of OK because we are using effectively zero RAM, so we can just let the code run over lunch. On the other, it’s starting to bug me that 99% of my processor time is spent on Python function calls, rather than on actual data crunching.

This is a problem for Python. To work on seriously big data, you need to drop into a library written in C, such as NumPy or Pandas. You need to do this on a high level: any per-byte or per-data-element processing cannot be in Python, if you don’t want to waste your processor’s cycles. Python’s ecosystem is Insanely Great, so this is mostly fine, but it does limit your ability to research or implement cool new methods using Python.

As an example, the generic_filter function in SciPy’s ndimage package has infinitely many cool uses, but using it to process a 100MB image (which is small in biology) would take 15 seconds in function call overhead alone. Lest you think this is reasonable, SciPy’s greyscale erosion, implemented in C, takes less than 4 seconds on an image that size. A lot of my once-lackadaisical attitude towards Python performance stemmed from not knowing how long things should take. A lot less than they do, it turns out.

What to do about it

As I mentioned, Python’s high performance libraries are many and great. Look hard for optimised libraries that already do what you want. Try to express what you want to do as combinations of functions from NumPy, SciPy, Pandas, scikit-image, scikit-learn, and so on. Minimise the amount of time spent in Python. This is advice that you learn early on in scientific Python programming, but I didn’t appreciate just how important it is.

At some point, that approach will fail, and you will want to do something cute and custom with your data points. Reach for Cython sooner rather than later. As a primer, I recommend Stefan Behnel’s excellent tutorial from EuroSciPy 2015.

There is also Continuum’s Numba, which is sometimes easier to use than Cython. I don’t have any experience with it so I can’t comment much here. However, I’d consider it a very valuable project to implement generic_filter in Numba.
In the long-run, these are all workarounds, and I hope that the Python interpreter itself becomes faster, though there are few signs of that happening.

If you have other ideas on how to get around Python’s function call cost, please let me know in the comments!

My first use of Python 3’s `yield from`!

I never really understood why yield from was useful. Last weekend, I wanted to use Python 3.5’s new os.scandir to explore a directory (and its subdirectories). Tragically, os.scandir is not recursive, and I find os.walks 3-tuple values obnoxious.
Lo and behold, while I was trying to implement a recursive version of scandir, a yield from use just popped right out!

import os
def rscandir(path):
    for entry in os.scandir(path):
        yield entry
        if entry.is_dir():
            yield from rscandir(entry.path)

That’s it! I have to admit that reads wonderfully. The Legacy Python (aka Python 2.x) alternative is quite a bit uglier:

import os
def rscandir(path):
    for p in os.listdir(path):
        yield p
        if os.path.isdir(p):
            for q in rscandir(p):
                yield q

Yuck. So, yet again: time to move away from Legacy Python! ;)

EuroSciPy 2015 debrief

The videos from EuroSciPy 2015 are up! This marks a good time to write up my thoughts on the conference.
I’ve mentioned before that the yearly SciPy conference is stunningly useful. This year I couldn’t make it to Austin, but I did attend EuroSciPy, the European version of the same conference, in Cambridge, UK. It was spectacular.

Useful talks

The talk of the conference, for me, goes to Robin Wilson for recipy, which one can describe as a logging utility, if one wishes to make it sound as uninspiring as possible. Recipy’s strength is in its mind-boggling simplicity. Here is the unabridged usage guide:

import recipy

With this single line, your script will now generate an entry in a database every time it is run. It logs the start and end time, the working directory, the script’s git hash, any differences between the working copy and the last git commit (!), and the names of any input and output files. (File hashes are coming soon, I’m assured).
I don’t know about you but I have definitely lost count of the times I’ve looked at a file and wondered what script I ran to get it, or the input data that went into it. This library solves that problem with absolutely minimal friction for the user.
I also enjoyed Nicolas Rougier’s talk on ReScience, a new journal dedicated to replicated (and replicable) scientific analyses. It’s a venue to publish all those efforts to replicate a result you read in a paper. Given recent findings about how poorly most papers replicate, I think this is a really important outlet.
The other remarkable thing about it is that all review is open and done in the spirit of open source, on GitHub. Submission is by pull request, of course. With just one paper out so far, it’s a bit early to tell whether it’ll take off, but I really hope it does. I’ll be looking for stuff of my own to publish there, for sure. (Oh and by the way, they are looking for reviewers and editors!)
Another great talk was Philipp Rudiger on HoloViews, an object-oriented plotting framework. They define an arithmetic on figures: A * B overlays figure B on A, while B + C creates two subplots out of B and C (and automatically labels them). Their example notebooks rely a lot on IPython magic, which I’m not happy about and means I haven’t fully grokked the API, but it seems like a genuinely useful way to think about plotting.
A final highlight from the main session was Martin Weigert on Spimagine, his GPU-accelerated, 5D image analysis and visualisation framework. It was stupidly impressive. Although it’s a long-term project, I’m inclined to try to incorporate many of its components into scikit-image.

Tutorials

The tutorials are a great asset of both EuroSciPy and SciPy. I learn something new every year. The highlight for me was the Cython tutorial, in which Stefan Behnel demonstrated how easy it is to provide Python access to C++ code using Cython. (I have used Cython quite extensively, but only to speed up Python code, rather than wrap C or C++ code.)

Sprints

I was feeling a bit hypocritical for missing the sprints this year, since I had to run off before the Sunday. Emmanuelle Gouillart, another scikit-image core dev, suggested having a small, unofficial sprint on Friday evening. It grew and grew into a group of about 30 people (including about 10 new to sprinting) who all gathered at the Enthought Cambridge office to work on scikit-image or the SciPy lecture notes. A brilliant experience.
scikit-image sprint at Enthought
(By the way, nothing creepy going on with that dude hunching over one of our sprinters — that’s just husband-and-wife team Olivia and Robin Wilson! ;)

Final thoughts

As usual, I learned heaps and had a blast at this SciPy conference (my fourth). I hope it will remain a yearly ritual, and I hope someone reading this will give it a try next year!

The SciPy ecosystem belongs to everyone

I use Twitter favourites almost exclusively to mark posts that I know will be useful in some not-too-distant future; kind of like a Twitter Evernote. Recently I was looking through my list in search of this excellent blog post detailing how to build cross-platform binary distributions for conda.

I came across two other tweets from the EuroSciPy 2014 conference: this one by Ian Ozsvald about his IPython memory usage profiler, right next to this one by Alexandre Chabot about Aaron O’Leary’s notedown. I’d forgotten that this was how I came across these two tools, but since then I have contributed code to both (1, 2). I’d met Ian at EuroSciPy 2013, but I’ve never met Aaron, yet nevertheless there is my code in the latest version of his notedown library.

How remarkable the open-source Python community has become. Talks from Python conferences are posted to YouTube, usually as the conference is happening. (Add to that plenty of live tweeting.) Thus, even when I can’t attend the conferences, I can keep up with the latest open source libraries, from the other side of the world. And then I can grab the source code on GitHub, fiddle with it to my heart’s content, and submit a pull request to the author with my changes. After a short while, code that I wrote for my own utility is available for anyone else to use through PyPI or conda.

My point is: join us! Make your code open source, and conversely, when you need some functionality, don’t reinvent the wheel. See if there’s a library that almost meets your needs, and contribute!

Go to SciPy 2015

SciPy is my favourite conference. My goal with this post is to convince someone to go who hasn’t had that chance yet.

Photo by Ian Rees (from the SciPy 2012 conference website)

Why SciPy?

Most scientists go to conferences in their own field: neuroscientists go to the monstrous Society for Neuroscience (SfN); Bioinformaticians go to RECOMB, ISMB, or PSB; and so on.

People go to these to keep up with the latest advances in their field, and often, to do a bit of networking.

SciPy is a different kind of conference. It changes the way you do science. You learn about the latest free and open source software to help you with your work. You learn to write functions and interfaces instead of scripts, and to write tests so you don’t break your code. You also learn to contribute these to bigger projects, maximising the reach and impact of your work (see “sprints”, below).

And you learn these things by doing them, with the people who are the best at this, rather than by reading books and blog posts. (Which maybe I shouldn’t knock, since I’m writing a book about all this and you are reading my blog!)

Attendees to SciPy have scientific software in common, but come from diverse fields, including physics, pure maths, data visualisation, geosciences, climatology, and yes, biology and bioinformatics. Mingling with such a diverse group is a fantastic way to get your creative juices flowing!

The conference lasts a full week and is broken up into three parts: tutorials, main conference, and sprints.

the tutorials

With a few exceptions, you won’t learn about your own field. But you will learn an enormous amount about tools that will help you be a better scientist. If you are a newbie to Python, you can go to the beginner tutorials track and learn about the fantastic scientific libraries available in Python. If you already use NumPy, SciPy, pandas, and the IPython notebook, you can go to the intermediate or advanced tracks, and learn new things about those. Even as an advanced SciPy user I still get tons of value from the tutorials. (Last year Min RK gave a wild demo of IPython parallel’s capabilities by crawling wikipedia remotely while building up a graph visualisation on his live notebook.) (Fast-forward to the 1h mark to see just the payoff.) Here’s last year’s tutorial schedule for an idea of what to expect.

the main conference track

You will also hear about the latest advances in the scientific libraries you know and use, and about libraries you didn’t know about but will find useful (such as scikit-bio, yt or epipy). The main conference track features software advances written by presenters from many different fields. Hearing about these from the authors of the software lets you ask much different questions compared to hearing someone say, for example, “we used the Matlab image processing toolbox”. If you ever had a feature request for your favourite library, or you wondered why they do something in a particular way, there’s no better opportunity to get some closure.

The crosstalk between different fields is phenomenal. Hearing how a diverse set of people deal with their software problems really opens your mind to completely different approaches to what you had previously considered.

the sprints

Finally, there’s two days of coding sprints. Even if you are a complete beginner in software development, do yourself a favour and participate in one of these.

Two and a half years after my first SciPy in 2012, I’m writing a scientific Python book for O’Reilly, and I can 100% trace it to participating in the scikit-image sprint that year. With their guidance, I wrote my first ever GitHub pull request and my first ever unit test. Both were tiny and cute, and I would consider them trivial now, but that seed grew into massive improvements in my code-writing practice and many more contributions to open source projects.

And this is huge: now, instead of giving up when a software package doesn’t do what I need it to do, I just look at the source code and figure out how I can add what I want. Someone else probably wants that functionality, and by putting it into a major software library instead of in my own code, I get it into the hands of many more users. It’s a bit counterintuitive but there is nothing more gratifying than having some random person you’ve never met complain when you break something! This never happens when all your code is in your one little specialised repository containing functionality for your one paper.

How SciPy

The SciPy calls for tutorials, talks, posters, and its plotting contest are all out. There’s specialised tracks and most of you reading this are probably interested in the computational biology and medicine track. It’s taken me a while to write this post, so there’s just one week left to submit something: the deadline is April 1st Update: the deadline for talks and posters has been extended to April 10th!

Even if you don’t get something in, I encourage you to participate. Everything I said above still applies if you’re not presenting. You might have a bit more trouble convincing your funders to pay for your travels, but if that’s the case I encourage you to apply for financial assistance from the conference.

I’ve written about SciPy’s diversity problem before, so I’m happy to report that this year there’s specific scholarships for women and minorities. (This may have been true last year, I forget.) And, awesomely, Matt Davis has offered to help first-time submitters with writing their proposals.

Give SciPy a try: submit here and/or register here. And feel free to email me or comment below if you have questions!

Update: A colleague pointed out that I should also mention the awesomeness of the conference venue, so here goes: Austin in July is awesome. If you love the heat like I do, well, it doesn’t get any better. If you don’t, don’t worry: the AT&T Conference Center AC is on friggin overdrive the whole time. Plus, there’s some nearby cold springs to swim in. The center itself is an excellent hotel and the conference organises massive discounts for attendees. There’s a couple of great restaurants on-site; and the Mexican and Texas BBQ in the area are incredible — follow some Enthought and Continuum folks around to experience amazing food. Finally, Austin is a great city to bike in: last time I rented a road bike for the whole week from Mellow Johnny’s, and enjoyed quite a few lunchtime and evening rides.

Experiences porting a medium-sized library from Python 2 to 3

Prompted in part by some discussions with Ed Schofield, creator of python-future.org, I’ve been going on a bit of a porting spree to Python 3. I just finished with my gala segmentation library. (Find it on GitHub and ReadTheDocs.) Overall, the process is nowhere near as onerous as you might think it is. Getting started really is the hardest part. If you have more than yourself as a user, you should definitely just get on with it and port.

The second hardest part is the testing. In particular, you will need to be careful with dictionary iteration, pickled objects, and file persistence in general. I’ll go through these gotchas in more detail below.

Reminder: the order of dictionary items is undefined

This is one of those duh things that I forget over and over and over. In my porting, some tests that depended on a scikit-learn RandomForest object were failing. I assumed that there was some difference between the random seeding in Python 2 and Python 3, leading to slightly different models between the two versions of the random forest.

This was a massive red herring that took me forever to figure out. In actuality, the seeding was completely fine. However, gala uses networkx as its graph backend, which itself uses an adjacency dictionary to store edges. So when I asked for graph.edges() to get a set of training examples, I was getting the edges in a random order that was deterministic within Python 2.7: the edges returned were always in the same shuffled order. This went out the window when switching to Python 3.4, with the training examples now in a different order, resulting in a different random forest and thus a different learning outcome… And finally a failed test.

The solution should have been to use a classifier that is not sensitive to ordering of the training data. However, although many classifiers satisfy this property, in practice they suffer from slight numerical instability which is sufficient to throw the test results off between shufflings of the training data.

So I’ve trained a Naive Bayes classifier in Python 2.7, and which I then load up in Python 3.4 and check whether the parameters are close to a newly trained one. The actual classification results can differ slightly, and this becomes much worse in gala, where classification tasks are sequential, so a single misstep can throw off everything that comes after it.

When pickling, remember to open files in binary mode

I’ve always felt that the pickle module was deficient for not accepting filenames as input to dump. Instead, it takes an open, writeable file. This is all well and good but it turns out that you should always open files in binary mode when using pickle! I got this far without knowing that, surely an indictment of pickle’s API!

Additionally, you’ll have specify a encoding='bytes' when loading a Python 2 saved file in the Python 3 version of pickle.

Even when you do, objects may not map cleanly between Python 2 and 3 (for some libraries)

In Python 2:

>>> from sklearn.ensemble import RandomForestClassifier as RF
>>> rf = RF()
>>> from sklearn.datasets import load_iris
>>> iris = load_iris()
>>> rf = rf.fit(iris.data, iris.target)
>>> with open('rf', 'wb') as fout:
...     pck.dump(r, fout, protocol=2)

Then, in Python 3:

>>> with open('rf', 'rb') as fin:
...     rf = pck.load(fin, encoding='bytes')
... 
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-9-674ee92b354d> in <module>()
      1 with open('rf', 'rb') as fin:
----> 2     rf = pck.load(fin, encoding='bytes')
      3 

/Users/nuneziglesiasj/anaconda/envs/py3k/lib/python3.4/site-packages/sklearn/tree/_tree.so in sklearn.tree._tree.Tree.__setstate__ (sklearn/tree/_tree.c:18115)()

KeyError: 'node_count'

When all is said and done, your code will probably run slower on Python 3

I have to admit: this just makes me angry. After a lot of hard work ironing out all of the above kinks, gala’s tests run about 2x slower in Python 3.4 than in 2.7. I’d heard quite a few times that Python 3 is slower than 2, but that’s just ridiculous.

Nick Coghlan’s enormous Q&A has been cited as required reading before complaining about Python 3. Well, I’ve read it (which took days), and I’m still angry that the CPython core development team are generally dismissive of anyone wanting faster Python. Meanwhile, Google autocompletes “why is Python” with “so slow”. And although Nick asserts that those of us complaining about this “misunderstand the perspective of conservative users”, community surveys show a whopping 40% of Python 2 users citing “no incentive” as the reason they don’t switch.

In conclusion…

In the end, I’m glad I ported my code. I learned a few things, and I feel like a better Python “citizen” for having done it. But that’s the point: those are pretty weak reasons. Most people just want to get their work done and move on. Why would they bother porting their code if it’s not going to help them do that?