All posts by Juan Nunez-Iglesias

Summer School Announcement: ASPP Asia-Pacific 2018

The Advanced Scientific Programming in Python (ASPP) summer school has had 10 extremely successful iterations in Europe. (You can find past materials, schedules, and student evaluations at https://python.g-node.org/archives.) Now, thanks to the INCF, we will be holding its first iteration in Australia, to cater to the Asia Pacific region. (Note: the original ASPP will still take place in Europe next Northern summer; this is a fork of that school.)

Key details

  • The workshop runs January 14-21 at the Melbourne Brain Centre, University of Melbourne, Australia
  • topics include: git, contributing to open source software with github, testing, debugging, profiling, advanced NumPy, Cython, data visualisation.
  • hands-on learning using pair programming
  • free to attend (but students are responsible for travel, accommodation, and meals)
  • 30 student places, to be selected competitively
  • application deadline is Oct 31, 2017, 23:59 UTC.
  • website: https://melbournebioinformatics.org.au/aspp-asia-pacific
  • apply: https://melbournebioinformatics.org.au/aspp-asia-pacific/applications (make sure you read the FAQ on that page)

Background

Two-and-a-bit years ago, Tiziano Zito asked me if I could join the faculty at the 2015 ASPP school in Munich (then in its 8th iteration). It turned out to be a fantastic teaching experience, and, more importantly, it was a fantastic experience for the students. Students selected for the school fit a certain profile, neither novice nor advanced. As such, you can be sure that if you participate in the school, you will learn a great deal. We teach tools that will immediately improve your scientific practice. I decided that I wanted to replicate the school in Australia. Now it is finally here!

Course outline

Scientists spend increasingly more time writing, maintaining, and debugging software. While techniques for doing this efficiently have evolved, only few scientists have been trained to use them. As a result, instead of doing their research, they spend far too much time writing deficient code and reinventing the wheel. In this course we will present a selection of advanced programming techniques and best practices that are standard in industry, but especially tailored to the needs of a programming scientist. Lectures are devised to be interactive and to give the students enough time to acquire direct hands-on experience with the materials. Students will work in pairs throughout the school and will team up to practice the newly learned skills in a real programming project — an entertaining computer game.

We use the Python programming language for the entire course. Python works as a simple programming language for beginners, but more importantly, it also works great in scientific simulations and data analysis. We show how clean language design, ease of extensibility, and the great wealth of open source libraries for scientific computing and data visualization are driving Python to becoming a standard tool for scientists.

Who is eligible?

This school is targeted at Master/PhD students and postdocs from all areas of science. Competence in Python or in another language such as Java, C/C++, MATLAB, or Mathematica is absolutely required. Basic knowledge of Python and of a version control system such as git, subversion, mercurial, or bazaar is assumed. Participants without any prior experience with Python and/or git should work through the proposed introductory material before the course.

We have strived to get a pool of students that is international and gender-balanced, and have succeeded, with gender parity in the last four schools.

More questions

If you have any questions, contact aspp@melbournebioinformatics.org.au.

Please circulate this announcement widely!

Thanks,

Juan.

Prettier LowLevelCallables with Numba JIT and decorators

In my recent post, I extolled the virtues of SciPy 0.19’s LowLevelCallable. I did lament, however, that for generic_filter, the LowLevelCallable interface is a good deal uglier than the standard function interface. In the latter, you merely need to provide a function that takes the values within a pixel neighbourhood, and outputs a single value — an arbitrary function of the input values. That is a Wholesome and Good filter function, the way God intended.

In contrast, a LowLevelCallable takes the following signature:

int callback(double *buffer, intptr_t filter_size, 
             double *return_value, void *user_data)

That’s not very Pythonic at all. In fact, it’s positively Conic (TM). For those that don’t know, pointers are evil, so let’s aim to avoid their use.

“But Juan!”, you are no doubt exclaiming. “Juan! Didn’t you just tell us how to use pointers in Numba cfuncs, and tell us how great it was because it was so fast?”

Indeed I did. But it left a bad taste in my mouth. Although I felt that the tradeoff was worth it for such a phenomenal speed boost (300x!), I was unsatisfied. So I started immediately to look for a tidier solution. One that would let me write proper filter functions while still taking advantage of LowLevelCallables.

It turns out Numba cfuncs can call Numba jitted functions, so, with a little bit of decorator magic, it’s now ludicrously easy to write performant callables for SciPy using just pure Python. (If you don’t know what Numba JIT is, read my earlier post.) As in the last post, let’s look at grey_erosion as a baseline benchmark:

>>> import numpy as np
>>> footprint = np.array([[0, 1, 0],
...                       [1, 1, 1],
...                       [0, 1, 0]], dtype=bool)
>>> from scipy import ndimage as ndi
>>>
>>> %timeit ndi.grey_erosion(image, footprint=fp)
1 loop, best of 3: 160 ms per loop

Now, we write a decorator that uses Numba jit and Numba cfunc to make a LowLevelCallable suitable for passing directly into generic_filter:

>>> import numba
>>> from numba import cfunc, carray
>>> from numba.types import intc, CPointer, float64, intp, voidptr
>>> from scipy import LowLevelCallable
>>>
>>> def jit_filter_function(filter_function):
...     jitted_function = numba.jit(filter_function, nopython=True)
...     @cfunc(intc(CPointer(float64), intp, CPointer(float64), voidptr))
...     def wrapped(values_ptr, len_values, result, data):
...         values = carray(values_ptr, (len_values,), dtype=float64)
...         result[0] = jitted_function(values)
...         return 1
...     return LowLevelCallable(wrapped.ctypes)

If you haven’t seen decorators before, read this primer from Real Python. To summarise, we’ve written a function that takes as input a Python function, and outputs a LowLevelCallable. Here’s how to use it:

>>> @jit_filter_function
... def fmin(values):
...     result = np.inf
...     for v in values:
...         if v < result:
...             result = v
...     return result

As you can see, the fmin function definition looks just like a normal Python function. All the magic happens when we attach our @jit_filter_function decorator to the top of the function. Let’s see it in action:

>>> %timeit ndi.generic_filter(image, fmin, footprint=fp)
10 loops, best of 3: 92.9 ms per loop

Wow! numba.jit is actually over 70% faster than grey_erosion or the plain cfunc approach!

In case you want to use this, I’ve made a package available on PyPI, so you can actually pip install it right now with pip install llc (for low-level callable), and then:

>>> from llc import jit_filter_function

The source code is on GitHub. Currently it only covers ndi.generic_filter‘s signature, and only with Numba, but I hope to gradually expand it to cover all the functions that take LowLevelCallables in SciPy, as well as support Cython. Pull requests are welcome!

SciPy’s new LowLevelCallable is a game-changer

… and combines rather well with that other game-changing library I like, Numba.

I’ve lamented before that function calls are expensive in Python, and that this severely hampers many functions that should be insanely useful, such as SciPy’s ndimage.generic_filter.

To illustrate this, let’s look at image erosion, which is the replacement of each pixel in an image by the minimum of its neighbourhood. ndimage has a fast C implementation, which serves as a perfect benchmark against the generic version, using a generic filter with min as the operator. Let’s start with a 2048 x 2048 random image:

>>> import numpy as np
>>> image = np.random.random((2048, 2048))

and a neighbourhood “footprint” that picks out the pixels to the left and right, and above and below, the centre pixel:

>>> footprint = np.array([[0, 1, 0],
...                       [1, 1, 1],
...                       [0, 1, 0]], dtype=bool)

Now, we measure the speed of grey_erosion and generic_filter. Spoiler alert: it’s not pretty.

>>> from scipy import ndimage as ndi
>>> %timeit ndi.grey_erosion(image, footprint=footprint)
10 loops, best of 3: 118 ms per loop
>>> %timeit ndi.generic_filter(image, np.min, footprint=footprint)
1 loop, best of 3: 27 s per loop

As you can see, with Python functions, generic_filter is unusable for anything but the tiniest of images.

A few months ago, I was trying to get around this by using Numba-compiled functions, but the way to feed C functions to SciPy was different depending on which part of the library you were using. scipy.integrate used ctypes, while scipy.ndimage used PyCObjects or PyCapsules, depending on your Python version, and Numba only supported the former method at the time. (Plus, this topic starts to stretch my understanding of low-level Python, so I felt there wasn’t much I could do about it.)

Enter this pull request to SciPy from Pauli Virtanen, which is live in the most recent SciPy version, 0.19. It unifies all C-function interfaces within SciPy, and Numba already supports this format. It takes a bit of gymnastics, but it works! It really works!

(By the way, the release is full of little gold nuggets. If you use SciPy at all, the release notes are well worth a read.)

First, we need to define a C function of the appropriate signature. Now, you might think this is the same as the Python signature, taking in an array of values and returning a single value, but that would be too easy! Instead, we have to go back to some C-style programming with pointers and array sizes. From the generic_filter documentation:

This function also accepts low-level callback functions with one of the following signatures and wrapped in scipy.LowLevelCallable:

int callback(double *buffer, npy_intp filter_size, 
             double *return_value, void *user_data)
int callback(double *buffer, intptr_t filter_size, 
             double *return_value, void *user_data)

The calling function iterates over the elements of the input and output arrays, calling the callback function at each element. The elements within the footprint of the filter at the current element are passed through the buffer parameter, and the number of elements within the footprint through filter_size. The calculated value is returned in return_value. user_data is the data pointer provided to scipy.LowLevelCallable as-is.

The callback function must return an integer error status that is zero if something went wrong and one otherwise.

(Let’s leave aside that crazy reversal of Unix convention of the past 50 years in the last paragraph, except to note that our function must return 1 or it will be killed.)

So, we need a Numba cfunc that takes in:

  • a double pointer pointing to the values within the footprint,
  • a pointer-sized integer that specifies the number of values in the footprint,
  • a double pointer for the result, and
  • a void pointer, which could point to additional parameters, but which we can ignore for now.

The Numba type names are listed in this page. Unfortunately, at the time of writing, there’s no mention of how to make pointers there, but finding such a reference was not too hard. (Incidentally, it would make a good contribution to Numba’s documentation to add CPointer to the Numba types page.)

So, armed with all that documentation, and after much trial and error, I was finally ready to write that C callable:

>>> from numba import cfunc, carray
>>> from numba.types import intc, intp, float64, voidptr
>>> from numba.types import CPointer
>>> 
>>> 
>>> @cfunc(intc(CPointer(float64), intp,
...             CPointer(float64), voidptr))
... def nbmin(values_ptr, len_values, result, data):
...     values = carray(values_ptr, (len_values,), dtype=float64)
...     result[0] = np.inf
...     for v in values:
...         if v < result[0]:
...             result[0] = v
...     return 1

The only other tricky bits I had to watch out for while writing that function were as follows:

  • remembering that there’s two ways to de-reference a pointer in C: *ptr, which is not valid Python and thus not valid Numba, and ptr[0]. So, to place the result at the given double pointer, we use the latter syntax. (If you prefer to use Cython, the same rule applies.)
  • Creating an array out of the values_ptr and len_values variables, as shown here. That’s what enables the for v in values Python-style access to the array.

Ok, so now what you’ve been waiting for. How did we do? First, to recap, the original benchmarks:

>>> %timeit ndi.grey_erosion(image, footprint=footprint)
10 loops, best of 3: 118 ms per loop
>>> %timeit ndi.generic_filter(image, np.min, footprint=footprint)
1 loop, best of 3: 27 s per loop

And now, with our new Numba cfunc:

>>> %timeit ndi.generic_filter(image, LowLevelCallable(nbmin.ctypes), footprint=footprint)
10 loops, best of 3: 113 ms per loop

That’s right: it’s even marginally faster than the pure C version! I almost cried when I ran that.


Higher-order functions, ie functions that take other functions as input, enable powerful, concise, elegant expressions of various algorithms. Unfortunately, these have been hampered in Python for large-scale data processing because of Python’s function call overhead. SciPy’s latest update goes a long way towards redressing this.

Numba in the real world

Numba is a just-in-time compiler (JIT) for Python code focused on NumPy arrays and scientific Python. I’ve seen various tutorials around the web and in conferences, but I have yet to see someone use Numba “in the wild”. In the past few months, I’ve been using Numba in my own code, and I recently released my first real package using Numba, skan. The short version is that Numba is amazing and you should strongly consider it to speed up your scientific Python bottlenecks. Read on for the longer version.

Part 1: some toy examples

Let me illustrate what Numba is good for with the most basic example: adding two arrays together. You’ve probably seen similar examples around the web.

We start by defining a pure Python function for iterating over a pair of arrays and adding them:

In [1]:
import numpy as np


def addarr(x, y):
    result = np.zeros_like(x)
    for i in range(x.size):
        result[i] = x[i] + y[i]
    return result
How long does this take in pure Python?

In [2]:
n = int(1e6)
a = np.random.rand(n)
b = np.random.rand(n)
In [3]:
%timeit -r 1 -n 1 addarr(a, b)
1 loop, best of 1: 721 ms per loop
About half a second on my machine. Let’s try with Numba using its JIT decorator:

In [4]:
import numba

addarr_nb = numba.jit(addarr)
In [5]:
%timeit -r 1 -n 1 addarr_nb(a, b)
1 loop, best of 1: 283 ms per loop
The first time it runs, it’s only a tiny bit faster. That’s because of the nature of JITs: they only compile code as it is being run, in order to use object type information of the objects passed into the function. (Note that, in Python, the arguments a and b to addarr could be anything: an array, as expected, but also a list, a tuple, even a Banana, if you’ve defined such a class, and the meaning of the function body is different for each of those types.)

Let’s see what happens the next time we run it:

In [6]:
%timeit -r 1 -n 1 addarr_nb(a, b)
1 loop, best of 1: 6.36 ms per loop
Whoa! Now the code takes 5ms, about 100 times faster than the pure Python version. And the NumPy equivalent?

In [7]:
%timeit -r 1 -n 1 a + b
1 loop, best of 1: 5.62 ms per loop
Only marginally faster than Numba, even though NumPy addition is implemented in highly optimised C code. And, for some data types, Numba even beats NumPy:

In [8]:
r = np.random.randint(0, 128, size=n).astype(np.uint8)
s = np.random.randint(0, 128, size=n).astype(np.uint8)
In [9]:
%timeit -r 1 -n 1 r + s
1 loop, best of 1: 2.92 ms per loop
In [10]:
%timeit -r 1 -n 1 addarr_nb(r, s)
1 loop, best of 1: 238 ms per loop
In [11]:
%timeit -r 1 -n 1 addarr_nb(r, s)
1 loop, best of 1: 234 µs per loop
WOW! For smaller data types, Numba beats NumPy by over 10x!

I’m only speculating, but since my clock speed is about 1GHz (I’m writing this on a base Macbook with a 1.1GHz Core-m processor), I suspect that Numba is taking advantage of some SIMD capabilities of the processor, whereas NumPy is treating each array element as an individual arithmetic operation. (If any Numba or NumPy devs are reading this and have more concrete implementation details that explain this, please share them in the comments!)

So hopefully I’ve got your attention now. For years, NumPy has been the go-to library for performance Python in scientific computing. But, if you wanted to do something a little out of the ordinary, you were stuck. Now, Numba generally matches that for arbitrary code and sometimes beats it handily!

In this context, I decided to use Numba to do something a little less trivial, as part of my research.

Part 2: Real Numba

I’ll present below a slightly simplified version of the code present in my library, skan, which is currently available on PyPI and conda-forge. The task is to build an graph out of the pixels of a skleton image, like this one:

In [12]:
%matplotlib inline
In [13]:
import matplotlib.pyplot as plt
plt.rcParams['image.cmap'] = 'gray'
plt.rcParams['image.interpolation'] = 'nearest'
In [14]:
skeleton = np.array([[0, 1, 0, 0, 0, 1, 1],
                     [0, 0, 1, 1, 1, 0, 0],
                     [0, 1, 0, 0, 0, 1, 0],
                     [0, 0, 1, 0, 1, 0, 0],
                     [1, 1, 0, 1, 0, 0, 0]], dtype=bool)
skeleton = np.pad(skeleton, pad_width=1, mode='constant')
In [15]:
fig, ax = plt.subplots(figsize=(5, 5))
ax.imshow(skeleton)
ax.set_title('Skeleton')
ax.set_xlabel('col')
ax.set_ylabel('row')
Out[15]:

Every white pixel in the image will be a node in our graph, and we place edges between nodes if the pixels are next to each other (counting diagonals). A natural way to represent a graph in the SciPy world is as a sparse matrix A: we number the nonzero pixels from 1 onwards — these are the rows of the matrix — and then place a 1 at entry A(i, j) when pixel i is adjacent to pixel j. SciPy’s sparse.coo_matrix format make it very easy to construct such a matrix: we just need an array with the row coordinates and another with the column coordinates.

Because NumPy arrays are not dynamically resizable like Python lists, it helps to know ahead of time how many edges we are going to need to put in our row and column arrays. Thankfully, a well-known theorem of graph theory states that the number of edges of a graph is half the sum of the degrees. In our case, because we want to add the edges twice (once from i to j and once from j to i, we just need the sum of the degrees exactly. We can find this out with a convolution using scipy.ndimage:

In [16]:
from scipy import ndimage as ndi

neighbors = np.array([[1, 1, 1],
                      [1, 0, 1],
                      [1, 1, 1]])

degrees = ndi.convolve(skeleton.astype(int), neighbors) * skeleton
In [17]:
fig, ax = plt.subplots(figsize=(5, 5))
result = ax.imshow(degrees, cmap='magma')
ax.set_title('Skeleton, colored by node degree')
ax.set_xlabel('col')
ax.set_ylabel('row')
cbar = fig.colorbar(result, ax=ax, shrink=0.7)
cbar.set_ticks([0, 1, 2, 3])
cbar.set_label('degree')
There you can see “tips” of the skeleton, with only 1 neighbouring pixel, as purple, “paths”, with 2 neighbours, as red, and “junctions”, with 3 neighbors, as yellow.

Now, consider the pixel at position (1, 6). It has two neighbours (as indicated by its colour): (2, 5) and (1, 7). If we number the nonzero pixels as 1, 2, …, n from left to right and top to bottom, then this pixel has label 2, and its neighbours have labels 6 and 3. We therefore need to add edges (2, 3) and (2, 6) to the graph. Similarly, when we consider pixel 6, we will add edges (6, 5), (6, 3), and (6, 8).

In [18]:
fig, ax = plt.subplots(figsize=(5, 5))
result = ax.imshow(degrees, cmap='magma')
cbar = fig.colorbar(result, ax=ax, shrink=0.7)
cbar.set_ticks([0, 1, 2, 3])

nnz = len(np.flatnonzero(degrees))
pixel_labels = np.arange(nnz) + 1
for lab, y, x in zip(pixel_labels, *np.nonzero(degrees)):
    ax.text(x, y, lab, horizontalalignment='center',
            verticalalignment='center')

ax.set_xlabel('col')
ax.set_ylabel('row')
ax.set_title('Skeleton, with pixel IDs')
cbar.set_label('degree')
Scanning over the whole image, we see that we need row and col arrays of length exactly np.sum(degrees).

In [19]:
n_edges = np.sum(degrees)
row = np.empty(n_edges, dtype=np.int32)  # type expected by scipy.sparse
col = np.empty(n_edges, dtype=np.int32)
The final piece of the puzzle is finding neighbours. For this, we need to know a little about how NumPy stores arrays. Even though our array is 2-dimensional (rows and columns), these are all arrayed in a giant line, each row placed one after the other. (This is called “C-order”.) If we index into this linearised array (“raveled”, in NumPy’s language), we can make sure that our code works for 2D, 3D, and even higher-dimensional images. Using this indexing, neighbouring pixels to the left and right are accessed by subtracting or adding 1 to the current index. Neighbouring pixels above and below are accessed by subtracting or adding the length of a whole row. Finally, diagonal neighbours are found by combining these two. For simplicity, we only show the 2D version below:

In [20]:
def neighbour_steps(shape):
    step_sizes = np.cumprod((1,) + shape[-1:0:-1])
    axis_steps = np.array([[-1, -1],
                           [-1,  1],
                           [ 1, -1],
                           [ 1,  1]])
    diag = axis_steps @ step_sizes
    steps = np.concatenate((step_sizes, -step_sizes, diag))
    return steps
In [21]:
steps = neighbour_steps(degrees.shape)
print(steps)
[  1   9  -1  -9 -10   8  -8  10]
Of course, if we use these steps near the right edge of the image, we’ll wrap around, and mistakenly think that the first element of the next row is a neighbouring pixel! Our solution is to only process nonzero pixels, and make sure that we have a 1-pixel-wide “pad” of zero pixels — which we do, in the image above!

Now, we iterate over image pixels, look at neighbors, and populate the row and column vectors.

In [22]:
def build_graph(labeled_pixels, steps_to_neighbours, row, col):
    start = np.max(steps_to_neighbours)
    end = len(labeled_pixels) - start
    elem = 0  # row/col index
    for k in range(start, end):
        i = labeled_pixels[k]
        if i != 0:
            for s in steps:
                neighbour = k + s
                j = labeled_pixels[neighbour]
                if j != 0:
                    row[elem] = i
                    col[elem] = j
                    elem += 1
In [23]:
skeleton_int = np.ravel(skeleton.astype(np.int32))
skeleton_int[np.nonzero(skeleton_int)] = 1 + np.arange(nnz)
In [24]:
%timeit -r 1 -n 1 build_graph(skeleton_int, steps, row, col)
1 loop, best of 1: 917 µs per loop
Now we try the Numba version:

In [25]:
build_graph_nb = numba.jit(build_graph)
In [26]:
%timeit -r 1 -n 1 build_graph_nb(skeleton_int, steps, row, col)
1 loop, best of 1: 346 ms per loop
In [27]:
%timeit -r 1 -n 1 build_graph_nb(skeleton_int, steps, row, col)
1 loop, best of 1: 14.3 µs per loop
Nice! We get more than a 50-fold speedup using Numba, and this operation would have been difficult if not impossible to convert to a NumPy vectorized operation! We can now build our graph:

In [28]:
from scipy import sparse
G = sparse.coo_matrix((np.ones_like(row), (row, col))).tocsr()
As to what to do with said graph, I’ll leave that for another post. (You can also peruse the skan source code.) In the meantime, though, you can visualize it with NetworkX:

In [29]:
import networkx as nx

Gnx = nx.from_scipy_sparse_matrix(G)
Gnx.remove_node(0)

nx.draw_spectral(Gnx, with_labels=True)
There’s our pixel graph! Obviously, the speedup and n-d support are important for bigger, 3D volumes, not for this tiny graph. But they are important, and, thanks to Numba, easy to obtain.

Conclusion

I hope I’ve piqued your interest in Numba and encouraged you to use it in your own projects. I think the future of success of Python in science heavily depends on JITs, and Numba is a strong contender to be the default JIT in this field.

Note:This post was written using Jupyter Notebook. You can find the source notebook here.

Trump’s win

Like many of you, I watched in horror two days ago as the night unfolded, and the unthinkable slowly came to pass. After a Netflix binge to try to numb the fear, I dived into a clickhole of social media posts and news articles to try to make sense of what had happened. I hope that writing a synthesis of that will let me get on with my life in this brave new world.

I am deeply, depressingly pessimistic about the future of the planet under Trump. Let’s take the very best, most ludicrously optimistic scenario: that Trump swings to the center1 and doesn’t make good on his many horrid promises. Even then, his election, and the Republicans’ victory in the House and Senate, represent game over in the fight to avoid climate change2.

Like many of us, I’d buried my head in the sand about this, even after I read Michael Moore’s again-famous essay predicting Trump’s victory3, which is well worth a read. After that, here are some choice quotes from essays I recommend about how this nightmare came to be. From Glenn Greenwald’s The Ongoing, Dangerous Refusal to Learn the Lesson of Brexit4:

When a political party is demolished, the principle responsibility belongs to one entity: the party that got crushed. It’s the job of the party and the candidate, and nobody else, to persuade the citizenry to support them and find ways to do that. Last night, the Democrats failed, resoundingly, to do that, and any autopsy or liberal think piece or pro-Clinton pundit commentary that does not start and finish with their own behavior is one that is inherently worthless.

Democrats got complacent and forgot a huge bloc of voters, instead catering to us upper-middle-class city dwellers. Linked from that article, Vincent Bevins’s post-Brexit Facebook post5 expresses the core of their failure well:

Both Brexit and Trumpism are the very, very, wrong answers to legitimate questions that urban elites have refused to ask for thirty years.

Questions such as – Who are the losers of globalization, and how can we spread the benefits to them and ease the transition? Is it fair that the rich can capture almost all the gains of open borders and trade, or should the process be more equitable?

I also liked this paragraph from David Wong’s Don’t Panic6:

The truth is, most of Trump’s voters voted for him despite the fact that he said/believes awful things, not because of it. That in no way excuses it, but I have to admit I’ve spent eight years quietly tuning out news stories about drone strikes blowing up weddings in Afghanistan. […] [Trump supporters] look out their front door and see painkiller addicts and closed factories. They believe that nobody in Washington gives a shit about them, mainly because that’s 100-percent correct.

Now for the really depressing part: First, I don’t think Trump will fix these people’s problems. Despite the anti-establishment rhetoric, what he’ll deliver is more tax cuts for the rich. And second, I don’t see a plausible way out for the US. Its electoral system is completely ridiculous, and it’s going to get worse, much worse, before it gets better. As in 2000, the popular vote went to Hillary Clinton, but the Electoral College vote went easily to Trump. The lack of preferential voting also meant that Hillary could have won in various battleground states where her loss margin was much smaller than the number of votes for third-party candidates. (Again, as in 2000.) Finally, gerrymandering has handed the House of Representatives to Republicans for the foreseeable future, despite more people voting Democrat than Republican in most elections.

But, what incentive do the people in power have to change the electoral rules?

Paul Krugman’s dark closing for that night, Our Unknown Country7, summed it up best:

Is America a failed state and society? It looks truly possible.

Finally, many of you reading this will be in other countries thinking, it could not happen here. But it will, without action. Without massive change from the left-leaning parties of the world. Michael Moore’s warning applies almost everywhere. Here in Australia, the Labor party has egregiously followed the right-wing Coalition in their repugnant asylum-seeker policy (our very own “build that wall” is “stop the boats”), and in their mass surveillance policies. Much like the Democrats, they have only themselves to blame for their loss in the last election, earlier this year.

The other day, the news reported on all the people who are losing their jobs here in my home town of Geelong as the Ford assembly line closes. These are the people we cannot forget.

I don’t yet know what I can do about all this. But it’s clear that writing snarky Facebook posts either being outraged or mocking the government will solve nothing. We have to do better, and we have to really be out there, not in here.

  1. Trump on Hillary Clinton in 2008.

  2. Scientists say that the next decade will require dramatic emissions reductions, worldwide, if we are to keep warming to a reasonable level. The United States accounts for an enormous proportion of these emissions, and will now spend 4-8 years, at an absolute minimum, doing absolutely nothing — indeed, probably helping the oil industry. One day after his election, Trump has already selected a climate sceptic to head the EPA: https://www.scientificamerican.com/article/trump-picks-top-climate-skeptic-to-lead-epa-transition/

  3. Michael Moore: Trump will win.

  4. Glenn Greenwald: Democrats, Trump, and the ongoing, dangerous refusal to learn the lesson of Brexit.

  5. Vincent Bevins on Brexit.

  6. David Wong: Don’t Panic

  7. Paul Krugman: Our Unknown Country.

Review: Sony Digital Paper

Three years ago I excitedly posted about Sony’s then-new writeable e-paper tablet, called Sony Digital Paper System (DPS-1).

Now it is finally mine and I love it.

Here’s what I wrote about it when Sony announced it:

the iPad (et al) sucks for some things. Three of those are: (1) taking handwritten notes, (2) reading (some) pdfs in full-page view, and (3) reading in full daylight. By the sound of it, Sony’s new tablet will excel at all three

Having had it for about a month, I can confidently say that it does indeed excel at those three things, beyond my wildest dreams. Even with the improved competition of the identically-priced iPad Pro (which can now handle points 1 and 2 with aplomb), I still prefer the Sony. Here’s why:

  • Despite a comparatively low resolution (1200 x 1600), e-ink is simply nicer on the eyes. To see why, look at this old post looking at an original iPad display and a Kindle e-ink display under a microscope. (More modern Retina displays are only marginally better, see here.) Here’s a zoomed-in shot of a paper on the DPS-1: Like paper I can barely tell that it’s not just a slightly low-res print.
  • And of course, for reading outdoors, e-ink is just infinitely better. Try this on an iPad if you’re craving a good cry.
  • The Apple Pencil has received glowing reviews, but I’ve tried it, and it still feels decidedly like sliding on glass. The DPS’s stylus and matte screen combine to create friction that feels remarkably like pencil-on-paper.
  • In today’s distraction-filled digital world, disconnecting is an advantage. This will matter more or less depending on your work discipline, but for me it has been life-changing. The context switch that happens when I start to work on the DPS keeps me focused at a level I hadn’t experienced for years. You can certainly use “Do not disturb” on an iPad, but having distracting apps such as email a double-tap away is a definite downside.

The DPS is one of those rare products that does one thing and one thing only (well, two) really well: read and annotate pdfs, and take handwritten notes. It’s simply perfect for academics.

There’s one caveat and it’s the software. It is, in a word, amateurish. A few examples:

  • Cloud Sync works though WebDAV, a file transfer protocol with limited support from cloud storage providers (of the major players, only Box supports it as of this writing).
  • You can screen share with the DPS through a USB cable, which is great for giving pdf presentations, but it’s done through a companion Mac OS app distributed as a java archive, which doesn’t support full-screen.
  • You can make and delete files on the DPS, but you can’t move them to other folders.
  • And so on.

The funny thing is that it gets regular software updates, but none attains the level of polish you might expect from a company of Sony’s stature. I have a feeling that there’s this one engineer in charge of this thing at Sony, and they are just hammering away by themselves, unsupported, but trying their darnedest to make it better all the time.

In short, I think Sony’s development and marketing teams dropped the ball on this one. In its early days you couldn’t buy it at retail stores — you actually had to write to Sony to explain why you wanted one! I imagine they wanted to avoid negative press from consumers who didn’t know what they were getting into. And even now, retail availability is extremely limited. Just two stores carry it in the US (B&H Photo and CDW). In many countries you can’t buy it at all, except shipped from those US stores.

Sony really needs to put these babies on demo at every university bookshop in the rich world. (At $800 US, I’ll admit it’s a luxury.) It would sell like hotcakes.

In short, if you read a lot of scientific papers, or do a lot of handwriting (e.g. for math), you will love the Digital Paper. I second what my friend @gamesevolving said: I should have gotten it a long time ago.

Why scientists should code in the open

All too often, I encounter published papers in which the code is “available upon request”, or “available in the supplementary materials” (as a zip file). This is not just poor form. It also hurts your software’s future. (And, in my opinion, when results depend on software, it is inexcusable.)

Given the numerous options for posting code online, there’s just no excuse to give code in a less-than-convenient format, upon publication. When you publish, put your code on Github or Bitbucket.

In this piece, I’ll go even further: put your code there from the beginning. Put your code there as soon as you finish reading this article. Here’s why:

No, you won’t get scooped

Reading code is hard. Ask any experienced programmer: most have trouble reading code they themselves wrote a few months ago, let alone someone else’s code. It’s extremely unlikely that someone will browse your code looking for a scoop. That time is better spent doing research.

It’s never going to be ready

Another thing I hear is that they want to post their code, but they want to clean it up first, and remove all the “embarrassing” bits. Unfortunately, science doesn’t reward time spent “cleaning up” your code, at least not yet. So the sad reality is that you probably will never actually get to the point where you are happy to post your code online.

But here’s the secret: everybody is in that boat with you. That’s why this document exists. I recommend you read it in full, but this segment is particularly important:

When it comes time to empirically evaluate new research with respect to someone else’s prior work, even a rickety implementation of their work can save grad-student-months, or even grad-student-years, of time.

Matt Might himself is as thorough and high-profile as you get in computer science, and yet, he has this to say about code clean-up:

I kept telling myself that I’d clean it all up and release it some day.

I have to be honest with myself: this clean-up is never going to happen.

Your code might not meet your standards, but, believe it or not, your code will help others, and the sooner it’s out there, the sooner they can be helped.

You will gain collaborators and citations

If anyone is going to be rifling through your code, they will probably end up asking for your help. This happens with even the best projects: have a look at the activity on the mailing lists for scikit-learn or NumPy, two of the best-maintained open-source projects out there.

When you have to go back and explain how a piece of code worked, that’s when you will actually take the time and clean it up. In the process, the person you help will be more likely to contribute to your project, either in code or in bug reports, improvement suggestions, or even citations.

In the case of my own gala project, I guess that about half of the citations it received happened because of its open-source code and open mailing list.

Your coding ability will automagically improve

I first heard this one from Albert Cardona. They say sunlight is the best disinfectant, and this is certainly true of code. Just the very idea that anyone can easily read their code will make most people more careful when programming. Over time, this care will become second nature, and you will develop a taste for nice, easy-to-read code.

In short, the alleged downsides of code-sharing are, at best, longshots, while there are many tangible upsides. Put your code out there. (And use a liberal open-source license!)

The cost of a Python function call

I’ve read in various places that the Python function call overhead is very high. As I was parroting this “fact” to Ed Schofield recently, he asked me what the cost of a function actually was. I had no idea. This prompted us to do a few quick benchmarks.

The short version is that it takes about 150ns to call a function in Python (on my laptop). This doesn’t sound like a lot, but it means that you can make at most 6.7 million calls per second, two to three orders of magnitude slower than your processor’s clock speed.

If you want your function to do something, such as, oh, I don’t know, receive an input argument, this goes up to 350ns, throttling you at 2.8 million calls per second.

Benchmarking function calls

I cleaned up Ed’s and my initial experiments to make a small module and timer to measure all these values. You can clone the repo and run python function-calls/timer.py to check the numbers on your machine.

The benchmarks are variations of comparing the execution time of:

for i in range(n):
    pass

and:

def f():
    pass

for i in range(n):
    f()

for some suitably large n. As I mentioned above, that comes out to an absolute minimum of 150ns per function call.

What this means

I’ve been making a fuss over the past year about the excellent Toolz and the way it enables elegant streaming data processing. (See my demo repo and my EuroSciPy talk.) You can read data from a modern SSD at speeds approaching 500MB/s. If you want to stream each byte through Python functions, you’ll instantly lose two orders of magnitude of speed. And, the more functions you use, the slower you’ll go, which discourages functional programming and modularity — the very things I was trying to promote!

In the DNA sequence processing I demo in the talk, I get a throughput of about 0.5MB/s. On one hand, this is kind of OK because we are using effectively zero RAM, so we can just let the code run over lunch. On the other, it’s starting to bug me that 99% of my processor time is spent on Python function calls, rather than on actual data crunching.

This is a problem for Python. To work on seriously big data, you need to drop into a library written in C, such as NumPy or Pandas. You need to do this on a high level: any per-byte or per-data-element processing cannot be in Python, if you don’t want to waste your processor’s cycles. Python’s ecosystem is Insanely Great, so this is mostly fine, but it does limit your ability to research or implement cool new methods using Python.

As an example, the generic_filter function in SciPy’s ndimage package has infinitely many cool uses, but using it to process a 100MB image (which is small in biology) would take 15 seconds in function call overhead alone. Lest you think this is reasonable, SciPy’s greyscale erosion, implemented in C, takes less than 4 seconds on an image that size. A lot of my once-lackadaisical attitude towards Python performance stemmed from not knowing how long things should take. A lot less than they do, it turns out.

What to do about it

As I mentioned, Python’s high performance libraries are many and great. Look hard for optimised libraries that already do what you want. Try to express what you want to do as combinations of functions from NumPy, SciPy, Pandas, scikit-image, scikit-learn, and so on. Minimise the amount of time spent in Python. This is advice that you learn early on in scientific Python programming, but I didn’t appreciate just how important it is.

At some point, that approach will fail, and you will want to do something cute and custom with your data points. Reach for Cython sooner rather than later. As a primer, I recommend Stefan Behnel’s excellent tutorial from EuroSciPy 2015.

There is also Continuum’s Numba, which is sometimes easier to use than Cython. I don’t have any experience with it so I can’t comment much here. However, I’d consider it a very valuable project to implement generic_filter in Numba.
In the long-run, these are all workarounds, and I hope that the Python interpreter itself becomes faster, though there are few signs of that happening.

If you have other ideas on how to get around Python’s function call cost, please let me know in the comments!

My first use of Python 3’s `yield from`!

I never really understood why yield from was useful. Last weekend, I wanted to use Python 3.5’s new os.scandir to explore a directory (and its subdirectories). Tragically, os.scandir is not recursive, and I find os.walks 3-tuple values obnoxious.
Lo and behold, while I was trying to implement a recursive version of scandir, a yield from use just popped right out!

import os
def rscandir(path):
    for entry in os.scandir(path):
        yield entry
        if entry.is_dir():
            yield from rscandir(entry.path)

That’s it! I have to admit that reads wonderfully. The Legacy Python (aka Python 2.x) alternative is quite a bit uglier:

import os
def rscandir(path):
    for p in os.listdir(path):
        yield p
        if os.path.isdir(p):
            for q in rscandir(p):
                yield q

Yuck. So, yet again: time to move away from Legacy Python! ;)

EuroSciPy 2015 debrief

The videos from EuroSciPy 2015 are up! This marks a good time to write up my thoughts on the conference.
I’ve mentioned before that the yearly SciPy conference is stunningly useful. This year I couldn’t make it to Austin, but I did attend EuroSciPy, the European version of the same conference, in Cambridge, UK. It was spectacular.

Useful talks

The talk of the conference, for me, goes to Robin Wilson for recipy, which one can describe as a logging utility, if one wishes to make it sound as uninspiring as possible. Recipy’s strength is in its mind-boggling simplicity. Here is the unabridged usage guide:

import recipy

With this single line, your script will now generate an entry in a database every time it is run. It logs the start and end time, the working directory, the script’s git hash, any differences between the working copy and the last git commit (!), and the names of any input and output files. (File hashes are coming soon, I’m assured).
I don’t know about you but I have definitely lost count of the times I’ve looked at a file and wondered what script I ran to get it, or the input data that went into it. This library solves that problem with absolutely minimal friction for the user.
I also enjoyed Nicolas Rougier’s talk on ReScience, a new journal dedicated to replicated (and replicable) scientific analyses. It’s a venue to publish all those efforts to replicate a result you read in a paper. Given recent findings about how poorly most papers replicate, I think this is a really important outlet.
The other remarkable thing about it is that all review is open and done in the spirit of open source, on GitHub. Submission is by pull request, of course. With just one paper out so far, it’s a bit early to tell whether it’ll take off, but I really hope it does. I’ll be looking for stuff of my own to publish there, for sure. (Oh and by the way, they are looking for reviewers and editors!)
Another great talk was Philipp Rudiger on HoloViews, an object-oriented plotting framework. They define an arithmetic on figures: A * B overlays figure B on A, while B + C creates two subplots out of B and C (and automatically labels them). Their example notebooks rely a lot on IPython magic, which I’m not happy about and means I haven’t fully grokked the API, but it seems like a genuinely useful way to think about plotting.
A final highlight from the main session was Martin Weigert on Spimagine, his GPU-accelerated, 5D image analysis and visualisation framework. It was stupidly impressive. Although it’s a long-term project, I’m inclined to try to incorporate many of its components into scikit-image.

Tutorials

The tutorials are a great asset of both EuroSciPy and SciPy. I learn something new every year. The highlight for me was the Cython tutorial, in which Stefan Behnel demonstrated how easy it is to provide Python access to C++ code using Cython. (I have used Cython quite extensively, but only to speed up Python code, rather than wrap C or C++ code.)

Sprints

I was feeling a bit hypocritical for missing the sprints this year, since I had to run off before the Sunday. Emmanuelle Gouillart, another scikit-image core dev, suggested having a small, unofficial sprint on Friday evening. It grew and grew into a group of about 30 people (including about 10 new to sprinting) who all gathered at the Enthought Cambridge office to work on scikit-image or the SciPy lecture notes. A brilliant experience.
scikit-image sprint at Enthought
(By the way, nothing creepy going on with that dude hunching over one of our sprinters — that’s just husband-and-wife team Olivia and Robin Wilson! ;)

Final thoughts

As usual, I learned heaps and had a blast at this SciPy conference (my fourth). I hope it will remain a yearly ritual, and I hope someone reading this will give it a try next year!