Continuous integration in Python, 7: some helper tools and final thoughts

It’s time to draw my “continuous integration in Python” series to a close. This final post ties all six previous posts together and is the preferred write-up to share more widely and on which to provide feedback.

Almost everything I know about good Python development I’ve learned from Stéfan van der Walt, Tony Yu, and the rest of the scikit-image team. But a few weeks ago, I was trying to emulate the scikit-image CI process for my own project: cellom2tif, a tool to liberate images from a rather useless proprietary format. (I consider this parenthetical comment sufficient fanfare to announce the 0.2 release!) As I started copying and editing config files, I found that even from a complete template, getting started was not very straightforward. First, scikit-image has much more complicated requirements, so that a lot of the .travis.yml file was just noise for my purposes. And second, as detailed in the previous posts, a lot of the steps are not found or recorded anywhere in the repository, but rather must be navigated to on the webpages of GitHub, Travis, and Coveralls. I therefore decided to write this series as both a notetaking exercise and a guide for future CI novices. (Such as future me.)

To recap, here are my six steps to doing continuous integration in Python with pytest, Travis, and Coveralls:

If you do all of the above at the beginning of your projects, you’ll be in a really good place one, two, five years down the line, when many academic projects grow far beyond their original scope in unpredictable ways and end up with much broken code. (See this wonderful editorial by Zeeya Merali for much more on this topic.)

Reducing the boilerplate with PyScaffold

But it’s a lot of stuff to do for every little project. I was about to make myself some minimal setup.cfg and .travis.yml template files so that I could have these ready for all new projects, when I remembered PyScaffold, which sets up a Python project’s basic structure automatically (setup.py, package_name/__init__.py, etc.). Sure enough, PyScaffold has a --with-travis option that implements all my recommendations, including pytest, Travis, and Coveralls. If you set up your projects with PyScaffold, you’ll just have to turn on Travis-CI on your GitHub repo admin and Coveralls on coveralls.io, and you’ll be good to go.

When Travises attack

I’ve made a fuss about how wonderful Travis-CI is, but it breaks more often than I’d like. You’ll make some changes locally, and ensure that the tests pass, but when you push them to GitHub, Travis fails. This can happen for various reasons:

  • your environment is different (e.g. NumPy versions differ between your local build and Travis’s VMs).
  • you’re testing a function that depends on random number generation and have failed to set the seed.
  • you depend on some web resource that was temporarily unavailable when you pushed.
  • Travis has updated its VMs in some incompatible way.
  • you have more memory/CPUs locally than Travis allows.
  • some other, not-yet-understood-by-me reason.

Of these, the first three are acceptable. You can use conda to match your environments both locally and on Travis, and you should always set the seed for randomised tests. For network errors, Travis provides a special function, travis_retry, that you can prefix your commands with.

Travis VM updates should theoretically be benign and not cause any problems, but, in recent months, they have been a significant source of pain for the scikit-image team: every monthly update by Travis broke our builds. That’s disappointing, to say the least. For simple builds, you really shouldn’t run into this. But for major projects, this is an unnecessary source of instability.

Further, Travis VMs don’t have unlimited memory and disk space for your builds (naturally), but the limits are not strictly defined (unnaturally). This means that builds requiring “some” memory or disk space randomly fail. Again, disappointing. Travis could, for example, guarantee some minimal specs that everyone could program against — and request additional space either as special exemptions or at a cost.

Finally, there’s the weird failures. I don’t have any examples on hand but I’ll just note that sometimes Travis builds fail, where your local copy works fine every single time. Sometimes rebuilding fixes things, and other times you have to change some subtle but apparently inconsequential thing before the build is fixed. These would be mitigated if Travis allowed you to clone their VM images so you could run them on a local VM or on your own EC2 allocation.

Heisenbug
A too-common Travis occurrence: randomly failing tests

In all though, Travis is a fantastic resource, and you shouldn’t let my caveats stop you from using it. They are just something to keep in mind before you pull all your hair out.

The missing test: performance benchmarks

Testing helps you maintain the correctness of your code. However, as Michael Droettboom eloquently argued at SciPy 2014, all projects are prone to feature creep, which can progressively slow code down. Airspeed Velocity is to benchmarks what pytest is to unit tests, and allows you to monitor your project’s speed over time. Unfortunately, benchmarks are a different beast to tests, because you need to keep the testing computer’s specs and load constant for each benchmark run. Therefore, a VM-based CI service such as Travis is out of the question.

If your project has any performance component, it may well be worth investing in a dedicated machine only to run benchmarks. The machine could monitor your GitHub repo for changes and PRs, check them out when they come in, run the benchmarks, and report back. I have yet to do this for any of my projects, but will certainly consider this strongly in the future.

Some reservations about GitHub

The above tools all work great as part of GitHub’s pull request (PR) development model. It’s a model that is easy to grok, works well with new programmers, and has driven massive growth in the open-source community. Lately, I recommend it with a bit more trepidation than I used to, because it does have a few high-profile detractors, notably Linux and git creator Linus Torvalds, and OpenStack developer Julien Danjou. To paraphrase Julien, there are two core problems with GitHub’s chosen workflow, both of which are longstanding and neither of which shows any sign of improving.

First, comments on code diffs are buried by subsequent changes, whether the changes are a rebase or they simply change the diff. This makes it very difficult for an outside reviewer to assess what discussion, if any, resulted in the final/latest design of a PR. This could be a fairly trivial fix (colour-code outdated diffs, rather than hiding them), so I would love to see some comments from GitHub as to what is taking so long.

GitHub's hidden PR comments
Expect to see a lot of these when using pull requests.

Second, bisectability is broken by fixup commits. The GitHub development model is not only geared towards small, incremental commits being piled on to a history, but it actively encourages these with their per-commit badging of a user’s contribution calendar. Fixup commits make bug hunting with git bisect more difficult, because some commits will not be able to run a test suite at all. This could be alleviated by considering only commits merging GitHub PRs, whose commit message start with Merge pull request #, but I don’t know how to get git to do this automatically (ideas welcome in the comments).

I disagree with Julien that there is “no value in the social hype [GitHub] brings.” In fact, GitHub has dramatically improved my coding skills, and no doubt countless others’. For many, it is their first experience with code review. Give credit where it is due: GitHub is driving the current, enormous wave of open-source development. But there is no doubt it needs improvement, and it’s sad to see GitHub’s developers apparently ignoring their critics. I hope the latter will be loud enough soon that GitHub will have no choice but to take notice.

Final comments

This series, including this post, sums up my current thinking on CI in Python. It’s surely incomplete: I recently came across a curious “Health: 88%” badge on Mitchell Stanton-Cook’s BanzaiDB README. Clicking it took me to the project’s landscape.io page, which appears to do for coding style what Travis does for builds/tests and Coveralls does for coverage. How it measures “style” is not yet clear to me, but it might be another good CI tool to keep track of. Nevertheless, since it’s taken me a few years to get to this stage in my software development practice, I hope this series will help other scientists get there faster.

If any more experienced readers think any of my advice is rubbish, please speak up in the comments! I’ll update the post(s) accordingly. CI is a big rabbit hole and I’m still finding my way around.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s