Skip to main content

Experiences porting a medium-sized library from Python 2 to 3

Update: Much of the information in this post is outdated (especially the part about Python 3 being slower — Python 3.7 is the fastest version of Python ever created.). Take everything you read here with a grain of salt.

Prompted in part by some discussions with Ed Schofield, creator of python-future.org, I've been going on a bit of a porting spree to Python 3. I just finished with my gala segmentation library. (Find it on GitHub and ReadTheDocs.) Overall, the process is nowhere near as onerous as you might think it is. Getting started really is the hardest part. If you have more than yourself as a user, you should definitely just get on with it and port.

The second hardest part is the testing. In particular, you will need to be careful with dictionary iteration, pickled objects, and file persistence in general. I'll go through these gotchas in more detail below.

Reminder: the order of dictionary items is undefined

This is one of those duh things that I forget over and over and over. In my porting, some tests that depended on a scikit-learn RandomForest object were failing. I assumed that there was some difference between the random seeding in Python 2 and Python 3, leading to slightly different models between the two versions of the random forest.

This was a massive red herring that took me forever to figure out. In actuality, the seeding was completely fine. However, gala uses networkx as its graph backend, which itself uses an adjacency dictionary to store edges. So when I asked for graph.edges() to get a set of training examples, I was getting the edges in a random order that was deterministic within Python 2.7: the edges returned were always in the same shuffled order. This went out the window when switching to Python 3.4, with the training examples now in a different order, resulting in a different random forest and thus a different learning outcome... And finally a failed test.

The solution should have been to use a classifier that is not sensitive to ordering of the training data. However, although many classifiers satisfy this property, in practice they suffer from slight numerical instability which is sufficient to throw the test results off between shufflings of the training data.

So I've trained a Naive Bayes classifier in Python 2.7, and which I then load up in Python 3.4 and check whether the parameters are close to a newly trained one. The actual classification results can differ slightly, and this becomes much worse in gala, where classification tasks are sequential, so a single misstep can throw off everything that comes after it.

When pickling, remember to open files in binary mode

I've always felt that the pickle module was deficient for not accepting filenames as input to dump. Instead, it takes an open, writeable file. This is all well and good but it turns out that you should always open files in binary mode when using pickle! I got this far without knowing that, surely an indictment of pickle's API!

Additionally, you'll have specify a encoding='bytes' when loading a Python 2 saved file in the Python 3 version of pickle.

Even when you do, objects may not map cleanly between Python 2 and 3 (for some libraries)

In Python 2:

>>> from sklearn.ensemble import RandomForestClassifier as RF
>>> rf = RF()
>>> from sklearn.datasets import load_iris
>>> iris = load_iris()
>>> rf = rf.fit(iris.data, iris.target)
>>> with open('rf', 'wb') as fout:
...     pck.dump(r, fout, protocol=2)

Then, in Python 3:

>>> with open('rf', 'rb') as fin:
...     rf = pck.load(fin, encoding='bytes')
... 
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-9-674ee92b354d> in <module>()
      1 with open('rf', 'rb') as fin:
----> 2     rf = pck.load(fin, encoding='bytes')
      3 

/Users/nuneziglesiasj/anaconda/envs/py3k/lib/python3.4/site-packages/sklearn/tree/_tree.so in sklearn.tree._tree.Tree.__setstate__ (sklearn/tree/_tree.c:18115)()

KeyError: 'node_count'

When all is said and done, your code will probably run slower on Python 3

I have to admit: this just makes me angry. After a lot of hard work ironing out all of the above kinks, gala's tests run about 2x slower in Python 3.4 than in 2.7. I'd heard quite a few times that Python 3 is slower than 2, but that's just ridiculous.

Nick Coghlan's enormous Q&A has been cited as required reading before complaining about Python 3. Well, I've read it (which took days), and I'm still angry that the CPython core development team are generally dismissive of anyone wanting faster Python. Meanwhile, Google autocompletes "why is Python" with "so slow". And although Nick asserts that those of us complaining about this "misunderstand the perspective of conservative users", community surveys show a whopping 40% of Python 2 users citing "no incentive" as the reason they don't switch.

In conclusion...

In the end, I'm glad I ported my code. I learned a few things, and I feel like a better Python "citizen" for having done it. But that's the point: those are pretty weak reasons. Most people just want to get their work done and move on. Why would they bother porting their code if it's not going to help them do that?

Comments

Comments powered by Disqus