NLTK Python 3 support ===================== The following text is not a general comprehensive Python 3.x porting guide; it provides some information about the approach using for NLTK Python 3 port. Porting Strategy ---------------- NLTK is being ported to Python 3 using single codebase strategy: NLTK should work from a single codebase in Python 2.x and 3.x. Python 2.5 compatibility is dropped in order to take advantage of new ``__future__`` imports, ``b`` bytestring marker, new ``except Exception as e`` syntax and better standard library compatibility. General notes ^^^^^^^^^^^^^ There are good existing guides for writing Python 2.x - 3.x compatible code, e.g. * http://docs.python.org/dev/howto/pyporting.html * http://python3porting.com/ * https://docs.djangoproject.com/en/dev/topics/python3/ Take a look at them to have an idea how the approach works and what is changed in Python 3. nltk.compat ^^^^^^^^^^^ ``nltk.compat`` module is loosely based on a great `six`_ library. It provides simple utilities for wrapping over differences between Python 2 and Python 3. Moved imports, removed/renamed builtins, type names differences and support functions are there. .. note:: We don't use `six`_ directly because it didn't work well bundled at the time the port was started (this was a bug in six that is fixed now), and NLTK needs extra custom 2+3 helpers anyway. .. _six: http://packages.python.org/six/ map vs imap, items vs iteritems, ... ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ A number of Python builtins and builtin methods returns lists under Python 2.x and iterators under Python 3.x. There are 3 possible ways to workaround this: 1) use non-iterator versions of functions and methods under Python 3.x (e.g. cast ``zip`` result to list); 2) convert Python 2.x code to iterator versions (e.g. replace ``zip`` with ``itertools.izip`` when possible); 3) let the code behave different under Python 2.x and Python 3.x. In this NLTK port (1) and (3) methods are used; (3) is preferred. This way there are no breaking interface changes for Python 2.x code and Python 3.x code remains idiomatic (it is surprising for a dict subclass ``items`` method to return a list under Python 3.x). Existing code that uses NLTK will have to be ported from Python 2.x to 3.x anyway so I think such interface changes are acceptable. Unicode support --------------- Fixing corpora readers ^^^^^^^^^^^^^^^^^^^^^^ Previously, many corpora readers returned byte strings. In a Python 3.x branch the correct encodings are provided for all corpora and all corpora readers are now returning unicode. ``__repr__`` and ``__str__`` ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Under Python 2.x ``obj.__repr__`` and ``obj.__str__`` must return byte strings, while under Python 3.x they must return unicode strings. To make things worse, terminals are tricky, and under Python 2.x there is no hassle-free encoding that ``obj.__repr__`` and ``obj.__str__`` may use except for plain 7 bit ASCII. .. Should I link a blog post (http://kmike.ru/python-with-strings-attached/) or extract some text from it to make the statement about encodings more reasoned? In NLTK most classes with custom ``__repr__`` and/or ``__str__`` should use ``nltk.compat.python_2_unicode_compatible`` decorator. It works this way: 1) Class should be defined with ``__repr__`` and ``__str__`` methods returning unicode (that's Python 3.x semantics); 2) under Python 2.x the decorator fixes ``__repr__`` and ``__str__`` to return bytestrings; 3) under both Python 2.x and 3.x the decorator creates ``__unicode__`` method (which is an original ``__str__``) and ``unicode_repr`` method (which is an original ``__repr__``). Under Python 2.x ``__repr__`` method returns an escaped version of the unicode value and ``__str__`` returns a transliterated version. For transliteration `Unidecode `_, `text-unidecode `_ or a basic "accent remover" may be used, depending on what packages are installed. In order to write unicode-aware ``__str__`` and ``__repr__``, the following approach may be used: 1) ``from __future__ import unicode_literals`` is added to a top of file; 2) ``str(something)`` should be replaced with ``"%s" % something`` when used (maybe indirectly) inside ``__str__`` or ``__repr__``; 3) ``repr(something)`` and ``"%r" % something`` should be replaced with ``unicode_repr(something)`` and ``"%s" % unicode_repr(something)`` when used (maybe indirectly) inside ``__str__`` or ``__repr__``. Doctests porting notes ---------------------- NLTK test suite is mostly doctest-based. Most usual rules apply to porting doctests code. But ther are some issues that make the process harder, so in order to make doctests work under Python 2.x and Python 3.x extra tricks are needed. ``__future__`` imports ^^^^^^^^^^^^^^^^^^^^^^ Python's doctest runner doesn't support ``__future__`` imports. They are executed but has no effect in doctests' code. These imports are quite useful for making code Python 2.x + 3.x compatible so there are some methods to overcome the limitation. * ``from __future__ import print_function``: it may seem the import works because ``print(foo)`` works under python 2.x; but it works only because (foo) == foo; ``print(foo, bar)`` prints tuple; ``print(foo, sep=' ')`` raises an exception. In order to make print() work this future import is injected to all doctests' globals within NLTK test suite (implementation: ``nltk.test.doctest_nose_plugin.DoctestPluginHelper``). So NLTK's doctests shouldn't import print_function but they should assume this import is in effect. * ``from __future__ import unicode_literals``: there is no sane way to use non-ascii constants in doctests under python 2.x (see http://bugs.python.org/issue1293741 ); doctests with non-ascii constants should be better rewritten as unittests or as doctests without non-ascii constants. Tests may use variables with unicode values though. In order to print such values and have the same output under python 2 and python 3 the following trick may be used:: >>> print(unicode_value.encode('unicode-escape').decode('ascii')) But it may be a better idea to avoid this trick and rewrite the test to unittest format instead. * ``from __future__ import division``: it is usually not hard to cast results to int or float to have the same semantics under python 2 and 3. Unicode strings __repr__ ^^^^^^^^^^^^^^^^^^^^^^^^ Representation of unicode strings is different in Python 2.x and Python 3.x even if they contain only ascii characters. Python 2.x:: >>> x = b'foo'.decode('ascii') >>> x u'foo' Python 3.x:: >>> x = b'foo'.decode('ascii') >>> x 'foo' (Note the missing 'u' in Python 3 example). In order to simplify things NLTK's custom doctest runner (see ``nltk.test.doctest_nose_plugin.DoctestPluginHelper``) doesn't take 'u''s into account; it considers u'foo' and 'foo' equal; developer is free to write u'foo' or 'foo'. This is not absolutely correct but if this distinction is important then doctest should be converted to unittest. There are other possible fixes for this issue but they all make doctests less readable. For example, for single variables ``print`` may be used. Python 2.x:: >>> print(x) foo Python 3.x:: >>> print(x) foo This won't help with container types. Python 2.x:: >>> print([x, x]) [u'foo', u'foo'] Possible fixes for lists are:: >>> for txt in [x, x]: ... print(x) foo foo or:: >>> print(", ".join([x, x])) foo, foo Float values representation ^^^^^^^^^^^^^^^^^^^^^^^^^^^ The exact representation of float values may vary across Python interpreters (this is not only a Python 3.x - specific issue). So instead of this:: >>> recall 0.8888888888889 write this:: >>> print(recall) 0.88888888888... Porting tools ------------- python-modernize ^^^^^^^^^^^^^^^^ `python-modernize `_ script was used for tedious parts of python3 porting. Take a look at the docs for more information. The process was: * Run NLTK test suite under Python 2.x; * fix one specific aspect of NLTK by running one of python-modernize fixers on NLTK source code; * take a look at changes python-modernize proposes, fix stupid things; * run NLTK test suite again under Python 2.x and make sure there are no regressions. After python-modernize code wouldn't be necessary Python 3.x compatible but further porting would be easier and there shouldn't be 2.x regressions. 2to3 ^^^^ Doctest porting may be tedious, there is a lot of search/replace work (e.g. ``print foo`` -> ``print(foo)`` or ``raise Exception, e`` -> ``raise Exception as e``). In order to overcome this 2to3 utility was used, e.g.:: $ 2to3 -d -f print nltk/test/*.doctest Fixers were applied one-by-one, test suite was executed before and after fixing.