270 lines
9.1 KiB
ReStructuredText
270 lines
9.1 KiB
ReStructuredText
NLTK Python 3 support
|
|
=====================
|
|
|
|
The following text is not a general comprehensive Python 3.x porting guide;
|
|
it provides some information about the approach using for NLTK Python 3 port.
|
|
|
|
Porting Strategy
|
|
----------------
|
|
|
|
NLTK is being ported to Python 3 using single codebase strategy:
|
|
NLTK should work from a single codebase in Python 2.x and 3.x.
|
|
|
|
Python 2.5 compatibility is dropped in order to take advantage of
|
|
new ``__future__`` imports, ``b`` bytestring marker, new
|
|
``except Exception as e`` syntax and better standard library compatibility.
|
|
|
|
General notes
|
|
^^^^^^^^^^^^^
|
|
|
|
There are good existing guides for writing Python 2.x - 3.x compatible
|
|
code, e.g.
|
|
|
|
* http://docs.python.org/dev/howto/pyporting.html
|
|
* http://python3porting.com/
|
|
* https://docs.djangoproject.com/en/dev/topics/python3/
|
|
|
|
Take a look at them to have an idea how the approach works and what
|
|
is changed in Python 3.
|
|
|
|
nltk.compat
|
|
^^^^^^^^^^^
|
|
|
|
``nltk.compat`` module is loosely based on a great `six`_ library.
|
|
It provides simple utilities for wrapping over differences
|
|
between Python 2 and Python 3. Moved imports, removed/renamed builtins,
|
|
type names differences and support functions are there.
|
|
|
|
.. note::
|
|
|
|
We don't use `six`_ directly because it didn't work well
|
|
bundled at the time the port was started (this was a bug in six that
|
|
is fixed now), and NLTK needs extra custom 2+3 helpers anyway.
|
|
|
|
.. _six: http://packages.python.org/six/
|
|
|
|
|
|
map vs imap, items vs iteritems, ...
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
A number of Python builtins and builtin methods returns lists under
|
|
Python 2.x and iterators under Python 3.x. There are 3 possible ways
|
|
to workaround this:
|
|
|
|
1) use non-iterator versions of functions and methods under Python 3.x
|
|
(e.g. cast ``zip`` result to list);
|
|
2) convert Python 2.x code to iterator versions (e.g. replace ``zip``
|
|
with ``itertools.izip`` when possible);
|
|
3) let the code behave different under Python 2.x and Python 3.x.
|
|
|
|
In this NLTK port (1) and (3) methods are used; (3) is preferred.
|
|
This way there are no breaking interface changes for Python 2.x code
|
|
and Python 3.x code remains idiomatic (it is surprising for a dict
|
|
subclass ``items`` method to return a list under Python 3.x).
|
|
|
|
Existing code that uses NLTK will have to be ported from
|
|
Python 2.x to 3.x anyway so I think such interface changes are acceptable.
|
|
|
|
Unicode support
|
|
---------------
|
|
|
|
Fixing corpora readers
|
|
^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
Previously, many corpora readers returned byte strings. In a Python 3.x
|
|
branch the correct encodings are provided for all corpora and all corpora
|
|
readers are now returning unicode.
|
|
|
|
``__repr__`` and ``__str__``
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
Under Python 2.x ``obj.__repr__`` and ``obj.__str__`` must return
|
|
byte strings, while under Python 3.x they must return unicode strings.
|
|
|
|
To make things worse, terminals are tricky, and under Python 2.x
|
|
there is no hassle-free encoding that ``obj.__repr__`` and ``obj.__str__``
|
|
may use except for plain 7 bit ASCII.
|
|
|
|
..
|
|
|
|
Should I link a blog post
|
|
(http://kmike.ru/python-with-strings-attached/) or extract
|
|
some text from it to make the statement about encodings more reasoned?
|
|
|
|
In NLTK most classes with custom ``__repr__`` and/or ``__str__`` should use
|
|
``nltk.compat.python_2_unicode_compatible`` decorator. It works this way:
|
|
|
|
1) Class should be defined with ``__repr__`` and ``__str__`` methods
|
|
returning unicode (that's Python 3.x semantics);
|
|
2) under Python 2.x the decorator fixes ``__repr__`` and ``__str__``
|
|
to return bytestrings;
|
|
3) under both Python 2.x and 3.x the decorator creates
|
|
``__unicode__`` method (which is an original ``__str__``)
|
|
and ``unicode_repr`` method (which is an original ``__repr__``).
|
|
|
|
Under Python 2.x ``__repr__`` method returns an escaped version
|
|
of the unicode value and ``__str__`` returns a transliterated version.
|
|
For transliteration `Unidecode <http://pypi.python.org/pypi/Unidecode>`_,
|
|
`text-unidecode <http://pypi.python.org/pypi/text-unidecode/0.1>`_
|
|
or a basic "accent remover" may be used, depending on what
|
|
packages are installed.
|
|
|
|
In order to write unicode-aware ``__str__`` and ``__repr__``, the following
|
|
approach may be used:
|
|
|
|
1) ``from __future__ import unicode_literals`` is added to a top of file;
|
|
2) ``str(something)`` should be replaced with ``"%s" % something``
|
|
when used (maybe indirectly) inside ``__str__`` or ``__repr__``;
|
|
3) ``repr(something)`` and ``"%r" % something`` should be replaced with
|
|
``unicode_repr(something)`` and ``"%s" % unicode_repr(something)``
|
|
when used (maybe indirectly) inside ``__str__`` or ``__repr__``.
|
|
|
|
Doctests porting notes
|
|
----------------------
|
|
|
|
NLTK test suite is mostly doctest-based. Most usual rules apply to
|
|
porting doctests code. But ther are some issues that make the
|
|
process harder, so in order to make doctests work under
|
|
Python 2.x and Python 3.x extra tricks are needed.
|
|
|
|
``__future__`` imports
|
|
^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
Python's doctest runner doesn't support ``__future__`` imports.
|
|
They are executed but has no effect in doctests' code.
|
|
These imports are quite useful for making code Python 2.x + 3.x
|
|
compatible so there are some methods to overcome the limitation.
|
|
|
|
* ``from __future__ import print_function``: it may seem the import works
|
|
because ``print(foo)`` works under python 2.x; but it works only because
|
|
(foo) == foo; ``print(foo, bar)`` prints tuple; ``print(foo, sep=' ')``
|
|
raises an exception. In order to make print() work this future import
|
|
is injected to all doctests' globals within NLTK test suite
|
|
(implementation: ``nltk.test.doctest_nose_plugin.DoctestPluginHelper``).
|
|
So NLTK's doctests shouldn't import print_function but they should
|
|
assume this import is in effect.
|
|
|
|
* ``from __future__ import unicode_literals``: there is no sane way to
|
|
use non-ascii constants in doctests under python 2.x
|
|
(see http://bugs.python.org/issue1293741 ); doctests with non-ascii
|
|
constants should be better rewritten as unittests or as doctests
|
|
without non-ascii constants.
|
|
|
|
Tests may use variables with unicode values though. In order to print
|
|
such values and have the same output under python 2 and python 3 the
|
|
following trick may be used::
|
|
|
|
>>> print(unicode_value.encode('unicode-escape').decode('ascii'))
|
|
|
|
But it may be a better idea to avoid this trick and rewrite the test to
|
|
unittest format instead.
|
|
|
|
* ``from __future__ import division``: it is usually not hard to cast
|
|
results to int or float to have the same semantics under python 2 and 3.
|
|
|
|
|
|
Unicode strings __repr__
|
|
^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
Representation of unicode strings is different in Python 2.x and Python 3.x
|
|
even if they contain only ascii characters.
|
|
|
|
Python 2.x::
|
|
|
|
>>> x = b'foo'.decode('ascii')
|
|
>>> x
|
|
u'foo'
|
|
|
|
Python 3.x::
|
|
|
|
>>> x = b'foo'.decode('ascii')
|
|
>>> x
|
|
'foo'
|
|
|
|
(Note the missing 'u' in Python 3 example).
|
|
|
|
In order to simplify things NLTK's custom doctest runner
|
|
(see ``nltk.test.doctest_nose_plugin.DoctestPluginHelper``) doesn't
|
|
take 'u''s into account; it considers u'foo' and 'foo' equal;
|
|
developer is free to write u'foo' or 'foo'.
|
|
|
|
This is not absolutely correct but if this distinction is important
|
|
then doctest should be converted to unittest.
|
|
|
|
There are other possible fixes for this issue but they
|
|
all make doctests less readable. For example, for single variables
|
|
``print`` may be used. Python 2.x::
|
|
|
|
>>> print(x)
|
|
foo
|
|
|
|
Python 3.x::
|
|
|
|
>>> print(x)
|
|
foo
|
|
|
|
This won't help with container types. Python 2.x::
|
|
|
|
>>> print([x, x])
|
|
[u'foo', u'foo']
|
|
|
|
Possible fixes for lists are::
|
|
|
|
>>> for txt in [x, x]:
|
|
... print(x)
|
|
foo
|
|
foo
|
|
|
|
or::
|
|
|
|
>>> print(", ".join([x, x]))
|
|
foo, foo
|
|
|
|
|
|
Float values representation
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
The exact representation of float values may vary across Python interpreters
|
|
(this is not only a Python 3.x - specific issue). So instead of this::
|
|
|
|
>>> recall
|
|
0.8888888888889
|
|
|
|
write this::
|
|
|
|
>>> print(recall)
|
|
0.88888888888...
|
|
|
|
Porting tools
|
|
-------------
|
|
|
|
python-modernize
|
|
^^^^^^^^^^^^^^^^
|
|
|
|
`python-modernize <https://github.com/mitsuhiko/python-modernize>`_ script
|
|
was used for tedious parts of python3 porting. Take a look at the docs for
|
|
more information. The process was:
|
|
|
|
* Run NLTK test suite under Python 2.x;
|
|
* fix one specific aspect of NLTK by running one of python-modernize fixers
|
|
on NLTK source code;
|
|
* take a look at changes python-modernize proposes, fix stupid things;
|
|
* run NLTK test suite again under Python 2.x and make sure there are no
|
|
regressions.
|
|
|
|
After python-modernize code wouldn't be necessary Python 3.x compatible but
|
|
further porting would be easier and there shouldn't be 2.x regressions.
|
|
|
|
2to3
|
|
^^^^
|
|
|
|
Doctest porting may be tedious, there is a lot of search/replace work
|
|
(e.g. ``print foo`` -> ``print(foo)`` or
|
|
``raise Exception, e`` -> ``raise Exception as e``). In order to overcome
|
|
this 2to3 utility was used, e.g.::
|
|
|
|
$ 2to3 -d -f print nltk/test/*.doctest
|
|
|
|
Fixers were applied one-by-one, test suite was executed before and after
|
|
fixing.
|