========================================= NLTK Python 2.x - 3.x Compatibility Layer ========================================= NLTK comes with a Python 2.x/3.x compatibility layer, nltk.compat (which is loosely based on `six `_):: >>> from nltk import compat >>> compat.PY3 False >>> # and so on @python_2_unicode_compatible ---------------------------- Under Python 2.x ``__str__`` and ``__repr__`` methods must return bytestrings. ``@python_2_unicode_compatible`` decorator allows writing these methods in a way compatible with Python 3.x: 1) wrap a class with this decorator, 2) define ``__str__`` and ``__repr__`` methods returning unicode text (that's what they must return under Python 3.x), and they would be fixed under Python 2.x to return byte strings:: >>> from nltk.compat import python_2_unicode_compatible >>> @python_2_unicode_compatible ... class Foo(object): ... def __str__(self): ... return u'__str__ is called' ... def __repr__(self): ... return u'__repr__ is called' >>> foo = Foo() >>> foo.__str__().__class__ >>> foo.__repr__().__class__ >>> print(foo) __str__ is called >>> foo __repr__ is called Original versions of ``__str__`` and ``__repr__`` are available as ``__unicode__`` and ``unicode_repr``:: >>> foo.__unicode__().__class__ >>> foo.unicode_repr().__class__ >>> unicode(foo) u'__str__ is called' >>> foo.unicode_repr() u'__repr__ is called' There is no need to wrap a subclass with ``@python_2_unicode_compatible`` if it doesn't override ``__str__`` and ``__repr__``:: >>> class Bar(Foo): ... pass >>> bar = Bar() >>> bar.__str__().__class__ However, if a subclass overrides ``__str__`` or ``__repr__``, wrap it again:: >>> class BadBaz(Foo): ... def __str__(self): ... return u'Baz.__str__' >>> baz = BadBaz() >>> baz.__str__().__class__ # this is incorrect! >>> @python_2_unicode_compatible ... class GoodBaz(Foo): ... def __str__(self): ... return u'Baz.__str__' >>> baz = GoodBaz() >>> baz.__str__().__class__ >>> baz.__unicode__().__class__ Applying ``@python_2_unicode_compatible`` to a subclass shouldn't break methods that was not overridden:: >>> baz.__repr__().__class__ >>> baz.unicode_repr().__class__ unicode_repr ------------ Under Python 3.x ``repr(unicode_string)`` doesn't have a leading "u" letter. ``nltk.compat.unicode_repr`` function may be used instead of ``repr`` and ``"%r" % obj`` to make the output more consistent under Python 2.x and 3.x:: >>> from nltk.compat import unicode_repr >>> print(repr(u"test")) u'test' >>> print(unicode_repr(u"test")) 'test' It may be also used to get an original unescaped repr (as unicode) of objects which class was fixed by ``@python_2_unicode_compatible`` decorator:: >>> @python_2_unicode_compatible ... class Foo(object): ... def __repr__(self): ... return u'' >>> foo = Foo() >>> repr(foo) '' >>> unicode_repr(foo) u'' For other objects it returns the same value as ``repr``:: >>> unicode_repr(5) '5' It may be a good idea to use ``unicode_repr`` instead of ``%r`` string formatting specifier inside ``__repr__`` or ``__str__`` methods of classes fixed by ``@python_2_unicode_compatible`` to make the output consistent between Python 2.x and 3.x.