CofeehousePy/deps/numpy/doc/neps/nep-0040-legacy-datatype-im...

.. _NEP40:

================================================
NEP 40 — Legacy Datatype Implementation in NumPy
================================================

:title: Legacy Datatype Implementation in NumPy
:Author: Sebastian Berg
:Status: Draft
:Type: Informational
:Created: 2019-07-17


.. note::

    This NEP is part of a series of NEPs encompassing first information
    about the previous dtype implementation and issues with it in NEP 40
    (this document).
    :ref:`NEP 41 <NEP41>` then provides an overview and generic design choices
    for the refactor.
    Further NEPs 42 and 43 go into the technical details of the datatype
    and universal function related internal and external API changes.
    In some cases it may be necessary to consult the other NEPs for a full
    picture of the desired changes and why these changes are necessary.


Abstract
--------

As a preparation to further NumPy enhancement proposals 41, 42, and 43. This
NEP details the current status of NumPy datatypes as of NumPy 1.18.
It describes some of the technical aspects and concepts that
motivated the other proposals.
For more general information most readers should begin by reading :ref:`NEP 41 <NEP41>`
and use this document only as a reference or for additional details.


Detailed Description
--------------------

This section describes some central concepts and provides a brief overview
of the current implementation of dtypes as well as a discussion.
In many cases subsections will be split roughly to first describe the
current implementation and then follow with an "Issues and Discussion" section.

Parametric Datatypes
^^^^^^^^^^^^^^^^^^^^

Some datatypes are inherently *parametric*. All ``np.flexible`` scalar
types are attached to parametric datatypes (string, bytes, and void).
The class ``np.flexible`` for scalars is a superclass for the data types of
variable length (string, bytes, and void).
This distinction is similarly exposed by the C-Macros
``PyDataType_ISFLEXIBLE`` and ``PyTypeNum_ISFLEXIBLE``.
This flexibility generalizes to the set of values which can be represented
inside the array.
For instance, ``"S8"`` can represent longer strings than ``"S4"``.
The parametric string datatype thus also limits the values inside the array
to a subset (or subtype) of all values which can be represented by string
scalars.

The basic numerical datatypes are not flexible (do not inherit from
``np.flexible``). ``float64``, ``float32``, etc. do have a byte order, but the described
values are unaffected by it, and it is always possible to cast them to the
native, canonical representation without any loss of information.

The concept of flexibility can be generalized to parametric datatypes.
For example the private ``PyArray_AdaptFlexibleDType`` function also accepts the
naive datetime dtype as input to find the correct time unit.
The datetime dtype is thus parametric not in the size of its storage,
but instead in what the stored value represents.
Currently ``np.can_cast("datetime64[s]", "datetime64[ms]", casting="safe")``
returns true, although it is unclear that this is desired or generalizes
to possible future data types such as physical units.

Thus we have data types (mainly strings) with the properties that:

1. Casting is not always safe (``np.can_cast("S8", "S4")``)
2. Array coercion should be able to discover the exact dtype, such as for
   ``np.array(["str1", 12.34], dtype="S")`` where NumPy discovers the
   resulting dtype as ``"S5"``.
   (If the dtype argument is ommitted the behaviour is currently ill defined [gh-15327]_.)
   A form similar to ``dtype="S"`` is ``dtype="datetime64"`` which can
   discover the unit: ``np.array(["2017-02"], dtype="datetime64")``.

This notion highlights that some datatypes are more complex than the basic
numerical ones, which is evident in the complicated output type discovery
of universal functions.


Value Based Casting
^^^^^^^^^^^^^^^^^^^

Casting is typically defined between two types:
A type is considered to cast safely to a second type when the second type
can represent all values of the first without loss of information.
NumPy may inspect the actual value to decide
whether casting is safe or not.

This is useful for example in expressions such as::

    arr = np.array([1, 2, 3], dtype="int8")
    result = arr + 5
    assert result.dtype == np.dtype("int8")
    # If the value is larger, the result will change however:
    result = arr + 500
    assert result.dtype == np.dtype("int16")

In this expression, the python value (which originally has no datatype) is
represented as an ``int8`` or ``int16`` (the smallest possible data type).

NumPy currently does this even for NumPy scalars and zero-dimensional arrays,
so that replacing ``5`` with ``np.int64(5)`` or ``np.array(5, dtype="int64")``
in the above expression will lead to the same results, and thus ignores the
existing datatype. The same logic also applies to floating-point scalars,
which are allowed to lose precision.
The behavior is not used when both inputs are scalars, so that
``5 + np.int8(5)`` returns the default integer size (32 or 64-bit) and not
an ``np.int8``.

While the behaviour is defined in terms of casting and exposed by
``np.result_type`` it is mainly important for universal functions
(such as ``np.add`` in the above examples).
Universal functions currently rely on safe casting semantics to decide which
loop should be used, and thus what the output datatype will be.


Issues and Discussion
"""""""""""""""""""""

There appears to be some agreement that the current method is
not desirable for values that have a datatype,
but may be useful for pure python integers or floats as in the first
example.
However, any change of the datatype system and universal function dispatching
must initially fully support the current behavior.
A main difficulty is that for example the value ``156`` can be represented
by ``np.uint8`` and ``np.int16``.
The result depends on the "minimal" representation in the context of the
conversion (for ufuncs the context may depend on the loop order).


The Object Datatype
^^^^^^^^^^^^^^^^^^^

The object datatype currently serves as a generic fallback for any value
which is not otherwise representable.
However, due to not having a well-defined type, it has some issues,
for example when an array is filled with Python sequences::

    >>> l = [1, [2]]
    >>> np.array(l, dtype=np.object_)
    array([1, list([2])], dtype=object)  # a 1d array

    >>> a = np.empty((), dtype=np.object_)
    >>> a[...] = l
    ValueError: assignment to 0-d array  # ???
    >>> a[()] = l
    >>> a
    array(list([1, [2]]), dtype=object)

Without a well-defined type, functions such as ``isnan()`` or ``conjugate()``
do not necessarily work, but can work for a :class:`decimal.Decimal`.
To improve this situation it seems desirable to make it easy to create
``object`` dtypes that represent a specific Python datatype and stores its object
inside the array in the form of pointer to python ``PyObject``.
Unlike most datatypes, Python objects require garbage collection.
This means that additional methods to handle references and
visit all objects must be defined.
In practice, for most use-cases it is sufficient to limit the creation of such
datatypes so that all functionality related to Python C-level references is
private to NumPy.

Creating NumPy datatypes that match builtin Python objects also creates a few problems
that require more thoughts and discussion.
These issues do not need to solved right away:

* NumPy currently returns *scalars* even for array input in some cases, in most
  cases this works seamlessly. However, this is only true because the NumPy
  scalars behave much like NumPy arrays, a feature that general Python objects
  do not have.
* Seamless integration probably requires that ``np.array(scalar)`` finds the
  correct DType automatically since some operations (such as indexing) return
  the scalar instead of a 0D array.
  This is problematic if multiple users independently decide to implement
  for example a DType for ``decimal.Decimal``.


Current ``dtype`` Implementation
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Currently ``np.dtype`` is a Python class with its instances being the
``np.dtype(">float64")``, etc. instances.
To set the actual behaviour of these instances, a prototype instance is stored
globally and looked up based on the ``dtype.typenum``. The singleton is used
where possible. Where required it is copied and modified, for instance to change
endianess.

Parametric datatypes (strings, void, datetime, and timedelta) must store
additional information such as string lengths, fields, or datetime units --
new instances of these types are created instead of relying on a singleton.
All current datatypes within NumPy further support setting a metadata field
during creation which can be set to an arbitrary dictionary value, but seems
rarely used in practice (one recent and prominent user is h5py).

Many datatype-specific functions are defined within a C structure called
:c:type:`PyArray_ArrFuncs`, which is part of each ``dtype`` instance and
has a similarity to Python's ``PyNumberMethods``.
For user-defined datatypes this structure is exposed to the user, making
ABI-compatible changes impossible.
This structure holds important information such as how to copy or cast,
and provides space for pointers to functions, such as comparing elements,
converting to bool, or sorting.
Since some of these functions are vectorized operations, operating on more than
one element, they fit the model of ufuncs and do not need to be defined on the
datatype in the future.
For example the ``np.clip`` function was previously implemented using
``PyArray_ArrFuncs`` and is now implemented as a ufunc.

Discussion and Issues
"""""""""""""""""""""

A further issue with the current implementation of the functions on the dtype
is that, unlike methods,
they are not passed an instance of the dtype when called.
Instead, in many cases, the array which is being operated on is passed in
and typically only used to extract the datatype again.
A future API should likely stop passing in the full array object.
Since it will be necessary to fall back to the old definitions for
backward compatibility, the array object may not be available.
However, passing a "fake" array in which mainly the datatype is defined
is probably a sufficient workaround
(see backward compatibility; alignment information may sometimes also be desired).

Although not extensively used outside of NumPy itself, the currently
``PyArray_Descr`` is a public structure.
This is especially also true for the ``PyArray_ArrFuncs`` structure stored in
the ``f`` field.
Due to compatibility they may need to remain supported for a very long time,
with the possibility of replacing them by functions that dispatch to a newer API.

However, in the long run access to these structures will probably have to
be deprecated.


NumPy Scalars and Type Hierarchy
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

As a side note to the above datatype implementation: unlike the datatypes,
the NumPy scalars currently **do** provide a type hierarchy, consisting of abstract
types such as ``np.inexact`` (see figure below).
In fact, some control flow within NumPy currently uses
``issubclass(a.dtype.type, np.inexact)``.

.. figure:: _static/nep-0040_dtype-hierarchy.png

   **Figure:** Hierarchy of NumPy scalar types reproduced from the reference
   documentation. Some aliases such as ``np.intp`` are excluded. Datetime
   and timedelta are not shown.

NumPy scalars try to mimic zero-dimensional arrays with a fixed datatype.
For the numerical (and unicode) datatypes, they are further limited to
native byte order.


Current Implementation of Casting
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

One of the main features which datatypes need to support is casting between one
another using ``arr.astype(new_dtype, casting="unsafe")``, or during execution
of ufuncs with different types (such as adding integer and floating point numbers).

Casting tables determine whether it is possible to cast from one specific type to another.
However, generic casting rules cannot handle the parametric dtypes such as strings.
The logic for parametric datatypes is defined mainly in ``PyArray_CanCastTo``
and currently cannot be customized for user defined datatypes.

The actual casting has two distinct parts:

1. ``copyswap``/``copyswapn`` are defined for each dtype and can handle
   byte-swapping for non-native byte orders as well as unaligned memory.
2. The generic casting code is provided by C functions which know how to
   cast aligned and contiguous memory from one dtype to another
   (both in native byte order).
   These C-level functions can be registered to cast aligned and contiguous memory
   from one dtype to another.
   The function may be provided with both arrays (although the parameter
   is sometimes ``NULL`` for scalars).
   NumPy will ensure that these functions receive native byte order input.
   The current implementation stores the functions either in a C-array
   on the datatype which is cast, or in a dictionary when casting to a user
   defined datatype.

Generally NumPy will thus perform casting as chain of the three functions
``in_copyswapn -> castfunc -> out_copyswapn`` using (small) buffers between
these steps.

The above multiple functions are wrapped into a single function (with metadata)
that handles the cast and is used for example during the buffered iteration used
by ufuncs.
This is the mechanism that is always used for user defined datatypes.
For most dtypes defined within NumPy itself, more specialized code is used to
find a function to do the actual cast
(defined by the private ``PyArray_GetDTypeTransferFunction``).
This mechanism replaces most of the above mechanism and provides much faster
casts for example when the inputs are not contiguous in memory.
However, it cannot be extended by user defined datatypes.

Related to casting, we currently have a ``PyArray_EquivTypes`` function which
indicate that a *view* is sufficient (and thus no cast is necessary).
This function is used multiple places and should probably be part of
a redesigned casting API.


DType handling in Universal functions
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Universal functions are implemented as instances of the ``numpy.UFunc`` class
with an ordered-list of datatype-specific
(based on the dtype typecode character, not datatype instances) implementations,
each with a signature and a function pointer.
This list of implementations can be seen with ``ufunc.types`` where
all implementations are listed with their C-style typecode signatures.
For example::

    >>> np.add.types
    [...,
     'll->l',
     ...,
     'dd->d',
     ...]

Each of these signatures is associated with a single inner-loop function defined
in C, which does the actual calculation, and may be called multiple times.

The main step in finding the correct inner-loop function is to call a
:c:type:`PyUFunc_TypeResolutionFunc` which retrieves the input dtypes from
the provided input arrays
and will determine the full type signature (including output dtype) to be executed.

By default the ``TypeResolver`` is implemented by searching all of the implementations
listed in ``ufunc.types`` in order and stopping if all inputs can be safely
cast to fit the signature.
This means that if long (``l``) and double (``d``) arrays are added,
numpy will find that the ``'dd->d'`` definition works
(long can safely cast to double) and uses that.

In some cases this is not desirable. For example the ``np.isnat`` universal
function has a ``TypeResolver`` which rejects integer inputs instead of
allowing them to be cast to float.
In principle, downstream projects can currently use their own non-default
``TypeResolver``, since the corresponding C-structure necessary to do this
is public.
The only project known to do this is Astropy, which is willing to switch to
a new API if NumPy were to remove the possibility to replace the TypeResolver.

For user defined datatypes, the dispatching logic is similar,
although separately implemented and limited (see discussion below).


Issues and Discussion
"""""""""""""""""""""

It is currently only possible for user defined functions to be found/resolved
if any of the inputs (or the outputs) has the user datatype, since it uses the
`OO->O` signature.
For example, given that a ufunc loop to implement ``fraction_divide(int, int)
-> Fraction`` has been implemented,
the call ``fraction_divide(4, 5)`` (with no specific output dtype) will fail
because the loop that
includes the user datatype ``Fraction`` (as output) can only be found if any of
the inputs is already a ``Fraction``.
``fraction_divide(4, 5, dtype=Fraction)`` can be made to work, but is inconvenient.

Typically, dispatching is done by finding the first loop that matches. A match
is defined as: all inputs (and possibly outputs) can
be cast safely to the signature typechars (see also the current implementation
section).
However, in some cases safe casting is problematic and thus explicitly not
allowed.
For example the ``np.isnat`` function is currently only defined for
datetime and timedelta,
even though integers are defined to be safely castable to timedelta.
If this was not the case, calling
``np.isnat(np.array("NaT", "timedelta64").astype("int64"))`` would currently
return true, although the integer input array has no notion of "not a time".
If a universal function, such as most functions in ``scipy.special``, is only
defined for ``float32`` and ``float64`` it will currently automatically
cast a ``float16`` silently to ``float32`` (similarly for any integer input).
This ensures successful execution, but may lead to a change in the output dtype
when support for new data types is added to a ufunc.
When a ``float16`` loop is added, the output datatype will currently change
from ``float32`` to ``float16`` without a warning.

In general the order in which loops are registered is important.
However, this is only reliable if all loops are added when the ufunc is first defined.
Additional loops added when a new user datatypes is imported
must not be sensitive to the order in which imports occur.

There are two main approaches to better define the type resolution for user
defined types:

1. Allow for user dtypes to directly influence the loop selection.
   For example they may provide a function which return/select a loop
   when there is no exact matching loop available.
2. Define a total ordering of all implementations/loops, probably based on
   "safe casting" semantics, or semantics similar to that.

While option 2 may be less complex to reason about it remains to be seen
whether it is sufficient for all (or most) use cases.


Adjustment of Parametric output DTypes in UFuncs
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

A second step necessary for parametric dtypes is currently performed within
the ``TypeResolver``:
the datetime and timedelta datatypes have to decide on the correct parameter
for the operation and output array.
This step also needs to double check that all casts can be performed safely,
which by default means that they are "same kind" casts.

Issues and Discussion
"""""""""""""""""""""

Fixing the correct output dtype is currently part of the type resolution.
However, it is a distinct step and should probably be handled as such after
the actual type/loop resolution has occurred.

As such this step may move from the dispatching step (described above) to
the implementation-specific code described below.


DType-specific Implementation of the UFunc
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Once the correct implementation/loop is found, UFuncs currently call
a single *inner-loop function* which is written in C.
This may be called multiple times to do the full calculation and it has
little or no information about the current context. It also has a void
return value.

Issues and Discussion
"""""""""""""""""""""

Parametric datatypes may require passing
additional information to the inner-loop function to decide how to interpret
the data.
This is the reason why currently no universal functions for ``string`` dtypes
exist (although technically possible within NumPy itself).
Note that it is currently possible to pass in the input array objects
(which in turn hold the datatypes when no casting is necessary).
However, the full array information should not be required and currently the
arrays are passed in before any casting occurs.
The feature is unused within NumPy and no known user exists.

Another issue is the error reporting from within the inner-loop function.
There exist currently two ways to do this:

1. by setting a Python exception
2. using the CPU floating point error flags.

Both of these are checked before returning to the user.
However, many integer functions currently can set neither of these errors,
so that checking the floating point error flags is unnecessary overhead.
On the other hand, there is no way to stop the iteration or pass out error
information which does not use the floating point flags or requires to hold
the Python global interpreter lock (GIL).

It seems necessary to provide more control to authors of inner loop functions.
This means allowing users to pass in and out information from the inner-loop
function more easily, while *not* providing the input array objects.
Most likely this will involve:

* Allowing the execution of additional code before the first and after
  the last inner-loop call.
* Returning an integer value from the inner-loop to allow stopping the
  iteration early and possibly propagate error information.
* Possibly, to allow specialized inner-loop selections. For example currently
  ``matmul`` and many reductions will execute optimized code for certain inputs.
  It may make sense to allow selecting such optimized loops beforehand.
  Allowing this may also help to bring casting (which uses this heavily) and
  ufunc implementations closer.

The issues surrounding the inner-loop functions have been discussed in some
detail in the github issue gh-12518_ .

Reductions use an "identity" value.
This is currently defined once per ufunc, regardless of the ufunc dtype signature.
For example  ``0`` is used for ``sum``, or ``math.inf`` for ``min``.
This works well for numerical datatypes, but is not always appropriate for other dtypes.
In general it should be possible to provide a dtype-specific identity to the
ufunc reduction.


Datatype Discovery during Array Coercion
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

When calling ``np.array(...)`` to coerce a general Python object to a NumPy array,
all objects need to be inspected to find the correct dtype.
The input to ``np.array()`` are potentially nested Python sequences which hold
the final elements as generic Python objects.
NumPy has to unpack all the nested sequences and then inspect the elements.
The final datatype is found by iterating over all elements which will end up
in the array and:

1. discovering the dtype of the single element:

   * from array (or array like) or NumPy scalar using ``element.dtype``
   * using ``isinstance(..., float)`` for known Python types
     (note that these rules mean that subclasses are *currently* valid).
   * special rule for void datatypes to coerce tuples.

2. Promoting the current dtype with the next elements dtype using
   ``np.promote_types``.
3. If strings are found, the whole process is restarted (see also [gh-15327]_),
   in a similar manner as if ``dtype="S"`` was given (see below).

If ``dtype=...`` is given, this dtype is used unmodified, unless
it is an unspecific *parametric dtype instance* which means "S0", "V0", "U0",
"datetime64", and "timdelta64".
These are thus flexible datatypes without length 0 – considered to be unsized –
and datetimes or timedelta without a unit attached ("generic unit").

In future DType class hierarchy, these may be represented by the class rather
than a special instance, since these special instances should not normally be
attached to an array.

If such a *parametric dtype instance* is provided for example using ``dtype="S"``
``PyArray_AdaptFlexibleDType`` is called and effectively inspects all values
using DType specific logic.
That is:

* Strings will use ``str(element)`` to find the length of most elements
* Datetime64 is capable of coercing from strings and guessing the correct unit.


Discussion and Issues
"""""""""""""""""""""

It seems probable that during normal discovery, the ``isinstance`` should rather
be strict ``type(element) is desired_type`` checks.
Further, the current ``AdaptFlexibleDType`` logic should be made available to
user DTypes and not be a secondary step, but instead replace, or be part of,
the normal discovery.


Related Issues
--------------

``np.save`` currently translates all user-defined dtypes to void dtypes.
This means they cannot be stored using the ``npy`` format.
This is not an issue for the python pickle protocol, although it may require
some thought if we wish to ensure that such files can be loaded securely
without the possibility of executing malicious code
(i.e. without the ``allow_pickle=True`` keyword argument).

The additional existence of masked arrays and especially masked datatypes
within Pandas has interesting implications for interoperability.
Since mask information is often stored separately, its handling requires
support by the container (array) object.
NumPy itself does not provide such support, and is not expected to add it
in the foreseeable future.
However, if such additions to the datatypes within NumPy would improve
interoperability they could be considered even if
they are not used by NumPy itself.


Related Work
------------

* Julia types are an interesting blueprint for a type hierarchy, and define
  abstract and concrete types [julia-types]_.

* In Julia promotion can occur based on abstract types. If a promoter is
  defined, it will cast the inputs and then Julia can then retry to find
  an implementation with the new values [julia-promotion]_.

* ``xnd-project`` (https://github.com/xnd-project) with ndtypes and gumath

  * The ``xnd-project`` is similar to NumPy and defines data types as well
    as the possibility to extend them. A major difference is that it does
    not use promotion/casting within the ufuncs, but instead requires explicit
    definition of ``int32 + float64 -> float64`` loops.


Discussion
----------

There have been many discussions about the current state and what a future
datatype system may look like.
The full list of these discussion is long and some are lost to time,
the following provides a subset for more recent ones:

* Draft NEP by Stephan Hoyer after a developer meeting (was updated on the next developer meeting) https://hackmd.io/6YmDt_PgSVORRNRxHyPaNQ

* List of related documents gathered previously here
  https://hackmd.io/UVOtgj1wRZSsoNQCjkhq1g (TODO: Reduce to the most important
  ones):

  * https://github.com/numpy/numpy/pull/12630
    Matti Picus draft NEP, discusses the technical side of subclassing  more from
    the side of ``ArrFunctions``

  * https://hackmd.io/ok21UoAQQmOtSVk6keaJhw and https://hackmd.io/s/ryTFaOPHE
    (2019-04-30) Proposals for subclassing implementation approach.

  * Discussion about the calling convention of ufuncs and need for more
    powerful UFuncs: https://github.com/numpy/numpy/issues/12518

  * 2018-11-30 developer meeting notes:
    https://github.com/BIDS-numpy/docs/blob/master/meetings/2018-11-30-dev-meeting.md
    and subsequent draft for an NEP: https://hackmd.io/6YmDt_PgSVORRNRxHyPaNQ

    BIDS Meeting on November 30, 2018 and document by Stephan Hoyer about
    what numpy should provide and thoughts of how to get there. Meeting with
    Eric Wieser, Matti Picus, Charles Harris, Tyler Reddy, Stéfan van der
    Walt, and Travis Oliphant.

  * SciPy 2018 brainstorming session with summaries of use cases:
    https://github.com/numpy/numpy/wiki/Dtype-Brainstorming

    Also lists some requirements and some ideas on implementations


References
----------

.. _gh-12518: https://github.com/numpy/numpy/issues/12518
.. [gh-15327] https://github.com/numpy/numpy/issues/12518

.. [julia-types] https://docs.julialang.org/en/v1/manual/types/index.html#Abstract-Types-1

.. [julia-promotion] https://docs.julialang.org/en/v1/manual/conversion-and-promotion/


Copyright
---------

This document has been placed in the public domain.