646 lines
28 KiB
ReStructuredText
646 lines
28 KiB
ReStructuredText
|
.. _NEP40:
|
|||
|
|
|||
|
================================================
|
|||
|
NEP 40 — Legacy Datatype Implementation in NumPy
|
|||
|
================================================
|
|||
|
|
|||
|
:title: Legacy Datatype Implementation in NumPy
|
|||
|
:Author: Sebastian Berg
|
|||
|
:Status: Draft
|
|||
|
:Type: Informational
|
|||
|
:Created: 2019-07-17
|
|||
|
|
|||
|
|
|||
|
.. note::
|
|||
|
|
|||
|
This NEP is part of a series of NEPs encompassing first information
|
|||
|
about the previous dtype implementation and issues with it in NEP 40
|
|||
|
(this document).
|
|||
|
:ref:`NEP 41 <NEP41>` then provides an overview and generic design choices
|
|||
|
for the refactor.
|
|||
|
Further NEPs 42 and 43 go into the technical details of the datatype
|
|||
|
and universal function related internal and external API changes.
|
|||
|
In some cases it may be necessary to consult the other NEPs for a full
|
|||
|
picture of the desired changes and why these changes are necessary.
|
|||
|
|
|||
|
|
|||
|
|
|||
|
Abstract
|
|||
|
--------
|
|||
|
|
|||
|
As a preparation to further NumPy enhancement proposals 41, 42, and 43. This
|
|||
|
NEP details the current status of NumPy datatypes as of NumPy 1.18.
|
|||
|
It describes some of the technical aspects and concepts that
|
|||
|
motivated the other proposals.
|
|||
|
For more general information most readers should begin by reading :ref:`NEP 41 <NEP41>`
|
|||
|
and use this document only as a reference or for additional details.
|
|||
|
|
|||
|
|
|||
|
Detailed Description
|
|||
|
--------------------
|
|||
|
|
|||
|
This section describes some central concepts and provides a brief overview
|
|||
|
of the current implementation of dtypes as well as a discussion.
|
|||
|
In many cases subsections will be split roughly to first describe the
|
|||
|
current implementation and then follow with an "Issues and Discussion" section.
|
|||
|
|
|||
|
Parametric Datatypes
|
|||
|
^^^^^^^^^^^^^^^^^^^^
|
|||
|
|
|||
|
Some datatypes are inherently *parametric*. All ``np.flexible`` scalar
|
|||
|
types are attached to parametric datatypes (string, bytes, and void).
|
|||
|
The class ``np.flexible`` for scalars is a superclass for the data types of
|
|||
|
variable length (string, bytes, and void).
|
|||
|
This distinction is similarly exposed by the C-Macros
|
|||
|
``PyDataType_ISFLEXIBLE`` and ``PyTypeNum_ISFLEXIBLE``.
|
|||
|
This flexibility generalizes to the set of values which can be represented
|
|||
|
inside the array.
|
|||
|
For instance, ``"S8"`` can represent longer strings than ``"S4"``.
|
|||
|
The parametric string datatype thus also limits the values inside the array
|
|||
|
to a subset (or subtype) of all values which can be represented by string
|
|||
|
scalars.
|
|||
|
|
|||
|
The basic numerical datatypes are not flexible (do not inherit from
|
|||
|
``np.flexible``). ``float64``, ``float32``, etc. do have a byte order, but the described
|
|||
|
values are unaffected by it, and it is always possible to cast them to the
|
|||
|
native, canonical representation without any loss of information.
|
|||
|
|
|||
|
The concept of flexibility can be generalized to parametric datatypes.
|
|||
|
For example the private ``PyArray_AdaptFlexibleDType`` function also accepts the
|
|||
|
naive datetime dtype as input to find the correct time unit.
|
|||
|
The datetime dtype is thus parametric not in the size of its storage,
|
|||
|
but instead in what the stored value represents.
|
|||
|
Currently ``np.can_cast("datetime64[s]", "datetime64[ms]", casting="safe")``
|
|||
|
returns true, although it is unclear that this is desired or generalizes
|
|||
|
to possible future data types such as physical units.
|
|||
|
|
|||
|
Thus we have data types (mainly strings) with the properties that:
|
|||
|
|
|||
|
1. Casting is not always safe (``np.can_cast("S8", "S4")``)
|
|||
|
2. Array coercion should be able to discover the exact dtype, such as for
|
|||
|
``np.array(["str1", 12.34], dtype="S")`` where NumPy discovers the
|
|||
|
resulting dtype as ``"S5"``.
|
|||
|
(If the dtype argument is ommitted the behaviour is currently ill defined [gh-15327]_.)
|
|||
|
A form similar to ``dtype="S"`` is ``dtype="datetime64"`` which can
|
|||
|
discover the unit: ``np.array(["2017-02"], dtype="datetime64")``.
|
|||
|
|
|||
|
This notion highlights that some datatypes are more complex than the basic
|
|||
|
numerical ones, which is evident in the complicated output type discovery
|
|||
|
of universal functions.
|
|||
|
|
|||
|
|
|||
|
Value Based Casting
|
|||
|
^^^^^^^^^^^^^^^^^^^
|
|||
|
|
|||
|
Casting is typically defined between two types:
|
|||
|
A type is considered to cast safely to a second type when the second type
|
|||
|
can represent all values of the first without loss of information.
|
|||
|
NumPy may inspect the actual value to decide
|
|||
|
whether casting is safe or not.
|
|||
|
|
|||
|
This is useful for example in expressions such as::
|
|||
|
|
|||
|
arr = np.array([1, 2, 3], dtype="int8")
|
|||
|
result = arr + 5
|
|||
|
assert result.dtype == np.dtype("int8")
|
|||
|
# If the value is larger, the result will change however:
|
|||
|
result = arr + 500
|
|||
|
assert result.dtype == np.dtype("int16")
|
|||
|
|
|||
|
In this expression, the python value (which originally has no datatype) is
|
|||
|
represented as an ``int8`` or ``int16`` (the smallest possible data type).
|
|||
|
|
|||
|
NumPy currently does this even for NumPy scalars and zero-dimensional arrays,
|
|||
|
so that replacing ``5`` with ``np.int64(5)`` or ``np.array(5, dtype="int64")``
|
|||
|
in the above expression will lead to the same results, and thus ignores the
|
|||
|
existing datatype. The same logic also applies to floating-point scalars,
|
|||
|
which are allowed to lose precision.
|
|||
|
The behavior is not used when both inputs are scalars, so that
|
|||
|
``5 + np.int8(5)`` returns the default integer size (32 or 64-bit) and not
|
|||
|
an ``np.int8``.
|
|||
|
|
|||
|
While the behaviour is defined in terms of casting and exposed by
|
|||
|
``np.result_type`` it is mainly important for universal functions
|
|||
|
(such as ``np.add`` in the above examples).
|
|||
|
Universal functions currently rely on safe casting semantics to decide which
|
|||
|
loop should be used, and thus what the output datatype will be.
|
|||
|
|
|||
|
|
|||
|
Issues and Discussion
|
|||
|
"""""""""""""""""""""
|
|||
|
|
|||
|
There appears to be some agreement that the current method is
|
|||
|
not desirable for values that have a datatype,
|
|||
|
but may be useful for pure python integers or floats as in the first
|
|||
|
example.
|
|||
|
However, any change of the datatype system and universal function dispatching
|
|||
|
must initially fully support the current behavior.
|
|||
|
A main difficulty is that for example the value ``156`` can be represented
|
|||
|
by ``np.uint8`` and ``np.int16``.
|
|||
|
The result depends on the "minimal" representation in the context of the
|
|||
|
conversion (for ufuncs the context may depend on the loop order).
|
|||
|
|
|||
|
|
|||
|
The Object Datatype
|
|||
|
^^^^^^^^^^^^^^^^^^^
|
|||
|
|
|||
|
The object datatype currently serves as a generic fallback for any value
|
|||
|
which is not otherwise representable.
|
|||
|
However, due to not having a well-defined type, it has some issues,
|
|||
|
for example when an array is filled with Python sequences::
|
|||
|
|
|||
|
>>> l = [1, [2]]
|
|||
|
>>> np.array(l, dtype=np.object_)
|
|||
|
array([1, list([2])], dtype=object) # a 1d array
|
|||
|
|
|||
|
>>> a = np.empty((), dtype=np.object_)
|
|||
|
>>> a[...] = l
|
|||
|
ValueError: assignment to 0-d array # ???
|
|||
|
>>> a[()] = l
|
|||
|
>>> a
|
|||
|
array(list([1, [2]]), dtype=object)
|
|||
|
|
|||
|
Without a well-defined type, functions such as ``isnan()`` or ``conjugate()``
|
|||
|
do not necessarily work, but can work for a :class:`decimal.Decimal`.
|
|||
|
To improve this situation it seems desirable to make it easy to create
|
|||
|
``object`` dtypes that represent a specific Python datatype and stores its object
|
|||
|
inside the array in the form of pointer to python ``PyObject``.
|
|||
|
Unlike most datatypes, Python objects require garbage collection.
|
|||
|
This means that additional methods to handle references and
|
|||
|
visit all objects must be defined.
|
|||
|
In practice, for most use-cases it is sufficient to limit the creation of such
|
|||
|
datatypes so that all functionality related to Python C-level references is
|
|||
|
private to NumPy.
|
|||
|
|
|||
|
Creating NumPy datatypes that match builtin Python objects also creates a few problems
|
|||
|
that require more thoughts and discussion.
|
|||
|
These issues do not need to solved right away:
|
|||
|
|
|||
|
* NumPy currently returns *scalars* even for array input in some cases, in most
|
|||
|
cases this works seamlessly. However, this is only true because the NumPy
|
|||
|
scalars behave much like NumPy arrays, a feature that general Python objects
|
|||
|
do not have.
|
|||
|
* Seamless integration probably requires that ``np.array(scalar)`` finds the
|
|||
|
correct DType automatically since some operations (such as indexing) return
|
|||
|
the scalar instead of a 0D array.
|
|||
|
This is problematic if multiple users independently decide to implement
|
|||
|
for example a DType for ``decimal.Decimal``.
|
|||
|
|
|||
|
|
|||
|
Current ``dtype`` Implementation
|
|||
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|||
|
|
|||
|
Currently ``np.dtype`` is a Python class with its instances being the
|
|||
|
``np.dtype(">float64")``, etc. instances.
|
|||
|
To set the actual behaviour of these instances, a prototype instance is stored
|
|||
|
globally and looked up based on the ``dtype.typenum``. The singleton is used
|
|||
|
where possible. Where required it is copied and modified, for instance to change
|
|||
|
endianess.
|
|||
|
|
|||
|
Parametric datatypes (strings, void, datetime, and timedelta) must store
|
|||
|
additional information such as string lengths, fields, or datetime units --
|
|||
|
new instances of these types are created instead of relying on a singleton.
|
|||
|
All current datatypes within NumPy further support setting a metadata field
|
|||
|
during creation which can be set to an arbitrary dictionary value, but seems
|
|||
|
rarely used in practice (one recent and prominent user is h5py).
|
|||
|
|
|||
|
Many datatype-specific functions are defined within a C structure called
|
|||
|
:c:type:`PyArray_ArrFuncs`, which is part of each ``dtype`` instance and
|
|||
|
has a similarity to Python's ``PyNumberMethods``.
|
|||
|
For user-defined datatypes this structure is exposed to the user, making
|
|||
|
ABI-compatible changes impossible.
|
|||
|
This structure holds important information such as how to copy or cast,
|
|||
|
and provides space for pointers to functions, such as comparing elements,
|
|||
|
converting to bool, or sorting.
|
|||
|
Since some of these functions are vectorized operations, operating on more than
|
|||
|
one element, they fit the model of ufuncs and do not need to be defined on the
|
|||
|
datatype in the future.
|
|||
|
For example the ``np.clip`` function was previously implemented using
|
|||
|
``PyArray_ArrFuncs`` and is now implemented as a ufunc.
|
|||
|
|
|||
|
Discussion and Issues
|
|||
|
"""""""""""""""""""""
|
|||
|
|
|||
|
A further issue with the current implementation of the functions on the dtype
|
|||
|
is that, unlike methods,
|
|||
|
they are not passed an instance of the dtype when called.
|
|||
|
Instead, in many cases, the array which is being operated on is passed in
|
|||
|
and typically only used to extract the datatype again.
|
|||
|
A future API should likely stop passing in the full array object.
|
|||
|
Since it will be necessary to fall back to the old definitions for
|
|||
|
backward compatibility, the array object may not be available.
|
|||
|
However, passing a "fake" array in which mainly the datatype is defined
|
|||
|
is probably a sufficient workaround
|
|||
|
(see backward compatibility; alignment information may sometimes also be desired).
|
|||
|
|
|||
|
Although not extensively used outside of NumPy itself, the currently
|
|||
|
``PyArray_Descr`` is a public structure.
|
|||
|
This is especially also true for the ``PyArray_ArrFuncs`` structure stored in
|
|||
|
the ``f`` field.
|
|||
|
Due to compatibility they may need to remain supported for a very long time,
|
|||
|
with the possibility of replacing them by functions that dispatch to a newer API.
|
|||
|
|
|||
|
However, in the long run access to these structures will probably have to
|
|||
|
be deprecated.
|
|||
|
|
|||
|
|
|||
|
NumPy Scalars and Type Hierarchy
|
|||
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|||
|
|
|||
|
As a side note to the above datatype implementation: unlike the datatypes,
|
|||
|
the NumPy scalars currently **do** provide a type hierarchy, consisting of abstract
|
|||
|
types such as ``np.inexact`` (see figure below).
|
|||
|
In fact, some control flow within NumPy currently uses
|
|||
|
``issubclass(a.dtype.type, np.inexact)``.
|
|||
|
|
|||
|
.. figure:: _static/nep-0040_dtype-hierarchy.png
|
|||
|
|
|||
|
**Figure:** Hierarchy of NumPy scalar types reproduced from the reference
|
|||
|
documentation. Some aliases such as ``np.intp`` are excluded. Datetime
|
|||
|
and timedelta are not shown.
|
|||
|
|
|||
|
NumPy scalars try to mimic zero-dimensional arrays with a fixed datatype.
|
|||
|
For the numerical (and unicode) datatypes, they are further limited to
|
|||
|
native byte order.
|
|||
|
|
|||
|
|
|||
|
Current Implementation of Casting
|
|||
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|||
|
|
|||
|
One of the main features which datatypes need to support is casting between one
|
|||
|
another using ``arr.astype(new_dtype, casting="unsafe")``, or during execution
|
|||
|
of ufuncs with different types (such as adding integer and floating point numbers).
|
|||
|
|
|||
|
Casting tables determine whether it is possible to cast from one specific type to another.
|
|||
|
However, generic casting rules cannot handle the parametric dtypes such as strings.
|
|||
|
The logic for parametric datatypes is defined mainly in ``PyArray_CanCastTo``
|
|||
|
and currently cannot be customized for user defined datatypes.
|
|||
|
|
|||
|
The actual casting has two distinct parts:
|
|||
|
|
|||
|
1. ``copyswap``/``copyswapn`` are defined for each dtype and can handle
|
|||
|
byte-swapping for non-native byte orders as well as unaligned memory.
|
|||
|
2. The generic casting code is provided by C functions which know how to
|
|||
|
cast aligned and contiguous memory from one dtype to another
|
|||
|
(both in native byte order).
|
|||
|
These C-level functions can be registered to cast aligned and contiguous memory
|
|||
|
from one dtype to another.
|
|||
|
The function may be provided with both arrays (although the parameter
|
|||
|
is sometimes ``NULL`` for scalars).
|
|||
|
NumPy will ensure that these functions receive native byte order input.
|
|||
|
The current implementation stores the functions either in a C-array
|
|||
|
on the datatype which is cast, or in a dictionary when casting to a user
|
|||
|
defined datatype.
|
|||
|
|
|||
|
Generally NumPy will thus perform casting as chain of the three functions
|
|||
|
``in_copyswapn -> castfunc -> out_copyswapn`` using (small) buffers between
|
|||
|
these steps.
|
|||
|
|
|||
|
The above multiple functions are wrapped into a single function (with metadata)
|
|||
|
that handles the cast and is used for example during the buffered iteration used
|
|||
|
by ufuncs.
|
|||
|
This is the mechanism that is always used for user defined datatypes.
|
|||
|
For most dtypes defined within NumPy itself, more specialized code is used to
|
|||
|
find a function to do the actual cast
|
|||
|
(defined by the private ``PyArray_GetDTypeTransferFunction``).
|
|||
|
This mechanism replaces most of the above mechanism and provides much faster
|
|||
|
casts for example when the inputs are not contiguous in memory.
|
|||
|
However, it cannot be extended by user defined datatypes.
|
|||
|
|
|||
|
Related to casting, we currently have a ``PyArray_EquivTypes`` function which
|
|||
|
indicate that a *view* is sufficient (and thus no cast is necessary).
|
|||
|
This function is used multiple places and should probably be part of
|
|||
|
a redesigned casting API.
|
|||
|
|
|||
|
|
|||
|
DType handling in Universal functions
|
|||
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|||
|
|
|||
|
Universal functions are implemented as instances of the ``numpy.UFunc`` class
|
|||
|
with an ordered-list of datatype-specific
|
|||
|
(based on the dtype typecode character, not datatype instances) implementations,
|
|||
|
each with a signature and a function pointer.
|
|||
|
This list of implementations can be seen with ``ufunc.types`` where
|
|||
|
all implementations are listed with their C-style typecode signatures.
|
|||
|
For example::
|
|||
|
|
|||
|
>>> np.add.types
|
|||
|
[...,
|
|||
|
'll->l',
|
|||
|
...,
|
|||
|
'dd->d',
|
|||
|
...]
|
|||
|
|
|||
|
Each of these signatures is associated with a single inner-loop function defined
|
|||
|
in C, which does the actual calculation, and may be called multiple times.
|
|||
|
|
|||
|
The main step in finding the correct inner-loop function is to call a
|
|||
|
:c:type:`PyUFunc_TypeResolutionFunc` which retrieves the input dtypes from
|
|||
|
the provided input arrays
|
|||
|
and will determine the full type signature (including output dtype) to be executed.
|
|||
|
|
|||
|
By default the ``TypeResolver`` is implemented by searching all of the implementations
|
|||
|
listed in ``ufunc.types`` in order and stopping if all inputs can be safely
|
|||
|
cast to fit the signature.
|
|||
|
This means that if long (``l``) and double (``d``) arrays are added,
|
|||
|
numpy will find that the ``'dd->d'`` definition works
|
|||
|
(long can safely cast to double) and uses that.
|
|||
|
|
|||
|
In some cases this is not desirable. For example the ``np.isnat`` universal
|
|||
|
function has a ``TypeResolver`` which rejects integer inputs instead of
|
|||
|
allowing them to be cast to float.
|
|||
|
In principle, downstream projects can currently use their own non-default
|
|||
|
``TypeResolver``, since the corresponding C-structure necessary to do this
|
|||
|
is public.
|
|||
|
The only project known to do this is Astropy, which is willing to switch to
|
|||
|
a new API if NumPy were to remove the possibility to replace the TypeResolver.
|
|||
|
|
|||
|
For user defined datatypes, the dispatching logic is similar,
|
|||
|
although separately implemented and limited (see discussion below).
|
|||
|
|
|||
|
|
|||
|
Issues and Discussion
|
|||
|
"""""""""""""""""""""
|
|||
|
|
|||
|
It is currently only possible for user defined functions to be found/resolved
|
|||
|
if any of the inputs (or the outputs) has the user datatype, since it uses the
|
|||
|
`OO->O` signature.
|
|||
|
For example, given that a ufunc loop to implement ``fraction_divide(int, int)
|
|||
|
-> Fraction`` has been implemented,
|
|||
|
the call ``fraction_divide(4, 5)`` (with no specific output dtype) will fail
|
|||
|
because the loop that
|
|||
|
includes the user datatype ``Fraction`` (as output) can only be found if any of
|
|||
|
the inputs is already a ``Fraction``.
|
|||
|
``fraction_divide(4, 5, dtype=Fraction)`` can be made to work, but is inconvenient.
|
|||
|
|
|||
|
Typically, dispatching is done by finding the first loop that matches. A match
|
|||
|
is defined as: all inputs (and possibly outputs) can
|
|||
|
be cast safely to the signature typechars (see also the current implementation
|
|||
|
section).
|
|||
|
However, in some cases safe casting is problematic and thus explicitly not
|
|||
|
allowed.
|
|||
|
For example the ``np.isnat`` function is currently only defined for
|
|||
|
datetime and timedelta,
|
|||
|
even though integers are defined to be safely castable to timedelta.
|
|||
|
If this was not the case, calling
|
|||
|
``np.isnat(np.array("NaT", "timedelta64").astype("int64"))`` would currently
|
|||
|
return true, although the integer input array has no notion of "not a time".
|
|||
|
If a universal function, such as most functions in ``scipy.special``, is only
|
|||
|
defined for ``float32`` and ``float64`` it will currently automatically
|
|||
|
cast a ``float16`` silently to ``float32`` (similarly for any integer input).
|
|||
|
This ensures successful execution, but may lead to a change in the output dtype
|
|||
|
when support for new data types is added to a ufunc.
|
|||
|
When a ``float16`` loop is added, the output datatype will currently change
|
|||
|
from ``float32`` to ``float16`` without a warning.
|
|||
|
|
|||
|
In general the order in which loops are registered is important.
|
|||
|
However, this is only reliable if all loops are added when the ufunc is first defined.
|
|||
|
Additional loops added when a new user datatypes is imported
|
|||
|
must not be sensitive to the order in which imports occur.
|
|||
|
|
|||
|
There are two main approaches to better define the type resolution for user
|
|||
|
defined types:
|
|||
|
|
|||
|
1. Allow for user dtypes to directly influence the loop selection.
|
|||
|
For example they may provide a function which return/select a loop
|
|||
|
when there is no exact matching loop available.
|
|||
|
2. Define a total ordering of all implementations/loops, probably based on
|
|||
|
"safe casting" semantics, or semantics similar to that.
|
|||
|
|
|||
|
While option 2 may be less complex to reason about it remains to be seen
|
|||
|
whether it is sufficient for all (or most) use cases.
|
|||
|
|
|||
|
|
|||
|
Adjustment of Parametric output DTypes in UFuncs
|
|||
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|||
|
|
|||
|
A second step necessary for parametric dtypes is currently performed within
|
|||
|
the ``TypeResolver``:
|
|||
|
the datetime and timedelta datatypes have to decide on the correct parameter
|
|||
|
for the operation and output array.
|
|||
|
This step also needs to double check that all casts can be performed safely,
|
|||
|
which by default means that they are "same kind" casts.
|
|||
|
|
|||
|
Issues and Discussion
|
|||
|
"""""""""""""""""""""
|
|||
|
|
|||
|
Fixing the correct output dtype is currently part of the type resolution.
|
|||
|
However, it is a distinct step and should probably be handled as such after
|
|||
|
the actual type/loop resolution has occurred.
|
|||
|
|
|||
|
As such this step may move from the dispatching step (described above) to
|
|||
|
the implementation-specific code described below.
|
|||
|
|
|||
|
|
|||
|
DType-specific Implementation of the UFunc
|
|||
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|||
|
|
|||
|
Once the correct implementation/loop is found, UFuncs currently call
|
|||
|
a single *inner-loop function* which is written in C.
|
|||
|
This may be called multiple times to do the full calculation and it has
|
|||
|
little or no information about the current context. It also has a void
|
|||
|
return value.
|
|||
|
|
|||
|
Issues and Discussion
|
|||
|
"""""""""""""""""""""
|
|||
|
|
|||
|
Parametric datatypes may require passing
|
|||
|
additional information to the inner-loop function to decide how to interpret
|
|||
|
the data.
|
|||
|
This is the reason why currently no universal functions for ``string`` dtypes
|
|||
|
exist (although technically possible within NumPy itself).
|
|||
|
Note that it is currently possible to pass in the input array objects
|
|||
|
(which in turn hold the datatypes when no casting is necessary).
|
|||
|
However, the full array information should not be required and currently the
|
|||
|
arrays are passed in before any casting occurs.
|
|||
|
The feature is unused within NumPy and no known user exists.
|
|||
|
|
|||
|
Another issue is the error reporting from within the inner-loop function.
|
|||
|
There exist currently two ways to do this:
|
|||
|
|
|||
|
1. by setting a Python exception
|
|||
|
2. using the CPU floating point error flags.
|
|||
|
|
|||
|
Both of these are checked before returning to the user.
|
|||
|
However, many integer functions currently can set neither of these errors,
|
|||
|
so that checking the floating point error flags is unnecessary overhead.
|
|||
|
On the other hand, there is no way to stop the iteration or pass out error
|
|||
|
information which does not use the floating point flags or requires to hold
|
|||
|
the Python global interpreter lock (GIL).
|
|||
|
|
|||
|
It seems necessary to provide more control to authors of inner loop functions.
|
|||
|
This means allowing users to pass in and out information from the inner-loop
|
|||
|
function more easily, while *not* providing the input array objects.
|
|||
|
Most likely this will involve:
|
|||
|
|
|||
|
* Allowing the execution of additional code before the first and after
|
|||
|
the last inner-loop call.
|
|||
|
* Returning an integer value from the inner-loop to allow stopping the
|
|||
|
iteration early and possibly propagate error information.
|
|||
|
* Possibly, to allow specialized inner-loop selections. For example currently
|
|||
|
``matmul`` and many reductions will execute optimized code for certain inputs.
|
|||
|
It may make sense to allow selecting such optimized loops beforehand.
|
|||
|
Allowing this may also help to bring casting (which uses this heavily) and
|
|||
|
ufunc implementations closer.
|
|||
|
|
|||
|
The issues surrounding the inner-loop functions have been discussed in some
|
|||
|
detail in the github issue gh-12518_ .
|
|||
|
|
|||
|
Reductions use an "identity" value.
|
|||
|
This is currently defined once per ufunc, regardless of the ufunc dtype signature.
|
|||
|
For example ``0`` is used for ``sum``, or ``math.inf`` for ``min``.
|
|||
|
This works well for numerical datatypes, but is not always appropriate for other dtypes.
|
|||
|
In general it should be possible to provide a dtype-specific identity to the
|
|||
|
ufunc reduction.
|
|||
|
|
|||
|
|
|||
|
Datatype Discovery during Array Coercion
|
|||
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|||
|
|
|||
|
When calling ``np.array(...)`` to coerce a general Python object to a NumPy array,
|
|||
|
all objects need to be inspected to find the correct dtype.
|
|||
|
The input to ``np.array()`` are potentially nested Python sequences which hold
|
|||
|
the final elements as generic Python objects.
|
|||
|
NumPy has to unpack all the nested sequences and then inspect the elements.
|
|||
|
The final datatype is found by iterating over all elements which will end up
|
|||
|
in the array and:
|
|||
|
|
|||
|
1. discovering the dtype of the single element:
|
|||
|
|
|||
|
* from array (or array like) or NumPy scalar using ``element.dtype``
|
|||
|
* using ``isinstance(..., float)`` for known Python types
|
|||
|
(note that these rules mean that subclasses are *currently* valid).
|
|||
|
* special rule for void datatypes to coerce tuples.
|
|||
|
|
|||
|
2. Promoting the current dtype with the next elements dtype using
|
|||
|
``np.promote_types``.
|
|||
|
3. If strings are found, the whole process is restarted (see also [gh-15327]_),
|
|||
|
in a similar manner as if ``dtype="S"`` was given (see below).
|
|||
|
|
|||
|
If ``dtype=...`` is given, this dtype is used unmodified, unless
|
|||
|
it is an unspecific *parametric dtype instance* which means "S0", "V0", "U0",
|
|||
|
"datetime64", and "timdelta64".
|
|||
|
These are thus flexible datatypes without length 0 – considered to be unsized –
|
|||
|
and datetimes or timedelta without a unit attached ("generic unit").
|
|||
|
|
|||
|
In future DType class hierarchy, these may be represented by the class rather
|
|||
|
than a special instance, since these special instances should not normally be
|
|||
|
attached to an array.
|
|||
|
|
|||
|
If such a *parametric dtype instance* is provided for example using ``dtype="S"``
|
|||
|
``PyArray_AdaptFlexibleDType`` is called and effectively inspects all values
|
|||
|
using DType specific logic.
|
|||
|
That is:
|
|||
|
|
|||
|
* Strings will use ``str(element)`` to find the length of most elements
|
|||
|
* Datetime64 is capable of coercing from strings and guessing the correct unit.
|
|||
|
|
|||
|
|
|||
|
Discussion and Issues
|
|||
|
"""""""""""""""""""""
|
|||
|
|
|||
|
It seems probable that during normal discovery, the ``isinstance`` should rather
|
|||
|
be strict ``type(element) is desired_type`` checks.
|
|||
|
Further, the current ``AdaptFlexibleDType`` logic should be made available to
|
|||
|
user DTypes and not be a secondary step, but instead replace, or be part of,
|
|||
|
the normal discovery.
|
|||
|
|
|||
|
|
|||
|
|
|||
|
Related Issues
|
|||
|
--------------
|
|||
|
|
|||
|
``np.save`` currently translates all user-defined dtypes to void dtypes.
|
|||
|
This means they cannot be stored using the ``npy`` format.
|
|||
|
This is not an issue for the python pickle protocol, although it may require
|
|||
|
some thought if we wish to ensure that such files can be loaded securely
|
|||
|
without the possibility of executing malicious code
|
|||
|
(i.e. without the ``allow_pickle=True`` keyword argument).
|
|||
|
|
|||
|
The additional existence of masked arrays and especially masked datatypes
|
|||
|
within Pandas has interesting implications for interoperability.
|
|||
|
Since mask information is often stored separately, its handling requires
|
|||
|
support by the container (array) object.
|
|||
|
NumPy itself does not provide such support, and is not expected to add it
|
|||
|
in the foreseeable future.
|
|||
|
However, if such additions to the datatypes within NumPy would improve
|
|||
|
interoperability they could be considered even if
|
|||
|
they are not used by NumPy itself.
|
|||
|
|
|||
|
|
|||
|
Related Work
|
|||
|
------------
|
|||
|
|
|||
|
* Julia types are an interesting blueprint for a type hierarchy, and define
|
|||
|
abstract and concrete types [julia-types]_.
|
|||
|
|
|||
|
* In Julia promotion can occur based on abstract types. If a promoter is
|
|||
|
defined, it will cast the inputs and then Julia can then retry to find
|
|||
|
an implementation with the new values [julia-promotion]_.
|
|||
|
|
|||
|
* ``xnd-project`` (https://github.com/xnd-project) with ndtypes and gumath
|
|||
|
|
|||
|
* The ``xnd-project`` is similar to NumPy and defines data types as well
|
|||
|
as the possibility to extend them. A major difference is that it does
|
|||
|
not use promotion/casting within the ufuncs, but instead requires explicit
|
|||
|
definition of ``int32 + float64 -> float64`` loops.
|
|||
|
|
|||
|
|
|||
|
|
|||
|
Discussion
|
|||
|
----------
|
|||
|
|
|||
|
There have been many discussions about the current state and what a future
|
|||
|
datatype system may look like.
|
|||
|
The full list of these discussion is long and some are lost to time,
|
|||
|
the following provides a subset for more recent ones:
|
|||
|
|
|||
|
* Draft NEP by Stephan Hoyer after a developer meeting (was updated on the next developer meeting) https://hackmd.io/6YmDt_PgSVORRNRxHyPaNQ
|
|||
|
|
|||
|
* List of related documents gathered previously here
|
|||
|
https://hackmd.io/UVOtgj1wRZSsoNQCjkhq1g (TODO: Reduce to the most important
|
|||
|
ones):
|
|||
|
|
|||
|
* https://github.com/numpy/numpy/pull/12630
|
|||
|
Matti Picus draft NEP, discusses the technical side of subclassing more from
|
|||
|
the side of ``ArrFunctions``
|
|||
|
|
|||
|
* https://hackmd.io/ok21UoAQQmOtSVk6keaJhw and https://hackmd.io/s/ryTFaOPHE
|
|||
|
(2019-04-30) Proposals for subclassing implementation approach.
|
|||
|
|
|||
|
* Discussion about the calling convention of ufuncs and need for more
|
|||
|
powerful UFuncs: https://github.com/numpy/numpy/issues/12518
|
|||
|
|
|||
|
* 2018-11-30 developer meeting notes:
|
|||
|
https://github.com/BIDS-numpy/docs/blob/master/meetings/2018-11-30-dev-meeting.md
|
|||
|
and subsequent draft for an NEP: https://hackmd.io/6YmDt_PgSVORRNRxHyPaNQ
|
|||
|
|
|||
|
BIDS Meeting on November 30, 2018 and document by Stephan Hoyer about
|
|||
|
what numpy should provide and thoughts of how to get there. Meeting with
|
|||
|
Eric Wieser, Matti Picus, Charles Harris, Tyler Reddy, Stéfan van der
|
|||
|
Walt, and Travis Oliphant.
|
|||
|
|
|||
|
* SciPy 2018 brainstorming session with summaries of use cases:
|
|||
|
https://github.com/numpy/numpy/wiki/Dtype-Brainstorming
|
|||
|
|
|||
|
Also lists some requirements and some ideas on implementations
|
|||
|
|
|||
|
|
|||
|
|
|||
|
References
|
|||
|
----------
|
|||
|
|
|||
|
.. _gh-12518: https://github.com/numpy/numpy/issues/12518
|
|||
|
.. [gh-15327] https://github.com/numpy/numpy/issues/12518
|
|||
|
|
|||
|
.. [julia-types] https://docs.julialang.org/en/v1/manual/types/index.html#Abstract-Types-1
|
|||
|
|
|||
|
.. [julia-promotion] https://docs.julialang.org/en/v1/manual/conversion-and-promotion/
|
|||
|
|
|||
|
|
|||
|
|
|||
|
Copyright
|
|||
|
---------
|
|||
|
|
|||
|
This document has been placed in the public domain.
|