:Resolution:https://mail.python.org/pipermail/numpy-discussion/2020-April/080573.html and https://mail.python.org/pipermail/numpy-discussion/2020-March/080495.html
..note::
This NEP is part of a series of NEPs encompassing first information
about the previous dtype implementation and issues with it in
:ref:`NEP 40 <NEP40>`.
NEP 41 (this document) then provides an overview and generic design
choices for the refactor.
Further NEPs 42 and 43 go into the technical details of the datatype
and universal function related internal and external API changes.
In some cases it may be necessary to consult the other NEPs for a full
picture of the desired changes and why these changes are necessary.
Abstract
--------
:ref:`Datatypes <arrays.dtypes>` in NumPy describe how to interpret each
element in arrays. NumPy provides ``int``, ``float``, and ``complex`` numerical
types, as well as string, datetime, and structured datatype capabilities.
The growing Python community, however, has need for more diverse datatypes.
Examples are datatypes with unit information attached (such as meters) or
categorical datatypes (fixed set of possible values).
However, the current NumPy datatype API is too limited to allow the creation
of these.
This NEP is the first step to enable such growth; it will lead to
a simpler development path for new datatypes.
In the long run the new datatype system will also support the creation
of datatypes directly from Python rather than C.
Refactoring the datatype API will improve maintainability and facilitate
development of both user-defined external datatypes,
as well as new features for existing datatypes internal to NumPy.
Motivation and Scope
--------------------
..seealso::
The user impact section includes examples of what kind of new datatypes
will be enabled by the proposed changes in the long run.
It may thus help to read these section out of order.
Motivation
^^^^^^^^^^
One of the main issues with the current API is the definition of typical
functions such as addition and multiplication for parametric datatypes
(see also :ref:`NEP 40 <NEP40>`)
which require additional steps to determine the output type.
For example when adding two strings of length 4, the result is a string
of length 8, which is different from the input.
Similarly, a datatype which embeds a physical unit must calculate the new unit
information: dividing a distance by a time results in a speed.
A related difficulty is that the :ref:`current casting rules <ufuncs.casting>`
-- the conversion between different datatypes --
cannot describe casting for such parametric datatypes implemented outside of NumPy.
This additional functionality for supporting parametric datatypes introduces
increased complexity within NumPy itself,
and furthermore is not available to external user-defined datatypes.
In general the concerns of different datatypes are not well well-encapsulated.
This burden is exacerbated by the exposure of internal C structures,
limiting the addition of new fields
(for example to support new sorting methods [new_sort]_).
Currently there are many factors which limit the creation of new user-defined
datatypes:
* Creating casting rules for parametric user-defined dtypes is either impossible
or so complex that it has never been attempted.
* Type promotion, e.g. the operation deciding that adding float and integer
values should return a float value, is very valuable for numeric datatypes
but is limited in scope for user-defined and especially parametric datatypes.
* Much of the logic (e.g. promotion) is written in single functions
instead of being split as methods on the datatype itself.
* In the current design datatypes cannot have methods that do not generalize
to other datatypes. For example a unit datatype cannot have a ``.to_si()`` method to
easily find the datatype which would represent the same values in SI units.
The large need to solve these issues has driven the scientific community
to create work-arounds in multiple projects implementing physical units as an
array-like class instead of a datatype, which would generalize better across
multiple array-likes (Dask, pandas, etc.).
Already, Pandas has made a push into the same direction with its
extension arrays [pandas_extension_arrays]_ and undoubtedly
the community would be best served if such new features could be common
between NumPy, Pandas, and other projects.
Scope
^^^^^
The proposed refactoring of the datatype system is a large undertaking and
thus is proposed to be split into various phases, roughly:
* Phase I: Restructure and extend the datatype infrastructure (This NEP 41)
* Phase II: Incrementally define or rework API (Detailed largely in NEPs 42/43)
* Phase III: Growth of NumPy and Scientific Python Ecosystem capabilities.
For a more detailed accounting of the various phases, see
"Plan to Approach the Full Refactor" in the Implementation section below.
This NEP proposes to move ahead with the necessary creation of new dtype
subclasses (Phase I),
and start working on implementing current functionality.
Within the context of this NEP all development will be fully private API or
use preliminary underscored names which must be changed in the future.
Most of the internal and public API choices are part of a second Phase
and will be discussed in more detail in the following NEPs 42 and 43.
The initial implementation of this NEP will have little or no effect on users,
but provides the necessary ground work for incrementally addressing the
full rework.
The implementation of this NEP and the following, implied large rework of how
datatypes are defined in NumPy is expected to create small incompatibilities
(see backward compatibility section).
However, a transition requiring large code adaption is not anticipated and not
within scope.
Specifically, this NEP makes the following design choices which are discussed
in more details in the detailed description section:
1. Each datatype will be an instance of a subclass of ``np.dtype``, with most of the
datatype-specific logic being implemented
as special methods on the class. In the C-API, these correspond to specific
slots. In short, for ``f = np.dtype("f8")``, ``isinstance(f, np.dtype)`` will remain true,
but ``type(f)`` will be a subclass of ``np.dtype`` rather than just ``np.dtype`` itself.
The ``PyArray_ArrFuncs`` which are currently stored as a pointer on the instance (as ``PyArray_Descr->f``),
should instead be stored on the class as typically done in Python.
In the future these may correspond to python side dunder methods.
Storage information such as itemsize and byteorder can differ between
different dtype instances (e.g. "S3" vs. "S8") and will remain part of the instance.
This means that in the long run the current lowlevel access to dtype methods
will be removed (see ``PyArray_ArrFuncs`` in
:ref:`NEP 40 <NEP40>`).
2. The current NumPy scalars will *not* change, they will not be instances of
datatypes. This will also be true for new datatypes, scalars will not be
instances of a dtype (although ``isinstance(scalar, dtype)`` may be made
to return ``True`` when appropriate).
Detailed technical decisions to follow in NEP 42.
Further, the public API will be designed in a way that is extensible in the future:
3. All new C-API functions provided to the user will hide implementation details
as much as possible. The public API should be an identical, but limited,
version of the C-API used for the internal NumPy datatypes.
The datatype system may be targeted to work with NumPy arrays,
for example by providing strided-loops, but should avoid direct
interactions with the array-object (typically `np.ndarray` instances).
Instead, the design principle will be that the array-object is a consumer
of the datatype.
While only a guiding principle, this may allow splitting the datatype system
or even the NumPy datatypes into their own project which NumPy depends on.
The changes to the datatype system in Phase II must include a large refactor of the
UFunc machinery, which will be further defined in NEP 43:
4. To enable all of the desired functionality for new user-defined datatypes,
the UFunc machinery will be changed to replace the current dispatching
and type resolution system.
The old system should be *mostly* supported as a legacy version for some time.
Additionally, as a general design principle, the addition of new user-defined
datatypes will *not* change the behaviour of programs.
For example ``common_dtype(a, b)`` must not be ``c`` unless ``a`` or ``b`` know
that ``c`` exists.
User Impact
-----------
The current ecosystem has very few user-defined datatypes using NumPy, the
two most prominent being: ``rational`` and ``quaternion``.
These represent fairly simple datatypes which are not strongly impacted
by the current limitations.
However, we have identified a need for datatypes such as:
* bfloat16, used in deep learning
* categorical types
* physical units (such as meters)
* datatypes for tracing/automatic differentiation
* high, fixed precision math
* specialized integer types such as int2, int24
* new, better datetime representations
* extending e.g. integer dtypes to have a sentinel NA value
* geometrical objects [pygeos]_
Some of these are partially solved; for example unit capability is provided
in ``astropy.units``, ``unyt``, or ``pint``, as `numpy.ndarray` subclasses.
Most of these datatypes, however, simply cannot be reasonably defined
right now.
An advantage of having such datatypes in NumPy is that they should integrate
seamlessly with other array or array-like packages such as Pandas,
``xarray``[xarray_dtype_issue]_, or ``Dask``.
The long term user impact of implementing this NEP will be to allow both
the growth of the whole ecosystem by having such new datatypes, as well as
consolidating implementation of such datatypes within NumPy to achieve
better interoperability.
Examples
^^^^^^^^
The following examples represent future user-defined datatypes we wish to enable.
These datatypes are not part the NEP and choices (e.g. choice of casting rules)
are possibilities we wish to enable and do not represent recommendations.
Simple Numerical Types
""""""""""""""""""""""
Mainly used where memory is a consideration, lower-precision numeric types
such as `bfloat16 <https://en.wikipedia.org/wiki/Bfloat16_floating-point_format>`_
are common in other computational frameworks.
For these types the definitions of things such as ``np.common_type`` and
``np.can_cast`` are some of the most important interfaces. Once they
support ``np.common_type``, it is (for the most part) possible to find
the correct ufunc loop to call, since most ufuncs -- such as add -- effectively