353 lines
17 KiB
ReStructuredText
353 lines
17 KiB
ReStructuredText
|
===========================================================
|
|||
|
NEP 22 — Duck typing for NumPy arrays – high level overview
|
|||
|
===========================================================
|
|||
|
|
|||
|
:Author: Stephan Hoyer <shoyer@google.com>, Nathaniel J. Smith <njs@pobox.com>
|
|||
|
:Status: Final
|
|||
|
:Type: Informational
|
|||
|
:Created: 2018-03-22
|
|||
|
:Resolution: https://mail.python.org/pipermail/numpy-discussion/2018-September/078752.html
|
|||
|
|
|||
|
Abstract
|
|||
|
--------
|
|||
|
|
|||
|
We outline a high-level vision for how NumPy will approach handling
|
|||
|
“duck arrays”. This is an Informational-class NEP; it doesn’t
|
|||
|
prescribe full details for any particular implementation. In brief, we
|
|||
|
propose developing a number of new protocols for defining
|
|||
|
implementations of multi-dimensional arrays with high-level APIs
|
|||
|
matching NumPy.
|
|||
|
|
|||
|
|
|||
|
Detailed description
|
|||
|
--------------------
|
|||
|
|
|||
|
Traditionally, NumPy’s ``ndarray`` objects have provided two things: a
|
|||
|
high level API for expression operations on homogenously-typed,
|
|||
|
arbitrary-dimensional, array-structured data, and a concrete
|
|||
|
implementation of the API based on strided in-RAM storage. The API is
|
|||
|
powerful, fairly general, and used ubiquitously across the scientific
|
|||
|
Python stack. The concrete implementation, on the other hand, is
|
|||
|
suitable for a wide range of uses, but has limitations: as data sets
|
|||
|
grow and NumPy becomes used in a variety of new environments, there
|
|||
|
are increasingly cases where the strided in-RAM storage strategy is
|
|||
|
inappropriate, and users find they need sparse arrays, lazily
|
|||
|
evaluated arrays (as in dask), compressed arrays (as in blosc), arrays
|
|||
|
stored in GPU memory, arrays stored in alternative formats such as
|
|||
|
Arrow, and so forth – yet users still want to work with these arrays
|
|||
|
using the familiar NumPy APIs, and re-use existing code with minimal
|
|||
|
(ideally zero) porting overhead. As a working shorthand, we call these
|
|||
|
“duck arrays”, by analogy with Python’s “duck typing”: a “duck array”
|
|||
|
is a Python object which “quacks like” a numpy array in the sense that
|
|||
|
it has the same or similar Python API, but doesn’t share the C-level
|
|||
|
implementation.
|
|||
|
|
|||
|
This NEP doesn’t propose any specific changes to NumPy or other
|
|||
|
projects; instead, it gives an overview of how we hope to extend NumPy
|
|||
|
to support a robust ecosystem of projects implementing and relying
|
|||
|
upon its high level API.
|
|||
|
|
|||
|
Terminology
|
|||
|
~~~~~~~~~~~
|
|||
|
|
|||
|
“Duck array” works fine as a placeholder for now, but it’s pretty
|
|||
|
jargony and may confuse new users, so we may want to pick something
|
|||
|
else for the actual API functions. Unfortunately, “array-like” is
|
|||
|
already taken for the concept of “anything that can be coerced into an
|
|||
|
array” (including e.g. list objects), and “anyarray” is already taken
|
|||
|
for the concept of “something that shares ndarray’s implementation,
|
|||
|
but has different semantics”, which is the opposite of a duck array
|
|||
|
(e.g., np.matrix is an “anyarray”, but is not a “duck array”). This is
|
|||
|
a classic bike-shed so for now we’re just using “duck array”. Some
|
|||
|
possible options though include: arrayish, pseudoarray, nominalarray,
|
|||
|
ersatzarray, arraymimic, ...
|
|||
|
|
|||
|
|
|||
|
General approach
|
|||
|
~~~~~~~~~~~~~~~~
|
|||
|
|
|||
|
At a high level, duck array support requires working through each of
|
|||
|
the API functions provided by NumPy, and figuring out how it can be
|
|||
|
extended to work with duck array objects. In some cases this is easy
|
|||
|
(e.g., methods/attributes on ndarray itself); in other cases it’s more
|
|||
|
difficult. Here are some principles we’ve found useful so far:
|
|||
|
|
|||
|
|
|||
|
Principle 1: Focus on “full” duck arrays, but don’t rule out “partial” duck arrays
|
|||
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|||
|
|
|||
|
We can distinguish between two classes:
|
|||
|
|
|||
|
* “full” duck arrays, which aspire to fully implement np.ndarray’s
|
|||
|
Python-level APIs and work essentially anywhere that np.ndarray
|
|||
|
works
|
|||
|
|
|||
|
* “partial” duck arrays, which intentionally implement only a subset
|
|||
|
of np.ndarray’s API.
|
|||
|
|
|||
|
Full duck arrays are, well, kind of boring. They have exactly the same
|
|||
|
semantics as ndarray, with differences being restricted to
|
|||
|
under-the-hood decisions about how the data is actually stored. The
|
|||
|
kind of people that are excited about making numpy more extensible are
|
|||
|
also, unsurprisingly, excited about changing or extending numpy’s
|
|||
|
semantics. So there’s been a lot of discussion of how to best support
|
|||
|
partial duck arrays. We've been guilty of this ourself.
|
|||
|
|
|||
|
At this point though, we think the best general strategy is to focus
|
|||
|
our efforts primarily on supporting full duck arrays, and only worry
|
|||
|
about partial duck arrays as much as we need to to make sure we don't
|
|||
|
accidentally rule them out for no reason.
|
|||
|
|
|||
|
Why focus on full duck arrays? Several reasons:
|
|||
|
|
|||
|
First, there are lots of very clear use cases. Potential consumers of
|
|||
|
the full duck array interface include almost every package that uses
|
|||
|
numpy (scipy, sklearn, astropy, ...), and in particular packages that
|
|||
|
provide array-wrapping-classes that handle multiple types of arrays,
|
|||
|
such as xarray and dask.array. Potential implementers of the full duck
|
|||
|
array interface include: distributed arrays, sparse arrays, masked
|
|||
|
arrays, arrays with units (unless they switch to using dtypes),
|
|||
|
labeled arrays, and so forth. Clear use cases lead to good and
|
|||
|
relevant APIs.
|
|||
|
|
|||
|
Second, the Anna Karenina principle applies here: full duck arrays are
|
|||
|
all alike, but every partial duck array is partial in its own way:
|
|||
|
|
|||
|
* ``xarray.DataArray`` is mostly a duck array, but has incompatible
|
|||
|
broadcasting semantics.
|
|||
|
* ``xarray.Dataset`` wraps multiple arrays in one object; it still
|
|||
|
implements some array interfaces like ``__array_ufunc__``, but
|
|||
|
certainly not all of them.
|
|||
|
* ``pandas.Series`` has methods with similar behavior to numpy, but
|
|||
|
unique null-skipping behavior.
|
|||
|
* scipy’s ``LinearOperator``\s support matrix multiplication and nothing else
|
|||
|
* h5py and similar libraries for accessing array storage have objects
|
|||
|
that support numpy-like slicing and conversion into a full array,
|
|||
|
but not computation.
|
|||
|
* Some classes may be similar to ndarray, but without supporting the
|
|||
|
full indexing semantics.
|
|||
|
|
|||
|
And so forth.
|
|||
|
|
|||
|
Despite our best attempts, we haven't found any clear, unique way of
|
|||
|
slicing up the ndarray API into a hierarchy of related types that
|
|||
|
captures these distinctions; in fact, it’s unlikely that any single
|
|||
|
person even understands all the distinctions. And this is important,
|
|||
|
because we have a *lot* of APIs that we need to add duck array support
|
|||
|
to (both in numpy and in all the projects that depend on numpy!). By
|
|||
|
definition, these already work for ``ndarray``, so hopefully getting
|
|||
|
them to work for full duck arrays shouldn’t be so hard, since by
|
|||
|
definition full duck arrays act like ``ndarray``. It’d be very
|
|||
|
cumbersome to have to go through each function and identify the exact
|
|||
|
subset of the ndarray API that it needs, then figure out which partial
|
|||
|
array types can/should support it. Once we have things working for
|
|||
|
full duck arrays, we can go back later and refine the APIs needed
|
|||
|
further as needed. Focusing on full duck arrays allows us to start
|
|||
|
making progress immediately.
|
|||
|
|
|||
|
In the future, it might be useful to identify specific use cases for
|
|||
|
duck arrays and standardize narrower interfaces targeted just at those
|
|||
|
use cases. For example, it might make sense to have a standard “array
|
|||
|
loader” interface that file access libraries like h5py, netcdf, pydap,
|
|||
|
zarr, ... all implement, to make it easy to switch between these
|
|||
|
libraries. But that’s something that we can do as we go, and it
|
|||
|
doesn’t necessarily have to involve the NumPy devs at all. For an
|
|||
|
example of what this might look like, see the documentation for
|
|||
|
`dask.array.from_array
|
|||
|
<http://dask.pydata.org/en/latest/array-api.html#dask.array.from_array>`__.
|
|||
|
|
|||
|
|
|||
|
Principle 2: Take advantage of duck typing
|
|||
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|||
|
|
|||
|
``ndarray`` has a very large API surface area::
|
|||
|
|
|||
|
In [1]: len(set(dir(np.ndarray)) - set(dir(object)))
|
|||
|
Out[1]: 138
|
|||
|
|
|||
|
And this is a huge **under**\estimate, because there are also many
|
|||
|
free-standing functions in NumPy and other libraries which currently
|
|||
|
use the NumPy C API and thus only work on ``ndarray`` objects. In type
|
|||
|
theory, a type is defined by the operations you can perform on an
|
|||
|
object; thus, the actual type of ``ndarray`` includes not just its
|
|||
|
methods and attributes, but *all* of these functions. For duck arrays
|
|||
|
to be successful, they’ll need to implement a large proportion of the
|
|||
|
``ndarray`` API – but not all of it. (For example,
|
|||
|
``dask.array.Array`` does not provide an equivalent to the
|
|||
|
``ndarray.ptp`` method, presumably because no-one has ever noticed or
|
|||
|
cared about its absence. But this doesn’t seem to have stopped people
|
|||
|
from using dask.)
|
|||
|
|
|||
|
This means that realistically, we can’t hope to define the whole duck
|
|||
|
array API up front, or that anyone will be able to implement it all in
|
|||
|
one go; this will be an incremental process. It also means that even
|
|||
|
the so-called “full” duck array interface is somewhat fuzzily defined
|
|||
|
at the borders; there are parts of the ``np.ndarray`` API that duck
|
|||
|
arrays won’t have to implement, but we aren’t entirely sure what those
|
|||
|
are.
|
|||
|
|
|||
|
And ultimately, it isn’t really up to the NumPy developers to define
|
|||
|
what does or doesn’t qualify as a duck array. If we want scikit-learn
|
|||
|
functions to work on dask arrays (for example), then that’s going to
|
|||
|
require negotiation between those two projects to discover
|
|||
|
incompatibilities, and when an incompatibility is discovered it will
|
|||
|
be up to them to negotiate who should change and how. The NumPy
|
|||
|
project can provide technical tools and general advice to help resolve
|
|||
|
these disagreements, but we can’t force one group or another to take
|
|||
|
responsibility for any given bug.
|
|||
|
|
|||
|
Therefore, even though we’re focusing on “full” duck arrays, we
|
|||
|
*don’t* attempt to define a normative “array ABC” – maybe this will be
|
|||
|
useful someday, but right now, it’s not. And as a convenient
|
|||
|
side-effect, the lack of a normative definition leaves partial duck
|
|||
|
arrays room to experiment.
|
|||
|
|
|||
|
But, we do provide some more detailed advice for duck array
|
|||
|
implementers and consumers below.
|
|||
|
|
|||
|
Principle 3: Focus on protocols
|
|||
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|||
|
|
|||
|
Historically, numpy has had lots of success at interoperating with
|
|||
|
third-party objects by defining *protocols*, like ``__array__`` (asks
|
|||
|
an arbitrary object to convert itself into an array),
|
|||
|
``__array_interface__`` (a precursor to Python’s buffer protocol), and
|
|||
|
``__array_ufunc__`` (allows third-party objects to support ufuncs like
|
|||
|
``np.exp``).
|
|||
|
|
|||
|
`NEP 16 <https://github.com/numpy/numpy/pull/10706>`_ took a
|
|||
|
different approach: we need a duck-array equivalent of
|
|||
|
``asarray``, and it proposed to do this by defining a version of
|
|||
|
``asarray`` that would let through objects which implemented a new
|
|||
|
AbstractArray ABC. As noted above, we now think that trying to define
|
|||
|
an ABC is a bad idea for other reasons. But when this NEP was
|
|||
|
discussed on the mailing list, we realized that even on its own
|
|||
|
merits, this idea is not so great. A better approach is to define a
|
|||
|
*method* that can be called on an arbitrary object to ask it to
|
|||
|
convert itself into a duck array, and then define a version of
|
|||
|
``asarray`` that calls this method.
|
|||
|
|
|||
|
This is strictly more powerful: if an object is already a duck array,
|
|||
|
it can simply ``return self``. It allows more correct semantics: NEP
|
|||
|
16 assumed that ``asarray(obj, dtype=X)`` is the same as
|
|||
|
``asarray(obj).astype(X)``, but this isn’t true. And it supports more
|
|||
|
use cases: if h5py supported sparse arrays, it might want to provide
|
|||
|
an object which is not itself a sparse array, but which can be
|
|||
|
automatically converted into a sparse array. See NEP <XX, to be
|
|||
|
written> for full details.
|
|||
|
|
|||
|
The protocol approach is also more consistent with core Python
|
|||
|
conventions: for example, see the ``__iter__`` method for coercing
|
|||
|
objects to iterators, or the ``__index__`` protocol for safe integer
|
|||
|
coercion. And finally, focusing on protocols leaves the door open for
|
|||
|
partial duck arrays, which can pick and choose which subset of the
|
|||
|
protocols they want to participate in, each of which have well-defined
|
|||
|
semantics.
|
|||
|
|
|||
|
Conclusion: protocols are one honking great idea – let’s do more of
|
|||
|
those.
|
|||
|
|
|||
|
Principle 4: Reuse existing methods when possible
|
|||
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|||
|
|
|||
|
It’s tempting to try to define cleaned up versions of ndarray methods
|
|||
|
with a more minimal interface to allow for easier implementation. For
|
|||
|
example, ``__array_reshape__`` could drop some of the strange
|
|||
|
arguments accepted by ``reshape`` and ``__array_basic_getitem__``
|
|||
|
could drop all the `strange edge cases
|
|||
|
<http://www.numpy.org/neps/nep-0021-advanced-indexing.html>`__ of
|
|||
|
NumPy’s advanced indexing.
|
|||
|
|
|||
|
But as discussed above, we don’t really know what APIs we need for
|
|||
|
duck-typing ndarray. We would inevitably end up with a very long list
|
|||
|
of new special methods. In contrast, existing methods like ``reshape``
|
|||
|
and ``__getitem__`` have the advantage of already being widely
|
|||
|
used/exercised by libraries that use duck arrays, and in practice, any
|
|||
|
serious duck array type is going to have to implement them anyway.
|
|||
|
|
|||
|
Principle 5: Make it easy to do the right thing
|
|||
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|||
|
|
|||
|
Making duck arrays work well is going to be a community effort.
|
|||
|
Documentation helps, but only goes so far. We want to make it easy to
|
|||
|
implement duck arrays that do the right thing.
|
|||
|
|
|||
|
One way NumPy can help is by providing mixin classes for implementing
|
|||
|
large groups of related functionality at once.
|
|||
|
``NDArrayOperatorsMixin`` is a good example: it allows for
|
|||
|
implementing arithmetic operators implicitly via the
|
|||
|
``__array_ufunc__`` method. It’s not complete, and we’ll want more
|
|||
|
helpers like that (e.g. for reductions).
|
|||
|
|
|||
|
(We initially thought that the importance of these mixins might be an
|
|||
|
argument for providing an array ABC, since that’s the standard way to
|
|||
|
do mixins in modern Python. But in discussion around NEP 16 we
|
|||
|
realized that partial duck arrays also wanted to take advantage of
|
|||
|
these mixins in some cases, so even if we did have an array ABC then
|
|||
|
the mixins would still need some sort of separate existence. So never
|
|||
|
mind that argument.)
|
|||
|
|
|||
|
Tentative duck array guidelines
|
|||
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|||
|
|
|||
|
As a general rule, libraries using duck arrays should insist upon the
|
|||
|
minimum possible requirements, and libraries implementing duck arrays
|
|||
|
should provide as complete of an API as possible. This will ensure
|
|||
|
maximum compatibility. For example, users should prefer to rely on
|
|||
|
``.transpose()`` rather than ``.swapaxes()`` (which can be implemented
|
|||
|
in terms of transpose), but duck array authors should ideally
|
|||
|
implement both.
|
|||
|
|
|||
|
If you are trying to implement a duck array, then you should strive to
|
|||
|
implement everything. You certainly need ``.shape``, ``.ndim`` and
|
|||
|
``.dtype``, but also your dtype attribute should actually be a
|
|||
|
``numpy.dtype`` object, weird fancy indexing edge cases should ideally
|
|||
|
work, etc. Only details related to NumPy’s specific ``np.ndarray``
|
|||
|
implementation (e.g., ``strides``, ``data``, ``view``) are explicitly
|
|||
|
out of scope.
|
|||
|
|
|||
|
A (very) rough sketch of future plans
|
|||
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|||
|
|
|||
|
The proposals discussed so far – ``__array_ufunc__`` and some kind of
|
|||
|
``asarray`` protocol – are clearly necessary but not sufficient for
|
|||
|
full duck typing support. We expect the need for additional protocols
|
|||
|
to support (at least) these features:
|
|||
|
|
|||
|
* **Concatenating** duck arrays, which would be used internally by other
|
|||
|
array combining methods like stack/vstack/hstack. The implementation
|
|||
|
of concatenate will need to be negotiated among the list of array
|
|||
|
arguments. We expect to use an ``__array_concatenate__`` protocol
|
|||
|
like ``__array_ufunc__`` instead of multiple dispatch.
|
|||
|
* **Ufunc-like functions** that currently aren’t ufuncs. Many NumPy
|
|||
|
functions like median, percentile, sort, where and clip could be
|
|||
|
written as generalized ufuncs but currently aren’t. Either these
|
|||
|
functions should be written as ufuncs, or we should consider adding
|
|||
|
another generic wrapper mechanism that works similarly to ufuncs but
|
|||
|
makes fewer guarantees about how the implementation is done.
|
|||
|
* **Random number generation** with duck arrays, e.g.,
|
|||
|
``np.random.randn()``. For example, we might want to add new APIs
|
|||
|
like ``random_like()`` for generating new arrays with a matching
|
|||
|
shape *and* type – though we'll need to look at some real examples
|
|||
|
of how these functions are used to figure out what would be helpful.
|
|||
|
* **Miscellaneous other functions** such as ``np.einsum``,
|
|||
|
``np.zeros_like``, and ``np.broadcast_to`` that don’t fall into any
|
|||
|
of the above categories.
|
|||
|
* **Checking mutability** on duck arrays, which would imply that they
|
|||
|
support assignment with ``__setitem__`` and the out argument to
|
|||
|
ufuncs. Many otherwise fine duck arrays are not easily mutable (for
|
|||
|
example, because they use some kinds of sparse or compressed
|
|||
|
storage, or are in read-only shared memory), and it turns out that
|
|||
|
frequently-used code like the default implementation of ``np.mean``
|
|||
|
needs to check this (to decide whether it can re-use temporary
|
|||
|
arrays).
|
|||
|
|
|||
|
We intentionally do not describe exactly how to add support for these
|
|||
|
types of duck arrays here. These will be the subject of future NEPs.
|
|||
|
|
|||
|
|
|||
|
Copyright
|
|||
|
---------
|
|||
|
|
|||
|
This document has been placed in the public domain.
|