Add RangeIndex by benbovy · Pull Request #10076 · pydata/xarray

benbovy · 2025-02-25T15:04:31Z

Closes Regular (linspace) Coordinates/Index #8473
Tests added
User visible changes (including notable bug fixes) are documented in whats-new.rst
New functions/methods are listed in api.rst

~~Work in progress (just~~ Ready for review (copied and adapted the example from #9543 (comment)).

- Use start, stop, step terms - Make RangeIndex.__init__ private and more flexible, add RangeIndex.arange and RangeIndex.linspace public factories - General support of RangeIndex slicing - RangeIndex.isel with arbitrary 1D values: convert to PandasIndex - Add RangeIndex.to_pandas_index

... when check_default_indexes=False.

benbovy · 2025-03-21T11:34:55Z

I've made further progress on this. Some design questions (thoughts welcome!):

Create a new RangeIndex

I'm not sure yet about the public API? Currently RangeIndex.__init__ is "private" (more flexible and easier for internals) and there are two public factories RangeIndex.arange and RangeIndex.linspace inspired from Numpy API. Creating a new dataset with a range index would look like:

import xarray as xr
from xarray.indexes import RangeIndex

index = RangeIndex.arange("x", "x", 0.0, 1.0, 0.1)
ds = xr.Dataset(coords=xr.Coordinates.from_xindex(index))

<xarray.Dataset> Size: 80B
Dimensions:  (x: 10)
Coordinates:
  * x        (x) float64 80B 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Data variables:
    *empty*
Indexes:
    x        RangeIndex

RangeIndex doesn't support set_xindex. Do we want to support it? If yes, how would look like the input coordinate? An existing range with explicit values from which RangeIndex would try to infer a constant step value? Is that useful to have? Since the point of RangeIndex is to avoid materializing coordinate values in memory... Or a 1D coordinate with three values representing start, stop and step? Any other alternative?

Index import

Should we expose all public built-in Xarray indexes at the top level? Or only at the xarray.indexes level?

Currently the Index base class and CFTimeIndex (not an Xarray index but could eventually be refactored so) are exposed at the top level, while PandasIndex, PandasMultiIndex and RangeIndex (this PR) are only exposed at the xarray.indexes level. We might want to uniformize that.

benbovy · 2025-03-21T11:39:44Z

Note: this Xarray RangeIndex is designed for floating value ranges. For integer ranges it is probably best to use a PandasIndex wrapping a pandas.RangeIndex. I added a note in the docstrings here. More work on the documentation is needed but probably in a later PR addressing Xarray indexes in general.

doc/api-hidden.rst

xarray/core/indexes.py

dcherian · 2025-03-29T16:20:34Z

xarray/indexes/range_index.py

+        dim : str
+            Dimension name.
+        start : float, optional
+            Start of interval (default: 0.0). The interval includes this value.


Could consider adding a closed kwarg like pd.Interval, but in a future PR of course.

xarray/indexes/range_index.py

dcherian · 2025-03-29T16:28:11Z

xarray/indexes/range_index.py

+            "`Coordinates.from_xindex()`"
+        )
+
+    @property


Can these all be cached_property?

Would there be much benefit of caching those simple aliases to attributes of the underlying transform?

xarray/tests/test_range_index.py

Make `dim` a required keyword argument and `coord_name` an optional keyword argument (defaults to `dim`).

xarray/indexes/range_index.py

max-sixty · 2025-03-31T18:10:13Z

xarray/indexes/range_index.py

+        dtype : dtype, optional
+            The dtype of the coordinate variable (default: float64).
+
+        Examples


Suggested change

Examples

Note that all `start`, `stop` & `step` must be passed, which is more explicit than `np.arange` or `range`

Examples

(optional, no strong view)

Note that all start, stop & step must be passed

This isn't exactly true, but yes the API here is more explicit than np.arange and range, e.g., RangeIndex.arange(10.0) means start=10 while np.arange(10.0) means stop=10.

RangeIndex.arange(10.0) doesn't make much sense, though, considering the default value of stop=1.0. I'll see if we can get closer to np.arange using tpying.overload.

RangeIndex.arange(10.0) doesn't make much sense, though, considering the default value of stop=1.0. I'll see if we can get closer to np.arange using tpying.overload.

yeah. no objection to the more explicit approach — it's useful-but-a-bit-magic that arange / range changes the meaning of the first arg based on how many are supplied

Mimicing numpy.arange behavior is surprisingly difficult! (at least for me, I've been struggling with this).

I got it close with some simple logic, but then I hit the same issue than numpy/numpy#17878 (i.e., RangeIndex.arange(start=10) returns a range in the [0, 10) interval, which makes no sense). I could fix it after some heavy refactor making the code / API ugly to a point I'm not sure I want to push it here :).

Numpy relies on the Python C API PyArg_ParseTupleAndKeywords() but AFAIK there seem to be no easy way to know from Python whether a value has been passed as positional or keyword argument.

!!

I think the explicit approach is very valid, but maybe we just call it out / ensure people need to pass kwargs where it could be confusing

In 4a128d0 I've chosen to mimicking numpy and pandas API anyway, with some simple logic and clearly documenting the caveat above.

I find myself (likely others too) doing range(10) or np.arange(10.0) so many times while I doubt many will write something like np.arange(start=10).

pandas.RangeIndex(start=10) actually still returns RangeIndex(start=0, stop=10, step=1) and I haven't seen anyone complaining in the pandas issues (or I missed it).

wpbonelli · 2025-04-15T11:33:16Z

Hope it's alright to chime in here.

RangeIndex doesn't support set_xindex. Do we want to support it? If yes, how would look like the input coordinate? An existing range with explicit values from which RangeIndex would try to infer a constant step value? Is that useful to have? Since the point of RangeIndex is to avoid materializing coordinate values in memory... Or a 1D coordinate with three values representing start, stop and step? Any other alternative?

One use case for set_xindex (i.e. situation where a materialized coordinate pre-exists an index) could be to "alias" a dimension coordinate?

E.g. say you have a regular grid with x/y/z dims/coords, and you want to index not only positionally but explicitly with e.g. i/j/k or row/col/lay. You could write an integer range index which accepts those names for the dimension lookup. Is that reasonable? Maybe I'm thinking about this wrong. I asked a very beginner question along these lines a few weeks ago, still trying to wrap my head around indexing.

(I understand for integer ranges one wants a PandasIndex wrapping RangeIndex — is there an example of that anywhere?)

benbovy · 2025-04-16T09:54:18Z

@wpbonelli I don't think this is well documented, but you could do:

da = xr.DataArray(np.zeros((5, 10)), dims=("rows", "cols"))

da.coords["r"] = ("rows", pd.RangeIndex(da.sizes["rows"]))
da = da.set_xindex("r")

da
# <xarray.DataArray (rows: 5, cols: 10)> Size: 400B
# array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
#        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
#        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
#        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
#        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]])
# Coordinates:
#   * r        (rows) int64 40B 0 1 2 3 4
# Dimensions without coordinates: rows, cols

da.xindexes["r"]
# PandasIndex(RangeIndex(start=0, stop=5, step=1, name='r'))

This relies on the fact that Xarray internally keeps track of the pandas.Index object when wrapped as variable data, though.

A more explicit way of achieving the same result:

r_index = PandasIndex(pd.RangeIndex(da.sizes["rows"], name="r"), dim="rows")
da = da.assign_coords(xr.Coordinates.from_xindex(r_index))

With one caveat (also in pandas): `RangeIndex.arange(4.0)` creates an index within the range [0.0, 4.0) (`start` is interpreted as `stop`). This caveat is documented.

benbovy · 2025-04-16T12:28:05Z

This is ready for another round of review! I don't think that CI failures are related to anything in this PR.

wpbonelli · 2025-04-16T12:56:48Z

@benbovy thanks! sorry to hijack the thread.

xarray/tests/test_range_index.py

* main: (76 commits) Update how-to-add-new-backend.rst (#10240) Support extension array indexes (#9671) Switch documentation to pydata-sphinx-theme (#8708) Bump codecov/codecov-action from 5.4.0 to 5.4.2 in the actions group (#10239) Fix mypy, min-versions CI, xfail Zarr tests (#10255) Remove `test_dask_layers_and_dependencies` (#10242) Fix: Docs generation create temporary files that are not cleaned up. (#10238) opendap / dap4 support for pydap backend (#10182) Add RangeIndex (#10076) Fix mypy (#10232) Fix doctests (#10230) Fix broken Sphinx Roles (#10225) `DatasetView.map` fix `keep_attrs` (#10219) Add datatree repr asv (#10214) CI: Automatic PR labelling is back (#10201) Fixes dimension order in `xarray.Dataset.to_stacked_array` (#10205) Fix references to core classes in docs (#10207) Update pre-commit hooks (#10208) add `scipy-stubs` as extra `[types]` dependency (#10202) Fix sparse dask repr test (#10200) ...

* main: (76 commits) Update how-to-add-new-backend.rst (pydata#10240) Support extension array indexes (pydata#9671) Switch documentation to pydata-sphinx-theme (pydata#8708) Bump codecov/codecov-action from 5.4.0 to 5.4.2 in the actions group (pydata#10239) Fix mypy, min-versions CI, xfail Zarr tests (pydata#10255) Remove `test_dask_layers_and_dependencies` (pydata#10242) Fix: Docs generation create temporary files that are not cleaned up. (pydata#10238) opendap / dap4 support for pydap backend (pydata#10182) Add RangeIndex (pydata#10076) Fix mypy (pydata#10232) Fix doctests (pydata#10230) Fix broken Sphinx Roles (pydata#10225) `DatasetView.map` fix `keep_attrs` (pydata#10219) Add datatree repr asv (pydata#10214) CI: Automatic PR labelling is back (pydata#10201) Fixes dimension order in `xarray.Dataset.to_stacked_array` (pydata#10205) Fix references to core classes in docs (pydata#10207) Update pre-commit hooks (pydata#10208) add `scipy-stubs` as extra `[types]` dependency (pydata#10202) Fix sparse dask repr test (pydata#10200) ...

add RangeIndex

dcc8f5b

TomNicholas added topic-indexing enhancement labels Feb 26, 2025

benbovy added 9 commits March 20, 2025 09:27

Merge branch 'main' into range-index

04827db

fix error raised during coords diff formatting

e7f6476

assert_allclose: add support for Coordinates

fb1b10b

add tests (wip)

76f58c0

assert invariants: skip check IndexVariable ...

53a02c8

... when check_default_indexes=False.

more tests and fixes

9eaa530

no support for set_xindex (error msg)

e6709a1

add public API documentation

e18725c

benbovy added 2 commits March 21, 2025 12:40

fix doctests

cc3601f

add docstring examples

b5c5207

benbovy marked this pull request as ready for review March 21, 2025 11:55

benbovy added 5 commits March 21, 2025 13:06

[skip-ci] update whats new

b48ed9d

[skip-ci] doc: add RangeIndex factories to API (hidden)

2a27198

add repr + start, stop, step properties

f819846

add support for rename vars and/or dims

cd0a396

step: attr -> property

4655934

benbovy mentioned this pull request Mar 26, 2025

Understanding concerns with CF encoding of CRS zarr-developers/geozarr-spec#20

Open

Merge branch 'main' into range-index

ee11665