You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In #9077 (comment) I suggested the idea of a function which could open any netCDF file with groups as a dictionary mapping group path strings to xr.Dataset objects.
The motivation is as follows:
People want the new xarray.DataTree class to support inheriting coordinates from parent groups,
This can only be done if the coordinates align with the variables in the child group (i.e. using xr.align),
The best time to enforce this alignment is at DataTree construction time,
This requirement is not enforced in the netCDF/Zarr model, so this would mean some files can no longer be opened by open_datatree directly, as doing so would raise an alignment error,
But we still really want users to have some way to open an arbitrary file with xarray and see what's inside (including displaying all the groups Opening a dataset doesn't display groups. #4840).
A simpler intermediate structure of a dictionary mapping group paths to xarray.Dataset objects doesn't enforce alignment, so can represent any file.
We should add a new opening function to allow any file to be opened as this dict-of-datasets structure.
Users can then use this to inspect "untidy" data, and make changes to the dict returned before creating an aligned DataTree object via DataTree.from_dict if they like.
Describe the solution you'd like
Add a function like this:
defopen_dict_of_datasets(
filename_or_obj: str|os.PathLike[Any] |BufferedIOBase|AbstractDataStore,
engine: T_Engine=None,
group: Optional[str] =None,
**kwargs,
) ->dict[str, Dataset]:
""" Open and decode a file or file-like object, creating a dictionary containing one xarray Dataset for each group in the file. Useful when you have e.g. a netCDF file containing many groups, some of which are not alignable with their parents and so the file cannot be opened directly with ``open_datatree``. It is encouraged to use this function to inspect your data, then make the necessary changes to make the structure coercible to a `DataTree` object before calling `DataTree.from_dict()` and proceeding with your analysis. Parameters ---------- filename_or_obj : str, Path, file-like, or DataStore Strings and Path objects are interpreted as a path to a netCDF file or Zarr store. engine : str, optional Xarray backend engine to use. Valid options include `{"netcdf4", "h5netcdf", "zarr"}`. group : str, optional Group to use as the root group to start reading from. Groups above this root group will not be included in the output. **kwargs : dict Additional keyword arguments passed to :py:func:`~xarray.open_dataset` for each group. Returns ------- dict[str, xarray.Dataset] See Also -------- open_datatree() DataTree.from_dict() """
...
This would live inside backends.api.py, and be exposed publicly as a top-level function along with the rest of open_datatree/DataTree etc. as part of #9033.
The actual implementation could re-use the code for opening many groups of the same file performantly from #9014. Indeed we could add a open_dict_of_datasets method to the BackendEntryPoint class, which uses pretty much the same code as the existing open_datatree method added in #9014 but just doesn't actually create a DataTree object.
Describe alternatives you've considered
Really the main alternative to this is not to have coordinate inheritance in DataTree at all (see 9077), in which case open_datatree would be sufficient to open any file.
The name of the function is up for debate. I prefer nothing with the word "datatree" in it since this doesn't actually create a DataTree object at any point. (In fact we could and perhaps should have implemented this function years ago, even without the new DataTree class.) The reason for not calling it "open_as_dict_of_datasets" is that we don't use "as" in the existing open_dataset/open_dataarray etc.
Is your feature request related to a problem?
In #9077 (comment) I suggested the idea of a function which could open any netCDF file with groups as a dictionary mapping group path strings to
xr.Datasetobjects.The motivation is as follows:
xarray.DataTreeclass to support inheriting coordinates from parent groups,xr.align),DataTreeconstruction time,open_datatreedirectly, as doing so would raise an alignment error,xarray.Datasetobjects doesn't enforce alignment, so can represent any file.DataTreeobject viaDataTree.from_dictif they like.Describe the solution you'd like
Add a function like this:
This would live inside
backends.api.py, and be exposed publicly as a top-level function along with the rest ofopen_datatree/DataTreeetc. as part of #9033.The actual implementation could re-use the code for opening many groups of the same file performantly from #9014. Indeed we could add a
open_dict_of_datasetsmethod to theBackendEntryPointclass, which uses pretty much the same code as the existingopen_datatreemethod added in #9014 but just doesn't actually create aDataTreeobject.Describe alternatives you've considered
Really the main alternative to this is not to have coordinate inheritance in
DataTreeat all (see 9077), in which caseopen_datatreewould be sufficient to open any file.The name of the function is up for debate. I prefer nothing with the word "datatree" in it since this doesn't actually create a
DataTreeobject at any point. (In fact we could and perhaps should have implemented this function years ago, even without the newDataTreeclass.) The reason for not calling it "open_as_dict_of_datasets" is that we don't use "as" in the existingopen_dataset/open_dataarrayetc.Additional context
cc @eni-awowale @flamingbear @owenlittlejohns @keewis @shoyer @autydp