Changing the `Dataset` Class

I've been thinking a bit more about how to add the remaining data sources to the data-loader repository. I think we want to slightly adjust our downloading code to account for the fact that the remaining data sources don't necessarily have convenient mechanisms for treating them like Zarr datasets (e.g. the problems we encountered in #7 ). All of them can usually be downloaded as NetCDFs files though. 

I think we can easily adjust the current folder structure to accommodate this. I'm thinking of instead of having a download.zarr folder we simply only have a download folder in each dataset directory. So the then the two types of dataset directory will look like this:
```
dataset_name1/
	download/
		download.zarr
	standardized.zarr
dataset_name2/
	download/
		file1.nc
		file2.nc
		...
	standardized.zarr
```

We can adjust the `Dataset` class slightly to accommodate the new structure by explicitly adding a `download` function:
```python
class ExDataset(Dataset):
	name = "example"

	@staticmethod
	def download(download_dir: Path):
		...

	@staticmethod
	def open(download_dir: Path) -> xr.Dataset:
		...
```
The `download` function can then be as customized to the data source as necessary. Crucially, in this design we don't need to be able to interpret some remote dataset, which might only be accessible through HTTP as a collection of NetCDF files, as a Zarr array. But we can download the dataset in its original form and then convert it into a Zarr array once we have downloaded the whole dataset to disk. 

The `open_downloaded_canonicalized_dataset` [function](https://github.com/ClimateBenchPress/data-loader/blob/57f9f3f54210c6ae6ddced2999cf9062a7fd0af6/src/climatebenchpress/data_loader/__init__.py#L16) then only needs to be minorly updated:
```python
def open_downloaded_canonicalized_dataset(
    cls: type[Dataset],
    basepath: Path = Path(),
    progress: bool = True,
) -> xr.Dataset:
    datasets = basepath / "datasets"

    download = datasets / cls.name / "download"
    if not download.exists():
        cls.download(download)

    standardized = datasets / cls.name / "standardized.zarr"
    if not standardized.exists():
        ds = cls.open(download)
        ds = canon.canonicalize_dataset(ds)

        with monitor.progress_bar(progress):
            ds.to_zarr(standardized, encoding=dict(), compute=False).compute()

    return xr.open_dataset(standardized, chunks=dict(), engine="zarr")
```

Does this seem like a reasonable path forward?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Changing the `Dataset` Class #9

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Changing the Dataset Class #9

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Changing the `Dataset` Class #9