Skip to content

Changing the Dataset Class #9

@treigerm

Description

@treigerm

I've been thinking a bit more about how to add the remaining data sources to the data-loader repository. I think we want to slightly adjust our downloading code to account for the fact that the remaining data sources don't necessarily have convenient mechanisms for treating them like Zarr datasets (e.g. the problems we encountered in #7 ). All of them can usually be downloaded as NetCDFs files though.

I think we can easily adjust the current folder structure to accommodate this. I'm thinking of instead of having a download.zarr folder we simply only have a download folder in each dataset directory. So the then the two types of dataset directory will look like this:

dataset_name1/
	download/
		download.zarr
	standardized.zarr
dataset_name2/
	download/
		file1.nc
		file2.nc
		...
	standardized.zarr

We can adjust the Dataset class slightly to accommodate the new structure by explicitly adding a download function:

class ExDataset(Dataset):
	name = "example"

	@staticmethod
	def download(download_dir: Path):
		...

	@staticmethod
	def open(download_dir: Path) -> xr.Dataset:
		...

The download function can then be as customized to the data source as necessary. Crucially, in this design we don't need to be able to interpret some remote dataset, which might only be accessible through HTTP as a collection of NetCDF files, as a Zarr array. But we can download the dataset in its original form and then convert it into a Zarr array once we have downloaded the whole dataset to disk.

The open_downloaded_canonicalized_dataset function then only needs to be minorly updated:

def open_downloaded_canonicalized_dataset(
    cls: type[Dataset],
    basepath: Path = Path(),
    progress: bool = True,
) -> xr.Dataset:
    datasets = basepath / "datasets"

    download = datasets / cls.name / "download"
    if not download.exists():
        cls.download(download)

    standardized = datasets / cls.name / "standardized.zarr"
    if not standardized.exists():
        ds = cls.open(download)
        ds = canon.canonicalize_dataset(ds)

        with monitor.progress_bar(progress):
            ds.to_zarr(standardized, encoding=dict(), compute=False).compute()

    return xr.open_dataset(standardized, chunks=dict(), engine="zarr")

Does this seem like a reasonable path forward?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions