Skip to content

[ENH] V1 → V2 API Migration - datasets#1608

Open
JATAYU000 wants to merge 240 commits intoopenml:mainfrom
JATAYU000:dataset_resource
Open

[ENH] V1 → V2 API Migration - datasets#1608
JATAYU000 wants to merge 240 commits intoopenml:mainfrom
JATAYU000:dataset_resource

Conversation

@JATAYU000
Copy link
Contributor

Metadata

@codecov-commenter
Copy link

codecov-commenter commented Jan 8, 2026

Codecov Report

❌ Patch coverage is 69.17586% with 475 lines in your changes missing coverage. Please review.
✅ Project coverage is 51.56%. Comparing base (da993f7) to head (ea80785).
⚠️ Report is 4 commits behind head on main.

Files with missing lines Patch % Lines
openml/_api/resources/dataset.py 59.41% 194 Missing ⚠️
openml/_config.py 73.86% 75 Missing ⚠️
openml/_api/clients/http.py 71.68% 62 Missing ⚠️
openml/_api/clients/minio.py 57.14% 30 Missing ⚠️
openml/datasets/dataset.py 30.76% 18 Missing ⚠️
openml/datasets/functions.py 31.57% 13 Missing ⚠️
openml/_api/resources/base/versions.py 84.41% 12 Missing ⚠️
openml/_api/setup/backend.py 80.70% 11 Missing ⚠️
openml/cli.py 0.00% 11 Missing ⚠️
openml/_api/resources/base/base.py 71.42% 10 Missing ⚠️
... and 9 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1608      +/-   ##
==========================================
- Coverage   53.09%   51.56%   -1.53%     
==========================================
  Files          37       61      +24     
  Lines        4362     5360     +998     
==========================================
+ Hits         2316     2764     +448     
- Misses       2046     2596     +550     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@JATAYU000
Copy link
Contributor Author

JATAYU000 commented Jan 9, 2026

FYI @geetu040 Currently the get_dataset() function has 3 download requirement

  • download_data : uses api_calls._download_minio_bucket() to download all the files in the bucket if download_all_files param was True and api_calls._download_minio_file() to download the dataset.pq file if it was not found in cache. When download parquet fails it fallback to download dataset.arff file with get request
  • download_features : if feature_file is passed via init it extracts during initialization else does get request and caches the xml
  • download_qualities : if qulities_file is passed via init it extracts during initialization else does get request and caches the xml

Issues:

  • The data files .pq and .arff are common for versions and doesn't make sense to be downloaded multiple times
  • Path handling for download to return the path especially the data files, As mentioned in the meet I can try the Download specific class which uses the cache mixin and only inherited by dataset resource.
  • Current implementation in OpenMLDataset has v1 specific parsing which in my opinion should be using the current interface (api_context)

Example:

current load_features() ref link
This calls a function which downloads and returns a file path and then parse from the file path
This can be changed by changing that function's definition ref link to get -> parse -> return features instead of file paths

def _get_dataset_features_file(did_cache_dir: str | Path | None, dataset_id: int) -> dict[int, OpenMLDataFeature]:
        return _features

Or by updating the Dataset class to use the underlining interface method from api_context directly.

def _load_features(self) -> None:
       ...
        self._features = api_context.backend.datasets.get_features(self.dataset_id)

Another option is to add return_path to client requests, which in my opinion would be wasteful since adding a param to all the methods of client for just the dataset resource, and that too which could be handled without it as mentioned above.

Copy link
Collaborator

@geetu040 geetu040 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please sync with base PR and update with these comments #1576 (comment)

Copilot AI review requested due to automatic review settings February 26, 2026 10:11
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 51 out of 52 changed files in this pull request and generated 10 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot AI review requested due to automatic review settings February 26, 2026 11:26
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 51 out of 52 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@JATAYU000
Copy link
Contributor Author

@geetu040 _dataset_file_is_downloaded using cache path to check is description.xml, features.xml, etc
which is being used by multiple tests via _dataset_description_is_downloaded(id) and are failing, What should be done about it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants