Skip to content

[ENH] Simplified Unified Get/List API#1552

Open
Omswastik-11 wants to merge 14 commits intoopenml:mainfrom
Omswastik-11:prototype-api
Open

[ENH] Simplified Unified Get/List API#1552
Omswastik-11 wants to merge 14 commits intoopenml:mainfrom
Omswastik-11:prototype-api

Conversation

@Omswastik-11
Copy link
Contributor

@Omswastik-11 Omswastik-11 commented Dec 23, 2025

Redundant syntax for getters

In getters, syntax is repeated and redundant, mainly through
the submodule having to be imported or addressed.

import openml

# List all datasets and their properties
openml.datasets.list_datasets(output_format="dataframe")

# Get dataset by ID
dataset = openml.datasets.get_dataset(61)

# Get dataset by name
dataset = openml.datasets.get_dataset('Fashion-MNIST')

# This is similar for flows, runs, studies, such as

study = openml.studies.get_study(42)
flow = openml.flows.get_flows(42)

API implementation

import openml

# List all datasets
datasets_df = openml.list_all("dataset", output_format="dataframe")

# Get dataset by ID
dataset = openml.get("dataset", 61)

# Get dataset by name
dataset = openml.get("dataset", "Fashion-MNIST")

# Get task
task = openml.get("task", 31)

# Get flow
flow = openml.get("flow", 10)

# Get run
run = openml.get("run", 20)

# Shortcut: infer dataset from name when no type specified
dataset = openml.get("Fashion-MNIST")

Implementation Details

  • Added openml.list_all(object_type: str, **kwargs) -> Any, a dispatcher that forwards to:

    • list_datasets
    • list_tasks
    • list_flows
    • list_runs
  • Added openml.get(object_type_or_name, identifier=None, **kwargs) -> Any, a unified getter with support for:

    • Type-based lookup

      openml.get("dataset", 61)
      openml.get("dataset", "dataset_name")
    • Name-only shortcut for datasets

      openml.get("Fashion-MNIST")
  • Exported both functions via __all__ and documented them with docstrings.

  • Preserved full backward compatibility:

    • Existing submodule APIs (e.g., openml.datasets.get_dataset) remain unchanged.
  • Added unit tests to validate dispatcher behavior without requiring network access.

@Omswastik-11 Omswastik-11 marked this pull request as ready for review December 24, 2025 10:22
@Omswastik-11 Omswastik-11 changed the title [ENH] improved the Getter API for users [ENH] Simplified Unified Get/List API Dec 24, 2025
@Omswastik-11 Omswastik-11 requested a review from fkiraly December 25, 2025 08:30
@codecov-commenter
Copy link

codecov-commenter commented Dec 30, 2025

Codecov Report

❌ Patch coverage is 50.00000% with 14 lines in your changes missing coverage. Please review.
✅ Project coverage is 52.12%. Comparing base (7feb2a3) to head (59bd74f).

Files with missing lines Patch % Lines
openml/dispatchers.py 48.14% 14 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1552      +/-   ##
==========================================
- Coverage   52.82%   52.12%   -0.70%     
==========================================
  Files          37       38       +1     
  Lines        4371     4399      +28     
==========================================
- Hits         2309     2293      -16     
- Misses       2062     2106      +44     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copilot AI review requested due to automatic review settings February 26, 2026 10:50
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a simplified unified API for getting and listing OpenML objects through two new dispatcher functions: openml.get() and openml.list_all(). These functions provide a more concise alternative to the existing submodule-specific APIs (e.g., openml.datasets.get_dataset()) while maintaining full backward compatibility. The implementation uses dispatch dictionaries to route calls to the appropriate underlying functions.

Changes:

  • Added unified get() and list_all() dispatcher functions for datasets, tasks, flows, and runs
  • Updated example tutorials to demonstrate the new API with legacy alternatives shown in comments
  • Added unit tests for the dispatcher functions using mocking

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
openml/dispatchers.py New module containing the unified get() and list_all() dispatcher functions with type checking and error handling
openml/init.py Exports the new dispatcher functions and module; contains duplicate entries in __all__ list
tests/test_openml/test_openml.py Unit tests for dispatchers covering datasets and tasks; missing coverage for flows, runs, and error conditions
examples/Basics/simple_tasks_tutorial.py Updated to demonstrate new openml.get() API with legacy alternative in comments
examples/Basics/simple_flows_and_runs_tutorial.py Updated to demonstrate both openml.list_all() and openml.get() APIs
examples/Basics/simple_datasets_tutorial.py Updated to demonstrate both openml.list_all() and openml.get() APIs
examples/Advanced/tasks_tutorial.py Updated multiple examples to demonstrate new openml.list_all() API

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +85 to +88
Raises
------
ValueError
If object_type is not one of the supported types.
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring for get states it raises ValueError for unsupported object types, but it also raises TypeError when object_type is not a string (line 98-99). Add TypeError to the "Raises" section of the docstring to accurately document this behavior.

Copilot uses AI. Check for mistakes.
Comment on lines +113 to 115
"__version__",
"_api_calls",
"_api_calls",
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicate entries in the __all__ list. Lines 113 and 114 duplicate "version" and "_api_calls" which are already present on lines 112 and 115 respectively. Remove the duplicates on lines 113 and 114.

Suggested change
"__version__",
"_api_calls",
"_api_calls",
"_api_calls",

Copilot uses AI. Check for mistakes.
Comment on lines +68 to +108
def get(identifier: int | str, *, object_type: str = "dataset", **kwargs: Any) -> Any:
"""Get an OpenML object by identifier.

Parameters
----------
identifier : int | str
The ID or name of the object to retrieve.
object_type : str, default="dataset"
The type of object to get. Must be one of 'dataset', 'task', 'flow', 'run'.
**kwargs : Any
Additional arguments passed to the underlying get function.

Returns
-------
Any
The requested OpenML object.

Raises
------
ValueError
If object_type is not one of the supported types.

Examples
--------
>>> openml.get(61) # Get dataset 61 (default object_type="dataset")
>>> openml.get("Fashion-MNIST") # Get dataset by name
>>> openml.get(31, object_type="task") # Get task 31
>>> openml.get(10, object_type="flow") # Get flow 10
>>> openml.get(20, object_type="run") # Get run 20
"""
if not isinstance(object_type, str):
raise TypeError(f"object_type must be a string, got {type(object_type).__name__}")

func = _GET_DISPATCH.get(object_type.lower())
if func is None:
valid_types = ", ".join(repr(k) for k in _GET_DISPATCH)
raise ValueError(
f"Unsupported object_type {object_type!r}; expected one of {valid_types}.",
)

return func(identifier, **kwargs)
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description shows an API like openml.get("dataset", 61) and openml.list_all("dataset", output_format="dataframe"), but the actual implementation has get(identifier, *, object_type="dataset", **kwargs) and list_all(object_type, /, **kwargs). This means the correct usage is openml.get(61) (with default object_type="dataset") or openml.get(31, object_type="task"), not openml.get("dataset", 61). The PR description should be updated to accurately reflect the implemented API, or the implementation should be changed to match the description.

Copilot uses AI. Check for mistakes.
get_dataset_mock.assert_called_with("Fashion-MNIST")

openml.get(31, object_type="task")
get_task_mock.assert_called_with(31)
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test coverage is incomplete. While the dispatcher supports "flow" and "run" object types, the tests only cover "dataset" and "task". Add test cases for openml.list_all("flow"), openml.list_all("run"), openml.get(flow_id, object_type="flow"), and openml.get(run_id, object_type="run") to ensure all supported object types are tested.

Suggested change
get_task_mock.assert_called_with(31)
get_task_mock.assert_called_with(31)
def test_list_dispatch_flow_and_run(self):
flow_list_mock = mock.Mock()
run_list_mock = mock.Mock()
# Need to patch after import, so update dispatch dict for flow and run
with mock.patch.dict(
"openml.dispatchers._LIST_DISPATCH",
{
"flow": flow_list_mock,
"run": run_list_mock,
},
):
openml.list_all("flow")
flow_list_mock.assert_called_once_with()
openml.list_all("run", size=10)
run_list_mock.assert_called_once_with(size=10)
def test_get_dispatch_flow_and_run(self):
get_flow_mock = mock.Mock()
get_run_mock = mock.Mock()
# Need to patch after import, so update dispatch dict for flow and run
with mock.patch.dict(
"openml.dispatchers._GET_DISPATCH",
{
"flow": get_flow_mock,
"run": get_run_mock,
},
):
openml.get(5, object_type="flow")
get_flow_mock.assert_called_once_with(5)
openml.get(7, object_type="run")
get_run_mock.assert_called_once_with(7)

Copilot uses AI. Check for mistakes.
Comment on lines +73 to +74
identifier : int | str
The ID or name of the object to retrieve.
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The get function accepts identifier: int | str but only get_dataset supports string identifiers (dataset names). The other get functions (get_task, get_flow, get_run) only accept integer IDs. This means calling openml.get("some-name", object_type="task") would fail with a confusing error. Consider either: (1) documenting that only datasets support string identifiers, or (2) adding validation to raise a clear error when a string identifier is used with non-dataset object types, or (3) restricting the type signature based on object_type.

Copilot uses AI. Check for mistakes.
get_dataset_mock.assert_called_with("Fashion-MNIST")

openml.get(31, object_type="task")
get_task_mock.assert_called_with(31)
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests don't verify error handling for invalid inputs. Add test cases to verify that list_all and get raise appropriate errors for: (1) invalid object types (e.g., openml.list_all("invalid_type")), (2) non-string object types (e.g., openml.list_all(123)), and (3) other edge cases. This ensures the validation logic in lines 55-63 and 98-106 of openml/dispatchers.py is working correctly.

Suggested change
get_task_mock.assert_called_with(31)
get_task_mock.assert_called_with(31)
def test_list_all_invalid_object_type(self):
# Invalid string object type should raise an error
with self.assertRaises((ValueError, KeyError, RuntimeError)):
openml.list_all("invalid_type")
# Non-string object type should raise a type-related error
with self.assertRaises((TypeError, ValueError)):
openml.list_all(123) # type: ignore[arg-type]
def test_get_invalid_object_type(self):
# Invalid string object type should raise an error
with self.assertRaises((ValueError, KeyError, RuntimeError)):
openml.get(1, object_type="invalid_type")
# Non-string object type should raise a type-related error
with self.assertRaises((TypeError, ValueError)):
openml.get(1, object_type=123) # type: ignore[arg-type]

Copilot uses AI. Check for mistakes.
The result from the type-specific list function (typically a DataFrame).

Raises
------
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring for list_all states it raises ValueError for unsupported object types, but it also raises TypeError when object_type is not a string (line 55-56). Add TypeError to the "Raises" section of the docstring to accurately document this behavior.

Suggested change
------
------
TypeError
If object_type is not a string.

Copilot uses AI. Check for mistakes.
Copy link
Collaborator

@geetu040 geetu040 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot left a few solid review comments above, please take a look and address them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants