[ENH] Simplified Unified Get/List API#1552
[ENH] Simplified Unified Get/List API#1552Omswastik-11 wants to merge 14 commits intoopenml:mainfrom
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1552 +/- ##
==========================================
- Coverage 52.82% 52.12% -0.70%
==========================================
Files 37 38 +1
Lines 4371 4399 +28
==========================================
- Hits 2309 2293 -16
- Misses 2062 2106 +44 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
This PR introduces a simplified unified API for getting and listing OpenML objects through two new dispatcher functions: openml.get() and openml.list_all(). These functions provide a more concise alternative to the existing submodule-specific APIs (e.g., openml.datasets.get_dataset()) while maintaining full backward compatibility. The implementation uses dispatch dictionaries to route calls to the appropriate underlying functions.
Changes:
- Added unified
get()andlist_all()dispatcher functions for datasets, tasks, flows, and runs - Updated example tutorials to demonstrate the new API with legacy alternatives shown in comments
- Added unit tests for the dispatcher functions using mocking
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| openml/dispatchers.py | New module containing the unified get() and list_all() dispatcher functions with type checking and error handling |
| openml/init.py | Exports the new dispatcher functions and module; contains duplicate entries in __all__ list |
| tests/test_openml/test_openml.py | Unit tests for dispatchers covering datasets and tasks; missing coverage for flows, runs, and error conditions |
| examples/Basics/simple_tasks_tutorial.py | Updated to demonstrate new openml.get() API with legacy alternative in comments |
| examples/Basics/simple_flows_and_runs_tutorial.py | Updated to demonstrate both openml.list_all() and openml.get() APIs |
| examples/Basics/simple_datasets_tutorial.py | Updated to demonstrate both openml.list_all() and openml.get() APIs |
| examples/Advanced/tasks_tutorial.py | Updated multiple examples to demonstrate new openml.list_all() API |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| Raises | ||
| ------ | ||
| ValueError | ||
| If object_type is not one of the supported types. |
There was a problem hiding this comment.
The docstring for get states it raises ValueError for unsupported object types, but it also raises TypeError when object_type is not a string (line 98-99). Add TypeError to the "Raises" section of the docstring to accurately document this behavior.
| "__version__", | ||
| "_api_calls", | ||
| "_api_calls", |
There was a problem hiding this comment.
Duplicate entries in the __all__ list. Lines 113 and 114 duplicate "version" and "_api_calls" which are already present on lines 112 and 115 respectively. Remove the duplicates on lines 113 and 114.
| "__version__", | |
| "_api_calls", | |
| "_api_calls", | |
| "_api_calls", |
| def get(identifier: int | str, *, object_type: str = "dataset", **kwargs: Any) -> Any: | ||
| """Get an OpenML object by identifier. | ||
|
|
||
| Parameters | ||
| ---------- | ||
| identifier : int | str | ||
| The ID or name of the object to retrieve. | ||
| object_type : str, default="dataset" | ||
| The type of object to get. Must be one of 'dataset', 'task', 'flow', 'run'. | ||
| **kwargs : Any | ||
| Additional arguments passed to the underlying get function. | ||
|
|
||
| Returns | ||
| ------- | ||
| Any | ||
| The requested OpenML object. | ||
|
|
||
| Raises | ||
| ------ | ||
| ValueError | ||
| If object_type is not one of the supported types. | ||
|
|
||
| Examples | ||
| -------- | ||
| >>> openml.get(61) # Get dataset 61 (default object_type="dataset") | ||
| >>> openml.get("Fashion-MNIST") # Get dataset by name | ||
| >>> openml.get(31, object_type="task") # Get task 31 | ||
| >>> openml.get(10, object_type="flow") # Get flow 10 | ||
| >>> openml.get(20, object_type="run") # Get run 20 | ||
| """ | ||
| if not isinstance(object_type, str): | ||
| raise TypeError(f"object_type must be a string, got {type(object_type).__name__}") | ||
|
|
||
| func = _GET_DISPATCH.get(object_type.lower()) | ||
| if func is None: | ||
| valid_types = ", ".join(repr(k) for k in _GET_DISPATCH) | ||
| raise ValueError( | ||
| f"Unsupported object_type {object_type!r}; expected one of {valid_types}.", | ||
| ) | ||
|
|
||
| return func(identifier, **kwargs) |
There was a problem hiding this comment.
The PR description shows an API like openml.get("dataset", 61) and openml.list_all("dataset", output_format="dataframe"), but the actual implementation has get(identifier, *, object_type="dataset", **kwargs) and list_all(object_type, /, **kwargs). This means the correct usage is openml.get(61) (with default object_type="dataset") or openml.get(31, object_type="task"), not openml.get("dataset", 61). The PR description should be updated to accurately reflect the implemented API, or the implementation should be changed to match the description.
| get_dataset_mock.assert_called_with("Fashion-MNIST") | ||
|
|
||
| openml.get(31, object_type="task") | ||
| get_task_mock.assert_called_with(31) |
There was a problem hiding this comment.
Test coverage is incomplete. While the dispatcher supports "flow" and "run" object types, the tests only cover "dataset" and "task". Add test cases for openml.list_all("flow"), openml.list_all("run"), openml.get(flow_id, object_type="flow"), and openml.get(run_id, object_type="run") to ensure all supported object types are tested.
| get_task_mock.assert_called_with(31) | |
| get_task_mock.assert_called_with(31) | |
| def test_list_dispatch_flow_and_run(self): | |
| flow_list_mock = mock.Mock() | |
| run_list_mock = mock.Mock() | |
| # Need to patch after import, so update dispatch dict for flow and run | |
| with mock.patch.dict( | |
| "openml.dispatchers._LIST_DISPATCH", | |
| { | |
| "flow": flow_list_mock, | |
| "run": run_list_mock, | |
| }, | |
| ): | |
| openml.list_all("flow") | |
| flow_list_mock.assert_called_once_with() | |
| openml.list_all("run", size=10) | |
| run_list_mock.assert_called_once_with(size=10) | |
| def test_get_dispatch_flow_and_run(self): | |
| get_flow_mock = mock.Mock() | |
| get_run_mock = mock.Mock() | |
| # Need to patch after import, so update dispatch dict for flow and run | |
| with mock.patch.dict( | |
| "openml.dispatchers._GET_DISPATCH", | |
| { | |
| "flow": get_flow_mock, | |
| "run": get_run_mock, | |
| }, | |
| ): | |
| openml.get(5, object_type="flow") | |
| get_flow_mock.assert_called_once_with(5) | |
| openml.get(7, object_type="run") | |
| get_run_mock.assert_called_once_with(7) |
| identifier : int | str | ||
| The ID or name of the object to retrieve. |
There was a problem hiding this comment.
The get function accepts identifier: int | str but only get_dataset supports string identifiers (dataset names). The other get functions (get_task, get_flow, get_run) only accept integer IDs. This means calling openml.get("some-name", object_type="task") would fail with a confusing error. Consider either: (1) documenting that only datasets support string identifiers, or (2) adding validation to raise a clear error when a string identifier is used with non-dataset object types, or (3) restricting the type signature based on object_type.
| get_dataset_mock.assert_called_with("Fashion-MNIST") | ||
|
|
||
| openml.get(31, object_type="task") | ||
| get_task_mock.assert_called_with(31) |
There was a problem hiding this comment.
The tests don't verify error handling for invalid inputs. Add test cases to verify that list_all and get raise appropriate errors for: (1) invalid object types (e.g., openml.list_all("invalid_type")), (2) non-string object types (e.g., openml.list_all(123)), and (3) other edge cases. This ensures the validation logic in lines 55-63 and 98-106 of openml/dispatchers.py is working correctly.
| get_task_mock.assert_called_with(31) | |
| get_task_mock.assert_called_with(31) | |
| def test_list_all_invalid_object_type(self): | |
| # Invalid string object type should raise an error | |
| with self.assertRaises((ValueError, KeyError, RuntimeError)): | |
| openml.list_all("invalid_type") | |
| # Non-string object type should raise a type-related error | |
| with self.assertRaises((TypeError, ValueError)): | |
| openml.list_all(123) # type: ignore[arg-type] | |
| def test_get_invalid_object_type(self): | |
| # Invalid string object type should raise an error | |
| with self.assertRaises((ValueError, KeyError, RuntimeError)): | |
| openml.get(1, object_type="invalid_type") | |
| # Non-string object type should raise a type-related error | |
| with self.assertRaises((TypeError, ValueError)): | |
| openml.get(1, object_type=123) # type: ignore[arg-type] |
| The result from the type-specific list function (typically a DataFrame). | ||
|
|
||
| Raises | ||
| ------ |
There was a problem hiding this comment.
The docstring for list_all states it raises ValueError for unsupported object types, but it also raises TypeError when object_type is not a string (line 55-56). Add TypeError to the "Raises" section of the docstring to accurately document this behavior.
| ------ | |
| ------ | |
| TypeError | |
| If object_type is not a string. |
geetu040
left a comment
There was a problem hiding this comment.
Copilot left a few solid review comments above, please take a look and address them.
Redundant syntax for getters
In getters, syntax is repeated and redundant, mainly through
the submodule having to be imported or addressed.
API implementation
Implementation Details
Added
openml.list_all(object_type: str, **kwargs) -> Any, a dispatcher that forwards to:list_datasetslist_taskslist_flowslist_runsAdded
openml.get(object_type_or_name, identifier=None, **kwargs) -> Any, a unified getter with support for:Type-based lookup
Name-only shortcut for datasets
Exported both functions via
__all__and documented them with docstrings.Preserved full backward compatibility:
openml.datasets.get_dataset) remain unchanged.Added unit tests to validate dispatcher behavior without requiring network access.