Skip to content

[MNT] Complete test for sparse dataset#1654

Open
omkar-334 wants to merge 2 commits intoopenml:mainfrom
omkar-334:sparseds
Open

[MNT] Complete test for sparse dataset#1654
omkar-334 wants to merge 2 commits intoopenml:mainfrom
omkar-334:sparseds

Conversation

@omkar-334
Copy link

@omkar-334 omkar-334 commented Feb 18, 2026

Fixes #1644

  1. Added coverage for sparse dataset get_data when row_id and ignore attributes are included.
  2. Split behavior into two tests: test_get_sparse_dataset_excludes_rowid (checks 19998 columns) and test_get_sparse_dataset_includes_rowid (checks 20000 columns).
  3. Updated test names to make include vs exclude behavior clear.

cc @geetu040 @PGijsbers

Copilot AI review requested due to automatic review settings February 18, 2026 18:50
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses issue #1644 by improving test coverage for sparse datasets with row_id and ignore attributes. The original test only verified behavior when these attributes were excluded; this PR adds a complementary test to verify behavior when they are included.

Changes:

  • Renamed test_get_sparse_dataset_rowid_and_ignore_and_target to test_get_sparse_dataset_excludes_rowid to clarify that it tests exclusion behavior
  • Added new test test_get_sparse_dataset_includes_rowid that verifies correct behavior when row_id and ignore attributes are included in the output
  • Both tests verify correct shape (19998 vs 20000 columns) and categorical indicators

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.


def test_get_sparse_dataset_rowid_and_ignore_and_target(self):
# TODO: re-add row_id and ignore attributes
def test_get_sparse_dataset_excludes_rowid(self):
Copy link

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test name test_get_sparse_dataset_excludes_rowid is imprecise. This test excludes both the row_id attribute AND the ignore attribute (as specified on lines 383-384 and the parameters on lines 387-388). Consider renaming to test_get_sparse_dataset_excludes_rowid_and_ignore to match the naming pattern used in the non-sparse test test_get_data_rowid_and_ignore_and_target on line 192.

Suggested change
def test_get_sparse_dataset_excludes_rowid(self):
def test_get_sparse_dataset_excludes_rowid_and_ignore(self):

Copilot uses AI. Check for mistakes.
self.assertListEqual(categorical, [False] * 19998)
assert y.shape == (600,)

def test_get_sparse_dataset_includes_rowid(self):
Copy link

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test name test_get_sparse_dataset_includes_rowid is imprecise. This test includes both the row_id attribute AND the ignore attribute (as specified on lines 400-401 and the parameters on lines 404-405). Consider renaming to test_get_sparse_dataset_includes_rowid_and_ignore to be clearer about what's being tested and to maintain consistency with the exclude test naming.

Suggested change
def test_get_sparse_dataset_includes_rowid(self):
def test_get_sparse_dataset_includes_rowid_and_ignore(self):

Copilot uses AI. Check for mistakes.
assert len(categorical) == 20000
self.assertListEqual(categorical, [False] * 20000)
assert y.shape == (600,)

Copy link

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding test coverage for partial inclusion scenarios. Currently, the tests only cover the cases where both row_id and ignore attributes are either excluded together or included together. The API supports four combinations: (1) both excluded, (2) both included, (3) include row_id but exclude ignore, and (4) include ignore but exclude row_id. Adding tests for scenarios (3) and (4) would provide more comprehensive coverage of the get_data method's behavior with sparse datasets.

Suggested change
def test_get_sparse_dataset_includes_rowid_excludes_ignore(self):
self.sparse_dataset.ignore_attribute = ["V256"]
self.sparse_dataset.row_id_attribute = ["V512"]
X, y, categorical, _ = self.sparse_dataset.get_data(
target="class",
include_row_id=True,
include_ignore_attribute=False,
)
assert all(dtype == pd.SparseDtype(np.float32, fill_value=0.0) for dtype in X.dtypes)
assert isinstance(y.dtypes, pd.SparseDtype)
assert X.shape == (600, 19999)
assert len(categorical) == 19999
self.assertListEqual(categorical, [False] * 19999)
assert y.shape == (600,)
def test_get_sparse_dataset_excludes_rowid_includes_ignore(self):
self.sparse_dataset.ignore_attribute = ["V256"]
self.sparse_dataset.row_id_attribute = ["V512"]
X, y, categorical, _ = self.sparse_dataset.get_data(
target="class",
include_row_id=False,
include_ignore_attribute=True,
)
assert all(dtype == pd.SparseDtype(np.float32, fill_value=0.0) for dtype in X.dtypes)
assert isinstance(y.dtypes, pd.SparseDtype)
assert X.shape == (600, 19999)
assert len(categorical) == 19999
self.assertListEqual(categorical, [False] * 19999)
assert y.shape == (600,)

Copilot uses AI. Check for mistakes.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@geetu040 do you think we need the other 2 scenarios as well?

@omkar-334
Copy link
Author

Screenshot 2026-02-19 at 12 27 14 AM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[MNT] Complete Test for Sparse Dataset Row ID and Ignore Attributes Inclusion

2 participants