[MNT] Complete test for sparse dataset#1654
Conversation
There was a problem hiding this comment.
Pull request overview
This PR addresses issue #1644 by improving test coverage for sparse datasets with row_id and ignore attributes. The original test only verified behavior when these attributes were excluded; this PR adds a complementary test to verify behavior when they are included.
Changes:
- Renamed
test_get_sparse_dataset_rowid_and_ignore_and_targettotest_get_sparse_dataset_excludes_rowidto clarify that it tests exclusion behavior - Added new test
test_get_sparse_dataset_includes_rowidthat verifies correct behavior when row_id and ignore attributes are included in the output - Both tests verify correct shape (19998 vs 20000 columns) and categorical indicators
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| def test_get_sparse_dataset_rowid_and_ignore_and_target(self): | ||
| # TODO: re-add row_id and ignore attributes | ||
| def test_get_sparse_dataset_excludes_rowid(self): |
There was a problem hiding this comment.
The test name test_get_sparse_dataset_excludes_rowid is imprecise. This test excludes both the row_id attribute AND the ignore attribute (as specified on lines 383-384 and the parameters on lines 387-388). Consider renaming to test_get_sparse_dataset_excludes_rowid_and_ignore to match the naming pattern used in the non-sparse test test_get_data_rowid_and_ignore_and_target on line 192.
| def test_get_sparse_dataset_excludes_rowid(self): | |
| def test_get_sparse_dataset_excludes_rowid_and_ignore(self): |
| self.assertListEqual(categorical, [False] * 19998) | ||
| assert y.shape == (600,) | ||
|
|
||
| def test_get_sparse_dataset_includes_rowid(self): |
There was a problem hiding this comment.
The test name test_get_sparse_dataset_includes_rowid is imprecise. This test includes both the row_id attribute AND the ignore attribute (as specified on lines 400-401 and the parameters on lines 404-405). Consider renaming to test_get_sparse_dataset_includes_rowid_and_ignore to be clearer about what's being tested and to maintain consistency with the exclude test naming.
| def test_get_sparse_dataset_includes_rowid(self): | |
| def test_get_sparse_dataset_includes_rowid_and_ignore(self): |
| assert len(categorical) == 20000 | ||
| self.assertListEqual(categorical, [False] * 20000) | ||
| assert y.shape == (600,) | ||
|
|
There was a problem hiding this comment.
Consider adding test coverage for partial inclusion scenarios. Currently, the tests only cover the cases where both row_id and ignore attributes are either excluded together or included together. The API supports four combinations: (1) both excluded, (2) both included, (3) include row_id but exclude ignore, and (4) include ignore but exclude row_id. Adding tests for scenarios (3) and (4) would provide more comprehensive coverage of the get_data method's behavior with sparse datasets.
| def test_get_sparse_dataset_includes_rowid_excludes_ignore(self): | |
| self.sparse_dataset.ignore_attribute = ["V256"] | |
| self.sparse_dataset.row_id_attribute = ["V512"] | |
| X, y, categorical, _ = self.sparse_dataset.get_data( | |
| target="class", | |
| include_row_id=True, | |
| include_ignore_attribute=False, | |
| ) | |
| assert all(dtype == pd.SparseDtype(np.float32, fill_value=0.0) for dtype in X.dtypes) | |
| assert isinstance(y.dtypes, pd.SparseDtype) | |
| assert X.shape == (600, 19999) | |
| assert len(categorical) == 19999 | |
| self.assertListEqual(categorical, [False] * 19999) | |
| assert y.shape == (600,) | |
| def test_get_sparse_dataset_excludes_rowid_includes_ignore(self): | |
| self.sparse_dataset.ignore_attribute = ["V256"] | |
| self.sparse_dataset.row_id_attribute = ["V512"] | |
| X, y, categorical, _ = self.sparse_dataset.get_data( | |
| target="class", | |
| include_row_id=False, | |
| include_ignore_attribute=True, | |
| ) | |
| assert all(dtype == pd.SparseDtype(np.float32, fill_value=0.0) for dtype in X.dtypes) | |
| assert isinstance(y.dtypes, pd.SparseDtype) | |
| assert X.shape == (600, 19999) | |
| assert len(categorical) == 19999 | |
| self.assertListEqual(categorical, [False] * 19999) | |
| assert y.shape == (600,) |
There was a problem hiding this comment.
@geetu040 do you think we need the other 2 scenarios as well?

Fixes #1644
test_get_sparse_dataset_excludes_rowid(checks 19998 columns) andtest_get_sparse_dataset_includes_rowid(checks 20000 columns).cc @geetu040 @PGijsbers