feat: Add TinyIMDB dataset for lightweight experiments by pranayyb · Pull Request #374 · PrunaAI/pruna

pranayyb · 2025-10-01T15:46:16Z

Description

This PR adds a new TinyIMDB dataset to Pruna's data module, providing a lightweight version of the IMDB movie reviews dataset with 1,000 samples for quick experiments, testing, and proof-of-concepts.

Related Issue

Fixes #357

Type of Change

New feature (non-breaking change which adds functionality)
This change requires a documentation update

How Has This Been Tested?

Added TinyIMDB to the existing parametrized test suite in test_datamodule.py.
Verified the dataset follows the same pattern as existing text datasets (e.g., WikiText, SmolTalk).
Confirmed the dataset size constraint (<=1,000 samples: 800 train + ~200 val + 200 test).
Tested dataset loading and splitting functionality.
Verified proper integration with Pruna's text_generation_collate function and SmashConfig.
Ensured the tokenizer (BERT) is required and properly registered in SmashConfig.

Checklist

My code follows the style guidelines of this project.
I have performed a self-review of my own code.
I have commented my code, particularly in hard-to-understand areas.
I have made corresponding changes to the documentation.
I have added tests that prove my feature works.
New and existing unit tests pass locally with my changes.

Additional Notes

The implementation reuses the existing stanfordnlp/imdb dataset and creates a subset rather than requiring a new dataset upload.
Maintains the same text format and sentiment labels as the full IMDB dataset.
Uses the same text_generation_collate function for consistency with other text datasets.
Total sample count: ~1,000 samples (800 train + ~200 val + 200 test), staying well under the 1,000 sample requirement.
Can be used with smash_config.add_data("TinyIMDB") following the same API as other Pruna datasets.

Files Modified

src/pruna/data/datasets/text_generation.py – Added setup_tiny_imdb_dataset() function.
src/pruna/data/init.py – Added import and registered TinyIMDB in base_datasets.
tests/data/test_datamodule.py – Added TinyIMDB test case.

Note

Introduce setup_tiny_imdb_dataset (1k-sample IMDB split) and register TinyIMDB in base_datasets, with tests updated to cover it.

Data:
- New dataset: setup_tiny_imdb_dataset in src/pruna/data/datasets/text_generation.py creating a 1k subset from stanfordnlp/imdb (≈800 train, 200 val, 200 test; falls back to val for test if empty).
- Registration: Add "TinyIMDB" to base_datasets in src/pruna/data/__init__.py using text_generation_collate.
Tests:
- Extend tests/data/test_datamodule.py to parametrize and run TinyIMDB via PrunaDataModule.from_string.

^{Written by Cursor Bugbot for commit 25d84b7. This will update automatically on new commits. Configure here.}

johannaSommer

Looks wonderful to me! I agree with Bugbot, the split / actual size of the dataset might not be super clear. How about we select the tiny dataset with 1000 samples like in line 208 and split into train/val/test from there?

johannaSommer · 2025-10-01T16:16:17Z

@pranayyb Could you please leave a brief comment in #357 so that I can assign this issue to you? Thanks!

pranayyb · 2025-10-01T16:29:23Z

commented on the issue #357 and will do the changes suggested

pranayyb · 2025-10-01T17:14:26Z

@johannaSommer please review the changes

sdiazlor

Hi @pranayyb, thanks for the contribution! It looks great!

feat: Add TinyIMDB dataset for lightweight experiments

46c139c

pranayyb changed the title ~~Add TinyIMDB dataset for lightweight experiments~~ feat: Add TinyIMDB dataset for lightweight experiments Oct 1, 2025

This comment was marked as outdated.

Sign in to view

johannaSommer requested changes Oct 1, 2025

View reviewed changes

fix: size of the dataset

25d84b7

pranayyb requested a review from johannaSommer October 4, 2025 11:34

johannaSommer approved these changes Oct 6, 2025

View reviewed changes

johannaSommer requested review from minettekaum and sdiazlor October 6, 2025 14:29

sdiazlor approved these changes Oct 6, 2025

View reviewed changes

johannaSommer merged commit 3d5be7a into PrunaAI:main Oct 6, 2025
4 checks passed

davidberenstein1957 added the hacktoberfest label Oct 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add TinyIMDB dataset for lightweight experiments#374

feat: Add TinyIMDB dataset for lightweight experiments#374
johannaSommer merged 2 commits intoPrunaAI:mainfrom
pranayyb:feat/tiny-imdb

pranayyb commented Oct 1, 2025 •

edited by cursor bot

Loading

Uh oh!

This comment was marked as outdated.

Uh oh!

johannaSommer left a comment

Uh oh!

johannaSommer commented Oct 1, 2025

Uh oh!

pranayyb commented Oct 1, 2025

Uh oh!

pranayyb commented Oct 1, 2025 •

edited

Loading

Uh oh!

sdiazlor left a comment •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

pranayyb commented Oct 1, 2025 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issue

Type of Change

How Has This Been Tested?

Checklist

Additional Notes

Files Modified

Uh oh!

This comment was marked as outdated.

Uh oh!

johannaSommer left a comment

Choose a reason for hiding this comment

Uh oh!

johannaSommer commented Oct 1, 2025

Uh oh!

pranayyb commented Oct 1, 2025

Uh oh!

pranayyb commented Oct 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sdiazlor left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

pranayyb commented Oct 1, 2025 •

edited by cursor bot

Loading

pranayyb commented Oct 1, 2025 •

edited

Loading

sdiazlor left a comment •

edited

Loading