Skip to content

feat: Add TinyIMDB dataset for lightweight experiments#374

Merged
johannaSommer merged 2 commits intoPrunaAI:mainfrom
pranayyb:feat/tiny-imdb
Oct 6, 2025
Merged

feat: Add TinyIMDB dataset for lightweight experiments#374
johannaSommer merged 2 commits intoPrunaAI:mainfrom
pranayyb:feat/tiny-imdb

Conversation

@pranayyb
Copy link
Copy Markdown
Contributor

@pranayyb pranayyb commented Oct 1, 2025

Description

This PR adds a new TinyIMDB dataset to Pruna's data module, providing a lightweight version of the IMDB movie reviews dataset with 1,000 samples for quick experiments, testing, and proof-of-concepts.

Related Issue

Fixes #357

Type of Change

  • New feature (non-breaking change which adds functionality)
  • This change requires a documentation update

How Has This Been Tested?

  • Added TinyIMDB to the existing parametrized test suite in test_datamodule.py.
  • Verified the dataset follows the same pattern as existing text datasets (e.g., WikiText, SmolTalk).
  • Confirmed the dataset size constraint (<=1,000 samples: 800 train + ~200 val + 200 test).
  • Tested dataset loading and splitting functionality.
  • Verified proper integration with Pruna's text_generation_collate function and SmashConfig.
  • Ensured the tokenizer (BERT) is required and properly registered in SmashConfig.

Checklist

  • My code follows the style guidelines of this project.
  • I have performed a self-review of my own code.
  • I have commented my code, particularly in hard-to-understand areas.
  • I have made corresponding changes to the documentation.
  • I have added tests that prove my feature works.
  • New and existing unit tests pass locally with my changes.

Additional Notes

  • The implementation reuses the existing stanfordnlp/imdb dataset and creates a subset rather than requiring a new dataset upload.
  • Maintains the same text format and sentiment labels as the full IMDB dataset.
  • Uses the same text_generation_collate function for consistency with other text datasets.
  • Total sample count: ~1,000 samples (800 train + ~200 val + 200 test), staying well under the 1,000 sample requirement.
  • Can be used with smash_config.add_data("TinyIMDB") following the same API as other Pruna datasets.

Files Modified

  • src/pruna/data/datasets/text_generation.py – Added setup_tiny_imdb_dataset() function.
  • src/pruna/data/init.py – Added import and registered TinyIMDB in base_datasets.
  • tests/data/test_datamodule.py – Added TinyIMDB test case.

Note

Introduce setup_tiny_imdb_dataset (1k-sample IMDB split) and register TinyIMDB in base_datasets, with tests updated to cover it.

  • Data:
    • New dataset: setup_tiny_imdb_dataset in src/pruna/data/datasets/text_generation.py creating a 1k subset from stanfordnlp/imdb (≈800 train, 200 val, 200 test; falls back to val for test if empty).
    • Registration: Add "TinyIMDB" to base_datasets in src/pruna/data/__init__.py using text_generation_collate.
  • Tests:
    • Extend tests/data/test_datamodule.py to parametrize and run TinyIMDB via PrunaDataModule.from_string.

Written by Cursor Bugbot for commit 25d84b7. This will update automatically on new commits. Configure here.

@pranayyb pranayyb changed the title Add TinyIMDB dataset for lightweight experiments feat: Add TinyIMDB dataset for lightweight experiments Oct 1, 2025
cursor[bot]

This comment was marked as outdated.

Copy link
Copy Markdown
Member

@johannaSommer johannaSommer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks wonderful to me! I agree with Bugbot, the split / actual size of the dataset might not be super clear. How about we select the tiny dataset with 1000 samples like in line 208 and split into train/val/test from there?

@johannaSommer
Copy link
Copy Markdown
Member

@pranayyb Could you please leave a brief comment in #357 so that I can assign this issue to you? Thanks!

@pranayyb
Copy link
Copy Markdown
Contributor Author

pranayyb commented Oct 1, 2025

commented on the issue #357 and will do the changes suggested

@pranayyb
Copy link
Copy Markdown
Contributor Author

pranayyb commented Oct 1, 2025

@johannaSommer please review the changes

@pranayyb pranayyb requested a review from johannaSommer October 4, 2025 11:34
Copy link
Copy Markdown
Collaborator

@sdiazlor sdiazlor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @pranayyb, thanks for the contribution! It looks great!

@johannaSommer johannaSommer merged commit 3d5be7a into PrunaAI:main Oct 6, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE] Add TinyIMDB dataset

4 participants