feat: Add TinyIMDB dataset for lightweight experiments#374
Merged
johannaSommer merged 2 commits intoPrunaAI:mainfrom Oct 6, 2025
Merged
feat: Add TinyIMDB dataset for lightweight experiments#374johannaSommer merged 2 commits intoPrunaAI:mainfrom
johannaSommer merged 2 commits intoPrunaAI:mainfrom
Conversation
johannaSommer
requested changes
Oct 1, 2025
Member
johannaSommer
left a comment
There was a problem hiding this comment.
Looks wonderful to me! I agree with Bugbot, the split / actual size of the dataset might not be super clear. How about we select the tiny dataset with 1000 samples like in line 208 and split into train/val/test from there?
Member
Contributor
Author
|
commented on the issue #357 and will do the changes suggested |
Contributor
Author
|
@johannaSommer please review the changes |
johannaSommer
approved these changes
Oct 6, 2025
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
This PR adds a new TinyIMDB dataset to Pruna's data module, providing a lightweight version of the IMDB movie reviews dataset with 1,000 samples for quick experiments, testing, and proof-of-concepts.
Related Issue
Fixes #357
Type of Change
How Has This Been Tested?
Checklist
Additional Notes
Files Modified
Note
Introduce
setup_tiny_imdb_dataset(1k-sample IMDB split) and registerTinyIMDBinbase_datasets, with tests updated to cover it.setup_tiny_imdb_datasetinsrc/pruna/data/datasets/text_generation.pycreating a 1k subset fromstanfordnlp/imdb(≈800 train, 200 val, 200 test; falls back to val for test if empty)."TinyIMDB"tobase_datasetsinsrc/pruna/data/__init__.pyusingtext_generation_collate.tests/data/test_datamodule.pyto parametrize and runTinyIMDBviaPrunaDataModule.from_string.Written by Cursor Bugbot for commit 25d84b7. This will update automatically on new commits. Configure here.