Skip to content

feat: add prompt only image generation datasets#310

Merged
nifleisch merged 4 commits intomainfrom
feat/add-prompt-datasets
Sep 22, 2025
Merged

feat: add prompt only image generation datasets#310
nifleisch merged 4 commits intomainfrom
feat/add-prompt-datasets

Conversation

@nifleisch
Copy link
Copy Markdown
Collaborator

Description

Currently, we only support image-generation datasets consisting of prompt–image pairs. For many applications (e.g., evaluation agents, distillation), only the prompt is needed, and some benchmarking datasets for image-generation models consist of prompts only. This PR relaxes the requirement to always include images by adjusting the collate function. It also adds three common benchmarking datasets for image-generation models: DrawBench, PartiPrompts, and GenAiBench.

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

  • I added tests for the newly added datasets that pass locally.
  • I tried the old and new image generation datasets together with the optimization agent.

Checklist

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

cursor[bot]

This comment was marked as outdated.

Copy link
Copy Markdown
Member

@davidberenstein1957 davidberenstein1957 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool! Feel free to merge after the minor bug fixes and the tests pass.

"""
ds = load_dataset("sayakpaul/drawbench", trust_remote_code=True)["train"]
ds = ds.rename_column("Prompts", "text")
return ds.select([0]), ds.select([0]), ds
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps this is leftover from testing? I believe we would want split the dataset in train test and val datasets?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This decision was made deliberately to ensure that the evaluation agent uses all prompts in the benchmark for evaluation. Because these are benchmarking datasets, they are not intended for training models. But maybe I am missing cases in which it is favorable to split the benchmark dataset into train, val, and test.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, for consistency, I would add a note why you are doing that. Also, wouldn't it be better to provide an empty dataset ?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I second the idea of empty datasets for training and validation here :)

Copy link
Copy Markdown
Member

@begumcig begumcig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

THEY ARE HERE! You're a champ 🥇🥇🥇🥇. One teeeny request about passing empty datasets for train/val modes. Also left one suggestion about the collate, but only if you also think that makes sense, left it up to you. Everything already looks so good!

Comment thread src/pruna/data/collate.py Outdated


def image_generation_collate(data: Any, img_size: int, output_format: str = "int") -> Tuple[List[str], torch.Tensor]:
def image_generation_collate(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I totally get why it feels natural to link prompt datasets with the image_generation_collate. At the same time, I wonder if it might make sense to introduce a really simple, pass-through style collate function specifically for prompt-only datasets. I can imagine us adding more and more datasets in the future, also not just for images but also for video generation too! So extending the current collate might not be the most sustainable approach long term.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really good point. I did not think about video models. With them in mind it really makes sense to treat the prompt datasets separately.

"""
ds = load_dataset("sayakpaul/drawbench", trust_remote_code=True)["train"]
ds = ds.rename_column("Prompts", "text")
return ds.select([0]), ds.select([0]), ds
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I second the idea of empty datasets for training and validation here :)

Comment thread src/pruna/data/__init__.py Outdated
"CIFAR10": (setup_cifar10_dataset, "image_classification_collate", {"img_size": 32}),
"DrawBench": (setup_drawbench_dataset, "image_generation_collate", {"img_size": None}),
"PartiPrompts": (setup_parti_prompts_dataset, "image_generation_collate", {"img_size": None}),
"GenAIBench": (setup_genai_bench_dataset, "image_generation_collate", {"img_size": None}),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a more detailed comment below, but if we were to have a separate collate, we also wouldn't have to set the image size for prompt datasets!

@github-actions
Copy link
Copy Markdown

github-actions bot commented Sep 7, 2025

This PR has been inactive for 10 days and is now marked as stale.

@github-actions github-actions bot added the stale label Sep 7, 2025
@nifleisch nifleisch force-pushed the feat/add-prompt-datasets branch from e4ea39d to 1b8f003 Compare September 8, 2025 12:06
@nifleisch nifleisch requested a review from begumcig September 8, 2025 12:26
Copy link
Copy Markdown
Member

@begumcig begumcig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing job Nils! Super excited to use these datasets already 🏆🏆🏆. It could be useful to add an info message to the user about how they should use the test split, but everything looks super good to me, so already approved!

"Polyglot": (setup_polyglot_dataset, "question_answering_collate", {}),
"OpenImage": (setup_open_image_dataset, "image_generation_collate", {"img_size": 1024}),
"CIFAR10": (setup_cifar10_dataset, "image_classification_collate", {"img_size": 32}),
"DrawBench": (setup_drawbench_dataset, "prompt_collate", {}),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

slay

"""
ds = load_dataset("BaiqiL/GenAI-Bench")["train"]
ds = ds.rename_column("Prompt", "text")
return ds.select([0]), ds.select([0]), ds
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After our discussion I see why we cannot pass an empty dataset for the train and validation. Do you think it would make sense to print an info / warning in the setup functions to let people know they should be using the test dataloader?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that sounds like a good idea. Will log an info message when loading the dataset.

@github-actions github-actions bot removed the stale label Sep 9, 2025
@github-actions
Copy link
Copy Markdown

This PR has been inactive for 10 days and is now marked as stale.

@github-actions github-actions bot added the stale label Sep 20, 2025
@nifleisch nifleisch merged commit 3d99160 into main Sep 22, 2025
7 checks passed
@nifleisch nifleisch deleted the feat/add-prompt-datasets branch September 22, 2025 08:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants