diff --git a/docs/assets/images/stable_diffusion_algorithms.png b/docs/assets/images/stable_diffusion_algorithms.png new file mode 100644 index 00000000..835fcf90 Binary files /dev/null and b/docs/assets/images/stable_diffusion_algorithms.png differ diff --git a/docs/assets/images/stable_diffusion_quantized.png b/docs/assets/images/stable_diffusion_quantized.png new file mode 100644 index 00000000..0bd2fc65 Binary files /dev/null and b/docs/assets/images/stable_diffusion_quantized.png differ diff --git a/docs/tutorials/flux_small.ipynb b/docs/tutorials/flux_small.ipynb deleted file mode 100644 index 8f559f68..00000000 --- a/docs/tutorials/flux_small.ipynb +++ /dev/null @@ -1,177 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Run your Flux model with half the memory" - ] - }, - { - "cell_type": "raw", - "metadata": { - "vscode": { - "languageId": "raw" - } - }, - "source": [ - "
\n", - " \n", - " \"Open\n", - " \n", - " \n", - " \"Open\n", - " \n", - "
" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "This tutorial demonstrates how to use the `pruna` package to optimize your Flux model for memory consumption." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "This tutorial smashes the Flux model on CPU, which will require around 28GB of memory. As the example inference is run on GPU with the smashed model, a GPU with around 24 GB VRAM is sufficient when using 4bit quantization." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# if you are not running the latest version of this tutorial, make sure to install the matching version of pruna\n", - "# the following command will install the latest version of pruna\n", - "!pip install pruna" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 1. Loading the Flux Model\n", - "\n", - "First, load your Flux model." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import torch\n", - "from diffusers import FluxPipeline\n", - "\n", - "pipe = FluxPipeline.from_pretrained(\"black-forest-labs/FLUX.1-schnell\", torch_dtype=torch.bfloat16)\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 2. Initializing the Smash Config\n", - "\n", - "Next, initialize the smash_config. You can uncomment the `torch_compile` line to additionally enable 50% speed up." - ] - }, - { - "cell_type": "code", - "execution_count": 2, - "metadata": {}, - "outputs": [], - "source": [ - "from pruna import SmashConfig\n", - "\n", - "# Initialize the SmashConfig\n", - "smash_config = SmashConfig()\n", - "# smash_config['compiler'] = 'torch_compile'\n", - "smash_config['quantizer'] = 'hqq_diffusers'\n", - "smash_config['hqq_diffusers_weight_bits'] = 4 # or 2, 4, 8" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 3. Smashing the Model\n", - "\n", - "Now, you can smash the model" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from pruna import smash\n", - "\n", - "pipe = smash(\n", - " model=pipe,\n", - " smash_config=smash_config,\n", - ").to(\"cuda\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 4. Running the Model\n", - "\n", - "Finally, run the model to generate the image." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "prompt = \"A cat holding a sign that says hello world\"\n", - "pipe(\n", - " prompt,\n", - " guidance_scale=0.0,\n", - " num_inference_steps=4,\n", - " max_sequence_length=256,\n", - " generator=torch.Generator(\"cpu\").manual_seed(0)\n", - ").images[0]" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Wrap Up\n", - "\n", - "Congratulations! You have successfully smashed a Flux model. Enjoy the smaller memory footprint!" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "pruna", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.11.11" - } - }, - "nbformat": 4, - "nbformat_minor": 2 -} diff --git a/docs/tutorials/image_generation.ipynb b/docs/tutorials/image_generation.ipynb new file mode 100644 index 00000000..aca1aceb --- /dev/null +++ b/docs/tutorials/image_generation.ipynb @@ -0,0 +1,442 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Compress and Evaluate Image Generation Models\n", + "\n", + "\n", + " \"Open\n", + "\n", + "\n", + "| Component | Details |\n", + "|-----------|---------|\n", + "| **Goal** | Demonstrate a standard workflow for optimizing and evaluating an image generation model |\n", + "| **Model** | [stabilityai/stable-diffusion-xl-base-1.0](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0) |\n", + "| **Dataset** | [nannullna/laion_subset](https://huggingface.co/datasets/nannullna/laion_subset) |\n", + "| **Optimization Algorithms** | cacher(deepcache), compiler(torch_compile), quantizer(hqq_diffusers) |\n", + "| **Evaluation Metrics** | `throughput`, `total time`, `clip_score` |\n", + "\n", + "## Getting Started\n", + "\n", + "To install the dependencies, run the following command:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!pip install pruna" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "vscode": { + "languageId": "plaintext" + } + }, + "source": [ + "The device is set to the best available option to maximize the benefits of the optimization process." + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "import torch\n", + "\n", + "device = \"cuda\" if torch.cuda.is_available() else \"mps\" if torch.backends.mps.is_available() else \"cpu\"" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 1. Load the Model\n", + "\n", + "Before optimizing the model, we first ensure that it loads correctly and fits into memory. For this example, we will use a lightweight image generation model, [segmind/Segmind-Vega](https://huggingface.co/segmind/Segmind-Vega) a distilled version of [stabilityai/stable-diffusion-xl-base-1.0](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0) but feel free to use any [text-to-image model on Hugging Face](https://huggingface.co/models?pipeline_tag=text-to-image).\n", + "\n", + "Although Pruna works at least as good with much larger models, like FLUX or SD3.5, however, a small model is a good starting point to show the steps of the optimization process." + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "import torch\n", + "from diffusers import DiffusionPipeline\n", + "\n", + "pipe = DiffusionPipeline.from_pretrained(\n", + " pretrained_model_name_or_path=\"segmind/Segmind-Vega\", torch_dtype=torch.bfloat16\n", + ")\n", + "pipe = pipe.to(device)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now that we've loaded the pipeline, let's examine some of the outputs it can generate. We use an example from [this amazing prompt guide](https://strikingloo.github.io/stable-diffusion-vs-dalle-2)." + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [ + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "d58dce8c11304d8b863a1cd022d19c93", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + " 0%| | 0/50 [00:00" + ] + }, + "execution_count": 10, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "prompt = \"Editorial Style Photo, Bonsai Apple Tree, Task Lighting, Inspiring and Sunset, Afternoon, Beautiful, 4k\"\n", + "image = pipe(\n", + " prompt,\n", + " generator=torch.Generator().manual_seed(42),\n", + ")\n", + "image.images[0]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "As we can see, the model is able to generate a beautiful image based on the provided input prompt.\n", + "\n", + "## 2. Define the SmashConfig\n", + "\n", + "Now that we've confirmed the model is functioning correctly, let's proceed with the optimization process by defining the `SmashConfig`, which will be used later to optimize the model.\n", + "\n", + "For diffusion models, the most important categories of optimization algorithms are cachers, compilers, and quantizers. Note that not all algorithms are compatible with every model. For Stable Diffusion models, the following options are available:\n", + "\n", + "\"Stable\n", + "\n", + "You can learn more about the various optimization algorithms and their hyperparameters in the [Algorithms Overview](https://docs.pruna.ai/en/stable/compression.html) section of the documentation.\n", + "\n", + "In this optimization, we'll combine [``deepcache``](https://docs.pruna.ai/en/stable/compression.html#deepcache), [``torch_compile``](https://docs.pruna.ai/en/stable/compression.html#torch-compile), and [`hqq-diffusers`](https://docs.pruna.ai/en/stable/compression.html#hqq-diffusers). We'll also update some of the parameters for these algorithms, setting `hqq_diffusers_weight_bits` to `4`. This is just one of many possible configurations and is intended to serve as an example.\n", + "\n", + "\"Stable\n", + "\n", + "Let's define the `SmashConfig` object." + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [], + "source": [ + "from pruna import SmashConfig\n", + "\n", + "smash_config = SmashConfig(device=device)\n", + "# configure the deepcache cacher\n", + "smash_config[\"cacher\"] = \"deepcache\"\n", + "smash_config[\"deepcache_interval\"] = 2\n", + "# configure the torch_compile compiler\n", + "smash_config[\"compiler\"] = \"torch_compile\"\n", + "# configure the hqq_diffusers quantizer\n", + "smash_config[\"quantizer\"] = \"hqq_diffusers\"\n", + "smash_config[\"hqq_diffusers_weight_bits\"] = 4\n", + "smash_config[\"hqq_diffusers_group_size\"] = 64\n", + "smash_config[\"hqq_diffusers_backend\"] = \"marlin\"" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 3. Smash the Model\n", + "\n", + "Now that we've defined the `SmashConfig` object, we can proceed to smash the model. We'll use the `smash` function, passing both the `model` and the `smash_config` as arguments. We make a deep copy of the model to avoid modifying the original model.\n", + "\n", + "Let's smash the model, which should take around 20 seconds for this configuration." + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "INFO - Starting quantizer hqq_diffusers...\n", + "100%|██████████| 220/220 [00:00<00:00, 51515.57it/s]\n", + "100%|██████████| 190/190 [00:01<00:00, 126.83it/s]\n", + "INFO - quantizer hqq_diffusers was applied successfully.\n", + "INFO - Starting cacher deepcache...\n", + "INFO - cacher deepcache was applied successfully.\n", + "INFO - Starting compiler torch_compile...\n", + "self.unet=None of type cannot be saved.\n", + "INFO - compiler torch_compile was applied successfully.\n" + ] + } + ], + "source": [ + "import copy\n", + "\n", + "from pruna import smash\n", + "\n", + "copy_pipe = copy.deepcopy(pipe).to(\"cpu\")\n", + "smashed_pipe = smash(\n", + " model=pipe,\n", + " smash_config=smash_config,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now that we've smashed the model, let's verify that everything still works as expected by running inference with the smashed model.\n", + "\n", + "If you are using torch_compile as your compiler, you can expect the first inference warmup to take a bit longer than the actual inference." + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "b645008b76a948e8b2a59664a2aa01b8", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + " 0%| | 0/50 [00:00" + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "prompt = \"Editorial Style Photo, Bonsai Apple Tree, Task Lighting, Inspiring and Sunset, Afternoon, Beautiful, 4k\"\n", + "image = smashed_pipe(\n", + " prompt,\n", + " generator=torch.Generator().manual_seed(42),\n", + ")\n", + "image.images[0]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "As we can see, the model is able to generate a similar image as the original model. \n", + "\n", + "If you notice a significant difference, it might have several reasons, the model, the configuration, the hardware, etc. As optimization can be non-deterministic, we encourage you to retry the optimization process or try out different configurations and models to find the best fit for your use case but also feel free to reach out to us on [Discord](https://discord.gg/Tun8YgzxZ9) if you have any questions or feedback.\n", + "\n", + "## 4. Evaluate the Smashed Model\n", + "\n", + "Now that the model has been optimized, we can evaluate its performance using the `EvaluationAgent`. This evaluation will include metrics like `elapsed_time` for general performance and the `clip_score` for evaluating the quality of the generated images.\n", + "\n", + "You can find a complete overview of all available metrics in our [documentation](https://docs.pruna.ai/)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from pruna import PrunaModel\n", + "from pruna.data.pruna_datamodule import PrunaDataModule\n", + "from pruna.evaluation.evaluation_agent import EvaluationAgent\n", + "from pruna.evaluation.metrics import (\n", + " LatencyMetric,\n", + " ThroughputMetric,\n", + " TorchMetricWrapper,\n", + ")\n", + "from pruna.evaluation.task import Task\n", + "\n", + "# Define the metrics\n", + "metrics = [\n", + " LatencyMetric(n_iterations=20, n_warmup_iterations=5),\n", + " ThroughputMetric(n_iterations=20, n_warmup_iterations=5),\n", + " TorchMetricWrapper(\"clip\"),\n", + "]\n", + "\n", + "# Define the datamodule\n", + "datamodule = PrunaDataModule.from_string(\"LAION256\")\n", + "datamodule.limit_datasets(10)\n", + "\n", + "# Define the task and evaluation agent\n", + "task = Task(metrics, datamodule=datamodule, device=device)\n", + "eval_agent = EvaluationAgent(task)\n", + "\n", + "# Evaluate base model and offload it to CPU\n", + "wrapped_pipe = PrunaModel(model=pipe)\n", + "wrapped_pipe.move_to_device(device)\n", + "base_model_results = eval_agent.evaluate(wrapped_pipe)\n", + "wrapped_pipe.move_to_device(\"cpu\")\n", + "\n", + "# Evaluate smashed model and offload it to CPU\n", + "smashed_pipe.move_to_device(device)\n", + "smashed_model_results = eval_agent.evaluate(smashed_pipe)\n", + "smashed_pipe.move_to_device(\"cpu\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can now review the evaluation results and compare the performance of the original model with the optimized version." + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": {}, + "outputs": [ + { + "data": { + "text/markdown": [ + "| Metric | Base Model | Compressed Model | Relative Difference |\n", + "|--------|----------|-----------|------------|\n", + "| clip_score | 28.5992 | 28.0233 | -2.01% |\n", + "| throughput | 0.0002 num_iterations/ms | 0.0003 num_iterations/ms | +91.06% |\n", + "| latency | 5739.0089 ms | 3004.1498 ms | -47.65% |\n" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "from IPython.display import Markdown, display # noqa\n", + "\n", + "\n", + "# Calculate percentage differences for each metric\n", + "def calculate_percentage_diff(original, optimized): # noqa\n", + " return ((optimized - original) / original) * 100\n", + "\n", + "\n", + "# Calculate differences and prepare table data\n", + "table_data = []\n", + "for base_metric_result in base_model_results:\n", + " for smashed_metric_result in smashed_model_results:\n", + " if base_metric_result.name == smashed_metric_result.name:\n", + " diff = calculate_percentage_diff(base_metric_result.result, smashed_metric_result.result)\n", + " table_data.append(\n", + " {\n", + " \"Metric\": base_metric_result.name,\n", + " \"Base Model\": f\"{base_metric_result.result:.4f}\",\n", + " \"Compressed Model\": f\"{smashed_metric_result.result:.4f}\",\n", + " \"Relative Difference\": f\"{diff:+.2f}%\",\n", + " }\n", + " )\n", + " break\n", + "\n", + "# Create and display markdown table manually\n", + "markdown_table = \"| Metric | Base Model | Compressed Model | Relative Difference |\\n\"\n", + "markdown_table += \"|--------|----------|-----------|------------|\\n\"\n", + "for row in table_data:\n", + " metric = [m for m in metrics if m.metric_name == row[\"Metric\"]][0]\n", + " unit = metric.metric_units if hasattr(metric, \"metric_units\") else \"\"\n", + " markdown_table += f\"| {row['Metric']} | {row['Base Model']} {unit} | {row['Compressed Model']} {unit} | {row['Relative Difference']} |\\n\" # noqa: E501\n", + "\n", + "display(Markdown(markdown_table))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "As we can see, the optimized model is approximately 2× faster and smaller than the base model. While the CLIP score remains nearly unchanged. This is expected, given the nature of the optimization process.\n", + "\n", + "We can now save the optimized model to disk or share it with others:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# save the model to disk\n", + "smashed_pipe.save_pretrained(\"segmind-vega-smashed\")\n", + "# after saving the model, you can load it with\n", + "# smashed_pipe = PrunaModel.from_pretrained(\"segmind-vega-smashed\")\n", + "\n", + "# save the model to HuggingFace\n", + "# smashed_pipe.save_to_hub(\"PrunaAI/segmind-vega-smashed\")\n", + "# smashed_pipe = PrunaModel.from_hub(\"PrunaAI/segmind-vega-smashed\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Conclusion\n", + "\n", + "In this tutorial, we demonstrated a standard workflow for optimizing and evaluating an image generation model using Pruna.\n", + "\n", + "We defined our optimization strategy using the `SmashConfig` object and applied it to the model with the `smash` function. We then evaluated the performance of the optimized model using the `EvaluationAgent`, comparing key metrics such as `elapsed_time` and `CLIP score`.\n", + "\n", + "To support the workflow, we also used the `PrunaDataModule` to load the dataset and the `Task` object to configure the task and link it to the evaluation process.\n", + "\n", + "The results show that we can significantly improve runtime performance and reduce memory usage and energy consumption, while maintaining a high level of output quality. This makes it easy to explore trade-offs and iterate on configurations to find the best optimization strategy for your specific use case." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/docs/tutorials/index.rst b/docs/tutorials/index.rst index 235866fa..2cfdd685 100644 --- a/docs/tutorials/index.rst +++ b/docs/tutorials/index.rst @@ -7,6 +7,12 @@ This tutorial will guide you through the process of using |pruna| to optimize yo .. grid:: 1 2 2 2 + .. grid-item-card:: Compress and Evaluate Image Generation Models + :text-align: center + :link: ./image_generation.ipynb + + Compress with a ``hq_diffusers`` ``quantizer`` and a ``deepcache`` ``cacher``, and evaluate with ``throughput``, ``total time``, ``clip_score``. + .. grid-item-card:: Transcribe 2 hour of audio in 2 minutes with Whisper :text-align: center :link: ./asr_tutorial.ipynb @@ -31,12 +37,6 @@ This tutorial will guide you through the process of using |pruna| to optimize yo ``Evaluate`` image generation quality with ``CMMD`` and ``EvaluationAgent``. - .. grid-item-card:: Run your Flux model with half the memory - :text-align: center - :link: ./flux_small.ipynb - - Speed up your image generation model with ``torch_compile`` ``compilation`` and ``hqq_diffusers`` ``quantization``. - .. grid-item-card:: Making your LLMs 4x smaller :text-align: center :link: ./llms.ipynb