[Stable Diffusion] Image2image and inpaint pipeline support (#161)

* draft img2img sd pipe * refactor inheritance * refactoring * refactoring * fix sdxl unet inf * inference done * add post processing and doc * fix style * add test * update doc prompt * fix num images per prompt issue * fix test * add img2img pipe * better scale * test no resize * remove image * remove hack * hack for debug vae encoder * img2img done * add inpaint pipe * add tests * update doc * title upper class * add results img * fix shape * address comments & api doc * fix doc * due with name * improve api doc * update doc * Update docs/source/guides/models.mdx Co-authored-by: Michael Benayoun <[email protected]> * Update docs/source/guides/models.mdx Co-authored-by: Michael Benayoun <[email protected]> * Update docs/source/guides/models.mdx Co-authored-by: Michael Benayoun <[email protected]> * apply suggestion --------- Co-authored-by: JingyaHuang <[email protected]> Co-authored-by: Michael Benayoun <[email protected]>
huggingface · Sep 21, 2023 · 16115ac · 16115ac
1 parent d2f5c9d
commit 16115ac
Show file tree

Hide file tree

Showing 16 changed files with 1,456 additions and 383 deletions.
diff --git a/docs/source/guides/models.mdx b/docs/source/guides/models.mdx
@@ -211,7 +211,11 @@ You can also accelerate the inference of stable diffusion on neuronx devices (in
 * VAE encoder
 * VAE decoder
 
-The export can be done either with the CLI or with `NeuronStableDiffusionPipeline` API. Here is an example of exporting stable diffusion components with `NeuronStableDiffusionPipeline`:
+### Text-to-Image
+
+`NeuronStableDiffusionPipeline` class allows you to generate images from a text prompt on neuron devices similar to the experience with `diffusers`. 
+
+Like for other tasks, you need to compile models before being able to perform inference. The export can be done either via the CLI or via `NeuronStableDiffusionPipeline` API. Here is an example of exporting stable diffusion components with `NeuronStableDiffusionPipeline`:
 
 <Tip>
 
@@ -247,9 +251,75 @@ Now generate an image with a prompt on neuron:
 
 <img
   src="https://raw.githubusercontent.com/huggingface/optimum-neuron/main/docs/assets/guides/models/01-sd-image.png"
+  width="256" 
+  height="256"
   alt="stable diffusion generated image"
 />
 
+### Image-to-Image
+
+With the `NeuronStableDiffusionImg2ImgPipeline` class, you can generate a new image conditioned on a text prompt and an initial image.
+
+```python
+import requests
+from PIL import Image
+from io import BytesIO
+from optimum.neuron import NeuronStableDiffusionImg2ImgPipeline
+
+model_id = "nitrosocke/Ghibli-Diffusion"
+input_shapes = {"batch_size": 1, "height": 512, "width": 512}
+pipeline = NeuronStableDiffusionImg2ImgPipeline.from_pretrained(model_id, export=True, **input_shapes, device_ids=[0, 1])
+pipeline.save_pretrained("sd_img2img/")
+
+url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg"
+
+response = requests.get(url)
+init_image = Image.open(BytesIO(response.content)).convert("RGB")
+init_image = init_image.resize((512, 512))
+
+prompt = "ghibli style, a fantasy landscape with snowcapped mountains, trees, lake with detailed reflection. sunlight and cloud in the sky, warm colors, 8K"
+
+image = pipeline(prompt=prompt, image=init_image, strength=0.75, guidance_scale=7.5).images[0]
+image.save("fantasy_landscape.png")
+```
+`image`          | `prompt` | output |
+:-------------------------:|:-------------------------:|:-------------------------:|-------------------------:|
+<img src="https://huggingface.co/datasets/optimum/documentation-images/resolve/main/neuron/models/03-sd-img2img-init.png" alt="landscape photo" width="256" height="256"/> | ***ghibli style, a fantasy landscape with snowcapped mountains, trees, lake with detailed reflection. warm colors, 8K*** | <img src="https://huggingface.co/datasets/optimum/documentation-images/resolve/main/neuron/models/04-sd-img2img.png" alt="drawing" width="250"/> |
+
+### Inpaint
+
+With the `NeuronStableDiffusionInpaintPipeline` class, you can edit specific parts of an image by providing a mask and a text prompt.
+
+```python
+import requests
+from PIL import Image
+from io import BytesIO
+from optimum.neuron import NeuronStableDiffusionInpaintPipeline
+
+model_id = "runwayml/stable-diffusion-inpainting"
+input_shapes = {"batch_size": 1, "height": 512, "width": 512}
+pipeline = NeuronStableDiffusionInpaintPipeline.from_pretrained(model_id, export=True, **input_shapes, device_ids=[0, 1])
+pipeline.save_pretrained("sd_inpaint/")
+
+def download_image(url):
+    response = requests.get(url)
+    return Image.open(BytesIO(response.content)).convert("RGB")
+
+img_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo.png"
+mask_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo_mask.png"
+
+init_image = download_image(img_url).resize((512, 512))
+mask_image = download_image(mask_url).resize((512, 512))
+
+prompt = "Face of a yellow cat, high resolution, sitting on a park bench"
+image = pipeline(prompt=prompt, image=init_image, mask_image=mask_image).images[0]
+image.save("cat_on_bench.png")
+```
+
+`image`          | `mask_image` | `prompt` | output |
+:-------------------------:|:-------------------------:|:-------------------------:|-------------------------:|
+<img src="https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo.png" alt="drawing" width="250"/> | <img src="https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo_mask.png" alt="drawing" width="250"/> | ***Face of a yellow cat, high resolution, sitting on a park bench*** | <img src="https://huggingface.co/datasets/optimum/documentation-images/resolve/main/neuron/models/05-sd-inpaint.png" alt="drawing" width="250"/> |
+
 ## Stable Diffusion XL
 
 Similar to Stable Diffusion, you will be able to use `NeuronStableDiffusionXLPipeline` API to export and run inference on Neuron devices with SDXL models. 
@@ -280,6 +350,8 @@ Now generate an image with a prompt on neuron:
 
 <img
   src="https://raw.githubusercontent.com/huggingface/optimum-neuron/main/docs/assets/guides/models/02-sdxl-image.jpeg"
+  width="256" 
+  height="256"
   alt="sdxl generated image"
 />
 

diff --git a/docs/source/package_reference/export.mdx b/docs/source/package_reference/export.mdx
@@ -70,7 +70,8 @@ Since many architectures share similar properties for their Neuron configuration
 | RoFormer               | feature-extraction, fill-mask, multiple-choice, question-answering, text-classification, token-classification                                 |
 | XLM                    | feature-extraction, fill-mask, multiple-choice, question-answering, text-classification, token-classification                                 |
 | XLM-RoBERTa            | feature-extraction, fill-mask, multiple-choice, question-answering, text-classification, token-classification                                 |
-| Stable Diffusion       | text-to-image                                                                                                                                 |
+| Stable Diffusion       | text-to-image, image-to-image, inpaint                                                                                                        |
+| Stable Diffusion XL    | text-to-image                                                                                                                                 |
 
 
 <Tip>

diff --git a/docs/source/package_reference/modeling.mdx b/docs/source/package_reference/modeling.mdx
@@ -71,4 +71,20 @@ The following Neuron model classes are available for natural language processing
 
 ### NeuronStableDiffusionPipeline
 
-[[autodoc]] modeling_diffusion.NeuronStableDiffusionPipeline
+[[autodoc]] modeling_diffusion.NeuronStableDiffusionPipeline
+    - __call__
+
+### NeuronStableDiffusionImg2ImgPipeline
+
+[[autodoc]] modeling_diffusion.NeuronStableDiffusionImg2ImgPipeline
+    - __call__
+
+### NeuronStableDiffusionInpaintPipeline
+
+[[autodoc]] modeling_diffusion.NeuronStableDiffusionInpaintPipeline
+    - __call__
+
+### NeuronStableDiffusionXLPipeline
+
+[[autodoc]] modeling_diffusion.NeuronStableDiffusionXLPipeline
+    - __call__
diff --git a/optimum/exporters/neuron/__main__.py b/optimum/exporters/neuron/__main__.py
@@ -143,18 +143,25 @@ def infer_stable_diffusion_shapes_from_diffusers(
     vae_encoder_num_channels = model.vae.config.in_channels
     vae_decoder_num_channels = model.vae.config.latent_channels
     vae_scale_factor = 2 ** (len(model.vae.config.block_out_channels) - 1) or 8
-    height = input_shapes["unet_input_shapes"]["height"] // vae_scale_factor
-    width = input_shapes["unet_input_shapes"]["width"] // vae_scale_factor
+    height = input_shapes["unet_input_shapes"]["height"]
+    scaled_height = height // vae_scale_factor
+    width = input_shapes["unet_input_shapes"]["width"]
+    scaled_width = width // vae_scale_factor
 
     input_shapes["text_encoder_input_shapes"].update({"sequence_length": sequence_length})
     input_shapes["unet_input_shapes"].update(
-        {"sequence_length": sequence_length, "num_channels": unet_num_channels, "height": height, "width": width}
+        {
+            "sequence_length": sequence_length,
+            "num_channels": unet_num_channels,
+            "height": scaled_height,
+            "width": scaled_width,
+        }
     )
     input_shapes["vae_encoder_input_shapes"].update(
         {"num_channels": vae_encoder_num_channels, "height": height, "width": width}
     )
     input_shapes["vae_decoder_input_shapes"].update(
-        {"num_channels": vae_decoder_num_channels, "height": height, "width": width}
+        {"num_channels": vae_decoder_num_channels, "height": scaled_height, "width": scaled_width}
     )
 
     return input_shapes

diff --git a/optimum/exporters/neuron/model_configs.py b/optimum/exporters/neuron/model_configs.py
@@ -256,6 +256,7 @@ def outputs(self) -> List[str]:
 
     def generate_dummy_inputs(self, return_tuple: bool = False, **kwargs):
         # For neuron, we use static shape for compiling the unet. Unlike `optimum`, we use the given `height` and `width` instead of the `sample_size`.
+        # TODO: Modify optimum.utils.DummyVisionInputGenerator to enable unequal height and width (it prioritize `image_size` to custom h/w now)
         if self.height == self.width:
             self._normalized_config.image_size = self.height
         else:
@@ -302,7 +303,7 @@ def check_model_inputs_order(self, model, dummy_inputs):
 
 @register_in_tasks_manager("vae-encoder", *["semantic-segmentation"])
 class VaeEncoderNeuronConfig(VisionNeuronConfig):
-    ATOL_FOR_VALIDATION = 1e-2
+    ATOL_FOR_VALIDATION = 1e-3
     MODEL_TYPE = "vae-encoder"
 
     NORMALIZED_CONFIG_CLASS = NormalizedConfig.with_args(
@@ -319,6 +320,22 @@ def inputs(self) -> List[str]:
     def outputs(self) -> List[str]:
         return ["latent_sample"]
 
+    def generate_dummy_inputs(self, return_tuple: bool = False, **kwargs):
+        # For neuron, we use static shape for compiling the unet. Unlike `optimum`, we use the given `height` and `width` instead of the `sample_size`.
+        # TODO: Modify optimum.utils.DummyVisionInputGenerator to enable unequal height and width (it prioritize `image_size` to custom h/w now)
+        if self.height == self.width:
+            self._normalized_config.image_size = self.height
+        else:
+            raise ValueError(
+                "You need to input the same value for `self.height({self.height})` and `self.width({self.width})`."
+            )
+        dummy_inputs = super().generate_dummy_inputs(**kwargs)
+
+        if return_tuple is True:
+            return tuple(dummy_inputs.values())
+        else:
+            return dummy_inputs
+
 
 @register_in_tasks_manager("vae-decoder", *["semantic-segmentation"])
 class VaeDecoderNeuronConfig(VisionNeuronConfig):

diff --git a/optimum/neuron/__init__.py b/optimum/neuron/__init__.py
@@ -34,6 +34,8 @@
     ],
     "modeling_diffusion": [
         "NeuronStableDiffusionPipeline",
+        "NeuronStableDiffusionImg2ImgPipeline",
+        "NeuronStableDiffusionInpaintPipeline",
         "NeuronStableDiffusionXLPipeline",
     ],
     "modeling_decoder": ["NeuronDecoderModel"],
@@ -60,6 +62,8 @@
     from .modeling_base import NeuronBaseModel
     from .modeling_decoder import NeuronDecoderModel
     from .modeling_diffusion import (
+        NeuronStableDiffusionImg2ImgPipeline,
+        NeuronStableDiffusionInpaintPipeline,
         NeuronStableDiffusionPipeline,
         NeuronStableDiffusionXLPipeline,
     )

diff --git a/optimum/neuron/modeling_diffusion.py b/optimum/neuron/modeling_diffusion.py
@@ -59,8 +59,12 @@
     from diffusers.schedulers.scheduling_utils import SCHEDULER_CONFIG_NAME
     from diffusers.utils import CONFIG_NAME, is_invisible_watermark_available
 
-    from .pipelines.diffusers.pipeline_stable_diffusion import StableDiffusionPipelineMixin
-    from .pipelines.diffusers.pipeline_stable_diffusion_xl import StableDiffusionXLPipelineMixin
+    from .pipelines import (
+        NeuronStableDiffusionImg2ImgPipelineMixin,
+        NeuronStableDiffusionInpaintPipelineMixin,
+        NeuronStableDiffusionPipelineMixin,
+        NeuronStableDiffusionXLPipelineMixin,
+    )
 
 
 if TYPE_CHECKING:
@@ -158,16 +162,16 @@ def __init__(
         self.unet = NeuronModelUnet(
             unet, self, self.configs[DIFFUSION_MODEL_UNET_NAME], self.neuron_configs[DIFFUSION_MODEL_UNET_NAME]
         )
-        self.vae_encoder = (
-            NeuronModelVaeEncoder(
+        if vae_encoder is not None:
+            self.vae_encoder = NeuronModelVaeEncoder(
                 vae_encoder,
                 self,
                 self.configs[DIFFUSION_MODEL_VAE_ENCODER_NAME],
                 self.neuron_configs[DIFFUSION_MODEL_VAE_ENCODER_NAME],
             )
-            if vae_encoder is not None
-            else None
-        )
+        else:
+            self.vae_encoder = None
+
         self.vae_decoder = NeuronModelVaeDecoder(
             vae_decoder,
             self,
@@ -623,15 +627,36 @@ def __init__(
     ):
         super().__init__(model, parent_model, config, neuron_config, DIFFUSION_MODEL_VAE_DECODER_NAME)
 
-    def forward(self, latent_sample: torch.Tensor):
+    def forward(
+        self,
+        latent_sample: torch.Tensor,
+        image: Optional[torch.Tensor] = None,
+        mask: Optional[torch.Tensor] = None,
+    ):
         inputs = (latent_sample,)
+        if image is not None:
+            inputs += (image,)
+        if mask is not None:
+            inputs += (mask,)
         outputs = self.model(*inputs)
 
         return tuple(output for output in outputs.values())
 
 
-class NeuronStableDiffusionPipeline(NeuronStableDiffusionPipelineBase, StableDiffusionPipelineMixin):
-    __call__ = StableDiffusionPipelineMixin.__call__
+class NeuronStableDiffusionPipeline(NeuronStableDiffusionPipelineBase, NeuronStableDiffusionPipelineMixin):
+    __call__ = NeuronStableDiffusionPipelineMixin.__call__
+
+
+class NeuronStableDiffusionImg2ImgPipeline(
+    NeuronStableDiffusionPipelineBase, NeuronStableDiffusionImg2ImgPipelineMixin
+):
+    __call__ = NeuronStableDiffusionImg2ImgPipelineMixin.__call__
+
+
+class NeuronStableDiffusionInpaintPipeline(
+    NeuronStableDiffusionPipelineBase, NeuronStableDiffusionInpaintPipelineMixin
+):
+    __call__ = NeuronStableDiffusionInpaintPipelineMixin.__call__
 
 
 class NeuronStableDiffusionXLPipelineBase(NeuronStableDiffusionPipelineBase):
@@ -689,5 +714,5 @@ def __init__(
             self.watermark = None
 
 
-class NeuronStableDiffusionXLPipeline(NeuronStableDiffusionXLPipelineBase, StableDiffusionXLPipelineMixin):
-    __call__ = StableDiffusionXLPipelineMixin.__call__
+class NeuronStableDiffusionXLPipeline(NeuronStableDiffusionXLPipelineBase, NeuronStableDiffusionXLPipelineMixin):
+    __call__ = NeuronStableDiffusionXLPipelineMixin.__call__
diff --git a/optimum/neuron/pipelines/__init__.py b/optimum/neuron/pipelines/__init__.py
@@ -20,9 +20,21 @@
 
 _import_structure = {
     "transformers": ["pipeline"],
+    "diffusers": [
+        "NeuronStableDiffusionPipelineMixin",
+        "NeuronStableDiffusionImg2ImgPipelineMixin",
+        "NeuronStableDiffusionInpaintPipelineMixin",
+        "NeuronStableDiffusionXLPipelineMixin",
+    ],
 }
 
 if TYPE_CHECKING:
+    from .diffusers import (
+        NeuronStableDiffusionImg2ImgPipelineMixin,
+        NeuronStableDiffusionInpaintPipelineMixin,
+        NeuronStableDiffusionPipelineMixin,
+        NeuronStableDiffusionXLPipelineMixin,
+    )
     from .transformers import (
         pipeline,
     )

diff --git a/optimum/neuron/pipelines/diffusers/__init__.py b/optimum/neuron/pipelines/diffusers/__init__.py
@@ -0,0 +1,19 @@
+# coding=utf-8
+# Copyright 2023 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from .pipeline_stable_diffusion import NeuronStableDiffusionPipelineMixin
+from .pipeline_stable_diffusion_img2img import NeuronStableDiffusionImg2ImgPipelineMixin
+from .pipeline_stable_diffusion_inpaint import NeuronStableDiffusionInpaintPipelineMixin
+from .pipeline_stable_diffusion_xl import NeuronStableDiffusionXLPipelineMixin