Table of Contents
Understanding Tasks in Diffusers: Part 1
In this tutorial, you will learn the basics of different tasks and the corresponding models and pipelines inside Hugging Face Diffusers.
This lesson is the 1st of a 3-part series on Understanding Tasks in Diffusers:
- Understanding Tasks in Diffusers: Part 1 (this tutorial)
- Understanding Tasks in Diffusers: Part 2
- Understanding Tasks in Diffusers: Part 3
To learn how to make the most of your diffusion pipeline and work with the latest, just keep reading.
Understanding Tasks in Diffusers: Part 1
Welcome to your guide for image generation with Diffusers, a state-of-the-art toolkit for exploring and harnessing the power of diffusion models. In this tutorial, we will look at tasks inside the Hugging Face Diffusers Library.
What will we cover?
- Unconditional Image Generation
- Text-to-Image Generation
- Image-to-Image Generation
In a previous tutorial, we looked at how to get started with the diffusers library and understand, pipelines
, models
, and schedulers
. Here, we will make use of what we previously learned and understand some fundamental tasks in the Diffusers library.
Configuring Your Development Environment
To follow this guide, you need to have the diffusers
and accelerate
libraries installed on your system.
Luckily, diffusers
is pip-installable:
$ pip install diffusers $ pip install accelerate
Setup and Imports
A pipeline is an end-to-end class that provides a quick and easy way to use a diffusion system for inference by bundling independently trained models and schedulers together.
—Hugging Face: Diffusers Overview
First, we install diffusers
and accelerate
and import the necessary libraries for the pipelines.
This includes the following:
torch
: be sure to have thetorch>2.0
AutoPipelineForText2Image
andAutoPipelineForImage2Image
: from thediffusers
libraryload_image
andmake_image_grid
: as image utilities
import torch from diffusers import DiffusionPipeline from diffusers import AutoPipelineForText2Image from diffusers import AutoPipelineForImage2Image from diffusers.utils import load_image from diffusers.utils import make_image_grid
device = "cuda" if torch.cuda.is_available() else "cpu" generator = torch.Generator(device).manual_seed(31)
We set the device type to “cuda” if a cuda kernel is available or else fall back to “cpu”.
A torch Generator object enables reproducibility in a pipeline by setting a manual seed.
Unconditional Image Generation
Imagine conjuring up images out of thin air without even a single word as a guide. That’s unconditional image generation, a technique in which diffusion models paint pictures based solely on what they learned from their training data.
Unconditional image generation generates images that resemble a random sample from the images in the training data. This is because the denoising process in the diffusion model is not guided by a text prompt or any other parameters.
Here, we use the DiffusionPipeline
to load the google/ddpm-celebahq-256
that has been trained with the faces of celebrity images.
model_id = "google/ddpm-celebahq-256" pipeline = DiffusionPipeline.from_pretrained(model_id).to(device) images = pipeline(generator=generator).images
We use the make_image_grid
utility function to show the generated image.
make_image_grid(images, rows=1, cols=1)
pipeline.to("cpu") del pipeline torch.cuda.empty_cache()
In the above lines, we offload the pipeline
to cpu
and then delete the pipeline
. This is done so that the cuda
kernel is freed to perform other operations and load models as required.
Text-to-Image Generation
When you think of diffusion models, text-to-image is usually one of the first things that come to mind. Text-to-image generates an image based on the images it has been trained on but also being guided by a text-prompt.
Here, we use the AutoPipelineForText2Image
to load the stabilityai/stable-diffusion-xl-base-1.0
model.
Stable Diffusion XL (SDXL) is a much larger variant of the Stable Diffusion models. It involves a two-stage model process that makes the generated image even more detailed. One can even make use of micro-conditionings within the model to generate fine-grained images.
Let us know if you are interested in a complete overview of Stable Diffusion Models.
Please let us know your choice here.
model_id = "stabilityai/stable-diffusion-xl-base-1.0" pipeline = AutoPipelineForText2Image.from_pretrained( model_id, torch_dtype=torch.float16, variant="fp16" ).to(device)
In the pipeline that we just defined, we can pass in our prompt
and the generator
we defined in the setup.
prompt = "Ernest Hemingway in watercolour, cold color palette, muted colors, detailed, 8k" images = pipeline( prompt, generator=generator ).images
Finally, visualize the image when the generation is complete using the make_image_grid
functionality,
make_image_grid(images, rows=1, cols=1)
pipeline.to("cpu") del pipeline torch.cuda.empty_cache()
Again, we offload the pipeline
and delete it to preserve resources.
Specifying Parameters
Something that can be additionally helpful is specifying parameters that allow us to determine exactly how we want our images.
height
: indicating the height of the generated imagewidth
: indicating the width of the generated imageguidance_scale
: indicating how much the prompt influences image generation; a lowerguidance_scale
means more creative, and a higherguidance_scale
means following the prompt to the ‘t’negative_prompt
: indicates objects or visuals to steer away from while generating
model_id = "runwayml/stable-diffusion-v1-5" pipeline = AutoPipelineForText2Image.from_pretrained( model_id, torch_dtype=torch.float16, variant="fp16", ).to(device)
prompt = "Ernest Hemingway in watercolor, warm color palette, few colors, detailed, 8k" negative_prompt = "ugly, deformed, disfigured, poor details, bad anatomy" images = pipeline( prompt, height=512, width=512, guidance_scale=7, negative_prompt=negative_prompt, generator=generator, ).images
We visualize the image after it has finished generation.
make_image_grid(images, rows=1, cols=1)
pipeline.to("cpu") del pipeline torch.cuda.empty_cache()
Image-to-Image Generation
Image-to-image is another application similar to text-to-image, but we can also pass in an input image in addition to the prompt.
The initial image is encoded to latent space, and noise is added to it.
The latent diffusion model starts with a “noisy” version of an image. It then analyzes this messy image and the provided prompt (if any) to understand what the potential final image should look like. Based on this understanding, the model predicts the specific “noise” that was added to the image during the blurring process. Finally, it subtracts this predicted noise from the original messy image, giving us a clearer, more defined image.
Finally, a decoder transforms the new latent representation back into an image.
model_id = "kandinsky-community/kandinsky-2-2-decoder" pipeline = AutoPipelineForImage2Image.from_pretrained( model_id, torch_dtype=torch.float16, ).to(device)
Here, we fetch the image from a url using the load_image
utility and then pass the image
and the prompt
to the pipeline
.
init_image = load_image("https://i.imgur.com/gq6q4CJ.jpg") prompt = "animated Mona Lisa, detailed, fantasy, cute, adorable, Pixar, Disney, 8k" images = pipeline( prompt, image=init_image ).images
make_image_grid([init_image, images[0]], rows=1, cols=2)
pipeline.to("cpu") del pipeline torch.cuda.empty_cache()
Stable Diffusion XL (SDXL) Model
Let us see how the same image turns out with an SDXL model with similar parameters.
TL;DR
- A Stable Diffusion model generates an image
- It does this by iteratively adding noise to an image and then removing noise to regain the original image
- In the process, it learns how to remove noise
- The model that learns this can be conditioned on text
- It can also be conditioned on other types of images, like depth maps, image masks, and sometimes just patches of colors.
SDXL is a larger and more powerful version of Stable Diffusion v1.5. This model can follow a two-stage model process: the base model generates an image, and a refiner model takes that image and further enhances its details and quality.
First, we load the model in the pipeline using the AutoPipelineForImage2Image
pipeline. Next, we pass in the prompt and the initial image as a starting point.
model_id = "stabilityai/stable-diffusion-xl-refiner-1.0" pipeline = AutoPipelineForImage2Image.from_pretrained( model_id, torch_dtype=torch.float16, variant="fp16", ).to(device)
init_image = load_image("https://i.imgur.com/gq6q4CJ.jpg") prompt = "animated Mona Lisa, detailed, fantasy, cute, adorable, Pixar, Disney, 8k" images = pipeline(prompt, image=init_image, strength=0.5).images
As usual, we visualize the images after the generation is complete.
make_image_grid([init_image, images[0]], rows=1, cols=2)
pipeline.to("cpu") del pipeline torch.cuda.empty_cache()
A Closer Look at Pipeline Parameters
As we saw in this tutorial, one of the factors determining the quality of generated images inside the diffusers library is the pipeline parameters. Let’s take a closer look at how these parameters affect image generation.
- Strength:
strength
is one of the most important parameters. It determines how much the generated image resembles the initial image. - A higher strength value: makes the image more creative, a value closer or equal to 1.0 means the initial image is ignored.
- A lower strength value: means the generated image is more similar to the initial image.
Note: A little caveat to note here is that the strength
and num_inference_steps
parameters are related. If the num_inference_steps
is 20 and strength
is 0.6, then this means adding 12 (20 * 0.6) steps of noise to the initial image. This also means denoising for 12 steps to get the final image.
- Guidance Scale: The
guidance_scale
parameter is used to control how closely aligned the generated image and text prompt are. A higherguidance_scale
value means your generated image is more aligned with the prompt, while a lowerguidance_scale
value means your generated image deviates more from the prompt. - Negative Prompt: A negative prompt conditions the model to not include things in an image, and it can be used to improve image quality or modify an image. We can also use this to be a safety guardrail for the model.
Let’s see how we can combine these parameters for the SDXL model.
model_id = "stabilityai/stable-diffusion-xl-refiner-1.0" pipeline = AutoPipelineForImage2Image.from_pretrained( model_id, torch_dtype=torch.float16, variant="fp16", ).to(device)
Select and experiment with the parameters to see how they affect the generation.
init_image = load_image("https://i.imgur.com/gq6q4CJ.jpg") prompt = "animated Mona Lisa, detailed, fantasy, cute, adorable, Pixar, Disney, 8k" negative_prompt = "ugly, deformed, disfigured, poor details, bad anatomy" images = pipeline( prompt, negative_prompt=negative_prompt, guidance_scale=8.0, strength=0.2, num_inference_steps=40, image=init_image ).images
Finally, visualize the image.
make_image_grid([init_image, images[0]], rows=1, cols=2)
What's next? We recommend PyImageSearch University.
84 total classes • 114+ hours of on-demand code walkthrough videos • Last updated: February 2024
★★★★★ 4.84 (128 Ratings) • 16,000+ Students Enrolled
I strongly believe that if you had the right teacher you could master computer vision and deep learning.
Do you think learning computer vision and deep learning has to be time-consuming, overwhelming, and complicated? Or has to involve complex mathematics and equations? Or requires a degree in computer science?
That’s not the case.
All you need to master computer vision and deep learning is for someone to explain things to you in simple, intuitive terms. And that’s exactly what I do. My mission is to change education and how complex Artificial Intelligence topics are taught.
If you're serious about learning computer vision, your next stop should be PyImageSearch University, the most comprehensive computer vision, deep learning, and OpenCV course online today. Here you’ll learn how to successfully and confidently apply computer vision to your work, research, and projects. Join me in computer vision mastery.
Inside PyImageSearch University you'll find:
- ✓ 84 courses on essential computer vision, deep learning, and OpenCV topics
- ✓ 84 Certificates of Completion
- ✓ 114+ hours of on-demand video
- ✓ Brand new courses released regularly, ensuring you can keep up with state-of-the-art techniques
- ✓ Pre-configured Jupyter Notebooks in Google Colab
- ✓ Run all code examples in your web browser — works on Windows, macOS, and Linux (no dev environment configuration required!)
- ✓ Access to centralized code repos for all 536+ tutorials on PyImageSearch
- ✓ Easy one-click downloads for code, datasets, pre-trained models, etc.
- ✓ Access on mobile, laptop, desktop, etc.
Summary
We are at the point in history where human creativity is no longer limited by skill, perseverance, or generational talent but by imagination, vocabulary, and the power of computing. It is on us to learn how to utilize these tools to the best of our potential.
In this tutorial, we focused on how to leverage different pipelines for simple tasks in the Hugging Face diffusers library. The tasks involved are:
- Unconditional Image Generation
- Text-to-Image Generation
- Image-to-Image Generation
The next two parts of this series will focus on how to do:
- Image Inpainting
- Depth to Image
- ControlNet
- Some advanced tasks
Did you find value in this tutorial, and do you want us to cover more tutorials on diffusion and stable diffusion in the future?
Let us know in the survey here.
Citation Information
A. R. Gosthipaty and R. Raha. “Understanding Tasks in Diffusers: Part 1,” PyImageSearch, P. Chugh, S. Huot, and K. Kidriavsteva, eds., 2024, https://pyimg.co/5azv1
@incollection{ARG-RR_2024_Understanding-Tasks-in-Diffusers-Part-1, author = {Aritra Roy Gosthipaty and Ritwik Raha}, title = {Understanding Tasks in Diffusers: Part 1}, booktitle = {PyImageSearch}, editor = {Puneet Chugh and Susan Huot and Kseniia Kidriavsteva}, year = {2024}, url = {https://pyimg.co/5azv1}, }
Unleash the potential of computer vision with Roboflow - Free!
- Step into the realm of the future by signing up or logging into your Roboflow account. Unlock a wealth of innovative dataset libraries and revolutionize your computer vision operations.
- Jumpstart your journey by choosing from our broad array of datasets, or benefit from PyimageSearch’s comprehensive library, crafted to cater to a wide range of requirements.
- Transfer your data to Roboflow in any of the 40+ compatible formats. Leverage cutting-edge model architectures for training, and deploy seamlessly across diverse platforms, including API, NVIDIA, browser, iOS, and beyond. Integrate our platform effortlessly with your applications or your favorite third-party tools.
- Equip yourself with the ability to train a potent computer vision model in a mere afternoon. With a few images, you can import data from any source via API, annotate images using our superior cloud-hosted tool, kickstart model training with a single click, and deploy the model via a hosted API endpoint. Tailor your process by opting for a code-centric approach, leveraging our intuitive, cloud-based UI, or combining both to fit your unique needs.
- Embark on your journey today with absolutely no credit card required. Step into the future with Roboflow.
Join the PyImageSearch Newsletter and Grab My FREE 17-page Resource Guide PDF
Enter your email address below to join the PyImageSearch Newsletter and download my FREE 17-page Resource Guide PDF on Computer Vision, OpenCV, and Deep Learning.
The post Understanding Tasks in Diffusers: Part 1 appeared first on PyImageSearch.