Stable Diffusion Hugging Face Tutorial

Author

Reads 1.2K

Celebrate family togetherness during festive holidays with joyful hugs and laughter at home.
Credit: pexels.com, Celebrate family togetherness during festive holidays with joyful hugs and laughter at home.

Stable Diffusion is a powerful generative model that allows you to create realistic images from text prompts.

To get started, you'll need to install the Hugging Face library and the Transformers library.

The Hugging Face library provides a simple and efficient way to implement various deep learning models, including Stable Diffusion.

You can install the libraries using pip, the Python package manager.

The Transformers library is a collection of pre-trained models and a simple interface for using them.

Getting Started

To use Stable Diffusion with Hugging Face, you'll need to install the necessary libraries, including requests and Pillow, which can be installed using pip install request Pillow.

The Hugging Face Inference endpoints can directly work with binary data, allowing you to send your prompt and receive an image in return.

To get started, you'll need to install the hugging face hub library and login using your token, which you can obtain by filling out the form on the Stable Diffusion 3 Medium Hugging Face page and accepting the gate.

Credit: youtube.com, Getting Started with Stable Diffusion in 2024 for Absolute Beginners

You can use the 🤗 Diffuser library, which provides a high-level interface for working with Stable Diffusion.

Here are the three official models in the SD3 family that you can use:

  • stabilityai/stable-diffusion-3-medium-diffusers
  • stabilityai/stable-diffusion-3.5-large
  • stabilityai/stable-diffusion-3.5-large-turbo

Make sure you have a GPU machine to run the code, as the first time you run the command to initialize a pipeline, it will download the model from the hugging face model hub to your local machine.

Integrate API with Python

To integrate the Stable Diffusion API with Python, you can use the Hugging Face Inference endpoints. These endpoints can directly work with binary data, allowing you to send your prompt and receive an image in return.

You'll need to install the requests and Pillow libraries using pip, as they are required for sending HTTP requests and saving generated images to disk.

One way to send requests and generate images is by using the requests library to send your prompt and receive the generated image. This can be done by providing the parameters in the parameters attribute when sending requests.

Credit: youtube.com, Working with APIs in Python - Code in 10 Minutes

Here's an example JSON payload for generating a 768x768 image:

```json

{

"prompt": "a sunset on the beach",

"parameters": {

"height": 768,

"width": 768

}

}

```

You can also use the 🤗 Diffuser library to integrate the Stable Diffusion API with Python. This library provides a convenient way to work with the Hugging Face Inference endpoints and generate images using Stable Diffusion.

Recommended read: Stable Diffusion Finetune

Call

The call function is where the magic happens in the Stable Diffusion pipeline. It's what generates the image based on the prompt and parameters you provide. You can call the function by using the `__call__` method, which takes in several parameters that control the generation process.

The prompt is the most important parameter, as it guides the image generation. You can provide a single string or a list of strings as the prompt. The height and width parameters control the size of the generated image, and are set to 512 by default.

The number of denoising steps, or `num_inference_steps`, determines the quality of the image, with more steps leading to higher quality but slower inference. The guidance scale, or `guidance_scale`, controls how closely the generated image should match the prompt, with higher values encouraging closer matches but potentially lower image quality.

Credit: youtube.com, Getting Started Call

You can also use a torch generator to make the generation deterministic, and pre-generated latents to tweak the same generation with different prompts. The output type can be set to PIL or nd.array, and the return dict parameter determines whether to return a StableDiffusionPipelineOutput or a plain tuple.

Here's a summary of the parameters you can pass to the `__call__` method:

Running

To get started with Stable Diffusion, you need to accept the gate by filling out the form on the Stable Diffusion 3 Medium Hugging Face page and logging in.

You'll need to use the command to log in, which is a crucial step before you can start generating images. The command is:

Use the command below to log in:

Note that you can also run Stable Diffusion using the SD3 pipeline, which applies the same optimizations and techniques.

There are three official models in the SD3 family, and they are:

  • stabilityai/stable-diffusion-3-medium-diffusers
  • stabilityai/stable-diffusion-3.5-large
  • stabilityai/stable-diffusion-3.5-large-turbo

To run Stable Diffusion, you'll need to import the StableDiffusionPipeline from the diffusers library, which will download the model from the hugging face model hub to your local machine. You'll require a GPU machine to be able to run this code.

Core Components

Credit: youtube.com, Text to Image generation using Stable Diffusion || HuggingFace Tutorial Diffusers Library

The core components of Stable Diffusion are three main parts: a text encoder, an autoencoder, and a U-Net. These components work together to generate high-quality images.

A text encoder is used to convert text into embeddings, which are then used as input to the U-Net. Specifically, Stable Diffusion uses a CLIP trained encoder for this conversion, which creates embeddings that are similar in latent space.

The autoencoder, also known as a Variational Auto Encoder (VAE), is used to encode and decode images to and from latent representations. This is done to reduce the computational time required to generate high-resolution images.

The U-Net is used to gradually subtract noise in the latent space over several steps to reach the desired output. It has an encoder and a decoder, which are comprised of ResNet blocks, and cross-attention layers are added to both the encoder and the decoder part of the U-Net.

Here are the three main components of Stable Diffusion:

Using the Diffuser Library

Credit: youtube.com, HuggingFace Diffusion Model Class, Unit 1 (casual notebook walkthough)

To use the Diffuser library, you'll need to install it along with the Hugging Face Hub library. This can be done using the following code.

The Diffuser library is used in conjunction with the Hugging Face Hub library to download models from the Hugging Face model hub to your local machine. This process is necessary for running Stable Diffusion.

To get started, you'll need to download the Hugging Face Hub library using the code provided in the documentation. This will allow you to access the Hugging Face model hub and download the necessary models.

Once you have the Hugging Face Hub library installed, you can use it to download the Diffuser library. This will give you access to the Diffuser library's features and functionality.

Here's a list of the dependencies you'll need to install:

  • Fast.ai course — 1st Two Lessons of From Deep Learning Foundations to Stable Diffusion
  • Stable Diffusion with 🧨 Diffusers
  • Getting Started in the World of Stable Diffusion

These resources will provide you with a solid foundation for using the Diffuser library and getting started with Stable Diffusion.

What's Their Role

High-Speed Photography of Colorful Ink Diffusion in Water
Credit: pexels.com, High-Speed Photography of Colorful Ink Diffusion in Water

The core components of Stable Diffusion are the text encoder, autoencoder, and U-Net. These components work together to denoise random Gaussian noise and generate high-quality images.

The text encoder is a CLIP trained encoder that converts text into embeddings, which are then used as input to the U-Net. This is a key part of the Stable diffusion pipeline, as it allows the model to understand the text description and generate an image that matches.

The autoencoder, specifically a Variational Auto-Encoder (VAE), is used to reduce the computational time to generate high-resolution images. It encodes the image into a lower-dimensional latent space, where the diffusion process can occur more efficiently.

The U-Net is a key component of the Stable diffusion pipeline, responsible for gradually subtracting noise in the latent space over several steps to reach the desired output. It uses cross-attention layers to condition the output based on the text description provided.

Additional reading: Ai and Ml Images

High-Speed Photography of Colorful Ink Diffusion in Water
Credit: pexels.com, High-Speed Photography of Colorful Ink Diffusion in Water

Here are the roles of each component in the Stable diffusion pipeline:

The Stable diffusion pipeline relies on these components working together to generate high-quality images. By understanding the role of each component, we can gain a deeper appreciation for how Stable diffusion works.

Understanding the Process

The stable diffusion model takes the textual input and a seed, passing the textual input through the CLIP model to generate textual embedding of size 77x768.

This textual embedding is then used to condition the iterative denoising process of the U-Net, which repeatedly denoises the random latent image representations while conditioning on the text embeddings.

The output of the U-Net is predicted noise residual, which is then used to compute conditioned latents via a scheduler algorithm. This process is repeated N times to retrieve a better latent image representation.

Typical steps are in the range of 30–80, but recent papers claim to reduce it to 4–5 steps by using distillation techniques. We will use 50 steps in this case.

The final latent image representation is then decoded by the VAE decoder to retrieve the final output image of size 3x512x512.

Here's an interesting read: How Is Ai Used in Robotics

Hugging Face Inference Endpoints

Credit: youtube.com, Hands-On Introduction to Inference Endpoints (Hugging Face)

Hugging Face Inference Endpoints offer a secure production solution to easily deploy Machine Learning models on dedicated and autoscaling infrastructure managed by Hugging Face.

You can access the UI of Inference Endpoints directly at https://ui.endpoints.huggingface.co/ or through the Landing page.

To deploy a model as an Inference Endpoint, you need to add the Hugging Face repository ID of the model you want to deploy, such as stabilityai/stable-diffusion-2. If the repository is gated, you need to accept the terms on the model page.

You can make changes to the provider, region, or instance you want to use, as well as configure the security level of your endpoint. It's a good idea to keep the suggested defaults from the application.

Inference Endpoints will create a dedicated container with the model and start your resources once you click the "Create Endpoint" button. After a few minutes, your endpoint is up and running.

You can deploy any Stable-Diffusion model from the Hugging Face Hub to Hugging Face Inference Endpoints and integrate it via an API into your products.

For another approach, see: How to Create a Huggingface Dataset

Image Generation

Credit: youtube.com, Generating Images from Text with Stable Diffusion and Hugging Face

The image generation process with Stable Diffusion 2.0 is quite impressive. You can test and generate images directly in the UI, thanks to the inference widget that comes with each Inference Endpoint.

To get started, you can provide a prompt for the image to be generated. For example, you might try to create a realistic render of a group of flying blue whales towards the moon, with a sci-fi twist and extremely detailed digital painting.

Each prompt is a unique opportunity to explore the capabilities of the model. You can experiment with different descriptions to see what kind of images the model can produce.

The inference widget is similar to the ones you know from the Hugging Face Hub, making it easy to navigate and use.

Optimizations

Memory optimisations are crucial for running SD3 on low resource hardware, and Diffusers provides a few key techniques to achieve this.

SD3 uses three text encoders, one of which is the massive T5-XXL model, making it challenging to run on GPUs with less than 24GB of VRAM, even with fp16 precision.

Credit: youtube.com, 1000% FASTER Stable Diffusion in ONE STEP!

Using fp16 precision can help reduce memory usage, but it's not always enough. The T5-XXL model is just too large to fit in memory on smaller GPUs.

Using compiled components in the SD3 pipeline can speed up inference by as much as 4X, making it a game-changer for those with limited resources.

SD3 Performance Optimizations

Memory optimizations for SD3 can be a challenge, especially when running the large T5-XXL model on GPUs with less than 24GB of VRAM, even with fp16 precision.

To make it easier to run SD3 on low resource hardware, Diffusers offers a few memory optimizations that can help. These optimizations can be a lifesaver for developers working with limited resources.

Using compiled components in the SD3 pipeline can speed up inference by as much as 4X, as demonstrated in the code snippet that shows how to compile the Transformer and VAE components of the SD3 pipeline.

The prompt with the CLIP Text Encoders is still truncated to the 77 token limit, which is a performance optimization that can help developers work around this issue.

Disable Attention Slicing

Credit: youtube.com, Optimize GPU performance for AI - Prof. Gennady Pekhimenko

Disabling attention slicing allows your model to compute attention in one step, which can be beneficial if you need more control over the computation process.

This method will go back to computing attention in one step if enable_attention_slicing was previously invoked.

If you've previously enabled sliced attention computation, disabling it will revert to the original computation method.

This change can impact the performance and efficiency of your model, so consider the trade-offs before making this change.

Class Diffusers

Class Diffusers is a powerful tool in the Stable Diffusion Hugging Face pipeline. It's a class that allows you to create a pipeline for text-to-image generation using Stable Diffusion.

The StableDiffusionPipeline class is a pipeline for text-to-image generation using Stable Diffusion. It inherits from DiffusionPipeline and includes a variety of parameters, such as the vae, text_encoder, tokenizer, and unet.

The vae (Variational Auto-Encoder) Model is used to encode and decode images to and from latent representations. The text_encoder is a frozen text-encoder, specifically the CLIPTextModel, which is used to extract text features. The tokenizer is a CLIPTokenizer, which is used to tokenize the text input. The unet is a Conditional U-Net architecture to denoise the encoded image latents.

Credit: youtube.com, Learning Stable Diffusion - step into with codes - Hugging face Diffusers

Here are the parameters of the StableDiffusionPipeline class:

  • vae: AutoencoderKL
  • text_encoder: CLIPTextModel
  • tokenizer: CLIPTokenizer
  • unet: UNet2DConditionModel
  • scheduler: SchedulerMixin
  • safety_checker: StableDiffusionSafetyChecker
  • feature_extractor: CLIPFeatureExtractor

The StableDiffusionPipeline class also includes a variety of optional parameters, such as the number of inference steps, the guidance scale, and the eta value. These parameters can be used to fine-tune the pipeline and achieve better results.

The StableDiffusionPipeline class is a powerful tool for text-to-image generation using Stable Diffusion. It's easy to use and includes a variety of parameters to fine-tune the pipeline. With the right parameters, you can generate high-quality images from text prompts.

Frequently Asked Questions

Does hugging face use Stable Diffusion?

No, Hugging Face does not use Stable Diffusion, as it was created by CompVis, Stability AI, and LAION for image upscaling. However, you can explore Hugging Face's models and libraries for other AI applications and capabilities.

What is the best Stable Diffusion AI?

The top Stable Diffusion model is RealVisXL V4.0, renowned for its exceptional ability to create highly realistic human images. Its advanced capabilities make it a standout choice for generating lifelike faces and eyes.

Can Stable Diffusion generate faces?

Stable Diffusion can generate human faces, but may not always produce seamless, photorealistic results. Experimenting with prompts and new methodologies may be needed to achieve the best face generation outcomes.

How to speed up Stable Diffusion in Huggingface?

To speed up Stable Diffusion in Huggingface, try switching to float16 precision and reducing the number of inference steps. This simple tweak can significantly boost performance and get you generating images faster.

Carrie Chambers

Senior Writer

Carrie Chambers is a seasoned blogger with years of experience in writing about a variety of topics. She is passionate about sharing her knowledge and insights with others, and her writing style is engaging, informative and thought-provoking. Carrie's blog covers a wide range of subjects, from travel and lifestyle to health and wellness.

Love What You Read? Stay Updated!

Join our community for insights, tips, and more.