How to Run Diffusion Image GGUFs in ComfyUI
Guide for running Unsloth Diffusion GGUF models in ComfyUI.
ComfyUI is an open-source diffusion model GUI, API, and backend that uses a node-based (graph/flowchart) interface. ComfyUI is the most popular way to run workflows for image models like Qwen-Image-Edit or FLUX.
GGUF is of the best and efficient formats for running diffusion models locally, and Unsloth Dynamic GGUFs uses smart quantization to preserve accuracy even at low-bits.
You'll learn how to install ComfyUI (Windows, Linux, macOS), build workflows, and tune hyperparameters in this step-by-step tutorial.
Prerequisites & Requirements
You don’t need a GPU to run diffusion GGUFs, just a CPU with RAM. VRAM isn’t required but will make inference much faster. For best results, ensure your total usable memory (RAM + VRAM / unified) is slightly larger than the GGUF size; for example, the 4-bit (Q4_K_M) unsloth/Qwen-Image-Edit-2511-GGUF is 13.1 GB, so you should have at least ~13.2 GB of combined memory. You can find all Unsloth diffusion GGUFs in our Collection.
We recommend at least 3-bit quantization for diffusion models, since their layers, especially the vision components, are very sensitive to quantization. Unsloth Dynamic quants upcasts important layers to recover as much accuracy as possible.
📖 ComfyUI Tutorial
ComfyUI represents the entire image generation pipeline as a graph of connected nodes. This guide will focus on machines with CUDA, but instructions to build with on Apple or CPU are similar.
#1. Install & Setup
To install ComfyUI, you can download the desktop app on Windows or Mac devices here. Otherwise, to setup ComfyUI for running GGUF models run the following:
mkdir comfy_ggufs
cd comfy_ggufs
python -m venv .venv
source .venv/bin/activate
git clone https://github.com/comfyanonymous/ComfyUI.git
cd ComfyUI
pip install -r requirements.txt
cd custom_nodes
git clone https://github.com/city96/ComfyUI-GGUF
cd ComfyUI-GGUF
pip install -r requirements.txt
cd ../..#2. Download Models
Diffusion models typically need 3 models. A Variational AutoEncoder (VAE) that encodes image pixel space to latent space, a text encoder to translate text to input embeddings, and the actual diffusion transformer. You can find all Unsloth diffusion GGUFs in our Collection here.
Both the diffusion model and text encoder can be GGUF format while we typically use safetensors for the vae. Let's download the models we will use.
See GGUF uploads for: Qwen-Image-Edit-2511, FLUX.2-dev and Qwen-Image-Layered
The format of the vae and diffusion model might be different than the diffusers checkpoints. Only use checkpoints that are compatible with ComfyUI.
These files must be in the correct folders for ComfyUI to see them. In addition the vision tower in the mmproj file must use the same prefix as the text encoder.
Download reference images to be used later as well.
Workflow and Hyperparameters
You can also view our detailed 🎯 Workflow and Hyperparameters Guide.
Navigate to the ComfyUI directory and run:
This will launch a web server that allows you to access https://127.0.0.1:8188 . If you are running this on the cloud, you'll need to make sure port forwarding is setup to access on your local machine.
Workflows are saved as JSON files embedded in output images (PNG metadata) or as separate .json files. You can:
Drag & drop an image into ComfyUI to load its workflow
Export/import workflows via the menu
Share workflows as JSON files
Below are two examples of FLUX 2 json files which you can download and use:
Instead of setting up the workflow from scratch you can download the workflow here.
Load it into the browser page by clicking the Comfy Logo -> File -> Open -> Then choose the unsloth_flux2_t2i_gguf.json file you just downloaded. It should look like the below:


This workflow is based on the official ComfyUI published workflow except it uses the GGUF loader extension, and is simplified to illustrate text to image functionality.
#3. Inference
ComfyUI is highly customizable. You can mix models and create extremely complex pipelines. For a basic text to image setup we need to load the model, specify prompt and image details, and decide on a sampling strategy.
Upload Models + Set Prompt
We already downloaded the models, so we just need to pick the correct ones. For Unet Loader pick flux2-dev-Q4_K_M.gguf, for CLIPLoader pick Mistral-Small-3.2-24B-Instruct-2506-UD-Q4_K_XL.gguf, and for Load VAE pick flux2-vae.safetensors.
You can set any prompt you'd like. Since classifier free guidance is baked into the model we do not need to specify a negative prompt.
Image Size + Sampler Parameters
Flux2-dev supports different image sizes. You can make rectangular shapes by setting the values of width and height. For sampler parameters, you can experiment with different samplers other than euler, and more or less sampling steps. Change the RandomNoise setting from randomize to fixed if you want to see how different settings change outputs.
Run
Click Run and an image will be generated in 45-60 seconds. That output image can be saved. The interesting part is that the metadata for the entire comfy workflow is saved in the image. You can share and anyone can see how it was created by loading it in the UI.

Multi Reference Generation
A key feature of Flux2 is multi reference generation where you can supply multiple images to use to help control generation. This time load the unsloth_flux2_i2i_gguf.json. We will use the same models, the only difference this time are extra nodes to select images to reference, which we've downloaded earlier. You'll notice the prompt refers to both image 1 and image 2 which are prompt anchors for the images. Once loaded click Run, and you'll see an output that creates our two unique sloth characters together while preserving their likeness.

🎯 Workflow and Hyperparameters
For text to image workflows we need to specify a prompt, sampling parameters, image size, guidance scale, and any optimization configs.
Sampling
Sampling works differently from LLM's. Instead of sampling one token at a time we sample the whole image over multiple steps. Each step progressively "denoises" the image, which means that when you run for more steps, the image tends to be higher quality. There are also different sampling algorithms which range from first order and second order algorithms to deterministic and stochastic algorithms. For this tutorial we will use euler which a standard sampler that balances quality and speed.
Guidance
Guidance is another important hyperparameter for diffusion models. There are many flavors of guidance but the two most widely used forms are classifier free guidance (CFG) and guidance distillation. The concept of classifier free guidance stems from Classifier-Free Diffusion Guidance. Historically you needed a separate classifier model to guide the model to match the input condition, but this paper actually shows CFG uses the difference between the model’s conditional and unconditional predictions to form a guidance direction.
In practice it's not an unconditional prediction but a negative prompt prediction, meaning it's a prompt we definitely don't want and we should steer away from. When using CFG you do not need a separate model, but you need a second inference step from the unconditional or negative prompt. Other models have CFG baked in during training, but you can still set the strength of the guidance. This is separate from CFG since it does not need a second inference step, but it's still a tunable hyperparameter to set how strong its effect is.
Conclusion
Putting it all together, you set a prompt to tell the model what to produce, the text encoder encodes the text, the VAE encodes the image, both embeddings are stepped through the diffusion model according to the sampling parameters + guidance, and finally the output is decoded by the VAE which results in a usable image.
Key Concepts & Glossary
Latent: Compressed image representation (what the model operates on).
Conditioning: Text/image information that guides generation.
Diffusion Model / UNet: Neural network that performs the denoising.
VAE: Encoder/decoder between pixel space and latent space.
CLIP (text encoder): Converts a prompt into embeddings.
Sampler: Algorithm that iteratively denoises the latent.
Scheduler: Controls the noise schedule across steps.
Nodes: Operations (load model, encode text, sample, decode, etc.).
Edges: Data flowing between nodes.
Last updated
Was this helpful?

