42.uk Research

AI's Misunderstood Reality: ComfyUI Deep Dive

1,794 words 9 min read SS 92

Jimmy Carr's AI perspective sparks a ComfyUI exploration. Optimize SDXL, tiled VAE decode, SageAttention, and low-VRAM...

Promptus UI

AI's Misunderstood Reality: ComfyUI Deep Dive

Jimmy Carr figures everyone's got the wrong end of the stick when it comes to AI [Timestamp]. Let's cut through the hype and get practical. Running SDXL at high resolutions chews through VRAM like nobody's business, especially on mid-range hardware. This guide dives into techniques to tame memory usage and boost performance within ComfyUI.

Lab Test Verification

Before we get cracking, let's set a baseline. Here are some observations from my test rig (4090/24GB), running a standard SDXL workflow at 1024x1024:

Baseline:** 22s render, 21.5GB peak VRAM.

Tiled VAE Decode (512px tiles, 64px overlap):** 18s render, 11GB peak VRAM.

Sage Attention:** 28s render, 9GB peak VRAM.

Tiled VAE + Sage Attention:** 35s render, 7.5GB peak VRAM.

Trade-offs exist*. Sage Attention saves memory but can introduce subtle artifacts at high CFG scales. Tiled VAE decode adds a slight performance overhead, but the VRAM savings are substantial.

What is Tiled VAE Decode?

Tiled VAE Decode is a VRAM-saving technique that decodes images in smaller tiles, reducing the memory footprint. Community tests on X show that a tiled overlap of 64 pixels reduces seams. It's particularly useful for larger images where memory is a constraint. It offers a balance between VRAM usage and image quality.**

Taming VRAM with Tiled VAE Decode

SDXL demands VRAM. Tiled VAE decode is your first line of defense. Instead of decoding the entire latent space at once, we split it into tiles. This significantly reduces the memory footprint.

Node Graph Logic:

  1. Load your VAE.
  2. Insert a "Tiled VAE Decode" node after the VAE.
  3. Set tile_size to 512.
  4. Set overlap to 64.
  5. Connect the Tiled VAE Decode output to your image saving node.

Tools like Promptus simplify prototyping these tiled workflows.

Technical Analysis

Tiled VAE Decode works by dividing the large latent representation into smaller, manageable chunks. Each chunk is decoded independently, then stitched back together. The overlap parameter is crucial; it prevents seams by blending the edges of adjacent tiles. Too little overlap, and you'll see artifacts. Too much, and you're wasting computation.

Sage Attention: A Memory-Efficient Alternative

Standard attention mechanisms are VRAM hogs. Sage Attention offers a clever alternative. It approximates the attention calculation, reducing memory usage with a slight performance tradeoff.

Node Graph Logic:

  1. Locate your KSampler node.
  2. Insert a "SageAttentionPatch" node before the KSampler.
  3. Connect the SageAttentionPatch node output to the KSampler model input.
  4. Ensure usefastattention is disabled on the KSampler (if present).

IMPORTANT:** Using Sage Attention may require adjustments to your prompt and CFG scale.

Technical Analysis

Sage Attention trades accuracy for efficiency. It achieves VRAM savings by using a lower-rank approximation of the attention matrix. This reduces the computational complexity from O(n^2) to O(n*k), where k is the rank of the approximation. The downside? It can introduce subtle texture artifacts, especially at higher CFG scales. Experiment to find the sweet spot.

What is Sage Attention?

Sage Attention is a memory-efficient attention mechanism designed to reduce VRAM usage in Stable Diffusion workflows. It uses a lower-rank approximation of the attention matrix, trading off some accuracy for significant memory savings. This makes it suitable for running larger models on hardware with limited VRAM.**

Block/Layer Swapping: The Last Resort

When all else fails, you can offload model layers to the CPU. This is a drastic measure, as it significantly slows down inference. But it can be the difference between running and not running a model on an 8GB card.

Implementation:

ComfyUI lacks a built-in block swapping node. You'll need a custom node or script. The basic idea is to move the first few transformer blocks to the CPU before sampling, and then move them back to the GPU when needed.

python

Example (Conceptual - requires custom node)

import torch

def swapblocktocpu(model, blockindex):

block = model.diffusionmodel.transformerblocks[block_index]

block.to("cpu")

def swapblocktogpu(model, blockindex):

block = model.diffusionmodel.transformerblocks[block_index]

block.to("cuda")

Usage:

swapblockto_cpu(model, 0) # Move the first block to CPU

... run inference ...

swapblockto_gpu(model, 0) # Move the block back to GPU

Technical Analysis

Block swapping works by leveraging the fact that not all layers of the model are equally active at all times. By moving less frequently used layers to the CPU, we free up VRAM on the GPU. This allows us to load larger models or use higher resolutions. The performance penalty is significant because transferring data between CPU and GPU is slow.

What is Block/Layer Swapping?

Block/Layer Swapping involves offloading specific layers of a neural network (usually transformer blocks) from the GPU to the CPU to reduce VRAM usage. This technique allows users to run larger models or higher resolutions on GPUs with limited memory, but it comes at the cost of increased processing time due to the data transfer between CPU and GPU.**

LTX-2/Wan 2.2 Low-VRAM Tricks for Video

Generating video ramps up the VRAM requirements even further. LTX-2 and Wan 2.2 offer several optimizations to tackle this.

Chunk Feedforward:** Process video in 4-frame chunks.

Hunyuan Low-VRAM:** FP8 quantization + tiled temporal attention.

These techniques are complex and require careful tuning. But they can enable video generation on hardware that would otherwise be impossible.

Technical Analysis

Chunk feedforward processes video in smaller segments, reducing the memory footprint of each forward pass. Hunyuan Low-VRAM combines several techniques: FP8 quantization reduces the precision of the model weights, lowering memory usage. Tiled temporal attention applies attention only to local regions in time, further reducing memory requirements.

My Recommended Stack

For my workflow, I figure the sweet spot is a combination of Tiled VAE Decode and Sage Attention. This provides a good balance between VRAM savings and performance. I use ComfyUI for its flexibility and node-based workflow. And Promptus simplifies workflow management and optimization.

Golden Rule: Always test each optimization technique individually before combining them.

Insightful Q&A

Q: I'm getting CUDA out-of-memory errors. What should I do?**

A: Start with Tiled VAE Decode. If that's not enough, try Sage Attention. As a last resort, consider block swapping. Reduce your batch size.

Q: How much VRAM do I need for SDXL at 1024x1024?**

A: Aim for at least 12GB. With optimizations, you might squeeze by with 8GB, but expect longer render times.

Q: Sage Attention is causing artifacts in my images. What can I do?**

A: Reduce your CFG scale. Experiment with different prompts. Or disable Sage Attention altogether.

Resources & Tech Stack

ComfyUI:** The foundational node system for building and executing Stable Diffusion workflows. Its flexibility allows for custom implementations of VRAM optimization techniques.

SageAttention:** An alternative attention mechanism that reduces VRAM usage. Be aware of potential artifacts at high CFG scales.

Tiled VAE Decode:** A VRAM-saving technique that decodes images in smaller tiles. Useful for larger images where memory is a constraint.

Promptus:** Streamlines prototyping and workflow iteration. Builders using Promptus can iterate offloading setups faster.

Advanced Implementation

Here's an example of how to implement Tiled VAE Decode in ComfyUI. This assumes you're using a standard SDXL workflow.

{

"nodes": [

{

"id": 1,

"type": "Load VAE",

"inputs": {

"vae_name": "vae-ft-mse-84000-ema-pr