Skip to main content

Overview

Stable Diffusion XL (SDXL) is Stability AI’s flagship text-to-image model, offering exceptional quality and versatility. It’s the most widely-used and well-supported diffusion model, with excellent community support and thousands of fine-tuned variants available.
SDXL is the recommended starting point for most users due to its excellent balance of quality, speed, and VRAM requirements.

Model Variants

SDXL Base 1.0

The standard SDXL model optimized for high-quality image generation.
from hypergen import model

m = model.load("stabilityai/stable-diffusion-xl-base-1.0")
m.to("cuda")
Key Features:
  • Resolution: Native 1024x1024 (can generate up to 2048x2048)
  • Quality: Excellent detail and composition
  • VRAM: 8GB minimum, 12GB recommended
  • Speed: ~4 seconds per image (RTX 4090, 50 steps)

SDXL Turbo

A distilled variant optimized for ultra-fast generation (1-4 steps).
from hypergen import model

m = model.load("stabilityai/sdxl-turbo")
m.to("cuda")
Key Features:
  • Speed: 3-4x faster than base (1-4 inference steps)
  • Quality: Very good (slightly below base)
  • VRAM: 8GB minimum
  • Use Case: Rapid prototyping, real-time applications

SDXL Refiner

A specialized model for refining SDXL base outputs (optional).
from hypergen import model

# Load base model
m_base = model.load("stabilityai/stable-diffusion-xl-base-1.0")

# Load refiner (optional)
m_refiner = model.load("stabilityai/stable-diffusion-xl-refiner-1.0")
The refiner is optional and typically used for professional workflows. Most users don’t need it.

Loading SDXL with HyperGen

Basic Loading

from hypergen import model

# Load SDXL
m = model.load("stabilityai/stable-diffusion-xl-base-1.0")
m.to("cuda")

# Generate an image
image = m.generate("A majestic lion in the African savanna")
image[0].save("output.png")

Optimized Loading

For better performance and lower VRAM usage:
from hypergen import model

# Load with fp16 precision
m = model.load(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype="float16",
    variant="fp16",
    use_safetensors=True,
)
m.to("cuda")
Using torch_dtype="float16" reduces VRAM usage by ~50% with minimal quality loss.

Memory-Optimized Loading

For GPUs with 8GB VRAM:
from hypergen import model

m = model.load("stabilityai/stable-diffusion-xl-base-1.0", torch_dtype="float16")
m.to("cuda")

# Enable memory optimizations
m.enable_vae_slicing()           # Reduce VAE memory usage
m.enable_attention_slicing()     # Reduce attention memory usage

Training LoRAs with SDXL

SDXL is the most popular model for LoRA training due to its excellent quality and wide compatibility.

Basic LoRA Training

from hypergen import model, dataset

# Load model
m = model.load("stabilityai/stable-diffusion-xl-base-1.0", torch_dtype="float16")
m.to("cuda")

# Load dataset
ds = dataset.load("./my_images")

# Train LoRA
lora = m.train_lora(
    ds,
    steps=1000,
    rank=16,
    alpha=32,
    learning_rate=1e-4,
)
  • Quick Training (8GB VRAM)
  • Balanced Training (12GB VRAM)
  • High-Quality Training (16GB+ VRAM)
For fast iteration and testing:
lora = m.train_lora(
    ds,
    steps=800,
    learning_rate=1e-4,
    rank=8,
    alpha=16,
    batch_size=1,
    gradient_accumulation_steps=4,
)
Settings:
  • Lower rank (8) for faster training
  • Fewer steps for quick results
  • Works on 8GB VRAM
  • Training time: ~10 minutes (50 images)

Training for Different Use Cases

Learning an artistic style or aesthetic:
lora = m.train_lora(
    ds,
    steps=1500,
    learning_rate=1e-4,
    rank=16,
    alpha=32,
    batch_size=1,
    gradient_accumulation_steps=4,
)
Dataset:
  • 50-200 images in the target style
  • Consistent aesthetic across all images
  • Captions describing content, not style
  • High resolution (1024x1024+)
Example caption:
A landscape with mountains and a lake, trees in the foreground
(Describe what you see, not “in artistic style”)
Learning a specific person, character, or object:
lora = m.train_lora(
    ds,
    steps=1000,
    learning_rate=5e-5,
    rank=32,
    alpha=64,
    batch_size=1,
    gradient_accumulation_steps=4,
)
Dataset:
  • 20-100 images of the subject
  • Variety of poses, angles, and expressions
  • Different lighting conditions
  • Detailed captions
Example caption:
A photo of [subject name], smiling, wearing a blue shirt,
front-facing portrait, natural lighting
Learning a new concept or composition style:
lora = m.train_lora(
    ds,
    steps=2000,
    learning_rate=1e-4,
    rank=24,
    alpha=48,
    batch_size=1,
    gradient_accumulation_steps=8,
)
Dataset:
  • 30-150 images demonstrating the concept
  • Varied examples showing different aspects
  • Captions focusing on composition and elements

Inference Parameters

Basic Generation

image = m.generate(
    prompt="A serene Japanese garden with cherry blossoms",
    negative_prompt="blurry, low quality, distorted",
    num_inference_steps=50,
    guidance_scale=7.5,
    height=1024,
    width=1024,
)

Parameter Guide

prompt
str
required
Text description of the desired image
negative_prompt
str
default:""
What to avoid in the generated imageCommon negative prompts:
negative_prompt="blurry, low quality, distorted, deformed, bad anatomy"
num_inference_steps
int
default:50
Number of denoising steps
  • 20-30: Fast, good quality
  • 40-50: Better quality (recommended)
  • 50-100: Highest quality, diminishing returns
guidance_scale
float
default:7.5
How closely to follow the prompt
  • 5-6: More creative, less literal
  • 7-8: Balanced (recommended)
  • 9-12: Very literal, can be oversaturated
height
int
default:1024
Image height in pixels (must be multiple of 8)
  • 1024: Standard (recommended)
  • 768: Faster, lower quality
  • 1536-2048: Higher detail, slower
width
int
default:1024
Image width in pixels (must be multiple of 8)

Speed Priority

num_inference_steps=20
guidance_scale=7.0
height=768
width=768
Generation time: ~1.5s

Balanced

num_inference_steps=40
guidance_scale=7.5
height=1024
width=1024
Generation time: ~3.5s

Quality Priority

num_inference_steps=50
guidance_scale=7.5
height=1024
width=1024
Generation time: ~4.5s

Advanced Generation

# Generate with custom seed for reproducibility
image = m.generate(
    prompt="A futuristic cityscape at night",
    seed=42,
    num_inference_steps=50,
)

# Generate multiple variations
images = m.generate(
    prompt="A cat wearing a wizard hat",
    num_images=4,  # Generate 4 images
    guidance_scale=7.5,
)

# High-resolution generation
image = m.generate(
    prompt="A detailed portrait",
    height=1536,
    width=1536,
    num_inference_steps=60,
)

Performance Benchmarks

Generation Performance

Based on NVIDIA RTX 4090, 1024x1024 resolution:
StepsVRAMTimeQuality
20~9GB~1.8sGood
30~9GB~2.5sVery Good
40~9GB~3.5sExcellent
50~9GB~4.2sExcellent+

Training Performance

LoRA training on RTX 4090, 50 images:
ConfigurationVRAMTime (1000 steps)
Rank 8, Batch 1~8GB~10 min
Rank 16, Batch 1~9GB~12 min
Rank 32, Batch 1~11GB~15 min
Rank 32, Batch 2~14GB~18 min

VRAM Requirements

1

8GB VRAM

GPUs: RTX 3060 12GB, RTX 2080 TiCapabilities:
  • Generation: 1024x1024 
  • Training: Rank 8-16 
  • Batch size: 1 
2

12GB VRAM

GPUs: RTX 3060, RTX 4070 TiCapabilities:
  • Generation: 1024x1024 
  • Training: Rank 16-32 
  • Batch size: 1-2 
3

16GB+ VRAM

GPUs: RTX 4080, RTX 4090, A100Capabilities:
  • Generation: Up to 2048x2048 
  • Training: Rank 32-64 
  • Batch size: 2-4 

Best Practices

Prompt Engineering

  • Structure
  • Negative Prompts
  • Style Control
  • Quality Modifiers
Good prompt structure:
[Main subject], [Details], [Style], [Lighting], [Quality modifiers]
Examples: Good:
prompt = """
A majestic mountain landscape with snow-capped peaks,
pine forest in the foreground, golden hour lighting,
photorealistic, highly detailed, 8k
"""
L Poor:
prompt = "nice mountain"

Training Best Practices

1

Dataset Quality

Prepare high-quality training data: Do:
  • Use high-resolution images (1024x1024 or higher)
  • Ensure consistent quality
  • Include variety (poses, angles, lighting)
  • Write detailed captions
  • 20-150 images is usually sufficient
L Don’t:
  • Use low-resolution or blurry images
  • Include duplicates
  • Mix different subjects in same dataset
  • Leave images uncaptioned
2

Caption Writing

Write effective captions: Good caption:
A person wearing a red jacket and blue jeans,
standing in front of a brick wall,
natural daylight, slight smile
L Poor caption:
person
Tips:
  • Describe what you see objectively
  • Include composition, lighting, colors
  • Be consistent in style
  • Don’t describe what you want to learn
3

Hyperparameter Selection

Choose appropriate hyperparameters:Start with defaults:
steps=1000
learning_rate=1e-4
rank=16
alpha=32
Adjust based on results:
  • Underfitting? Increase steps, rank, or learning rate
  • Overfitting? Decrease steps, add more data
  • Out of memory? Reduce rank or batch size
4

Checkpoint Management

Save and test checkpoints:
lora = m.train_lora(
    ds,
    steps=2000,
    save_steps=500,  # Save every 500 steps
    output_dir="./checkpoints"
)
Test checkpoints at 500, 1000, 1500, and 2000 steps to find the best one.

Memory Optimization

Reduce VAE memory usage:
m.enable_vae_slicing()
  • Reduces VRAM by ~10%
  • Minimal performance impact
  • Recommended for all users
Reduce attention memory usage:
m.enable_attention_slicing()
  • Reduces VRAM by ~15-20%
  • Small performance impact (~5% slower)
  • Useful for 8GB GPUs
Offload to CPU when not in use:
m.enable_model_cpu_offload()
  • Reduces VRAM by ~40-50%
  • Significant performance impact (~20% slower)
  • Use only if necessary
Use float16 instead of float32:
m = model.load(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype="float16"
)
  • Reduces VRAM by ~50%
  • Minimal quality impact
  • Strongly recommended

Troubleshooting

Common Issues

Error: CUDA out of memory during image generationSolutions:
  1. Enable memory optimizations:
    m.enable_vae_slicing()
    m.enable_attention_slicing()
    
  2. Reduce image resolution:
    image = m.generate(prompt, height=768, width=768)
    
  3. Use float16 precision:
    m = model.load("stabilityai/stable-diffusion-xl-base-1.0", torch_dtype="float16")
    
  4. Generate fewer images:
    image = m.generate(prompt, num_images=1)
    
Error: CUDA out of memory during LoRA trainingSolutions:
  1. Reduce LoRA rank:
    lora = m.train_lora(ds, rank=8, alpha=16)
    
  2. Use batch size 1:
    lora = m.train_lora(ds, batch_size=1)
    
  3. Use gradient accumulation:
    lora = m.train_lora(ds, batch_size=1, gradient_accumulation_steps=8)
    
  4. Use float16 precision:
    m = model.load("stabilityai/stable-diffusion-xl-base-1.0", torch_dtype="float16")
    
Issue: Generated images are low quality or don’t match promptSolutions:
  1. Increase inference steps:
    image = m.generate(prompt, num_inference_steps=50)
    
  2. Adjust guidance scale:
    image = m.generate(prompt, guidance_scale=8.0)
    
  3. Improve prompt:
    prompt = "A detailed [subject], [style], highly detailed, 8k, professional"
    
  4. Use negative prompts:
    negative_prompt = "blurry, low quality, distorted"
    
Issue: Trained LoRA doesn’t work wellSolutions:
  1. Increase training steps:
    lora = m.train_lora(ds, steps=2000)
    
  2. Improve dataset:
    • Add more images
    • Improve caption quality
    • Use higher resolution images
    • Add more variety
  3. Adjust hyperparameters:
    lora = m.train_lora(
        ds,
        steps=1500,
        learning_rate=5e-5,
        rank=32,
        alpha=64,
    )
    
  4. Check earlier checkpoints:
    • Model might be overfitting
    • Try checkpoint-500 or checkpoint-1000
Issue: Image generation is very slowSolutions:
  1. Reduce inference steps:
    image = m.generate(prompt, num_inference_steps=30)
    
  2. Use SDXL Turbo:
    m = model.load("stabilityai/sdxl-turbo")
    image = m.generate(prompt, num_inference_steps=4)
    
  3. Reduce resolution:
    image = m.generate(prompt, height=768, width=768)
    
  4. Disable CPU offload if enabled:
    m.disable_model_cpu_offload()
    

Example Workflows

Basic Image Generation

from hypergen import model

# Load SDXL
m = model.load("stabilityai/stable-diffusion-xl-base-1.0", torch_dtype="float16")
m.to("cuda")

# Generate image
image = m.generate(
    prompt="A serene mountain lake at sunset, photorealistic, highly detailed",
    negative_prompt="blurry, low quality",
    num_inference_steps=40,
    guidance_scale=7.5,
)

image[0].save("mountain_lake.png")

LoRA Training Pipeline

from hypergen import model, dataset

# Load model
m = model.load("stabilityai/stable-diffusion-xl-base-1.0", torch_dtype="float16")
m.to("cuda")

# Load and prepare dataset
ds = dataset.load("./training_images")

# Train LoRA with checkpoints
lora = m.train_lora(
    ds,
    steps=2000,
    learning_rate=1e-4,
    rank=16,
    alpha=32,
    batch_size=1,
    gradient_accumulation_steps=4,
    save_steps=500,
    output_dir="./lora_checkpoints"
)

print("Training complete!")
print("Checkpoints saved in ./lora_checkpoints/")

Batch Generation

from hypergen import model

# Load model
m = model.load("stabilityai/stable-diffusion-xl-base-1.0", torch_dtype="float16")
m.to("cuda")

# Define prompts
prompts = [
    "A majestic lion",
    "A serene landscape",
    "A futuristic city",
    "A beautiful flower",
]

# Generate multiple images
for i, prompt in enumerate(prompts):
    image = m.generate(
        prompt=f"{prompt}, highly detailed, 8k, professional",
        negative_prompt="blurry, low quality",
        num_inference_steps=40,
    )
    image[0].save(f"output_{i}.png")

print(f"Generated {len(prompts)} images")

Community Fine-Tunes

SDXL has thousands of community fine-tunes available. Here are some popular ones:
# Anime style
m = model.load("stablediffusionapi/anything-v5")

# Realistic photography
m = model.load("SG161222/RealVisXL_V4.0")

# Artistic style
m = model.load("RunDiffusion/Juggernaut-XL-v9")

# Product photography
m = model.load("playgroundai/playground-v2.5-1024px-aesthetic")
Browse HuggingFace’s SDXL models for more options.

Next Steps

Additional Resources