Stable Diffusion 3 Guide

Overview

Stable Diffusion 3 (SD3) is Stability AI’s latest generation text-to-image model, featuring a completely redesigned architecture. It introduces significant improvements in text rendering, prompt understanding, and image quality compared to SDXL.

SD3 uses a new Multimodal Diffusion Transformer (MMDiT) architecture, making it fundamentally different from SDXL’s UNet-based approach.

Key Features

Text Rendering

Outstanding text generation:

Accurate spelling in images
Multiple text elements
Various fonts and styles
Proper text integration

Prompt Understanding

Improved comprehension:

Better complex prompt handling
More accurate composition
Better spatial relationships
Fewer artifacts

Image Quality

Enhanced visuals:

Better detail preservation
Improved colors and lighting
More coherent compositions
Reduced common artifacts

Efficiency

Optimized performance:

Similar VRAM to SDXL
Competitive speed
Better quality per step
Efficient training

Model Variants

SD3 Medium

The primary SD3 model optimized for quality and accessibility.

from hypergen import model

m = model.load("stabilityai/stable-diffusion-3-medium-diffusers")
m.to("cuda")

Specifications:

Parameters: 2B (transformer), 8B total with text encoders
Resolution: Native 1024x1024
VRAM: 12GB minimum, 16GB recommended
Architecture: Multimodal Diffusion Transformer (MMDiT)

SD3 Large (Coming Soon)

A larger variant with enhanced capabilities.

SD3 Large is not yet publicly available. Check the Stability AI website for release information.

What’s New in SD3

Architectural Changes

SD3 introduces several key differences from SDXL:

MMDiT Architecture
Text Encoders
Rectified Flow

Multimodal Diffusion Transformer:

Replaces UNet with transformer architecture
Processes text and image jointly
Better cross-modal understanding
More efficient attention mechanisms

Impact:

Superior text rendering
Better prompt comprehension
More coherent compositions

Improvements Over SDXL

Feature	SDXL	SD3
Text rendering	Poor	Excellent
Complex prompts	Good	Excellent
Spatial understanding	Good	Better
Architecture	UNet	Transformer
Text encoders	2 (CLIP)	3 (CLIP + T5)
Artifacts	Occasional	Fewer

Loading SD3 with HyperGen

Basic Loading

from hypergen import model

# Load SD3 Medium
m = model.load("stabilityai/stable-diffusion-3-medium-diffusers")
m.to("cuda")

# Generate an image
image = m.generate("A cat holding a sign that says 'HELLO WORLD'")
image[0].save("output.png")

SD3 excels at generating text in images. Try prompts that include signs, labels, or text elements!

Optimized Loading

For better performance:

from hypergen import model

# Load with fp16 or bf16 precision
m = model.load(
    "stabilityai/stable-diffusion-3-medium-diffusers",
    torch_dtype="float16",  # or "bfloat16"
    variant="fp16",
    use_safetensors=True,
)
m.to("cuda")

Memory-Optimized Loading

For 12GB VRAM GPUs:

from hypergen import model

m = model.load(
    "stabilityai/stable-diffusion-3-medium-diffusers",
    torch_dtype="float16"
)
m.to("cuda")

# Enable memory optimizations
m.enable_vae_slicing()
m.enable_attention_slicing()

Training LoRAs with SD3

SD3 supports LoRA training with HyperGen’s optimized pipeline.

Basic LoRA Training

from hypergen import model, dataset

# Load model
m = model.load(
    "stabilityai/stable-diffusion-3-medium-diffusers",
    torch_dtype="float16"
)
m.to("cuda")

# Load dataset
ds = dataset.load("./my_images")

# Train LoRA
lora = m.train_lora(
    ds,
    steps=1000,
    rank=16,
    alpha=32,
    learning_rate=1e-4,
)

Recommended Training Parameters

Quick Training (12GB VRAM)
Balanced Training (16GB VRAM)
High-Quality Training (24GB VRAM)

For fast iteration:

lora = m.train_lora(
    ds,
    steps=800,
    learning_rate=1e-4,
    rank=16,
    alpha=32,
    batch_size=1,
    gradient_accumulation_steps=4,
)

Settings:

Standard rank (16)
Works on 12GB VRAM
Training time: ~12 minutes (50 images)

For most use cases:

lora = m.train_lora(
    ds,
    steps=1000,
    learning_rate=1e-4,
    rank=32,
    alpha=64,
    batch_size=1,
    gradient_accumulation_steps=4,
)

Settings:

Higher rank (32) for better capacity
1000 steps for quality results
Training time: ~15 minutes (50 images)

For best results:

lora = m.train_lora(
    ds,
    steps=2000,
    learning_rate=5e-5,
    rank=64,
    alpha=128,
    batch_size=2,
    gradient_accumulation_steps=4,
    save_steps=500,
    output_dir="./sd3_lora_checkpoints"
)

Settings:

High rank (64) for maximum capacity
More steps for best quality
Training time: ~35 minutes (50 images)

Training for Different Use Cases

Style Transfer

Learning an artistic style:

lora = m.train_lora(
    ds,
    steps=1500,
    learning_rate=1e-4,
    rank=24,
    alpha=48,
    batch_size=1,
    gradient_accumulation_steps=4,
)

Dataset:

50-200 images in target style
Consistent aesthetic
High resolution (1024x1024+)
Captions describing content

Subject/Character

Learning a specific subject:

lora = m.train_lora(
    ds,
    steps=1200,
    learning_rate=5e-5,
    rank=32,
    alpha=64,
    batch_size=1,
    gradient_accumulation_steps=4,
)

Dataset:

20-100 images of subject
Variety of poses and angles
Detailed captions
Different lighting conditions

Text Rendering Style

Learning text rendering:

lora = m.train_lora(
    ds,
    steps=1000,
    learning_rate=1e-4,
    rank=32,
    alpha=64,
    batch_size=1,
    gradient_accumulation_steps=4,
)

Dataset:

Images with various text elements
Different fonts and styles
Captions describing the text content
Variety of text placements

Inference Parameters

Basic Generation

image = m.generate(
    prompt="A neon sign that says 'Open 24/7' on a brick wall",
    negative_prompt="blurry, low quality",
    num_inference_steps=40,
    guidance_scale=7.0,
    height=1024,
    width=1024,
)

Parameter Guide

prompt

str

required

Text description of the desired imageSD3 excels at complex, detailed prompts with multiple elements.

negative_prompt

str

default:""

What to avoid in the generated imageRecommended:

negative_prompt="blurry, low quality, distorted"

num_inference_steps

int

default:28

Number of denoising stepsSD3’s default is 28 steps (vs 50 for SDXL):

15-20: Fast, good quality
28-40: Better quality (recommended)
40-50: Highest quality

guidance_scale

float

default:7

How closely to follow the promptSD3 uses slightly lower guidance than SDXL:

5-6: More creative
7-8: Balanced (recommended)
9-10: Very literal

Recommended Settings

Speed Priority

num_inference_steps=20
guidance_scale=6.5

Time: ~2.5s (RTX 4090)

Balanced

num_inference_steps=28
guidance_scale=7.0

Time: ~3.5s (RTX 4090)

Quality Priority

num_inference_steps=40
guidance_scale=7.5

Time: ~5s (RTX 4090)

Text Generation in Images

SD3’s standout feature is accurate text rendering:

# Simple text
image = m.generate(
    prompt='A coffee shop sign that says "FRESH COFFEE" in bold letters',
    num_inference_steps=28,
)

# Multiple text elements
image = m.generate(
    prompt='''
    A street scene with multiple signs: a red "STOP" sign,
    a blue "Main Street" street sign, and a neon "CAFE" sign
    ''',
    num_inference_steps=40,
)

# Stylized text
image = m.generate(
    prompt='A vintage poster with "SUMMER SALE" in retro typography',
    num_inference_steps=28,
)

For best text results:

Put text in quotes
Describe the text style (bold, neon, handwritten, etc.)
Specify the object containing the text (sign, poster, label)
Keep text relatively short (1-5 words)

Performance Benchmarks

Generation Performance

Based on NVIDIA RTX 4090, 1024x1024 resolution:

Steps	VRAM	Time	Quality
20	~12GB	~2.5s	Good
28	~12GB	~3.5s	Excellent
40	~12GB	~5s	Excellent+
50	~12GB	~6s	Outstanding

Training Performance

LoRA training on RTX 4090, 50 images:

Configuration	VRAM	Time (1000 steps)
Rank 16, Batch 1	~12GB	~15 min
Rank 32, Batch 1	~14GB	~18 min
Rank 64, Batch 1	~16GB	~22 min
Rank 32, Batch 2	~18GB	~20 min

Comparison with SDXL

Metric	SDXL	SD3
Generation time (28 steps)	~3s	~3.5s
VRAM (generation)	~9GB	~12GB
VRAM (training, rank 16)	~9GB	~12GB
Text rendering	Poor	Excellent
Overall quality	Excellent	Excellent+

Best Practices

Prompt Engineering for SD3

Complex Compositions
Text in Images
Detailed Descriptions
Spatial Relationships

SD3 excels at complex scenes: Good:

prompt = """
A bustling farmer's market with wooden stalls selling
fresh vegetables, a red awning overhead, people shopping,
warm afternoon sunlight, photorealistic
"""

SD3 better understands spatial relationships and multiple elements.

Leverage SD3’s text capabilities: Effective:

# Single text element
prompt = 'A storefront with "BAKERY" written in gold letters'

# Multiple text elements
prompt = '''
A cookbook cover with the title "Easy Recipes" at the top
and "By Jane Smith" at the bottom
'''

# Stylized text
prompt = 'A neon sign saying "OPEN" in bright blue light'

L Less effective:

# Very long text
prompt = 'A sign with a full paragraph of text...'

# Complex formatting
prompt = 'A document with multiple columns and tables...'

Use rich, detailed prompts: Detailed:

prompt = """
A cozy coffee shop interior with wooden tables, hanging
pendant lights, exposed brick walls, a barista behind
the counter, warm lighting, morning atmosphere
"""

L Too simple:

prompt = "A coffee shop"

SD3’s improved understanding benefits from detailed descriptions.

Precise positioning: Clear spatial description:

prompt = """
A red apple on the left, a green pear in the center,
and an orange on the right, all on a white table
"""

SD3 better understands “left”, “right”, “above”, “below”, “between”, etc.

Training Best Practices

Dataset Preparation

Prepare high-quality data:

Use 1024x1024 or higher resolution
20-150 images for most use cases
Ensure consistent quality
Remove duplicates
Include variety in poses/angles

Caption Quality

Write effective captions: Good caption:

A person wearing a blue jacket and jeans, standing
in front of a brick wall, natural daylight from the
left, neutral expression, looking at camera

L Poor caption:

person

Tips for SD3:

Describe spatial relationships
Include text content if present
Describe lighting and colors
Be detailed but natural

Hyperparameter Selection

Start with recommended settings:

steps=1000
learning_rate=1e-4
rank=16-32
alpha=2*rank
batch_size=1
gradient_accumulation_steps=4

Adjust based on results and VRAM.

Monitoring Progress

Save and test checkpoints:

lora = m.train_lora(
    ds,
    steps=2000,
    save_steps=500,
    output_dir="./checkpoints"
)

Test multiple checkpoints to find optimal stopping point.

Memory Optimization

VAE Slicing

m.enable_vae_slicing()

Reduces VRAM by ~10-15%
Minimal performance impact
Recommended for all users

Attention Slicing

m.enable_attention_slicing()

Reduces VRAM by ~15-20%
Small performance impact
Useful for 12GB GPUs

Lower Precision

m = model.load(
    "stabilityai/stable-diffusion-3-medium-diffusers",
    torch_dtype="float16"
)

Reduces VRAM by ~40-50%
Minimal quality impact
Strongly recommended

Troubleshooting

Out of Memory (Generation)

Solutions:

Use float16 precision:

m = model.load(
    "stabilityai/stable-diffusion-3-medium-diffusers",
    torch_dtype="float16"
)

Enable memory optimizations:

m.enable_vae_slicing()
m.enable_attention_slicing()

Reduce resolution:

image = m.generate(prompt, height=768, width=768)

Out of Memory (Training)

Solutions:

Reduce LoRA rank:

lora = m.train_lora(ds, rank=16, alpha=32)

Use gradient accumulation:

lora = m.train_lora(ds, batch_size=1, gradient_accumulation_steps=8)

Use float16:

m = model.load(
    "stabilityai/stable-diffusion-3-medium-diffusers",
    torch_dtype="float16"
)

Text Not Rendering Correctly

Tips for better text:

Put text in quotes:
```
prompt = 'A sign that says "OPEN"'
```

Describe the text container:

prompt = 'A wooden sign with "WELCOME" carved in it'

Keep text short (1-5 words)

Increase inference steps:

image = m.generate(prompt, num_inference_steps=40)

Adjust guidance:

image = m.generate(prompt, guidance_scale=8.0)

Poor Training Results

Solutions:

Increase training steps:
```
lora = m.train_lora(ds, steps=1500)
```
Improve dataset quality:
- Add more images
- Write better captions
- Use higher resolution

Adjust learning rate:

lora = m.train_lora(ds, learning_rate=5e-5)

Increase LoRA rank:

lora = m.train_lora(ds, rank=32, alpha=64)

Example Workflows

Text-Rich Image Generation

from hypergen import model

# Load SD3
m = model.load(
    "stabilityai/stable-diffusion-3-medium-diffusers",
    torch_dtype="float16"
)
m.to("cuda")

# Generate image with text
image = m.generate(
    prompt='''
    A vintage travel poster with "VISIT PARIS" in bold art deco
    letters at the top, the Eiffel Tower in the background,
    warm sunset colors
    ''',
    num_inference_steps=28,
    guidance_scale=7.0,
)

image[0].save("paris_poster.png")

LoRA Training for Character

from hypergen import model, dataset

# Load model
m = model.load(
    "stabilityai/stable-diffusion-3-medium-diffusers",
    torch_dtype="float16"
)
m.to("cuda")

# Load character dataset
ds = dataset.load("./character_images")

# Train character LoRA
lora = m.train_lora(
    ds,
    steps=1200,
    learning_rate=5e-5,
    rank=32,
    alpha=64,
    batch_size=1,
    gradient_accumulation_steps=4,
    save_steps=400,
    output_dir="./character_lora"
)

print("Character LoRA training complete!")

Batch Generation with SD3

from hypergen import model

# Load SD3
m = model.load(
    "stabilityai/stable-diffusion-3-medium-diffusers",
    torch_dtype="float16"
)
m.to("cuda")

# Generate multiple images
prompts = [
    'A storefront with "BOOKS" written above the door',
    'A coffee cup with "MORNING" printed on it',
    'A street sign that says "MAIN ST"',
    'A neon sign saying "PIZZA" in red letters',
]

for i, prompt in enumerate(prompts):
    image = m.generate(
        prompt=prompt,
        num_inference_steps=28,
        guidance_scale=7.0,
    )
    image[0].save(f"text_image_{i}.png")

print(f"Generated {len(prompts)} images with text")

SD3 vs SDXL: When to Use Which

Use SD3 When
Use SDXL When

SD3 is better for: Text in images (signs, labels, posters) Complex compositions with multiple elements Precise spatial relationships Detailed scene understanding Latest technology and improvementsExample use cases:

Product mockups with labels
Signage and branding
Posters and advertisements
Complex scene compositions

GPU Requirements

Minimum

VRAM: 12GBGPUs:

RTX 3060 (12GB)
RTX 4070

Capabilities:

Generation: 1024x1024
Training: Rank 16
Batch size: 1

Optimal

VRAM: 24GB+GPUs:

RTX 4090
A100
H100

Capabilities:

Generation: Up to 2048x2048
Training: Rank 64+
Batch size: 2-4

Next Steps

Training Guide

Complete LoRA training documentation

Dataset Preparation

Learn how to prepare training data

SDXL Guide

Compare with SDXL

FLUX.1 Guide

Explore the latest models

Getting Started

Training

Serving

Models

Documentation Index

​Overview

​Key Features

Text Rendering

Prompt Understanding

Image Quality

Efficiency

​Model Variants

​SD3 Medium

​SD3 Large (Coming Soon)

​What’s New in SD3

​Architectural Changes

​Improvements Over SDXL

​Loading SD3 with HyperGen

​Basic Loading

​Optimized Loading

​Memory-Optimized Loading

​Training LoRAs with SD3

​Basic LoRA Training

​Recommended Training Parameters

​Training for Different Use Cases

​Inference Parameters

​Basic Generation

​Parameter Guide

​Recommended Settings

Speed Priority

Balanced

Quality Priority

​Text Generation in Images

​Performance Benchmarks

​Generation Performance

​Training Performance

​Comparison with SDXL

​Best Practices

​Prompt Engineering for SD3

​Training Best Practices

​Memory Optimization

​Troubleshooting

​Example Workflows

​Text-Rich Image Generation

​LoRA Training for Character

​Batch Generation with SD3

​SD3 vs SDXL: When to Use Which

​GPU Requirements

Minimum

Recommended

Optimal

​Next Steps

Training Guide

Dataset Preparation

SDXL Guide

FLUX.1 Guide

​Additional Resources

Overview

Key Features

Model Variants

SD3 Medium

SD3 Large (Coming Soon)

What’s New in SD3

Architectural Changes

Improvements Over SDXL

Loading SD3 with HyperGen

Basic Loading

Optimized Loading

Memory-Optimized Loading

Training LoRAs with SD3

Basic LoRA Training

Recommended Training Parameters

Training for Different Use Cases

Inference Parameters

Basic Generation

Parameter Guide

Recommended Settings

Text Generation in Images

Performance Benchmarks

Generation Performance

Training Performance

Comparison with SDXL

Best Practices

Prompt Engineering for SD3

Training Best Practices

Memory Optimization

Troubleshooting

Example Workflows

Text-Rich Image Generation

LoRA Training for Character

Batch Generation with SD3

SD3 vs SDXL: When to Use Which

GPU Requirements

Next Steps

Additional Resources