Skip to main content

Overview

FLUX.1 is Black Forest Labs’ state-of-the-art text-to-image generation model, representing the cutting edge of diffusion model technology. It delivers exceptional image quality, outstanding prompt adherence, and remarkable detail in generated images.

FLUX.1 Dev

Best for: Production use, highest quality
  • Superior image quality
  • Excellent prompt following
  • Requires 16GB+ VRAM

FLUX.1 Schnell

Best for: Fast iteration, prototyping
  • Optimized for speed (1-4 steps)
  • Good quality/speed tradeoff
  • Requires 12GB+ VRAM

Model Variants

FLUX.1 Dev

The development variant optimized for the highest quality outputs.
from hypergen import model

m = model.load("black-forest-labs/FLUX.1-dev", torch_dtype="bfloat16")
m.to("cuda")
Key Features:
  • Quality: State-of-the-art image generation
  • Prompt Following: Excellent text comprehension
  • License: Non-commercial (requires license for commercial use)
  • VRAM: 16GB+ recommended

FLUX.1 Schnell

The “schnell” (fast) variant optimized for rapid generation.
from hypergen import model

m = model.load("black-forest-labs/FLUX.1-schnell", torch_dtype="bfloat16")
m.to("cuda")
Key Features:
  • Speed: 3-4x faster than Dev
  • Quality: Excellent (slightly below Dev)
  • License: Apache 2.0 (permissive, commercial-friendly)
  • VRAM: 12GB+ recommended

Loading FLUX.1 with HyperGen

Basic Loading

from hypergen import model

# Load FLUX.1 Dev
m = model.load("black-forest-labs/FLUX.1-dev", torch_dtype="bfloat16")
m.to("cuda")

# Generate an image
image = m.generate("A serene mountain landscape at sunset")
image[0].save("output.png")
Always use bfloat16 dtype with FLUX.1 for optimal quality and memory efficiency. The model was trained with bfloat16 precision.

Advanced Loading Options

from hypergen import model

# Load with custom configuration
m = model.load(
    "black-forest-labs/FLUX.1-dev",
    torch_dtype="bfloat16",
    variant="bf16",              # Use bfloat16 weights
    use_safetensors=True,        # Use safetensors format
)
m.to("cuda")

# Enable memory optimizations (for lower VRAM)
m.enable_model_cpu_offload()    # Offload to CPU when not in use
m.enable_vae_slicing()           # Process VAE in slices

Low VRAM Configuration

For systems with limited VRAM (12GB):
from hypergen import model

m = model.load("black-forest-labs/FLUX.1-schnell", torch_dtype="bfloat16")
m.to("cuda")

# Enable memory optimizations
m.enable_model_cpu_offload()
m.enable_vae_slicing()
m.enable_attention_slicing()

Training LoRAs with FLUX.1

FLUX.1 supports efficient LoRA fine-tuning with HyperGen’s optimized training pipeline.

Basic LoRA Training

from hypergen import model, dataset

# Load model
m = model.load("black-forest-labs/FLUX.1-dev", torch_dtype="bfloat16")
m.to("cuda")

# Load training data
ds = dataset.load("./my_training_images")

# Train LoRA
lora = m.train_lora(
    ds,
    steps=1500,
    rank=32,
    alpha=64,
    learning_rate=5e-5,
)
Goal: Learn an artistic style or aesthetic
lora = m.train_lora(
    ds,
    steps=1500,
    learning_rate=5e-5,
    rank=32,
    alpha=64,
    batch_size=1,
    gradient_accumulation_steps=4,
)
Dataset Requirements:
  • 50-200 images
  • Consistent style across images
  • Captions describing content, not style

Memory-Optimized Training

For 16GB VRAM GPUs:
lora = m.train_lora(
    ds,
    steps=1500,
    learning_rate=5e-5,
    rank=32,                        # Medium capacity
    alpha=64,
    batch_size=1,                   # Single image per step
    gradient_accumulation_steps=8,  # Simulate batch_size=8
    save_steps=500,
    output_dir="./flux_lora_checkpoints"
)

High-Quality Training

For 24GB+ VRAM GPUs:
lora = m.train_lora(
    ds,
    steps=2500,
    learning_rate=4e-5,
    rank=64,                        # High capacity
    alpha=128,
    batch_size=2,                   # Process 2 images at once
    gradient_accumulation_steps=4,  # Simulate batch_size=8
    save_steps=500,
    output_dir="./flux_lora_checkpoints"
)

Inference Parameters

Generation Settings

image = m.generate(
    prompt="A photo of a cat wearing a space suit on Mars",
    num_inference_steps=50,      # 20-50 for Dev, 1-4 for Schnell
    guidance_scale=7.5,          # Prompt adherence strength
    height=1024,                 # Image height
    width=1024,                  # Image width
    num_images=4,                # Generate 4 images
    seed=42,                     # For reproducibility
)

FLUX.1 Dev

Quality Priority:
num_inference_steps=50
guidance_scale=7.5
Balanced:
num_inference_steps=30
guidance_scale=7.0
Speed Priority:
num_inference_steps=20
guidance_scale=6.5

FLUX.1 Schnell

Best Settings:
num_inference_steps=4
guidance_scale=0.0  # Schnell doesn't use CFG
Alternative:
num_inference_steps=2
guidance_scale=0.0

Advanced Generation

# Generate with custom scheduler
from diffusers import DPMSolverMultistepScheduler

m.scheduler = DPMSolverMultistepScheduler.from_config(m.scheduler.config)

image = m.generate(
    prompt="A futuristic cityscape at night",
    negative_prompt="blurry, low quality, distorted",
    num_inference_steps=30,
    guidance_scale=7.5,
    height=1024,
    width=1024,
)

Performance Benchmarks

Generation Benchmarks

Based on NVIDIA RTX 4090, 1024x1024 images:
VariantStepsVRAM UsedTimeQuality
FLUX.1 Dev50~18GB~8sOutstanding
FLUX.1 Dev30~18GB~5sExcellent
FLUX.1 Dev20~18GB~3.5sVery Good
FLUX.1 Schnell4~16GB~1.5sExcellent
FLUX.1 Schnell2~16GB~1sVery Good

Training Benchmarks

LoRA training on RTX 4090, 50 images, rank 32:
ConfigurationVRAMTime (1000 steps)Time (2000 steps)
Batch 1, GA 1~16GB~20 min~40 min
Batch 1, GA 4~16GB~22 min~44 min
Batch 1, GA 8~16GB~25 min~50 min
Batch 2, GA 4~22GB~28 min~56 min
GA = Gradient Accumulation Steps. Higher values slightly increase training time but improve quality.

GPU Requirements

Minimum (Schnell)

VRAM: 12GBExamples:
  • RTX 3060 (12GB)
  • RTX 4070
  • A10
Settings:
  • Enable optimizations
  • Batch size 1
  • Rank 16-32

Recommended (Dev)

VRAM: 16GBExamples:
  • RTX 4080
  • RTX 4090
  • A100 (40GB)
Settings:
  • Standard settings
  • Batch size 1-2
  • Rank 32-64

Optimal (Dev)

VRAM: 24GB+Examples:
  • RTX 4090
  • A100 (40GB)
  • H100
Settings:
  • Maximum quality
  • Batch size 2-4
  • Rank 64-128

Best Practices

Prompt Engineering

FLUX.1 has excellent prompt comprehension. Here are tips for best results:
Good prompt structure:
[Subject] [Action/Pose] [Environment] [Lighting] [Style] [Quality]
Example:
prompt = """
A majestic red fox sitting on a moss-covered rock in a misty forest,
soft morning light filtering through the trees, photorealistic style,
highly detailed, 8k quality
"""

Training Best Practices

1

Dataset Preparation

Quality over quantity:
  • Use high-resolution images (1024x1024 or higher)
  • Ensure consistent quality across dataset
  • 20-150 images is usually sufficient
  • Remove duplicates and near-duplicates
2

Caption Quality

Write descriptive captions:
  • Describe what you see, not what you want to learn
  • Include details about composition, lighting, colors
  • Be consistent in caption style
  • Use natural language
Example:
A close-up portrait of a person wearing a red jacket,
standing in front of a blue wall, soft natural lighting
from the left, neutral expression
3

Hyperparameter Tuning

Start with defaults, then adjust:
  1. Begin with recommended settings
  2. If underfitting (not learning), increase:
    • Training steps
    • LoRA rank
    • Learning rate (carefully)
  3. If overfitting (memorizing), decrease:
    • Training steps
    • LoRA rank
    • Add more training images
4

Monitor Training

Save checkpoints regularly:
lora = m.train_lora(
    ds,
    steps=2000,
    save_steps=500,  # Save every 500 steps
    output_dir="./checkpoints"
)
Test different checkpoints to find the best one.

Memory Optimization

Offload model components to CPU when not in use:
m.enable_model_cpu_offload()
Pros: Reduces VRAM by 40-50% Cons: Slower generation (10-20% slower)
Process VAE in smaller slices:
m.enable_vae_slicing()
Pros: Reduces VRAM by 10-15% Cons: Minimal performance impact
Compute attention in slices:
m.enable_attention_slicing()
Pros: Reduces VRAM by 15-20% Cons: Slower generation (5-10% slower)
Reduce LoRA rank during training:
lora = m.train_lora(ds, rank=16, alpha=32)  # Instead of rank=32
Pros: Reduces VRAM by 20-30% Cons: Lower model capacity

Troubleshooting

Common Issues

Solutions:
  1. Enable memory optimizations:
    m.enable_model_cpu_offload()
    m.enable_vae_slicing()
    m.enable_attention_slicing()
    
  2. Reduce image resolution:
    image = m.generate(prompt, height=768, width=768)
    
  3. Generate fewer images at once:
    image = m.generate(prompt, num_images=1)  # Instead of 4
    
Solutions:
  1. Reduce batch size:
    lora = m.train_lora(ds, batch_size=1)
    
  2. Lower LoRA rank:
    lora = m.train_lora(ds, rank=16, alpha=32)
    
  3. Use gradient accumulation:
    lora = m.train_lora(ds, batch_size=1, gradient_accumulation_steps=8)
    
Possible causes and solutions:
  1. Not enough training steps:
    • Increase to 2000-3000 steps
  2. Low quality dataset:
    • Use higher resolution images
    • Add more diverse examples
    • Improve caption quality
  3. Wrong hyperparameters:
    • Try learning_rate=4e-5 or 6e-5
    • Increase rank to 64
    • Adjust alpha to 2x rank
Solutions:
  1. Use FLUX.1 Schnell instead of Dev:
    m = model.load("black-forest-labs/FLUX.1-schnell", torch_dtype="bfloat16")
    
  2. Reduce inference steps:
    image = m.generate(prompt, num_inference_steps=20)
    
  3. Disable CPU offload if enabled:
    m.disable_model_cpu_offload()
    

Example Projects

Portrait LoRA Training

from hypergen import model, dataset

# Load FLUX.1 Dev
m = model.load("black-forest-labs/FLUX.1-dev", torch_dtype="bfloat16")
m.to("cuda")

# Load portrait dataset
ds = dataset.load("./portraits", caption_extension=".txt")

# Train portrait LoRA
lora = m.train_lora(
    ds,
    steps=2000,
    learning_rate=4e-5,
    rank=64,
    alpha=128,
    batch_size=1,
    gradient_accumulation_steps=4,
    save_steps=500,
    output_dir="./portrait_lora"
)

print("Training complete! LoRA saved to ./portrait_lora")

Style Transfer LoRA

from hypergen import model, dataset

# Load FLUX.1 Dev
m = model.load("black-forest-labs/FLUX.1-dev", torch_dtype="bfloat16")
m.to("cuda")

# Load artistic style dataset
ds = dataset.load("./art_style")

# Train style LoRA with lower rank (style doesn't need high capacity)
lora = m.train_lora(
    ds,
    steps=1500,
    learning_rate=5e-5,
    rank=32,
    alpha=64,
    batch_size=1,
    gradient_accumulation_steps=8,
    output_dir="./style_lora"
)

Batch Generation

from hypergen import model

# Load FLUX.1 Schnell for fast generation
m = model.load("black-forest-labs/FLUX.1-schnell", torch_dtype="bfloat16")
m.to("cuda")

# Generate multiple variations
prompts = [
    "A serene mountain landscape at sunrise",
    "A bustling city street at night",
    "A peaceful garden with cherry blossoms",
    "A dramatic ocean sunset with waves",
]

for i, prompt in enumerate(prompts):
    images = m.generate(
        prompt,
        num_inference_steps=4,
        guidance_scale=0.0,
        num_images=2,
    )

    for j, img in enumerate(images):
        img.save(f"output_{i}_{j}.png")

print("Generated", len(prompts) * 2, "images")

License Information

Important: FLUX.1 variants have different licenses!

FLUX.1 Dev

License: FLUX.1 Dev Non-Commercial License
  •  Personal use
  •  Research
  •  Evaluation
  • L Commercial use (requires separate license)
Contact Black Forest Labs for commercial licensing.

FLUX.1 Schnell

License: Apache 2.0
  •  Personal use
  •  Research
  •  Commercial use
  •  Modification and distribution
Fully permissive open-source license.

Next Steps

Training Guide

Complete LoRA training documentation

Dataset Preparation

Learn how to prepare training data

Serving FLUX.1

Deploy FLUX.1 with the API

API Reference

Complete model API documentation

Additional Resources