Datasets

Overview

HyperGen makes it easy to load training datasets. Simply organize your images in a folder, optionally add captions, and load with one line of code.

Basic Usage

from hypergen import dataset

# Load all images from a folder
ds = dataset.load("./my_training_images")
print(f"Loaded {len(ds)} images")

Folder Structure

Images Only

The simplest structure - just put all your images in a folder:

my_images/
  photo1.jpg
  photo2.png
  photo3.webp
  ...

Supported formats:

.jpg / .jpeg
.png
.webp
.bmp

Images with Captions

For better results, add caption files next to each image:

my_images/
  photo1.jpg
  photo1.txt        <- "A beautiful sunset over the ocean"
  photo2.jpg
  photo2.txt        <- "A person walking in a forest"
  photo3.png
  photo3.txt        <- "Close-up of a flower"
  ...

Caption files should:

Have the same name as the image (except the extension)
Be plain text files (.txt)
Contain a descriptive caption on the first line
Be UTF-8 encoded

Captions are optional but highly recommended. They help the model learn what features to associate with your style or subject.

Loading Datasets

Simple Loading

from hypergen import dataset

ds = dataset.load("./my_images")

Custom Extensions

Specify which file extensions to include:

ds = dataset.load(
    "./my_images",
    extensions=[".jpg", ".png"]  # Only load JPG and PNG files
)

Checking Dataset Contents

# Get dataset size
print(f"Dataset has {len(ds)} images")

# Iterate over items
for image, caption in ds:
    print(f"Image size: {image.size}")
    print(f"Caption: {caption}")
    break  # Just show first item

Batch Iteration

Process dataset in batches:

for batch in ds.batch(batch_size=4):
    print(f"Batch has {len(batch)} items")
    for image, caption in batch:
        # Process each item in the batch
        pass

Dataset Guidelines

Image Count

Minimum

10-20 imagesMinimum for basic style/subject learning

Maximum

1000+ imagesFor complex styles or high diversity

Diminishing Returns

Beyond 500 imagesMore data helps, but gains are smaller

Image Quality

Resolution:

Minimum: 512x512
Recommended: 1024x1024 or higher
The model will resize images automatically

Quality Tips:

Use high-quality, sharp images
Avoid heavily compressed JPEGs
Remove watermarks if possible
Crop to relevant content

Variety:

Include different angles and compositions
Vary lighting conditions
Mix different aspects of your subject/style
Avoid duplicate or near-duplicate images

Caption Guidelines

Good captions help the model learn better: Do:

Describe what’s in the image objectively
Mention key visual elements (colors, objects, actions)
Be specific but concise (1-2 sentences)
Use consistent terminology across captions

Don’t:

Write subjective opinions (“beautiful”, “amazing”)
Add metadata or keywords
Copy the same caption for all images
Write overly long descriptions

Examples:

A woman with red hair wearing a blue dress standing in a garden with blooming roses

Advanced Dataset Usage

Accessing Individual Items

# Get specific item by index
image, caption = ds[0]

# Check if captions are available
if caption is not None:
    print(f"Has caption: {caption}")
else:
    print("No caption for this image")

Dataset Properties

# Image paths
print(ds.image_paths)  # List of Path objects

# Captions (may be None)
print(ds.captions)  # List of strings or None values

Custom Dataset Class

For advanced use cases, you can subclass the Dataset:

from hypergen.dataset import Dataset
from pathlib import Path

class MyCustomDataset(Dataset):
    def __getitem__(self, idx):
        # Custom loading logic
        image, caption = super().__getitem__(idx)

        # Apply custom preprocessing
        # ...

        return image, caption

Common Issues

No Images Found

Error:

ValueError: No images found in ./my_images with extensions ['.jpg', '.jpeg', ...]

Solutions:

Check that the path is correct
Verify images have supported extensions
Ensure files aren’t hidden (don’t start with .)
Try absolute path instead of relative

Missing Captions

If some images don’t have captions, they’ll have None as their caption:

for image, caption in ds:
    if caption is None:
        print("Image has no caption")
    else:
        print(f"Caption: {caption}")

This is fine - the model will work without captions, just less effectively.

Unicode/Encoding Issues

If you see encoding errors, ensure caption files are UTF-8:

# When creating caption files, specify UTF-8
with open("photo1.txt", "w", encoding="utf-8") as f:
    f.write("Your caption with �mojis and -�")

Dataset Examples

Style Transfer Dataset

For learning an art style:

art_style_dataset/
  painting1.jpg
  painting1.txt     <- "Abstract painting with blue and red geometric shapes"
  painting2.jpg
  painting2.txt     <- "Impressionist landscape with trees and water"
  painting3.jpg
  painting3.txt     <- "Cubist portrait with angular features"
  ...

Subject/Character Dataset

For learning a specific person or character:

character_dataset/
  portrait1.jpg
  portrait1.txt     <- "Close-up portrait of John facing forward"
  portrait2.jpg
  portrait2.txt     <- "John smiling in profile view"
  portrait3.jpg
  portrait3.txt     <- "Full body shot of John wearing a suit"
  ...

Product Dataset

For learning product photography:

product_dataset/
  product1.jpg
  product1.txt      <- "Red sneaker on white background from side angle"
  product2.jpg
  product2.txt      <- "Red sneaker on white background from top view"
  product3.jpg
  product3.txt      <- "Red sneaker on white background from front"
  ...

Next Steps

LoRA Training

Learn how to train with your dataset

Training Overview

Understand the training process

Examples

View complete training examples

Quick Start

Train your first LoRA in 5 minutes

Getting Started

Training

Serving

Models

Overview

Basic Usage

Folder Structure

Images Only

Images with Captions

Loading Datasets

Simple Loading

Custom Extensions

Checking Dataset Contents

Batch Iteration

Dataset Guidelines

Image Count

Minimum

Recommended

Maximum

Diminishing Returns

Image Quality

Caption Guidelines

Advanced Dataset Usage

Accessing Individual Items

Dataset Properties

Custom Dataset Class

Common Issues

No Images Found

Missing Captions

Unicode/Encoding Issues

Dataset Examples

Style Transfer Dataset

Subject/Character Dataset

Product Dataset

Next Steps

LoRA Training

Training Overview

Examples

Quick Start

Getting Started

Training

Serving

Models

​Overview

​Basic Usage

​Folder Structure

​Images Only

​Images with Captions

​Loading Datasets

​Simple Loading

​Custom Extensions

​Checking Dataset Contents

​Batch Iteration

​Dataset Guidelines

​Image Count

Minimum

Recommended

Maximum

Diminishing Returns

​Image Quality

​Caption Guidelines

​Advanced Dataset Usage

​Accessing Individual Items

​Dataset Properties

​Custom Dataset Class

​Common Issues

​No Images Found

​Missing Captions

​Unicode/Encoding Issues

​Dataset Examples

​Style Transfer Dataset

​Subject/Character Dataset

​Product Dataset

​Next Steps

LoRA Training

Training Overview

Examples

Quick Start

Overview

Basic Usage

Folder Structure

Images Only

Images with Captions

Loading Datasets

Simple Loading

Custom Extensions

Checking Dataset Contents

Batch Iteration

Dataset Guidelines

Image Count

Image Quality

Caption Guidelines

Advanced Dataset Usage

Accessing Individual Items

Dataset Properties

Custom Dataset Class

Common Issues

No Images Found

Missing Captions

Unicode/Encoding Issues

Dataset Examples

Style Transfer Dataset

Subject/Character Dataset

Product Dataset

Next Steps