Dataset Configuration¶
This section introduces how to use dataset configuration and provides an overview of the dataset architecture. Through this document, you will learn how to flexibly use configuration files to load, process, organize, and augment datasets.
Overview¶
In the RainbowNeko Engine, datasets are defined through configuration files. It is recommended to use Python configuration files, which allow users to define data sources, data processing logic, data bucketing strategies, and more in a flexible manner.
Below is a typical example of a dataset configuration:
@neko_cfg
def cfg_data():
return dict(
dataset1=partial(BaseDataset, batch_size=128, loss_weight=1.0,
source=dict(
data_source1=IndexSource(
data=torchvision.datasets.cifar.CIFAR10(root=r'D:\others\dataset\cifar', train=True, download=True)
),
),
handler=HandlerChain(
load=LoadImageHandler(),
bucket=FixedBucket.handler,
image=ImageHandler(transform=T.Compose([
T.RandomCrop(size=32, padding=4),
T.RandomHorizontalFlip(),
T.ToTensor(),
T.Normalize(mean=[0.4914, 0.4822, 0.4465], std=[0.2023, 0.1994, 0.2010]),
]),
)
),
bucket=FixedBucket(target_size=32),
)
)
Core Components of a Dataset¶
The dataset configuration primarily consists of the following core components:
Data Source (DataSource)
Implemented viaDataSourceand its subclasses.Defines the source of the data (e.g., loading from local files or fetching from remote APIs).
Specifies the structure of the data (e.g.,
(image, label)or(image, image, text)).
Data Handler (DataHandler)
Used for processing raw data such as image reading, augmentation, and format conversion. Multiple handlers can be combined usingHandlerChainorHandlerGroup.Data Bucket (Bucket)
Organizes data into groups (e.g., grouping images with the same size into one batch). Implemented viaBaseBucketand its subclasses.Dataset
Wraps the data source, handler, and bucket while providing standard__getitem__and__len__interfaces.
Detailed Dataset Configuration¶
1. Data Source Configuration¶
The data source is defined in the source parameter of Dataset. It is a dictionary that supports various types of data sources. Below is an example of a typical data source configuration:
source=dict(
data_source1=IndexSource(
data=torchvision.datasets.cifar.CIFAR10(root=r'D:\others\dataset\cifar', train=True, download=True)
),
)
Data Source Types
IndexSource
Loads data directly from an indexable object (providing__getitem__and__len__interfaces), such as PyTorch’s built-inDataset.ImageLabelSource
Loads images and labels from an image folder and a label file.data_source1=ImageLabelSource( img_root='Path to image folder', label_file='Path to label file', )
Supported formats for
label_filecan be found in Label File Formats.Data structure:
{ 'image': Path to image, 'label': Label }
ImagePairSource
Loads paired image-to-image datasets from an image folder and a label file.data_source1=ImagePairSource( img_root='Path to image folder', label_file='Path to label file', )
Data structure:
{ 'image': Path to first image, 'label': Path to second image }
ImageFolderClassSource
Stores each class’s images in separate folders; suitable for classification models.dataset/ ├── class1/ │ ├── img1.png │ └── img2.png ├── class2/ └── ... ...Usage:
data_source1=ImageFolderClassSource( img_root='Path to dataset folder', use_cls_index=True, # True for using class IDs; False for using class names. )
Data structure:
{ 'image': Path to image, 'label': Class name or class ID }
UnLabelSource
Unlabeled datasets containing only raw images.data_source1=UnLabelSource( img_root='Path to dataset folder', )
Data structure:
{ 'image': Path to image, }
Multiple Data Sources¶
You can define multiple data sources that will be merged after being processed by handlers and grouped by buckets.
source=dict(
data_source1=...,
data_source2=...,
)
2. Data Handler Configuration¶
Data handlers are defined in the handler field of Dataset, used for preprocessing or augmenting the data. Below is an example of a commonly used image handler configuration:
handler=HandlerChain(
load=LoadImageHandler(), # Reads images.
bucket=FixedBucket.handler, # Built-in bucket handler.
# Image transformation and augmentation.
image=ImageHandler(transform=T.Compose([
T.RandomCrop(size=32, padding=4),
T.RandomHorizontalFlip(),
T.ToTensor(),
T.Normalize(mean=[0.4914, 0.4822, 0.4465], std=[0.2023, 0.1994, 0.2010]),
]),
)
)
Tip
In most common scenarios, you only need to modify the transform section within ImageHandler. For more advanced handler configurations, refer to Advanced Data Processing Configurations.
Batch Handlers¶
The BaseDataset supports adding batch handlers for operations like MixUP that require processing at the batch level. For instance:
from rainbowneko.data.handler import MixUPHandler
dataset1 = BaseDataset(
batch_handler=HandlerChain(
mixup=MixUPHandler(num_classes=num_classes)
)
)
3. Data Bucket Configuration¶
Buckets are defined via the bucket field and are used for grouping datasets so that all images in a batch have consistent sizes.
If all your training images have identical sizes or cropping is not critical: use
FixedBucket. It scales all images by their shorter side and crops them to a fixed size.
from rainbowneko.data import FixedBucket
bucket = FixedBucket(target_size=32) # Resizes to dimensions of size (32x32).
For tasks sensitive to cropping but not scaling: use
RatioBucket. It clusters images into buckets based on aspect ratios while minimizing cropping.
from rainbowneko.data import RatioBucket
bucket = RatioBucket.from_files(
target_area=512 * 512,
step_size=8,
num_bucket=10,
)
For tasks sensitive to both scaling and cropping: use
SizeBucket. It clusters based on resolution similarity instead of aspect ratio.
from rainbowneko.data import SizeBucket
bucket = SizeBucket.from_files(step_size=8, num_bucket=10, )
Dataset Wrapping¶
Datasets are wrapped using the BaseDataset class by combining sources with handlers/buckets alongside additional parameters.
Multiple datasets can be configured simultaneously with independent resolutions or input formats during training.
dict(
dataset1=partial(BaseDataset, batch_size=128, loss_weight=1.0,
source=...,
handler=...,
bucket=...,
),
dataset2=partial(BaseDataset, batch_size=32, loss_weight=0.2,
source=...,
handler=...,
bucket=...,
)
)
In this configuration:
batch_sizespecifies the number of samples in each batch for the respective dataset.loss_weightdetermines the weight of this dataset’s loss when calculating the total loss during training.
Configuring Your Own Dataset¶
Below is an example configuration for the ImageNet dataset:
@neko_cfg
def cfg_data():
return dict(
dataset1=partial(BaseDataset, batch_size=64, loss_weight=1.0,
source=dict(
data_source1=ImageFolderClassSource(img_root='./imagenet'),
),
handler=HandlerChain(
load=LoadImageHandler(),
image=ImageHandler(transform=T.Compose([
T.Resize(224),
T.RandomResizedCrop(224),
T.RandomHorizontalFlip(),
T.ToTensor(),
T.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
]),
)
),
bucket=FixedBucket(target_size=224),
)
)
Explanation of the Configuration:¶
Data Source:
TheImageFolderClassSourceis used to load ImageNet data where images are organized by class folders.Handlers:
LoadImageHandler: Reads image files.ImageHandler: Applies a series of transformations and augmentations:Resize images to a fixed size of 224x224.
Randomly crop a resized region to ensure variation.
Apply horizontal flipping with a certain probability.
Normalize pixel values using ImageNet’s mean and standard deviation.
Bucket:
AFixedBucketensures that all images are resized to a uniform size (224x224) for efficient batching.Batch Size and Loss Weight:
The dataset is configured with a batch size of 64 and a loss weight of 1.0, meaning its contribution to the total loss is fully weighted.
By following this guide, you can configure datasets tailored to your specific needs in the RainbowNeko Engine framework. For further customization or advanced features such as integrating custom handlers or buckets, refer to the advanced documentation sections linked within this guide.