Data Loading
DatasetProvider
The DatasetProvider is responsible for creating dataset and data collator instances.
Distributed-Awareness
d9d will not apply sharding to your dataset automatically. You have to configure it manually (optionally applying other dataset wrappers).
Please see the Dataset Utilities documentation.
Example Implementation
d9d.loop.control.dataset_provider
DatasetProvider
Bases: Protocol
Protocol that allows users to define how datasets are loaded and collated.
Users should subclass this to provide custom data loading logic.
__call__(context)
Initializes the dataset components.
It is important that the user must shard the dataset manually, perhaps using d9d.dataset.ShardedDataset.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
context
|
InitializeDatasetContext
|
Context for this operation. |
required |
Returns:
| Type | Description |
|---|---|
InitializeDatasetResult
|
Result of this operation. |
InitializeDatasetContext
dataclass
Context data required to initialize a dataset provider.
Attributes:
| Name | Type | Description |
|---|---|---|
dist_context |
DistributedContext
|
The distributed context containing rank and world size information. |
batch_maths |
BatchMaths
|
The batch maths component handling global batch size calculations. |