Configuration Schemas
The d9d.loop.config package defines the structure for configuring the training job using Pydantic models. This ensures strict validation of configurations (e.g., ensuring global batch size is divisible by microbatch size and DP size).
Main Config
d9d.loop.config.TrainerConfig
Bases: BaseModel
Top-level configuration object defining a complete training job.
Attributes:
| Name | Type | Description |
|---|---|---|
run |
RunConfig
|
Meta-information about the run (name, ID, tags). |
batching |
BatchingConfig
|
Batch sizing strategy. |
data_loading |
DataLoadingConfig
|
DataLoader settings. |
logging |
JobLoggerConfig
|
Experiment tracking settings. |
pipelining |
PipeliningConfig
|
Pipeline Parallelism schedule and settings. If None, pipeline parallelism is disabled. |
model_stage_factory |
ModelStageFactoryConfig
|
Model initialization and additional checkpointing logic. |
determinism |
DeterminismConfig
|
Random seed settings. |
gc |
GarbageCollectionConfig
|
Garbage collection settings. |
checkpointing |
CheckpointingConfig
|
Checkpoint saving settings. |
gradient_clipping |
GradientClippingConfig
|
Gradient clipping settings. |
profiling |
ProfilingConfig | None
|
Profiler settings. |
gradient_manager |
GradientManagerConfig
|
Gradient Synchronization Settings. |
timeout |
TimeoutConfig
|
Distributed timeout settings. |
d9d.loop.config.InferenceConfig
Bases: BaseModel
Top-level configuration object defining an inference/evaluation job.
Attributes:
| Name | Type | Description |
|---|---|---|
batching |
BatchingConfig
|
Batch sizing strategy. |
data_loading |
DataLoadingConfig
|
DataLoader settings. |
model_stage_factory |
ModelStageFactoryConfig
|
Model initialization logic. |
determinism |
DeterminismConfig
|
Random seed settings. |
gc |
GarbageCollectionConfig
|
Garbage collection settings. |
checkpointing |
CheckpointingConfig
|
Checkpointing settings. |
profiling |
ProfilingConfig | None
|
Profiler settings. |
timeout |
TimeoutConfig
|
Distributed timeout settings. |
Sub-Configurations
Diagnostics & Reproducibility
d9d.tracker.RunConfig
Bases: BaseModel
Configuration for initializing a specific logged run.
Attributes:
| Name | Type | Description |
|---|---|---|
name |
str
|
The display name of the experiment. |
description |
str | None
|
An optional description of the experiment. |
hparams |
dict[str, Any]
|
A dictionary of hyperparameters to log at the start of the run. |
d9d.loop.config.JobLoggerConfig
Bases: BaseModel
Configuration for experiment tracking and logging.
Attributes:
| Name | Type | Description |
|---|---|---|
period_steps |
StepActionPeriod
|
How frequently metrics are flushed to the logger. |
tracker |
AnyTrackerConfig
|
Logic for the specific tracking backend (e.g., WandB, MLflow, stdout). |
d9d.loop.config.ProfilingConfig
Bases: BaseModel
Configuration for the PyTorch Profiler.
Attributes:
| Name | Type | Description |
|---|---|---|
enabled |
bool
|
Whether to enable the profiler. |
traces_dir |
Path
|
Directory where trace files will be saved. |
period_steps |
int
|
Total length of a profiling cycle (wait + warmup + active). |
warmup_steps |
int
|
Number of steps to ignore before recording to allow for warming-up. |
active_steps |
int
|
Number of steps to actively record traces. |
d9d.loop.config.DeterminismConfig
Experiment Trackers
d9d.tracker.AnyTrackerConfig = Annotated[AimConfig | NullTrackerConfig, Field(discriminator='provider')]
module-attribute
d9d.tracker.provider.null.NullTrackerConfig
d9d.tracker.provider.aim.config.AimConfig
Bases: BaseModel
Configuration for the Aim tracker backend.
Attributes:
| Name | Type | Description |
|---|---|---|
provider |
Literal['aim']
|
Discriminator field, must be 'aim'. |
repo |
str
|
Path to the Aim repository directory or URL. |
log_system_params |
bool
|
Whether to log system resource usage (CPU/GPU/Memory). |
capture_terminal_logs |
bool
|
Whether to capture stdout/stderr. |
system_tracking_interval |
int
|
Interval in seconds for system monitoring. |
Batching & Data
d9d.loop.config.BatchingConfig
Bases: BaseModel
Configuration for batch sizing logic.
Attributes:
| Name | Type | Description |
|---|---|---|
global_batch_size |
int
|
The total effective batch size across all distributed replicas and gradient accumulation steps. |
microbatch_size |
int
|
The distinct batch size fed into the model during a single forward pass on a single device. |
d9d.loop.config.DataLoadingConfig
Bases: BaseModel
Configuration for PyTorch DataLoaders.
Attributes:
| Name | Type | Description |
|---|---|---|
num_workers |
int
|
The number of subprocesses to use for data loading. |
pin_memory |
bool
|
Whether to copy tensors into CUDA pinned memory before returning them. |
persistent_workers |
bool
|
If True, the data loader will not shutdown the worker processes after a dataset has been consumed once. |
Checkpointing
d9d.loop.config.CheckpointingConfig
Bases: BaseModel
Configuration for saving model snapshots.
Attributes:
| Name | Type | Description |
|---|---|---|
save_dir |
Path
|
The root directory where checkpoints will be stored. |
period_steps |
StepActionPeriod
|
How frequently to save a checkpoint. |
num_to_keep |
int | None
|
The maximum number of recent checkpoints to retain. If None, all checkpoints are kept. |
Model Initialization
d9d.loop.config.ModelStageFactoryConfig
Bases: BaseModel
Configuration for initializing model weights.
Attributes:
| Name | Type | Description |
|---|---|---|
source_checkpoint |
Path | None
|
Path to an initial checkpoint to load into the model before training starts. If None, random initialization is used. |
checkpoint_only_trainable_parameters |
bool
|
If True, only parameters with requires_grad=True will be saved in checkpoints. Useful for PEFT/LoRA. |
Optimization
d9d.loop.config.GradientClippingConfig
Bases: BaseModel
Configuration for gradient norm clipping.
Attributes:
| Name | Type | Description |
|---|---|---|
max_norm |
float | None
|
The maximum norm value for gradient clipping. If None, no clipping is performed. |
log_total_steps |
StepActionPeriod
|
Frequency at which to log the total gradient norm. |
d9d.loop.config.GradientManagerConfig
Infrastructure
d9d.loop.config.PipeliningConfig
Bases: BaseModel
Configuration for pipeline parallelism orchestration.
Attributes:
| Name | Type | Description |
|---|---|---|
schedule |
AnyPipelineScheduleConfig
|
The specific scheduling strategy configuration used to manage pipeline execution. |
d9d.loop.config.GarbageCollectionConfig
Bases: BaseModel
Configuration for manual Python garbage collection control.
Attributes:
| Name | Type | Description |
|---|---|---|
period_steps |
StepActionPeriod
|
How frequently to manually trigger the Python garbage collector. |
d9d.loop.config.TimeoutConfig
Types
d9d.loop.config.StepActionPeriod = int | StepActionSpecial
module-attribute
Union type representing a configuration for periodic events.
Values
int: The period in steps (frequency) at which the event occurs. StepActionSpecial: A special flag indicating end-of-run execution or disabling.
d9d.loop.config.StepActionSpecial
Bases: StrEnum
Special flag values for configuring periodic actions.
Attributes:
| Name | Type | Description |
|---|---|---|
last_step |
Indicates the action should occur exactly once at the very end of the training run. |
|
disable |
Indicates the action should never occur. |