Skip to content

Configuration Schemas

The d9d.loop.config package defines the structure for configuring the training job using Pydantic models. This ensures strict validation of configurations (e.g., ensuring global batch size is divisible by microbatch size and DP size).

Main Config

d9d.loop.config.TrainerConfig

Bases: BaseModel

Top-level configuration object defining a complete training job.

Attributes:

Name Type Description
run RunConfig

Meta-information about the run (name, ID, tags).

batching BatchingConfig

Batch sizing strategy.

data_loading DataLoadingConfig

DataLoader settings.

logging JobLoggerConfig

Experiment tracking settings.

pipelining PipeliningConfig

Pipeline Parallelism schedule and settings. If None, pipeline parallelism is disabled.

model_stage_factory ModelStageFactoryConfig

Model initialization and additional checkpointing logic.

determinism DeterminismConfig

Random seed settings.

gc GarbageCollectionConfig

Garbage collection settings.

checkpointing CheckpointingConfig

Checkpoint saving settings.

gradient_clipping GradientClippingConfig

Gradient clipping settings.

profiling ProfilingConfig | None

Profiler settings.

gradient_manager GradientManagerConfig

Gradient Synchronization Settings.

timeout TimeoutConfig

Distributed timeout settings.

d9d.loop.config.InferenceConfig

Bases: BaseModel

Top-level configuration object defining an inference/evaluation job.

Attributes:

Name Type Description
batching BatchingConfig

Batch sizing strategy.

data_loading DataLoadingConfig

DataLoader settings.

model_stage_factory ModelStageFactoryConfig

Model initialization logic.

determinism DeterminismConfig

Random seed settings.

gc GarbageCollectionConfig

Garbage collection settings.

checkpointing CheckpointingConfig

Checkpointing settings.

profiling ProfilingConfig | None

Profiler settings.

timeout TimeoutConfig

Distributed timeout settings.

Sub-Configurations

Diagnostics & Reproducibility

d9d.tracker.RunConfig

Bases: BaseModel

Configuration for initializing a specific logged run.

Attributes:

Name Type Description
name str

The display name of the experiment.

description str | None

An optional description of the experiment.

hparams dict[str, Any]

A dictionary of hyperparameters to log at the start of the run.

d9d.loop.config.JobLoggerConfig

Bases: BaseModel

Configuration for experiment tracking and logging.

Attributes:

Name Type Description
period_steps StepActionPeriod

How frequently metrics are flushed to the logger.

tracker AnyTrackerConfig

Logic for the specific tracking backend (e.g., WandB, MLflow, stdout).

d9d.loop.config.ProfilingConfig

Bases: BaseModel

Configuration for the PyTorch Profiler.

Attributes:

Name Type Description
enabled bool

Whether to enable the profiler.

traces_dir Path

Directory where trace files will be saved.

period_steps int

Total length of a profiling cycle (wait + warmup + active).

warmup_steps int

Number of steps to ignore before recording to allow for warming-up.

active_steps int

Number of steps to actively record traces.

d9d.loop.config.DeterminismConfig

Bases: BaseModel

Configuration for reproducibility and random number generation.

Attributes:

Name Type Description
base_seed int

The base integer seed used to initialize random number generators (Python, NumPy, PyTorch) across all ranks.

Experiment Trackers

d9d.tracker.AnyTrackerConfig = Annotated[AimConfig | NullTrackerConfig, Field(discriminator='provider')] module-attribute

d9d.tracker.provider.null.NullTrackerConfig

Bases: BaseModel

Configuration for the Null (no-op) tracker.

Attributes:

Name Type Description
provider Literal['null']

Discriminator field, must be 'null'.

d9d.tracker.provider.aim.config.AimConfig

Bases: BaseModel

Configuration for the Aim tracker backend.

Attributes:

Name Type Description
provider Literal['aim']

Discriminator field, must be 'aim'.

repo str

Path to the Aim repository directory or URL.

log_system_params bool

Whether to log system resource usage (CPU/GPU/Memory).

capture_terminal_logs bool

Whether to capture stdout/stderr.

system_tracking_interval int

Interval in seconds for system monitoring.

Batching & Data

d9d.loop.config.BatchingConfig

Bases: BaseModel

Configuration for batch sizing logic.

Attributes:

Name Type Description
global_batch_size int

The total effective batch size across all distributed replicas and gradient accumulation steps.

microbatch_size int

The distinct batch size fed into the model during a single forward pass on a single device.

d9d.loop.config.DataLoadingConfig

Bases: BaseModel

Configuration for PyTorch DataLoaders.

Attributes:

Name Type Description
num_workers int

The number of subprocesses to use for data loading.

pin_memory bool

Whether to copy tensors into CUDA pinned memory before returning them.

persistent_workers bool

If True, the data loader will not shutdown the worker processes after a dataset has been consumed once.

Checkpointing

d9d.loop.config.CheckpointingConfig

Bases: BaseModel

Configuration for saving model snapshots.

Attributes:

Name Type Description
save_dir Path

The root directory where checkpoints will be stored.

period_steps StepActionPeriod

How frequently to save a checkpoint.

num_to_keep int | None

The maximum number of recent checkpoints to retain. If None, all checkpoints are kept.

Model Initialization

d9d.loop.config.ModelStageFactoryConfig

Bases: BaseModel

Configuration for initializing model weights.

Attributes:

Name Type Description
source_checkpoint Path | None

Path to an initial checkpoint to load into the model before training starts. If None, random initialization is used.

checkpoint_only_trainable_parameters bool

If True, only parameters with requires_grad=True will be saved in checkpoints. Useful for PEFT/LoRA.

Optimization

d9d.loop.config.GradientClippingConfig

Bases: BaseModel

Configuration for gradient norm clipping.

Attributes:

Name Type Description
max_norm float | None

The maximum norm value for gradient clipping. If None, no clipping is performed.

log_total_steps StepActionPeriod

Frequency at which to log the total gradient norm.

d9d.loop.config.GradientManagerConfig

Bases: BaseModel

Configuration for gradient synchronization.

Attributes:

Name Type Description
grad_dtype str | None

The data type to use for storing the gradient. If None, follows the model's dtype.

bucket_size_mb int

The size of gradient buckets in Megabytes for communication.

Infrastructure

d9d.loop.config.PipeliningConfig

Bases: BaseModel

Configuration for pipeline parallelism orchestration.

Attributes:

Name Type Description
schedule AnyPipelineScheduleConfig

The specific scheduling strategy configuration used to manage pipeline execution.

d9d.loop.config.GarbageCollectionConfig

Bases: BaseModel

Configuration for manual Python garbage collection control.

Attributes:

Name Type Description
period_steps StepActionPeriod

How frequently to manually trigger the Python garbage collector.

d9d.loop.config.TimeoutConfig

Bases: BaseModel

Configuration for distributed process group timeouts.

Attributes:

Name Type Description
init_timeout int

Timeout in seconds for initializing the process group.

step_timeout int

Timeout in seconds for individual step communications.

Types

d9d.loop.config.StepActionPeriod = int | StepActionSpecial module-attribute

Union type representing a configuration for periodic events.

Values

int: The period in steps (frequency) at which the event occurs. StepActionSpecial: A special flag indicating end-of-run execution or disabling.

d9d.loop.config.StepActionSpecial

Bases: StrEnum

Special flag values for configuring periodic actions.

Attributes:

Name Type Description
last_step

Indicates the action should occur exactly once at the very end of the training run.

disable

Indicates the action should never occur.