Configuration Schemas

The d9d.loop.config package defines the structure for configuring the training job using Pydantic models. This ensures strict validation of configurations (e.g., ensuring global batch size is divisible by microbatch size and DP size).

Main Config

`d9d.loop.config.TrainerConfig`

Bases: BaseModel

Top-level configuration object defining a complete training job.

Attributes:

Name	Type	Description
`run`	`RunConfig`	Meta-information about the run (name, ID, tags).
`batching`	`BatchingConfig`	Batch sizing strategy.
`data_loading`	`DataLoadingConfig`	DataLoader settings.
`logging`	`JobLoggerConfig`	Experiment tracking settings.
`pipelining`	`PipeliningConfig`	Pipeline Parallelism schedule and settings. If None, pipeline parallelism is disabled.
`model_stage_factory`	`ModelStageFactoryConfig`	Model initialization and additional checkpointing logic.
`determinism`	`DeterminismConfig`	Random seed settings.
`gc`	`GarbageCollectionConfig`	Garbage collection settings.
`checkpointing`	`CheckpointingConfig`	Checkpoint saving settings.
`gradient_clipping`	`GradientClippingConfig`	Gradient clipping settings.
`profiling`	`ProfilingConfig \| None`	Profiler settings.
`gradient_manager`	`GradientManagerConfig`	Gradient Synchronization Settings.
`timeout`	`TimeoutConfig`	Distributed timeout settings.

`d9d.loop.config.InferenceConfig`

Bases: BaseModel

Top-level configuration object defining an inference/evaluation job.

Attributes:

Name	Type	Description
`batching`	`BatchingConfig`	Batch sizing strategy.
`data_loading`	`DataLoadingConfig`	DataLoader settings.
`model_stage_factory`	`ModelStageFactoryConfig`	Model initialization logic.
`determinism`	`DeterminismConfig`	Random seed settings.
`gc`	`GarbageCollectionConfig`	Garbage collection settings.
`checkpointing`	`CheckpointingConfig`	Checkpointing settings.
`profiling`	`ProfilingConfig \| None`	Profiler settings.
`timeout`	`TimeoutConfig`	Distributed timeout settings.

Sub-Configurations

Diagnostics & Reproducibility

`d9d.tracker.RunConfig`

Bases: BaseModel

Configuration for initializing a specific logged run.

Attributes:

Name	Type	Description
`name`	`str`	The display name of the experiment.
`description`	`str \| None`	An optional description of the experiment.
`hparams`	`dict[str, Any]`	A dictionary of hyperparameters to log at the start of the run.

`d9d.loop.config.JobLoggerConfig`

Bases: BaseModel

Configuration for experiment tracking and logging.

Attributes:

Name	Type	Description
`period_steps`	`StepActionPeriod`	How frequently metrics are flushed to the logger.
`tracker`	`AnyTrackerConfig`	Logic for the specific tracking backend (e.g., WandB, MLflow, stdout).

`d9d.loop.config.ProfilingConfig`

Bases: BaseModel

Configuration for the PyTorch Profiler.

Attributes:

Name	Type	Description
`enabled`	`bool`	Whether to enable the profiler.
`traces_dir`	`Path`	Directory where trace files will be saved.
`period_steps`	`int`	Total length of a profiling cycle (wait + warmup + active).
`warmup_steps`	`int`	Number of steps to ignore before recording to allow for warming-up.
`active_steps`	`int`	Number of steps to actively record traces.

`d9d.loop.config.DeterminismConfig`

Bases: BaseModel

Configuration for reproducibility and random number generation.

Attributes:

Name	Type	Description
`base_seed`	`int`	The base integer seed used to initialize random number generators (Python, NumPy, PyTorch) across all ranks.

Experiment Trackers

`d9d.tracker.AnyTrackerConfig = Annotated[AimConfig | NullTrackerConfig, Field(discriminator='provider')]` `module-attribute`

`d9d.tracker.provider.null.NullTrackerConfig`

Bases: BaseModel

Configuration for the Null (no-op) tracker.

Attributes:

Name	Type	Description
`provider`	`Literal['null']`	Discriminator field, must be 'null'.

`d9d.tracker.provider.aim.config.AimConfig`

Bases: BaseModel

Configuration for the Aim tracker backend.

Attributes:

Name	Type	Description
`provider`	`Literal['aim']`	Discriminator field, must be 'aim'.
`repo`	`str`	Path to the Aim repository directory or URL.
`log_system_params`	`bool`	Whether to log system resource usage (CPU/GPU/Memory).
`capture_terminal_logs`	`bool`	Whether to capture stdout/stderr.
`system_tracking_interval`	`int`	Interval in seconds for system monitoring.

Batching & Data

`d9d.loop.config.BatchingConfig`

Bases: BaseModel

Configuration for batch sizing logic.

Attributes:

Name	Type	Description
`global_batch_size`	`int`	The total effective batch size across all distributed replicas and gradient accumulation steps.
`microbatch_size`	`int`	The distinct batch size fed into the model during a single forward pass on a single device.

`d9d.loop.config.DataLoadingConfig`

Bases: BaseModel

Configuration for PyTorch DataLoaders.

Attributes:

Name	Type	Description
`num_workers`	`int`	The number of subprocesses to use for data loading.
`pin_memory`	`bool`	Whether to copy tensors into CUDA pinned memory before returning them.
`persistent_workers`	`bool`	If True, the data loader will not shutdown the worker processes after a dataset has been consumed once.

Checkpointing

`d9d.loop.config.CheckpointingConfig`

Bases: BaseModel

Configuration for saving model snapshots.

Attributes:

Name	Type	Description
`save_dir`	`Path`	The root directory where checkpoints will be stored.
`period_steps`	`StepActionPeriod`	How frequently to save a checkpoint.
`num_to_keep`	`int \| None`	The maximum number of recent checkpoints to retain. If None, all checkpoints are kept.

Model Initialization

`d9d.loop.config.ModelStageFactoryConfig`

Bases: BaseModel

Configuration for initializing model weights.

Attributes:

Name	Type	Description
`source_checkpoint`	`Path \| None`	Path to an initial checkpoint to load into the model before training starts. If None, random initialization is used.
`checkpoint_only_trainable_parameters`	`bool`	If True, only parameters with requires_grad=True will be saved in checkpoints. Useful for PEFT/LoRA.

Optimization

`d9d.loop.config.GradientClippingConfig`

Bases: BaseModel

Configuration for gradient norm clipping.

Attributes:

Name	Type	Description
`max_norm`	`float \| None`	The maximum norm value for gradient clipping. If None, no clipping is performed.
`log_total_steps`	`StepActionPeriod`	Frequency at which to log the total gradient norm.

`d9d.loop.config.GradientManagerConfig`

Bases: BaseModel

Configuration for gradient synchronization.

Attributes:

Name	Type	Description
`grad_dtype`	`str \| None`	The data type to use for storing the gradient. If None, follows the model's dtype.
`bucket_size_mb`	`int`	The size of gradient buckets in Megabytes for communication.

Infrastructure

`d9d.loop.config.PipeliningConfig`

Bases: BaseModel

Configuration for pipeline parallelism orchestration.

Attributes:

Name	Type	Description
`schedule`	`AnyPipelineScheduleConfig`	The specific scheduling strategy configuration used to manage pipeline execution.

`d9d.loop.config.GarbageCollectionConfig`

Bases: BaseModel

Configuration for manual Python garbage collection control.

Attributes:

Name	Type	Description
`period_steps`	`StepActionPeriod`	How frequently to manually trigger the Python garbage collector.

`d9d.loop.config.TimeoutConfig`

Bases: BaseModel

Configuration for distributed process group timeouts.

Attributes:

Name	Type	Description
`init_timeout`	`int`	Timeout in seconds for initializing the process group.
`step_timeout`	`int`	Timeout in seconds for individual step communications.

Types

`d9d.loop.config.StepActionPeriod = int | StepActionSpecial` `module-attribute`

Union type representing a configuration for periodic events.

Values

int: The period in steps (frequency) at which the event occurs. StepActionSpecial: A special flag indicating end-of-run execution or disabling.

`d9d.loop.config.StepActionSpecial`

Bases: StrEnum

Special flag values for configuring periodic actions.

Attributes:

Name	Type	Description
`last_step`		Indicates the action should occur exactly once at the very end of the training run.
`disable`		Indicates the action should never occur.

Configuration Schemas

Main Config

d9d.loop.config.TrainerConfig

d9d.loop.config.InferenceConfig

Sub-Configurations

Diagnostics & Reproducibility

d9d.tracker.RunConfig

d9d.loop.config.JobLoggerConfig

d9d.loop.config.ProfilingConfig

d9d.loop.config.DeterminismConfig

Experiment Trackers

d9d.tracker.AnyTrackerConfig = Annotated[AimConfig | NullTrackerConfig, Field(discriminator='provider')] module-attribute

d9d.tracker.provider.null.NullTrackerConfig

d9d.tracker.provider.aim.config.AimConfig

Batching & Data

d9d.loop.config.BatchingConfig

d9d.loop.config.DataLoadingConfig

Checkpointing

d9d.loop.config.CheckpointingConfig

Model Initialization

d9d.loop.config.ModelStageFactoryConfig

Optimization

d9d.loop.config.GradientClippingConfig

d9d.loop.config.GradientManagerConfig

Infrastructure

d9d.loop.config.PipeliningConfig

d9d.loop.config.GarbageCollectionConfig

d9d.loop.config.TimeoutConfig

Types

d9d.loop.config.StepActionPeriod = int | StepActionSpecial module-attribute

d9d.loop.config.StepActionSpecial

`d9d.loop.config.TrainerConfig`

`d9d.loop.config.InferenceConfig`

`d9d.tracker.RunConfig`

`d9d.loop.config.JobLoggerConfig`

`d9d.loop.config.ProfilingConfig`

`d9d.loop.config.DeterminismConfig`

`d9d.tracker.AnyTrackerConfig = Annotated[AimConfig | NullTrackerConfig, Field(discriminator='provider')]` `module-attribute`

`d9d.tracker.provider.null.NullTrackerConfig`

`d9d.tracker.provider.aim.config.AimConfig`

`d9d.loop.config.BatchingConfig`

`d9d.loop.config.DataLoadingConfig`

`d9d.loop.config.CheckpointingConfig`

`d9d.loop.config.ModelStageFactoryConfig`

`d9d.loop.config.GradientClippingConfig`

`d9d.loop.config.GradientManagerConfig`

`d9d.loop.config.PipeliningConfig`

`d9d.loop.config.GarbageCollectionConfig`

`d9d.loop.config.TimeoutConfig`

`d9d.loop.config.StepActionPeriod = int | StepActionSpecial` `module-attribute`

`d9d.loop.config.StepActionSpecial`