Experiment Tracking

About

Warning:

If you are utilizing the standard d9d training infrastructure, you do not need to call these functions manually. The framework automatically handles tracking based on configuration. This package is primarily intended for users extending d9d.

The d9d.tracker package provides a unified, configuration-driven interface for logging metrics, hyperparameters, and distributions during training.

It abstracts the specific backend (such as Aim or simple console logging) behind a common API. This coupled with a pydantic configuration system allows users to switch logging backends via configuration files without changing a single line of training loop code.

Crucially, the tracker is State Aware. It implements the PyTorch Stateful protocol, ensuring that if a training job is interrupted and resumed, the tracker automatically re-attaches to the existing experiment run rather than creating a fragmented new one.

Architecture Separation of Concerns

The module splits tracking logic into two distinct phases:

The Tracker (Factory/Manager): Represented by BaseTracker. This object persists throughout the lifecycle of the application. It holds configuration (where to save logs) and state (the ID of the current run). It is responsible for creating "Runs".
The Run (Session): Represented by BaseTrackerRun. This is a context-managed object active only during the actual training loop. It handles the set_step, scalar, and bins operations.

There is also factory method called tracker_from_config that can create a BaseTracker object based on Pydantic configuration.

Adding a New Tracker

To support a new logging backend (e.g., Weights & Biases, MLFlow), you need to implement three components and register them in the factory.

The Configuration

Create a Pydantic model for your tracker's settings. Functionally, it must contain a provider literal field which acts as the discriminator for the polymorphic deserialization.

from typing import Literal
from pydantic import BaseModel

class WandbConfig(BaseModel):
    provider: Literal['wandb'] = 'wandb'
    project: str
    entity: str | None = None

The Run Handler

Implement BaseTrackerRun. This class maps d9d calls (scalar, bins) to the specific calls of your backend SDK.

from d9d.tracker import BaseTrackerRun

class WandbRun(BaseTrackerRun):
    def __init__(self, run_obj):
        self._run = run_obj
        self._step = 0

    def set_step(self, step: int):
        self._step = step

    # ... implement scalar(), bins(), etc. to call self._run.log()

The Tracker Factory

Implement BaseTracker. This handles initialization and state persistence (resuming).

from contextlib import contextmanager
from d9d.tracker import BaseTracker, RunConfig

class WandbTracker(BaseTracker[WandbConfig]):
    def __init__(self, config: WandbConfig):
        self.config = config
        self.run_id = None # State to persist

    def state_dict(self):
        # This is saved to the checkpoint
        return {"run_id": self.run_id}

    def load_state_dict(self, state_dict):
        # This is restored from the checkpoint
        self.run_id = state_dict.get("run_id")

    @contextmanager
    def open(self, props: RunConfig):
        # Logic to init e.g. wandb.init(id=self.run_id, resume="allow", ...)
        # self.run_id = ...
        # yield WandbRun(...)
        # cleanup if necessary

Registration

To make tracker_from_config recognize your new tracker, you must modify d9d/tracker/factory.py.

Add your config to AnyTrackerConfig type alias:

AnyTrackerConfig = Annotated[
    AimConfig | NullTrackerConfig | WandbConfig, # <--- Add here
    Field(discriminator='provider')
]

try:
    from .provider.wandb.tracker import WandbTracker
    _MAP[WandbConfig] = WandbTracker
except ImportError as e:
    _MAP[WandbConfig] = _TrackerImportFailed('wandb', e)

`d9d.tracker`

Package providing a unified interface for experiment tracking and logging.

`BaseTracker`

Bases: ABC, Stateful, Generic[TConfig]

Abstract base class for a tracker backend factory.

This class manages the lifecycle of runs and integration with the distributed checkpointing system to ensure experiment continuity (e.g., resuming the same run hash after a restart).

Source code in d9d/tracker/base.py

class BaseTracker(abc.ABC, Stateful, Generic[TConfig]):
    """
    Abstract base class for a tracker backend factory.

    This class manages the lifecycle of runs and integration with the
    distributed checkpointing system to ensure experiment continuity
    (e.g., resuming the same run hash after a restart).
    """

    @contextmanager
    @abc.abstractmethod
    def open(self, properties: RunConfig) -> Generator[BaseTrackerRun, None, None]:
        """
        Context manager that initiates and manages an experiment run.

        Args:
            properties: Configuration metadata for the run.

        Yields:
            An active BaseTrackerRun instance for logging metrics.
        """

        ...

    @classmethod
    @abc.abstractmethod
    def from_config(cls, config: TConfig) -> Self:
        """
        Factory method to create a tracker instance from a configuration object.

        Args:
            config: The backend-specific configuration object.

        Returns:
            An initialized instance of the tracker.
        """

        ...

`from_config(config)` `abstractmethod` `classmethod`

Factory method to create a tracker instance from a configuration object.

Parameters:

Name	Type	Description	Default
`config`	`TConfig`	The backend-specific configuration object.	required

Returns:

Type	Description
`Self`	An initialized instance of the tracker.

Source code in d9d/tracker/base.py

@classmethod
@abc.abstractmethod
def from_config(cls, config: TConfig) -> Self:
    """
    Factory method to create a tracker instance from a configuration object.

    Args:
        config: The backend-specific configuration object.

    Returns:
        An initialized instance of the tracker.
    """

    ...

`open(properties)` `abstractmethod`

Context manager that initiates and manages an experiment run.

Parameters:

Name	Type	Description	Default
`properties`	`RunConfig`	Configuration metadata for the run.	required

Yields:

Type	Description
`BaseTrackerRun`	An active BaseTrackerRun instance for logging metrics.

Source code in d9d/tracker/base.py

@contextmanager
@abc.abstractmethod
def open(self, properties: RunConfig) -> Generator[BaseTrackerRun, None, None]:
    """
    Context manager that initiates and manages an experiment run.

    Args:
        properties: Configuration metadata for the run.

    Yields:
        An active BaseTrackerRun instance for logging metrics.
    """

    ...

`BaseTrackerRun`

Bases: ABC

Abstract base class representing an active tracking session (run).

This object is responsible for the actual logging of metrics, parameters, during train or inference run.

Source code in d9d/tracker/base.py

class BaseTrackerRun(abc.ABC):
    """
    Abstract base class representing an active tracking session (run).

    This object is responsible for the actual logging of metrics, parameters,
    during train or inference run.
    """

    @abc.abstractmethod
    def set_step(self, step: int):
        """
        Updates the global step counter for subsequent logs.

        Args:
            step: The current step index (e.g., iteration number).
        """
        ...

    @abc.abstractmethod
    def set_context(self, context: dict[str, str]):
        """
        Sets a persistent context dictionary for subsequent logs.

        These context values (tags) will be attached to every metric logged
        until changed.

        Args:
            context: A dictionary of tag names and values.
        """
        ...

    @abc.abstractmethod
    def scalar(self, name: str, value: float, context: dict[str, str] | None = None):
        """
        Logs a scalar value.

        Args:
            name: The name of the metric.
            value: The scalar value to log.
            context: Optional ephemeral context specific to this metric event.
                Merged with global context if present.
        """
        ...

    @abc.abstractmethod
    def bins(self, name: str, values: torch.Tensor, context: dict[str, str] | None = None):
        """
        Logs a distribution/histogram of values.

        Args:
            name: The name of the metric.
            values: A tensor containing the population of values to bin.
            context: Optional ephemeral context specific to this metric event.
                Merged with global context if present.
        """
        ...

`bins(name, values, context=None)` `abstractmethod`

Logs a distribution/histogram of values.

Parameters:

Name	Type	Description	Default
`name`	`str`	The name of the metric.	required
`values`	`Tensor`	A tensor containing the population of values to bin.	required
`context`	`dict[str, str] \| None`	Optional ephemeral context specific to this metric event. Merged with global context if present.	`None`

Source code in d9d/tracker/base.py

@abc.abstractmethod
def bins(self, name: str, values: torch.Tensor, context: dict[str, str] | None = None):
    """
    Logs a distribution/histogram of values.

    Args:
        name: The name of the metric.
        values: A tensor containing the population of values to bin.
        context: Optional ephemeral context specific to this metric event.
            Merged with global context if present.
    """
    ...

`scalar(name, value, context=None)` `abstractmethod`

Logs a scalar value.

Parameters:

Name	Type	Description	Default
`name`	`str`	The name of the metric.	required
`value`	`float`	The scalar value to log.	required
`context`	`dict[str, str] \| None`	Optional ephemeral context specific to this metric event. Merged with global context if present.	`None`

Source code in d9d/tracker/base.py

@abc.abstractmethod
def scalar(self, name: str, value: float, context: dict[str, str] | None = None):
    """
    Logs a scalar value.

    Args:
        name: The name of the metric.
        value: The scalar value to log.
        context: Optional ephemeral context specific to this metric event.
            Merged with global context if present.
    """
    ...

`set_context(context)` `abstractmethod`

Sets a persistent context dictionary for subsequent logs.

These context values (tags) will be attached to every metric logged until changed.

Parameters:

Name	Type	Description	Default
`context`	`dict[str, str]`	A dictionary of tag names and values.	required

Source code in d9d/tracker/base.py

@abc.abstractmethod
def set_context(self, context: dict[str, str]):
    """
    Sets a persistent context dictionary for subsequent logs.

    These context values (tags) will be attached to every metric logged
    until changed.

    Args:
        context: A dictionary of tag names and values.
    """
    ...

`set_step(step)` `abstractmethod`

Updates the global step counter for subsequent logs.

Parameters:

Name	Type	Description	Default
`step`	`int`	The current step index (e.g., iteration number).	required

Source code in d9d/tracker/base.py

@abc.abstractmethod
def set_step(self, step: int):
    """
    Updates the global step counter for subsequent logs.

    Args:
        step: The current step index (e.g., iteration number).
    """
    ...

`RunConfig`

Bases: BaseModel

Configuration for initializing a specific logged run.

Attributes:

Name	Type	Description
`name`	`str`	The display name of the experiment.
`description`	`str \| None`	An optional description of the experiment.
`hparams`	`dict[str, Any]`	A dictionary of hyperparameters to log at the start of the run.

Source code in d9d/tracker/base.py

class RunConfig(BaseModel):
    """
    Configuration for initializing a specific logged run.

    Attributes:
        name: The display name of the experiment.
        description: An optional description of the experiment.
        hparams: A dictionary of hyperparameters to log at the start of the run.
    """

    name: str
    description: str | None
    hparams: dict[str, Any] = Field(default_factory=dict)

`tracker_from_config(config)`

Instantiates a specific tracker implementation based on the configuration.

Based on the 'provider' field in the config, this function selects the appropriate backend (e.g., Aim, Null). It handles checking for missing dependencies for optional backends.

Parameters:

Name	Type	Description	Default
`config`	`AnyTrackerConfig`	A specific tracker configuration object.	required

Returns:

Type	Description
`BaseTracker`	An initialized BaseTracker instance.

Raises:

Type	Description
`ImportError`	If the dependencies for the requested provider are not installed.

Source code in d9d/tracker/factory.py

def tracker_from_config(config: AnyTrackerConfig) -> BaseTracker:
    """
    Instantiates a specific tracker implementation based on the configuration.

    Based on the 'provider' field in the config, this function selects the
    appropriate backend (e.g., Aim, Null). It handles checking for missing
    dependencies for optional backends.

    Args:
        config: A specific tracker configuration object.

    Returns:
        An initialized BaseTracker instance.

    Raises:
        ImportError: If the dependencies for the requested provider are not installed.
    """

    tracker_type = _MAP[type(config)]

    if isinstance(tracker_type, _TrackerImportFailed):
        raise ImportError(
            f"The tracker configuration {config.provider} could not be loaded - "
            f"ensure these dependencies are installed: {tracker_type.dependency}"
        ) from tracker_type.exception

    return tracker_type.from_config(config)

Experiment Tracking

About

Architecture Separation of Concerns

Adding a New Tracker

The Configuration

The Run Handler

The Tracker Factory

Registration

d9d.tracker

BaseTracker

from_config(config) abstractmethod classmethod

open(properties) abstractmethod

BaseTrackerRun

bins(name, values, context=None) abstractmethod

scalar(name, value, context=None) abstractmethod

set_context(context) abstractmethod

set_step(step) abstractmethod

RunConfig

tracker_from_config(config)

`d9d.tracker`

`BaseTracker`

`from_config(config)` `abstractmethod` `classmethod`

`open(properties)` `abstractmethod`

`BaseTrackerRun`

`bins(name, values, context=None)` `abstractmethod`

`scalar(name, value, context=None)` `abstractmethod`

`set_context(context)` `abstractmethod`

`set_step(step)` `abstractmethod`

`RunConfig`

`tracker_from_config(config)`