About

Warning:

If you are utilizing the standard d9d training infrastructure, you do not need to call these functions manually. The framework automatically handles profiling based on configuration. This package is primarily intended for users extending d9d.

The d9d.internals.profiling package provides a distributed-aware wrapper around the standard PyTorch Profiler.

In large-scale distributed training, profiling often becomes difficult due to:

  1. File Naming: Thousands of ranks writing to the same filename causes race conditions.
  2. Storage Space: Raw Chrome tracing JSON files can grow to gigabytes very quickly.
  3. Synchronization: Ensuring all ranks profile the same specific step without manual intervention.

The Profiler class solves these issues by automatically handling file naming based on the DeviceMesh coordinates, compressing traces into .tar.gz archives on the fly, and managing the profiling schedule (wait/warmup/active).

d9d.internals.profiling

Exposes the internal distributed profiler.

Profiler

Manages distributed performance profiling using PyTorch Profiler.

This class wraps torch.profiler to provide automatic trace exporting, compression, and file naming consistent with the distributed DeviceMesh topology. It configures the schedule to repeat periodically based on the provided step counts.

Source code in d9d/internals/profiling/profile.py
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
class Profiler:
    """
    Manages distributed performance profiling using PyTorch Profiler.

    This class wraps `torch.profiler` to provide automatic trace exporting,
    compression, and file naming consistent with the distributed DeviceMesh
    topology. It configures the schedule to repeat periodically based on
    the provided step counts.
    """

    def __init__(
            self,
            save_dir: Path,
            period_steps: int,
            warmup_steps: int,
            active_steps: int,
            dist_context: DistributedContext
    ):
        """
        Constructs a Profiler object.

        Args:
            save_dir: Directory where trace files will be saved.
            period_steps: Total length of a profiling cycle (wait + warmup + active).
            warmup_steps: Number of steps to ignore before recording to allow for warming-up.
            active_steps: Number of steps to actively record traces.
            dist_context: The distributed context object.
        """

        self._save_dir = save_dir
        self._period = period_steps
        self._warmup = warmup_steps
        self._active = active_steps
        self._dist_context = dist_context

    def _dump_trace(self, prof: tprof.profile):
        save_dir = self._save_dir / f"step_{prof.step_num}"
        save_dir.mkdir(parents=True, exist_ok=True)
        mesh_regular = self._dist_context.mesh_for(REGULAR_DOMAIN)
        coord = mesh_regular.get_coordinate()
        if coord is None:
            raise RuntimeError("Invalid mesh")
        coord_str = "-".join(map(str, coord))
        rank = mesh_regular.get_rank()
        save_file = save_dir / f"rank-{rank}-coord-{coord_str}-trace.json"

        begin = time.monotonic()

        prof.export_chrome_trace(str(save_file))
        with tarfile.open(save_file.with_suffix(".tar.gz"), "w:gz") as tar:
            tar.add(save_file, arcname=save_file.name)
        save_file.unlink()

        end = time.monotonic()

        self._dist_context.logger.info(
            f"Finished dumping profiler traces in {end - begin:.2f} seconds"
        )

    @contextmanager
    def open(self, start_step: int):
        """
        Opens a context manager for profiling execution.

        This sets up the `torch.profiler.profile` with a schedule derived from
        the initialization parameters. It captures both CPU and CUDA activities,
        records shapes, and tracks stack traces.

        When the schedule triggers `on_trace_ready`, the trace is automatically
        exported to the `save_dir`, compressed into a `.tar.gz` file, and the
        raw JSON is removed to save space.

        Args:
            start_step: The current global step number to initialize the
                profiler state.

        Yields:
            The configured torch profiler instance.
        """

        wait = self._period - (self._active + self._warmup)
        warmup = self._warmup
        active = self._active

        with tprof.profile(
                activities=[
                    tprof.ProfilerActivity.CPU,
                    tprof.ProfilerActivity.CUDA
                ],
                schedule=tprof.schedule(wait=wait, warmup=warmup, active=active),
                on_trace_ready=self._dump_trace,
                record_shapes=True,
                with_stack=True
        ) as profiler:
            profiler.step_num = start_step
            yield profiler

__init__(save_dir, period_steps, warmup_steps, active_steps, dist_context)

Constructs a Profiler object.

Parameters:

Name Type Description Default
save_dir Path

Directory where trace files will be saved.

required
period_steps int

Total length of a profiling cycle (wait + warmup + active).

required
warmup_steps int

Number of steps to ignore before recording to allow for warming-up.

required
active_steps int

Number of steps to actively record traces.

required
dist_context DistributedContext

The distributed context object.

required
Source code in d9d/internals/profiling/profile.py
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
def __init__(
        self,
        save_dir: Path,
        period_steps: int,
        warmup_steps: int,
        active_steps: int,
        dist_context: DistributedContext
):
    """
    Constructs a Profiler object.

    Args:
        save_dir: Directory where trace files will be saved.
        period_steps: Total length of a profiling cycle (wait + warmup + active).
        warmup_steps: Number of steps to ignore before recording to allow for warming-up.
        active_steps: Number of steps to actively record traces.
        dist_context: The distributed context object.
    """

    self._save_dir = save_dir
    self._period = period_steps
    self._warmup = warmup_steps
    self._active = active_steps
    self._dist_context = dist_context

open(start_step)

Opens a context manager for profiling execution.

This sets up the torch.profiler.profile with a schedule derived from the initialization parameters. It captures both CPU and CUDA activities, records shapes, and tracks stack traces.

When the schedule triggers on_trace_ready, the trace is automatically exported to the save_dir, compressed into a .tar.gz file, and the raw JSON is removed to save space.

Parameters:

Name Type Description Default
start_step int

The current global step number to initialize the profiler state.

required

Yields:

Type Description

The configured torch profiler instance.

Source code in d9d/internals/profiling/profile.py
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
@contextmanager
def open(self, start_step: int):
    """
    Opens a context manager for profiling execution.

    This sets up the `torch.profiler.profile` with a schedule derived from
    the initialization parameters. It captures both CPU and CUDA activities,
    records shapes, and tracks stack traces.

    When the schedule triggers `on_trace_ready`, the trace is automatically
    exported to the `save_dir`, compressed into a `.tar.gz` file, and the
    raw JSON is removed to save space.

    Args:
        start_step: The current global step number to initialize the
            profiler state.

    Yields:
        The configured torch profiler instance.
    """

    wait = self._period - (self._active + self._warmup)
    warmup = self._warmup
    active = self._active

    with tprof.profile(
            activities=[
                tprof.ProfilerActivity.CPU,
                tprof.ProfilerActivity.CUDA
            ],
            schedule=tprof.schedule(wait=wait, warmup=warmup, active=active),
            on_trace_ready=self._dump_trace,
            record_shapes=True,
            with_stack=True
    ) as profiler:
        profiler.step_num = start_step
        yield profiler