Skip to content

Distributed Profiling

Internal API Warning

If you are utilizing the standard d9d training infrastructure, you do not need to call these functions manually. The framework automatically handles profiling based on configuration. This package is primarily intended for users extending d9d.

About

The d9d.internals.profiling package provides a distributed-aware wrapper around the standard PyTorch Profiler.

In large-scale distributed training, profiling often becomes difficult due to:

  1. File Naming: Thousands of ranks writing to the same filename causes race conditions.
  2. Storage Space: Raw Chrome tracing JSON files can grow to gigabytes very quickly.
  3. Synchronization: Ensuring all ranks profile the same specific step without manual intervention.

The Profiler class solves these issues by automatically handling file naming based on the DeviceMesh coordinates, compressing traces into .tar.gz archives on the fly, and managing the profiling schedule (wait/warmup/active).

d9d.internals.profiling

Exposes the internal distributed profiler.

Profiler

Manages distributed performance profiling using PyTorch Profiler.

This class wraps torch.profiler to provide automatic trace exporting, compression, and file naming consistent with the distributed DeviceMesh topology. It configures the schedule to repeat periodically based on the provided step counts.

__init__(save_dir, period_steps, warmup_steps, active_steps, dist_context)

Constructs a Profiler object.

Parameters:

Name Type Description Default
save_dir Path

Directory where trace files will be saved.

required
period_steps int

Total length of a profiling cycle (wait + warmup + active).

required
warmup_steps int

Number of steps to ignore before recording to allow for warming-up.

required
active_steps int

Number of steps to actively record traces.

required
dist_context DistributedContext

The distributed context object.

required

open(start_step)

Opens a context manager for profiling execution.

This sets up the torch.profiler.profile with a schedule derived from the initialization parameters. It captures both CPU and CUDA activities, records shapes, and tracks stack traces.

When the schedule triggers on_trace_ready, the trace is automatically exported to the save_dir, compressed into a .tar.gz file, and the raw JSON is removed to save space.

Parameters:

Name Type Description Default
start_step int

The current global step number to initialize the profiler state.

required

Yields:

Type Description

The configured torch profiler instance.