Distributed Context: The Source of Truth for topology. Understanding DeviceMesh domains (dense, expert, batch).
Distributed Operations: Utilities for gathering var-length tensors and objects.
PyTree Sharding: Utilities for splitting complex nested structures across ranks.
Typing Extensions: Python type annotations for common objects and structures.

🚀 Execution Engine

How to configure and run jobs.

Training Loop: The lifecycle of the Trainer, dependency injection, and execution flow.
Inference Loop: The lifecycle of distributed Inference and forward-only execution.
Configuration: Pydantic schemas for configuring jobs, batching, and logging.
Interfaces (Providers & Tasks): How to inject your custom Model, Dataset, and Step logic (Train & Infer).

💾 Data & State

Managing data loading and model checkpoints.

Model State Mapper: The graph-based transformation engine for checkpoints (transform architectures on-the-fly).
Model State I/O: Streaming reader/writers for checkpoints.
Datasets: Distributed-aware dataset wrappers and smart bucketing.

Building blocks for modern LLMs.

Strategies for distributing computations.

Horizontal Parallelism: Data Parallelism, Fully-Sharded Data Parallelism, Expert Parallelism, Tensor Parallelism.
Pipeline Parallelism: Vertical scaling, schedules (1F1B, ZeroBubble), and cross-stage communication.

Parameter-Efficient Fine-Tuning framework.

Deep dive into the engine room.

AutoGrad Extensions: How we do split-backward for Pipeline Parallel.
Pipelining Internals: How the VM and Schedules work.
Gradient Sync: Custom backward hooks for overlapping comms.
Gradient Norm & Clipping: Correct global norm calculation across hybrid meshes.
Metric Collection: Custom overlapped metric synchronization & computation.
Pipeline State: Context switching between Global and Microbatch scopes.
Determinism.
Profiling.