Skip to content

Table of Contents

🌐 Distributed Core

The foundational primitives managing the cluster.

🚀 Execution Engine

How to configure and run jobs.

  • Training Loop: The lifecycle of the Trainer, dependency injection, and execution flow.
  • Inference Loop: The lifecycle of distributed Inference and forward-only execution.
  • Configuration: Pydantic schemas for configuring jobs, batching, and logging.
  • Interfaces (Providers & Tasks): How to inject your custom Model, Dataset, and Step logic (Train & Infer).

💾 Data & State

Managing data loading and model checkpoints.

  • Model State Mapper: The graph-based transformation engine for checkpoints (transform architectures on-the-fly).
  • Model State I/O: Streaming reader/writers for checkpoints.
  • Datasets: Distributed-aware dataset wrappers and smart bucketing.

🧠 Modeling & Architecture

Building blocks for modern LLMs.

  • Model Catalogue: Models available directly in d9d.
  • Model Design: Principles for creating compatible models.
  • Modules: Building blocks for implementing compatible models.

⚡ Parallelism

Strategies for distributing computations.

  • Horizontal Parallelism: Data Parallelism, Fully-Sharded Data Parallelism, Expert Parallelism, Tensor Parallelism.
  • Pipeline Parallelism: Vertical scaling, schedules (1F1B, ZeroBubble), and cross-stage communication.

🔧 Fine-Tuning (PEFT)

Parameter-Efficient Fine-Tuning framework.

📈 Optimization & Metrics

⚙️ Internals

Deep dive into the engine room.