About
The d9d.module.block.moe package provides a complete, high-performance implementation of Sparse Mixture-of-Experts layers.
Expert Parallelism
For information on setting up Expert Parallelism, see this page
Features
Sparse Expert Router
TopKRouter is a learnable router implementation.
It computes routing probabilities in FP32 to ensure numeric stability.
Sparse Expert Token Dispatcher
ExpertCommunicationHandler is the messaging layer.
NoCommunicationHandler is used by default for single-GPU or Tensor Parallel setups where no token movement is needed.
DeepEpCommunicationHandler is enabled if using Expert Parallelism. It uses the DeepEP library for highly optimized all-to-all communication over NVLink/RDMA, enabling scaling to thousands of experts.
Sparse Experts
GroupedSwiGLU provides a sparse SwiGLU experts module implementation.
Instead of looping over experts, it uses Grouped GEMM kernels to execute all experts in parallel, regardless of how many tokens each expert received.
Shared Experts
Currently not supported, feel free to contribute :)
d9d.module.block.moe
Provides building blocks for Mixture-of-Experts (MoE) architectures.
GroupedLinear
Bases: Module, ModuleLateInit
Applies a linear transformation using Grouped GEMM (Generalized Matrix Multiplication).
This module allows efficient execution of multiple linear layers (experts) in parallel, where each expert processes a variable number of tokens. It is the computational core of the Mixture-of-Experts layer.
Source code in d9d/module/block/moe/grouped_linear.py
12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 | |
__init__(n_groups, in_features, out_features, device=None, dtype=None)
Constructs the GroupedLinear layer.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
n_groups
|
int
|
Number of groups (experts). |
required |
in_features
|
int
|
Input hidden size. |
required |
out_features
|
int
|
Output hidden size. |
required |
device
|
device | str | None
|
Target device. |
None
|
dtype
|
dtype | None
|
Target data type. |
None
|
Source code in d9d/module/block/moe/grouped_linear.py
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 | |
forward(x, x_groups)
Performs the grouped matrix multiplication.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
x
|
Tensor
|
Flattened input tensor containing tokens for all groups.
Shape: |
required |
x_groups
|
Tensor
|
CPU Tensor indicating the number of tokens assigned to each group.
Must sum to |
required |
Returns:
| Type | Description |
|---|---|
Tensor
|
The output tensor. Shape: |
Source code in d9d/module/block/moe/grouped_linear.py
49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 | |
reset_parameters()
Initializes weights using a uniform distribution based on input features.
Source code in d9d/module/block/moe/grouped_linear.py
76 77 78 | |
GroupedSwiGLU
Bases: Module, ModuleLateInit
Executes a collection of SwiGLU experts efficiently using Grouped GEMM.
This module implements the architectural pattern: down_proj(SiLU(gate_proj(x)) * up_proj(x)).
It applies this operation across multiple discrete experts in parallel without padding or masking.
Source code in d9d/module/block/moe/grouped_experts.py
12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 | |
__init__(hidden_dim, intermediate_dim, num_experts)
Constructs the GroupedSwiGLU module.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
hidden_dim
|
int
|
Dimensionality of the input and output hidden states. |
required |
intermediate_dim
|
int
|
Dimensionality of the intermediate projection. |
required |
num_experts
|
int
|
Total number of experts managed by this local instance. |
required |
Source code in d9d/module/block/moe/grouped_experts.py
20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 | |
forward(permuted_x, permuted_probs, tokens_per_expert)
Computes expert outputs for sorted input tokens.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
permuted_x
|
Tensor
|
Input tokens sorted by their assigned expert.
Shape: |
required |
permuted_probs
|
Tensor
|
Routing weights/probabilities corresponding to the sorted tokens.
Shape: |
required |
tokens_per_expert
|
Tensor
|
Number of tokens assigned to each consecutive expert. It is a CPU tensor.
Shape: |
required |
Returns:
| Name | Type | Description |
|---|---|---|
Tensor
|
The computed and weighted output tokens (still permuted). |
|
Shape |
Tensor
|
|
Source code in d9d/module/block/moe/grouped_experts.py
42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 | |
reset_parameters()
Resets parameters for all internal linear projections.
Source code in d9d/module/block/moe/grouped_experts.py
78 79 80 81 82 83 | |
MoELayer
Bases: Module, ModuleLateInit
A complete Mixture-of-Experts (MoE) block comprising routing, communication, and computation.
This layer integrates:
- Router: Selects experts for each token.
- Communicator: Handles token dispatch to local or remote experts (EP).
- Experts: Performs parallelized computation (Grouped SwiGLU).
Source code in d9d/module/block/moe/layer.py
19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 | |
__init__(hidden_dim, intermediate_dim_grouped, num_grouped_experts, top_k, router_renormalize_probabilities)
Constructs the MoELayer.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
hidden_dim
|
int
|
Hidden size. |
required |
intermediate_dim_grouped
|
int
|
Intermediate dimension for the Expert FFNs. |
required |
num_grouped_experts
|
int
|
Total number of experts. |
required |
top_k
|
int
|
Number of experts to route each token to. |
required |
router_renormalize_probabilities
|
bool
|
Configures router probability normalization behavior. |
required |
Source code in d9d/module/block/moe/layer.py
30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 | |
enable_distributed_communicator(group)
Switches from local no-op communication to distributed DeepEP communication.
This should be called during model initialization if the model is running in a distributed Expert Parallel environment.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
group
|
ProcessGroup
|
The PyTorch process group spanning the expert parallel ranks. |
required |
Source code in d9d/module/block/moe/layer.py
66 67 68 69 70 71 72 73 74 75 76 77 78 79 | |
forward(hidden_states)
Routes tokens to experts, computes, and combines results.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
hidden_states
|
Tensor
|
Input tensor. Shape: |
required |
Returns:
| Type | Description |
|---|---|
Tensor
|
Output tensor combined from experts. Shape: |
Source code in d9d/module/block/moe/layer.py
90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 | |
reset_parameters()
Resets module parameters.
Source code in d9d/module/block/moe/layer.py
117 118 119 120 121 122 | |
reset_stats()
Resets the expert load balancing counters.
Source code in d9d/module/block/moe/layer.py
85 86 87 88 | |
TopKRouter
Bases: Module, ModuleLateInit
Selects the top-K experts based on a learned gating mechanism.
This router:
- Projects input tokens into expert space
- Applies softmax, optionally adds expert bias to influence selection
- Selects the experts with the highest probabilities
- Selected probabilities are then re-normalized to sum to 1 if needed.
Source code in d9d/module/block/moe/router.py
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 | |
__init__(dim, num_experts, top_k, renormalize_probabilities, enable_expert_bias=False)
Constructs the TopKRouter.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dim
|
int
|
Input feature dimensionality. |
required |
num_experts
|
int
|
Total number of experts to choose from. |
required |
top_k
|
int
|
Number of experts to select for each token. |
required |
renormalize_probabilities
|
bool
|
If True, probabilities of selected experts will be renormalized to sum up to 1 |
required |
enable_expert_bias
|
bool
|
If True, adds a bias term to the routing scores before top-k selection. This can be used for loss-free load balancing. |
False
|
Source code in d9d/module/block/moe/router.py
20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 | |
forward(hidden_states)
Calculates routing decisions for the input tokens.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
hidden_states
|
Tensor
|
Input tokens. Shape: |
required |
Returns:
| Type | Description |
|---|---|
Tensor
|
A tuple containing: |
Tensor
|
|
tuple[Tensor, Tensor]
|
|
Source code in d9d/module/block/moe/router.py
56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 | |
reset_parameters()
Resets module parameters.
Source code in d9d/module/block/moe/router.py
98 99 100 101 102 103 | |
d9d.module.block.moe.communications
Provides communication strategies for Mixture-of-Experts routing operations.
DeepEpCommunicationHandler
Bases: ExpertCommunicationHandler
Handles MoE communication using the high-performance DeepEP library.
Source code in d9d/module/block/moe/communications/deepep.py
209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 | |
__init__(num_experts)
Constructs the DeepEpCommunicationHandler.
Source code in d9d/module/block/moe/communications/deepep.py
212 213 214 215 216 217 218 219 220 221 222 | |
setup(group, hidden_size, hidden_dtype)
Initializes the backend buffer and calculates expert sharding.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
group
|
ProcessGroup
|
The process group containing all experts. |
required |
hidden_size
|
int
|
Dimensionality of the hidden states. |
required |
hidden_dtype
|
dtype
|
Data type of the hidden states. |
required |
Source code in d9d/module/block/moe/communications/deepep.py
224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 | |
ExpertCommunicationHandler
Bases: ABC
Abstract base class for Mixture-of-Experts communication strategies.
Source code in d9d/module/block/moe/communications/base.py
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 | |
combine(hidden_states)
abstractmethod
Restores hidden states to their original order and location.
Undoes the permutation and performs the reverse All-to-All communication to return processed results to the workers that originated the requests.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
hidden_states
|
Tensor
|
The processed hidden states. Shape: |
required |
Returns:
| Type | Description |
|---|---|
Tensor
|
The combined hidden states with the original shape and order. Shape: |
Source code in d9d/module/block/moe/communications/base.py
41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 | |
dispatch(hidden_states, topk_ids, topk_weights)
abstractmethod
Prepares and routes local hidden states to their target experts (possibly on other workers).
This process involves:
-
All-to-All Communication: Transfers hidden states to workers containing the assigned experts. States assigned to multiple experts are replicated.
-
Permutation: Sorts tokens by expert ID to prepare for Grouped GEMM.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
hidden_states
|
Tensor
|
Input tokens. Shape: |
required |
topk_ids
|
Tensor
|
Indices of the top-k experts selected for each token. Shape: |
required |
topk_weights
|
Tensor
|
Routing weights associated with the selected experts. Shape: |
required |
Returns:
| Type | Description |
|---|---|
Tensor
|
A tuple containing: |
Tensor
|
|
Tensor
|
|
tuple[Tensor, Tensor, Tensor]
|
|
Source code in d9d/module/block/moe/communications/base.py
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 | |
NoCommunicationHandler
Bases: ExpertCommunicationHandler
Handles MoE routing within a single device or when no cross-device routing is needed.
This handler does not perform network operations. It only permutes elements mostly for local logical grouping or debugging.
Source code in d9d/module/block/moe/communications/naive.py
14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 | |
__init__(num_experts)
Constructs the NoCommunicationHandler.
Source code in d9d/module/block/moe/communications/naive.py
22 23 24 25 26 27 | |