easydel.modules.gpt_oss.gpt_oss_configuration#

GPT-OSS Model Configuration

This module provides configuration classes for the GPT-OSS model, a transformer-based language model with Mixture of Experts (MoE) architecture. The model features sparse routing, sliding window attention, and efficient parameter sharding for distributed training.

The configuration includes custom sharding specifications for MoE components and comprehensive model hyperparameters.

class easydel.modules.gpt_oss.gpt_oss_configuration.GptOssConfig(num_hidden_layers: int = 36, num_local_experts: int = 128, vocab_size: int = 201088, hidden_size: int = 2880, intermediate_size: int = 2880, head_dim: int = 64, num_attention_heads: int = 64, num_key_value_heads: int = 8, sliding_window: int = 128, rope_theta: float = 150000.0, tie_word_embeddings=False, hidden_act: str = 'silu', initializer_range: float = 0.02, max_position_embeddings=131072, rms_norm_eps: float = 1e-05, rope_scaling=None, attention_dropout: float = 0.0, num_experts_per_tok=4, router_aux_loss_coef: float = 0.9, output_router_logits=False, use_cache=True, layer_types=None, mlp_activations_limit: float = 7.0, **kwargs)[source]#

Bases: EasyDeLBaseConfig

Configuration class for GPT-OSS model.

GPT-OSS is a transformer-based language model featuring: - Mixture of Experts (MoE) architecture with sparse routing - Alternating sliding window and full attention layers - RMSNorm for layer normalization - Rotary Position Embeddings (RoPE) with optional scaling - Efficient parameter sharding for distributed training

num_hidden_layers#

Number of transformer layers. Default: 36

Type

int

num_local_experts#

Number of expert networks per MoE layer. Default: 128

Type

int

vocab_size#

Size of the vocabulary. Default: 201088

Type

int

hidden_size#

Dimension of hidden representations. Default: 2880

Type

int

intermediate_size#

Dimension of MLP intermediate layer. Default: 2880

Type

int

head_dim#

Dimension of each attention head. Default: 64

Type

int

num_attention_heads#

Number of attention heads. Default: 64

Type

int

num_key_value_heads#

Number of key-value heads for GQA. Default: 8

Type

int

sliding_window#

Size of sliding window for local attention. Default: 128

Type

int

rope_theta#

Base frequency for RoPE. Default: 150000.0

Type

float

tie_word_embeddings#

Whether to tie input/output embeddings. Default: False

Type

bool

hidden_act#

Activation function for MLP. Default: “silu”

Type

str

initializer_range#

Standard deviation for weight initialization. Default: 0.02

Type

float

max_position_embeddings#

Maximum sequence length. Default: 131072

Type

int

rms_norm_eps#

Epsilon for RMS normalization. Default: 1e-5

Type

float

rope_scaling#

Configuration for RoPE scaling. Default: YARN scaling with factor 32

Type

dict

attention_dropout#

Dropout rate for attention weights. Default: 0.0

Type

float

num_experts_per_tok#

Number of experts to route each token to. Default: 4

Type

int

router_aux_loss_coef#

Coefficient for load balancing auxiliary loss. Default: 0.9

Type

float

output_router_logits#

Whether to output router logits. Default: False

Type

bool

use_cache#

Whether to use key-value caching. Default: True

Type

bool

layer_types#

Attention type for each layer. Default: alternating sliding/full

Type

list

Example

>>> config = GptOssConfig(
...     num_hidden_layers=24,
...     num_local_experts=64,
...     hidden_size=2048,
...     num_attention_heads=32
... )
>>> model = GptOssForCausalLM(config)
get_partition_rules(*args, **kwargs)[source]#

Get the partition rules for distributed training of GPT-OSS model.

Returns partition specifications for different parameter groups to enable efficient model parallelism. The rules specify how to shard parameters across devices for: - Embeddings: Column-wise sharding - Attention: Column-wise for QKV, row-wise for output projection - MoE: Custom expert-parallel sharding for expert parameters - Normalization: Replicated across devices

Returns

Partition rules as (regex_pattern, PartitionSpec) pairs

Return type

tuple

model_type: str = 'gpt_oss'#