easydel.modules.gpt_oss.gpt_oss_configuration#
GPT-OSS Model Configuration
This module provides configuration classes for the GPT-OSS model, a transformer-based language model with Mixture of Experts (MoE) architecture. The model features sparse routing, sliding window attention, and efficient parameter sharding for distributed training.
The configuration includes custom sharding specifications for MoE components and comprehensive model hyperparameters.
- class easydel.modules.gpt_oss.gpt_oss_configuration.GptOssConfig(num_hidden_layers: int = 36, num_local_experts: int = 128, vocab_size: int = 201088, hidden_size: int = 2880, intermediate_size: int = 2880, head_dim: int = 64, num_attention_heads: int = 64, num_key_value_heads: int = 8, sliding_window: int = 128, rope_theta: float = 150000.0, tie_word_embeddings=False, hidden_act: str = 'silu', initializer_range: float = 0.02, max_position_embeddings=131072, rms_norm_eps: float = 1e-05, rope_scaling=None, attention_dropout: float = 0.0, num_experts_per_tok=4, router_aux_loss_coef: float = 0.9, output_router_logits=False, use_cache=True, layer_types=None, mlp_activations_limit: float = 7.0, **kwargs)[source]#
Bases:
EasyDeLBaseConfigConfiguration class for GPT-OSS model.
GPT-OSS is a transformer-based language model featuring: - Mixture of Experts (MoE) architecture with sparse routing - Alternating sliding window and full attention layers - RMSNorm for layer normalization - Rotary Position Embeddings (RoPE) with optional scaling - Efficient parameter sharding for distributed training
Number of transformer layers. Default: 36
- Type
int
- num_local_experts#
Number of expert networks per MoE layer. Default: 128
- Type
int
- vocab_size#
Size of the vocabulary. Default: 201088
- Type
int
Dimension of hidden representations. Default: 2880
- Type
int
- intermediate_size#
Dimension of MLP intermediate layer. Default: 2880
- Type
int
- head_dim#
Dimension of each attention head. Default: 64
- Type
int
- num_attention_heads#
Number of attention heads. Default: 64
- Type
int
- num_key_value_heads#
Number of key-value heads for GQA. Default: 8
- Type
int
- sliding_window#
Size of sliding window for local attention. Default: 128
- Type
int
- rope_theta#
Base frequency for RoPE. Default: 150000.0
- Type
float
- tie_word_embeddings#
Whether to tie input/output embeddings. Default: False
- Type
bool
Activation function for MLP. Default: “silu”
- Type
str
- initializer_range#
Standard deviation for weight initialization. Default: 0.02
- Type
float
- max_position_embeddings#
Maximum sequence length. Default: 131072
- Type
int
- rms_norm_eps#
Epsilon for RMS normalization. Default: 1e-5
- Type
float
- rope_scaling#
Configuration for RoPE scaling. Default: YARN scaling with factor 32
- Type
dict
- attention_dropout#
Dropout rate for attention weights. Default: 0.0
- Type
float
- num_experts_per_tok#
Number of experts to route each token to. Default: 4
- Type
int
- router_aux_loss_coef#
Coefficient for load balancing auxiliary loss. Default: 0.9
- Type
float
- output_router_logits#
Whether to output router logits. Default: False
- Type
bool
- use_cache#
Whether to use key-value caching. Default: True
- Type
bool
- layer_types#
Attention type for each layer. Default: alternating sliding/full
- Type
list
Example
>>> config = GptOssConfig( ... num_hidden_layers=24, ... num_local_experts=64, ... hidden_size=2048, ... num_attention_heads=32 ... ) >>> model = GptOssForCausalLM(config)
- get_partition_rules(*args, **kwargs)[source]#
Get the partition rules for distributed training of GPT-OSS model.
Returns partition specifications for different parameter groups to enable efficient model parallelism. The rules specify how to shard parameters across devices for: - Embeddings: Column-wise sharding - Attention: Column-wise for QKV, row-wise for output projection - MoE: Custom expert-parallel sharding for expert parameters - Normalization: Replicated across devices
- Returns
Partition rules as (regex_pattern, PartitionSpec) pairs
- Return type
tuple
- model_type: str = 'gpt_oss'#