easydel.modules.gpt_oss.gpt_oss_configuration

easydel.modules.gpt_oss.gpt_oss_configuration#

GPT-OSS Model Configuration

This module provides configuration classes for the GPT-OSS model, a transformer-based language model with Mixture of Experts (MoE) architecture. The model features sparse routing, sliding window attention, and efficient parameter sharding for distributed training.

The configuration includes custom sharding specifications for MoE components and comprehensive model hyperparameters.

class easydel.modules.gpt_oss.gpt_oss_configuration.GptOssConfig(num_hidden_layers: int = 36, num_local_experts: int = 128, vocab_size: int = 201088, hidden_size: int = 2880, intermediate_size: int = 2880, head_dim: int = 64, num_attention_heads: int = 64, num_key_value_heads: int = 8, sliding_window: int = 128, rope_theta: float = 150000.0, tie_word_embeddings=False, hidden_act: str = 'silu', initializer_range: float = 0.02, max_position_embeddings=131072, rms_norm_eps: float = 1e-05, rope_scaling=None, attention_dropout: float = 0.0, num_experts_per_tok=4, router_aux_loss_coef: float = 0.9, output_router_logits=False, use_cache=True, layer_types=None, mlp_activations_limit: float = 7.0, **kwargs)[source]#

Bases: EasyDeLBaseConfig

Configuration class for GPT-OSS model.

GPT-OSS is a transformer-based language model featuring: - Mixture of Experts (MoE) architecture with sparse routing - Alternating sliding window and full attention layers - RMSNorm for layer normalization - Rotary Position Embeddings (RoPE) with optional scaling - Efficient parameter sharding for distributed training

num_hidden_layers#

Number of transformer layers. Default: 36

Type: int

num_local_experts#

Number of expert networks per MoE layer. Default: 128

Type: int

vocab_size#

Size of the vocabulary. Default: 201088

Type: int

hidden_size#

Dimension of hidden representations. Default: 2880

Type: int

intermediate_size#

Dimension of MLP intermediate layer. Default: 2880

Type: int

head_dim#

Dimension of each attention head. Default: 64

Type: int

num_attention_heads#

Number of attention heads. Default: 64

Type: int

num_key_value_heads#

Number of key-value heads for GQA. Default: 8

Type: int

sliding_window#

Size of sliding window for local attention. Default: 128

Type: int

rope_theta#

Base frequency for RoPE. Default: 150000.0

Type: float

tie_word_embeddings#

Whether to tie input/output embeddings. Default: False

Type: bool

hidden_act#

Activation function for MLP. Default: “silu”

Type: str

initializer_range#

Standard deviation for weight initialization. Default: 0.02

Type: float

max_position_embeddings#

Maximum sequence length. Default: 131072

Type: int

rms_norm_eps#

Epsilon for RMS normalization. Default: 1e-5

Type: float

rope_scaling#

Configuration for RoPE scaling. Default: YARN scaling with factor 32

Type: dict

attention_dropout#

Dropout rate for attention weights. Default: 0.0

Type: float

num_experts_per_tok#

Number of experts to route each token to. Default: 4

Type: int

router_aux_loss_coef#

Coefficient for load balancing auxiliary loss. Default: 0.9

Type: float

output_router_logits#

Whether to output router logits. Default: False

Type: bool

use_cache#

Whether to use key-value caching. Default: True

Type: bool

layer_types#

Attention type for each layer. Default: alternating sliding/full

Type: list

Example

>>> config = GptOssConfig(
...     num_hidden_layers=24,
...     num_local_experts=64,
...     hidden_size=2048,
...     num_attention_heads=32
... )
>>> model = GptOssForCausalLM(config)

get_partition_rules(*args, **kwargs)[source]#

Get the partition rules for distributed training of GPT-OSS model.

Returns partition specifications for different parameter groups to enable efficient model parallelism. The rules specify how to shard parameters across devices for: - Embeddings: Column-wise sharding - Attention: Column-wise for QKV, row-wise for output projection - MoE: Custom expert-parallel sharding for expert parameters - Normalization: Replicated across devices

Returns: Partition rules as (regex_pattern, PartitionSpec) pairs
Return type: tuple

model_type: str = 'gpt_oss'#

easydel.modules.gpt_oss.gpt_oss_configuration

Contents

easydel.modules.gpt_oss.gpt_oss_configuration#