easydel.modules.qwen3_moe.modeling_qwen3_moe_flax

easydel.modules.qwen3_moe.modeling_qwen3_moe_flax#

class easydel.modules.qwen3_moe.modeling_qwen3_moe_flax.Qwen3MoeAttention(*args: Any, **kwargs: Any)[source]#

Bases: AttentionModule

Qwen3Moe Attention module.

This module implements the multi-head attention mechanism used in the Qwen3Moe model. It supports Grouped Query Attention (GQA) and Rotary Position Embeddings (RoPE).

config#

Configuration object for the model.

Type: Qwen3MoeConfig

dtype#

Data type for computations.

Type: jnp.dtype

param_dtype#

Data type for parameters.

Type: jnp.dtype

precision#

Precision setting for JAX operations.

Type: jax.lax.PrecisionLike

rngs#

Random number generators.

Type: nn.Rngs

hidden_size#

Dimensionality of the hidden states.

Type: int

head_dim#

Dimensionality of each attention head.

Type: int

num_key_value_groups#

Number of query head groups for each key/value head.

Type: int

q_proj#

Linear layer for query projection.

Type: ParallelLinear

k_proj#

Linear layer for key projection.

Type: ParallelLinear

v_proj#

Linear layer for value projection.

Type: ParallelLinear

o_proj#

Linear layer for the output projection.

Type: ParallelLinear

attention_performer#

Module to perform the core attention computation.

Type: FlexibleAttentionModule

rotary#

Rotary position embedding module.

Type: RoPE

class easydel.modules.qwen3_moe.modeling_qwen3_moe_flax.Qwen3MoeDecoderLayer(*args: Any, **kwargs: Any)[source]#

Bases: Module

Qwen3Moe Transformer Decoder Layer.

This module represents a single decoder layer in the Qwen3Moe model, combining self-attention and MLP sub-layers with residual connections and RMS normalization.

config#

Configuration object for the model.

Type: Qwen3MoeConfig

layer_idx#

The index of the layer in the model.

Type: int

dtype#

Data type for computations.

Type: jnp.dtype

param_dtype#

Data type for parameters.

Type: jnp.dtype

precision#

Precision setting for JAX operations.

Type: jax.lax.PrecisionLike

rngs#

Random number generators.

Type: nn.Rngs

input_layernorm#

RMS normalization applied before the attention layer.

Type: RMSNorm

self_attn#

The self-attention module.

Type: Qwen3MoeAttention

mlp#

The feed-forward (MLP) module.

Type: Qwen3MoeMLP

post_attention_layernorm#

RMS normalization applied after the attention layer and before the MLP layer.

Type: RMSNorm

class easydel.modules.qwen3_moe.modeling_qwen3_moe_flax.Qwen3MoeForCausalLM(*args: Any, **kwargs: Any)[source]#

Bases: EasyDeLBaseModule

Qwen3Moe model with a Causal Language Modeling head.

This model consists of the base Qwen3Moe transformer (Qwen3MoeModel) followed by a linear layer (lm_head) that projects the transformer’s output hidden states to the vocabulary size, producing logits for next token prediction. Optionally, the input token embeddings can be tied to the output projection layer.

config#

Configuration object for the model.

Type: Qwen3MoeConfig

dtype#

Data type for computation.

Type: jnp.dtype

param_dtype#

Data type for parameters.

Type: jnp.dtype

precision#

Precision setting for JAX operations.

Type: jax.lax.PrecisionLike

rngs#

Random number generators.

Type: nn.Rngs

model#

The core Qwen3Moe transformer model.

Type: Qwen3MoeModel

lm_head#

The linear layer for projecting hidden states to vocabulary logits.

Type: ParallelLinear

class easydel.modules.qwen3_moe.modeling_qwen3_moe_flax.Qwen3MoeForSequenceClassification(*args: Any, **kwargs: Any)[source]#

Bases: EasyDeLBaseModule

Qwen3Moe model with a Sequence Classification head.

This model consists of the base Qwen3Moe transformer (Qwen3MoeModel) followed by a linear layer (score) that projects the transformer’s output hidden states (typically the hidden state of the last token or a pooled representation) to the number of classes for classification.

config#

Configuration object for the model.

Type: Qwen3MoeConfig

dtype#

Data type for computation.

Type: jnp.dtype

param_dtype#

Data type for parameters.

Type: jnp.dtype

precision#

Precision setting for JAX operations.

Type: jax.lax.PrecisionLike

rngs#

Random number generators.

Type: nn.Rngs

model#

The core Qwen3Moe transformer model.

Type: Qwen3MoeModel

score#

The linear layer for classification.

Type: ParallelLinear

class easydel.modules.qwen3_moe.modeling_qwen3_moe_flax.Qwen3MoeMLP(*args: Any, **kwargs: Any)[source]#

Bases: Module

Qwen3Moe MLP module.

This module implements the feed-forward network (MLP) used in the Qwen3Moe model. It uses a Gated Linear Unit (GLU) structure with SiLU activation.

config#

Configuration object for the model.

Type: Qwen3MoeConfig

dtype#

Data type for computations.

Type: jnp.dtype

param_dtype#

Data type for parameters.

Type: jnp.dtype

precision#

Precision setting for JAX operations.

Type: jax.lax.PrecisionLike

gate_proj#

Linear layer for the GLU gate.

Type: ParallelLinear

down_proj#

Linear layer for the down projection.

Type: ParallelLinear

up_proj#

Linear layer for the GLU value.

Type: ParallelLinear

act_fn#

Activation function (SiLU).

Type: callable

class easydel.modules.qwen3_moe.modeling_qwen3_moe_flax.Qwen3MoeModel(*args: Any, **kwargs: Any)[source]#

Bases: EasyDeLBaseModule

The base Qwen3Moe model transformer.

This class represents the core transformer architecture of the Qwen3Moe model, consisting of an embedding layer, multiple Qwen3MoeDecoderLayer layers, and a final RMS normalization layer.

config#

Configuration object for the model.

Type: Qwen3MoeConfig

dtype#

Data type for computation.

Type: jnp.dtype

param_dtype#

Data type for parameters.

Type: jnp.dtype

precision#

Precision setting for JAX operations.

Type: jax.lax.PrecisionLike

rngs#

Random number generators.

Type: nn.Rngs

embed_tokens#

Embedding layer for input tokens.

Type: nn.Embed

layers#

List of decoder layers.

Type: tp.List[Qwen3MoeDecoderLayer]

norm#

Final layer normalization.

Type: RMSNorm

gradient_checkpointing#

Gradient checkpointing configuration.

Type: EasyDeLGradientCheckPointers

class easydel.modules.qwen3_moe.modeling_qwen3_moe_flax.Qwen3MoeSparseMoeBlock(*args: Any, **kwargs: Any)[source]#

Bases: Module

Sparse Mixture of Experts (MoE) block for Qwen3 MoE.

This block routes input hidden states to a selected subset of experts and combines their outputs.

config#

Configuration object for the model.

Type: Qwen3MoeConfig

gate#

Linear layer for the gating network.

Type: ParallelLinear

experts#

List of expert MLP modules.

Type: nn.List[Qwen3MoeMLP]

dtype#

Data type for computations.

Type: jnp.dtype

param_dtype#

Data type for parameters.

Type: jnp.dtype

precision#

Precision setting for matrix multiplications.

Type: jax.lax.PrecisionLike

rngs#

Random number generators.

Type: nn.Rngs

easydel.modules.qwen3_moe.modeling_qwen3_moe_flax

Contents

easydel.modules.qwen3_moe.modeling_qwen3_moe_flax#