easydel.modules.deepseek_v3.deepseek_configuration#
- class easydel.modules.deepseek_v3.deepseek_configuration.DeepseekV3Config(vocab_size=129280, hidden_size=7168, intermediate_size=18432, moe_intermediate_size=2048, num_hidden_layers=61, num_nextn_predict_layers=1, num_attention_heads=128, num_key_value_heads=128, n_shared_experts=1, n_routed_experts=256, ep_size=1, routed_scaling_factor=2.5, kv_lora_rank=512, q_lora_rank=1536, qk_rope_head_dim=64, v_head_dim=128, qk_nope_head_dim=128, topk_method='noaux_tc', n_group=8, topk_group=4, num_experts_per_tok=8, moe_layer_freq=1, first_k_dense_replace=3, norm_topk_prob=True, scoring_func='sigmoid', aux_loss_alpha=0.001, seq_aux=True, hidden_act='silu', max_position_embeddings=4096, initializer_range=0.02, rms_norm_eps=1e-06, use_cache=True, pad_token_id=None, bos_token_id=0, eos_token_id=1, pretraining_tp=1, tie_word_embeddings=False, rope_theta=10000.0, rope_scaling=None, attention_bias=False, attention_dropout=0.0, **kwargs)[source]#
Bases:
EasyDeLBaseConfigThis is the configuration class to store the configuration of a [DeepseekV3Model]. It is used to instantiate an DeepSeek model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the DeepSeek-V3. Configuration objects inherit from [PretrainedConfig] and can be used to control the model outputs. Read the documentation from [PretrainedConfig] for more information. :param vocab_size: Vocabulary size of the Deep model. Defines the number of different tokens that can be represented by the
inputs_ids passed when calling [DeepseekV3Model]
- Parameters
hidden_size (int, optional, defaults to 4096) โ Dimension of the hidden representations.
intermediate_size (int, optional, defaults to 11008) โ Dimension of the MLP representations.
moe_intermediate_size (int, optional, defaults to 1407) โ Dimension of the MoE representations.
num_hidden_layers (int, optional, defaults to 32) โ Number of hidden layers in the Transformer decoder.
num_nextn_predict_layers (int, optional, defaults to 1) โ Number of nextn predict layers in the DeepSeekV3 Model.
num_attention_heads (int, optional, defaults to 32) โ Number of attention heads for each attention layer in the Transformer decoder.
n_shared_experts (int, optional, defaults to None) โ Number of shared experts, None means dense model.
n_routed_experts (int, optional, defaults to None) โ Number of routed experts, None means dense model.
routed_scaling_factor (float, optional, defaults to 1.0) โ Scaling factor or routed experts.
topk_method (str, optional, defaults to gready) โ Topk method used in routed gate.
n_group (int, optional, defaults to None) โ Number of groups for routed experts.
topk_group (int, optional, defaults to None) โ Number of selected groups for each token(for each token, ensuring the selected experts is only within topk_group groups).
num_experts_per_tok (int, optional, defaults to None) โ Number of selected experts, None means dense model.
moe_layer_freq (int, optional, defaults to 1) โ The frequency of the MoE layer: one expert layer for every moe_layer_freq - 1 dense layers.
first_k_dense_replace (int, optional, defaults to 0) โ
- Number of dense layers in shallow layers(embed->dense->dense->โฆ->dense->moe->moeโฆ->lm_head).
--k dense layersโ/
norm_topk_prob (bool, optional, defaults to False) โ Whether to normalize the weights of the routed experts.
scoring_func (str, optional, defaults to โsoftmaxโ) โ Method of computing expert weights.
aux_loss_alpha (float, optional, defaults to 0.001) โ Auxiliary loss weight coefficient.
= (seq_aux) โ Whether to compute the auxiliary loss for each individual sample.
num_key_value_heads (int, optional) โ This is the number of key_value heads that should be used to implement Grouped Query Attention. If num_key_value_heads=num_attention_heads, the model will use Multi Head Attention (MHA), if num_key_value_heads=1 the model will use Multi Query Attention (MQA) otherwise GQA is used. When converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed by meanpooling all the original heads within that group. For more details checkout [this paper](https://arxiv.org/pdf/2305.13245.pdf). If it is not specified, will default to `num_attention_heads.
hidden_act (str or function, optional, defaults to โsiluโ) โ The non-linear activation function (function or string) in the decoder.
max_position_embeddings (int, optional, defaults to 2048) โ The maximum sequence length that this model might ever be used with.
initializer_range (float, optional, defaults to 0.02) โ The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
rms_norm_eps (float, optional, defaults to 1e-06) โ The epsilon used by the rms normalization layers.
use_cache (bool, optional, defaults to True) โ Whether or not the model should return the last key/values attentions (not used by all models). Only relevant if config.is_decoder=True.
pad_token_id (int, optional) โ Padding token id.
bos_token_id (int, optional, defaults to 1) โ Beginning of stream token id.
eos_token_id (int, optional, defaults to 2) โ End of stream token id.
pretraining_tp (int, optional, defaults to 1) โ Experimental feature. Tensor parallelism rank used during pretraining. Please refer to [this document](https://huggingface.co/docs/transformers/parallelism) to understand more about it. This value is necessary to ensure exact reproducibility of the pretraining results. Please refer to [this issue](pytorch/pytorch#76232).
tie_word_embeddings (bool, optional, defaults to False) โ Whether to tie weight embeddings
rope_theta (float, optional, defaults to 10000.0) โ The base period of the RoPE embeddings.
rope_scaling (Dict, optional) โ Dictionary containing the scaling configuration for the RoPE embeddings. Currently supports two scaling strategies: linear and dynamic. Their scaling factor must be a float greater than 1. The expected format is {โtypeโ: strategy name, โfactorโ: scaling factor}. When using this flag, donโt update max_position_embeddings to the expected new maximum.
attention_bias (bool, defaults to False, optional, defaults to False) โ Whether to use a bias in the query, key, value and output projection layers during self-attention.
attention_dropout (float, optional, defaults to 0.0) โ The dropout ratio for the attention probabilities.
`python >>> from transformers import DeepseekV3Model, DeepseekV3Config >>> # Initializing a Deepseek-V3 style configuration >>> configuration = DeepseekV3Config() >>> # Accessing the model configuration >>> configuration = model.config `- get_partition_rules(*args, **kwargs)[source]#
Get the partition rules for the model. :returns: The partition rules. :rtype: tp.Tuple[tp.Tuple[str, PartitionSpec]]
- keys_to_ignore_at_inference = ['past_key_values']#
- model_type: str = 'deepseek_v3'#