easydel.modules.deepseek_v3.__init__#

class easydel.modules.deepseek_v3.__init__.DeepseekV3Config(vocab_size=129280, hidden_size=7168, intermediate_size=18432, moe_intermediate_size=2048, num_hidden_layers=61, num_nextn_predict_layers=1, num_attention_heads=128, num_key_value_heads=128, n_shared_experts=1, n_routed_experts=256, ep_size=1, routed_scaling_factor=2.5, kv_lora_rank=512, q_lora_rank=1536, qk_rope_head_dim=64, v_head_dim=128, qk_nope_head_dim=128, topk_method='noaux_tc', n_group=8, topk_group=4, num_experts_per_tok=8, moe_layer_freq=1, first_k_dense_replace=3, norm_topk_prob=True, scoring_func='sigmoid', aux_loss_alpha=0.001, seq_aux=True, hidden_act='silu', max_position_embeddings=4096, initializer_range=0.02, rms_norm_eps=1e-06, use_cache=True, pad_token_id=None, bos_token_id=0, eos_token_id=1, pretraining_tp=1, tie_word_embeddings=False, rope_theta=10000.0, rope_scaling=None, attention_bias=False, attention_dropout=0.0, **kwargs)[source]#

Bases: EasyDeLBaseConfig

This is the configuration class to store the configuration of a [DeepseekV3Model]. It is used to instantiate an DeepSeek model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the DeepSeek-V3. Configuration objects inherit from [PretrainedConfig] and can be used to control the model outputs. Read the documentation from [PretrainedConfig] for more information. :param vocab_size: Vocabulary size of the Deep model. Defines the number of different tokens that can be represented by the

inputs_ids passed when calling [DeepseekV3Model]

Parameters
  • hidden_size (int, optional, defaults to 4096) โ€“ Dimension of the hidden representations.

  • intermediate_size (int, optional, defaults to 11008) โ€“ Dimension of the MLP representations.

  • moe_intermediate_size (int, optional, defaults to 1407) โ€“ Dimension of the MoE representations.

  • num_hidden_layers (int, optional, defaults to 32) โ€“ Number of hidden layers in the Transformer decoder.

  • num_nextn_predict_layers (int, optional, defaults to 1) โ€“ Number of nextn predict layers in the DeepSeekV3 Model.

  • num_attention_heads (int, optional, defaults to 32) โ€“ Number of attention heads for each attention layer in the Transformer decoder.

  • n_shared_experts (int, optional, defaults to None) โ€“ Number of shared experts, None means dense model.

  • n_routed_experts (int, optional, defaults to None) โ€“ Number of routed experts, None means dense model.

  • routed_scaling_factor (float, optional, defaults to 1.0) โ€“ Scaling factor or routed experts.

  • topk_method (str, optional, defaults to gready) โ€“ Topk method used in routed gate.

  • n_group (int, optional, defaults to None) โ€“ Number of groups for routed experts.

  • topk_group (int, optional, defaults to None) โ€“ Number of selected groups for each token(for each token, ensuring the selected experts is only within topk_group groups).

  • num_experts_per_tok (int, optional, defaults to None) โ€“ Number of selected experts, None means dense model.

  • moe_layer_freq (int, optional, defaults to 1) โ€“ The frequency of the MoE layer: one expert layer for every moe_layer_freq - 1 dense layers.

  • first_k_dense_replace (int, optional, defaults to 0) โ€“

    Number of dense layers in shallow layers(embed->dense->dense->โ€ฆ->dense->moe->moeโ€ฆ->lm_head).

    --k dense layersโ€“/

  • norm_topk_prob (bool, optional, defaults to False) โ€“ Whether to normalize the weights of the routed experts.

  • scoring_func (str, optional, defaults to โ€˜softmaxโ€™) โ€“ Method of computing expert weights.

  • aux_loss_alpha (float, optional, defaults to 0.001) โ€“ Auxiliary loss weight coefficient.

  • = (seq_aux) โ€“ Whether to compute the auxiliary loss for each individual sample.

  • num_key_value_heads (int, optional) โ€“ This is the number of key_value heads that should be used to implement Grouped Query Attention. If num_key_value_heads=num_attention_heads, the model will use Multi Head Attention (MHA), if num_key_value_heads=1 the model will use Multi Query Attention (MQA) otherwise GQA is used. When converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed by meanpooling all the original heads within that group. For more details checkout [this paper](https://arxiv.org/pdf/2305.13245.pdf). If it is not specified, will default to `num_attention_heads.

  • hidden_act (str or function, optional, defaults to โ€œsiluโ€) โ€“ The non-linear activation function (function or string) in the decoder.

  • max_position_embeddings (int, optional, defaults to 2048) โ€“ The maximum sequence length that this model might ever be used with.

  • initializer_range (float, optional, defaults to 0.02) โ€“ The standard deviation of the truncated_normal_initializer for initializing all weight matrices.

  • rms_norm_eps (float, optional, defaults to 1e-06) โ€“ The epsilon used by the rms normalization layers.

  • use_cache (bool, optional, defaults to True) โ€“ Whether or not the model should return the last key/values attentions (not used by all models). Only relevant if config.is_decoder=True.

  • pad_token_id (int, optional) โ€“ Padding token id.

  • bos_token_id (int, optional, defaults to 1) โ€“ Beginning of stream token id.

  • eos_token_id (int, optional, defaults to 2) โ€“ End of stream token id.

  • pretraining_tp (int, optional, defaults to 1) โ€“ Experimental feature. Tensor parallelism rank used during pretraining. Please refer to [this document](https://huggingface.co/docs/transformers/parallelism) to understand more about it. This value is necessary to ensure exact reproducibility of the pretraining results. Please refer to [this issue](pytorch/pytorch#76232).

  • tie_word_embeddings (bool, optional, defaults to False) โ€“ Whether to tie weight embeddings

  • rope_theta (float, optional, defaults to 10000.0) โ€“ The base period of the RoPE embeddings.

  • rope_scaling (Dict, optional) โ€“ Dictionary containing the scaling configuration for the RoPE embeddings. Currently supports two scaling strategies: linear and dynamic. Their scaling factor must be a float greater than 1. The expected format is {โ€œtypeโ€: strategy name, โ€œfactorโ€: scaling factor}. When using this flag, donโ€™t update max_position_embeddings to the expected new maximum.

  • attention_bias (bool, defaults to False, optional, defaults to False) โ€“ Whether to use a bias in the query, key, value and output projection layers during self-attention.

  • attention_dropout (float, optional, defaults to 0.0) โ€“ The dropout ratio for the attention probabilities.

`python >>> from transformers import DeepseekV3Model, DeepseekV3Config >>> # Initializing a Deepseek-V3 style configuration >>> configuration = DeepseekV3Config() >>> # Accessing the model configuration >>> configuration = model.config `

get_partition_rules(*args, **kwargs)[source]#

Get the partition rules for the model. :returns: The partition rules. :rtype: tp.Tuple[tp.Tuple[str, PartitionSpec]]

keys_to_ignore_at_inference = ['past_key_values']#
model_type: str = 'deepseek_v3'#
class easydel.modules.deepseek_v3.__init__.DeepseekV3ForCausalLM(*args: Any, **kwargs: Any)[source]#

Bases: EasyDeLBaseModule

class easydel.modules.deepseek_v3.__init__.DeepseekV3Model(*args: Any, **kwargs: Any)[source]#

Bases: EasyDeLBaseModule

property frequencies#

Returns frequency values from the config.