easydel.modules.mamba2.mamba2_configuration#
- class easydel.modules.mamba2.mamba2_configuration.Mamba2Config(num_heads=128, head_dim=64, vocab_size=32768, hidden_size=4096, state_size=128, num_hidden_layers=64, layer_norm_epsilon=1e-05, pad_token_id=1, bos_token_id=0, eos_token_id=2, expand=2, conv_kernel=4, n_groups=8, use_bias=False, use_conv_bias=True, hidden_act='silu', initializer_range=0.1, residual_in_fp32=True, time_step_rank='auto', time_step_min=0.001, time_step_max=0.1, time_step_floor=0.0001, time_step_limit=(0.0, inf), rescale_prenorm_residual=False, use_cache=True, norm_before_gate=True, rms_norm=True, chunk_size=256, tie_word_embeddings=False, gradient_checkpointing: EasyDeLGradientCheckPointers = EasyDeLGradientCheckPointers.NONE, **kwargs)[source]#
Bases:
EasyDeLBaseConfigConfiguration objects inherit from [EasyDeLBaseConfig] and can be used to control the model outputs. Read the documentation from [EasyDeLBaseConfig] for more information.
- Parameters
vocab_size (int, optional, defaults to 50280) – Vocabulary size of the Mamba model. Defines the number of different tokens that can be represented by the inputs_ids passed to the forward method.
hidden_size (int, optional, defaults to 768) – Dimensionality of the encoder layers and the pooler layer.
state_size (int, optional, defaults to 16) – State size of the Mamba model.
num_hidden_layers (int, optional, defaults to 32) – Number of hidden layers in the Transformer encoder.
layer_norm_epsilon (float, optional, defaults to 1e-5) – The epsilon used by the layer normalization layers.
pad_token_id (int, optional, defaults to 0) – The index of the padding token in the vocabulary.
bos_token_id (int, optional, defaults to 0) – The id of the beginning-of-sequence token.
eos_token_id (int, optional, defaults to 0) – The id of the end-of-sequence token.
expand (int, optional, defaults to 2) – Expansion factor for the intermediate size.
conv_kernel (int, optional, defaults to 4) – Kernel size of the convolution layer.
use_bias (bool, optional, defaults to False) – Whether to use bias in the linear layers.
use_conv_bias (bool, optional, defaults to True) – Whether to use bias in the convolution layer.
hidden_act (str or function, optional, defaults to “silu”) – The non-linear activation function (function or string) to use in the encoder and pooler. If string, “gelu”, “relu”, “swish” and “gelu_new” are supported.
initializer_range (float, optional, defaults to 0.1) – The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
residual_in_fp32 (bool, optional, defaults to True) – Whether to compute the residual connection in float32.
time_step_rank (str or int, optional, defaults to “auto”) – The rank of the time step embedding. If set to “auto”, the rank is calculated as math.ceil(self.hidden_size / 16).
time_step_scale (float, optional, defaults to 1.0) – The scale factor for the time step embedding.
time_step_min (float, optional, defaults to 0.001) – The minimum value for the time step embedding.
time_step_max (float, optional, defaults to 0.1) – The maximum value for the time step embedding.
time_step_floor (float, optional, defaults to 1e-4) – The floor value for the time step embedding.
rescale_prenorm_residual (bool, optional, defaults to False) – Whether to rescale the pre-norm residual.
use_cache (bool, optional, defaults to True) – Whether or not the model should return the last key/values attentions (not used by all models). Only relevant if config.is_decoder=True.
gradient_checkpointing (str, optional, defaults to “nothing_saveable”) – The gradient checkpointing configuration.
- attach_custom_arguments(gradient_checkpointing: EasyDeLGradientCheckPointers = EasyDeLGradientCheckPointers.NONE)[source]#
- get_partition_rules(*args, **kwargs)[source]#
Get the partition rules for the model. :returns: The partition rules. :rtype: tp.Tuple[tp.Tuple[str, PartitionSpec]]
- model_type: str = 'mamba2'#