easydel.modules.mosaic_mpt.init

easydel.modules.mosaic_mpt.init#

class easydel.modules.mosaic_mpt.__init__.MptAttentionConfig(attn_type='multihead_attention', attn_pdrop=0, attn_impl='torch', clip_qkv=None, softmax_scale=None, prefix_lm=False, qk_ln=False, attn_uses_sequence_id=False, alibi=True, alibi_bias_max=8, **kwargs)[source]#

Bases: EasyDeLBaseConfig

This is the configuration class to store the attention related configuration of a [MptModel].

Parameters

attn_type (str, optional, defaults to “multihead_attention”) – The type of attention to use. Can be either “multihead_attention” or “multiquery_attention”.
attn_pdrop (float, optional, defaults to 0.0) – The dropout probability applied to the attention output.
attn_impl (str, optional, defaults to “torch”) – The implementation of the attention mechanism. Can be either “torch” or “flash”.
clip_qkv (float, optional) – The clip value applied to the query, key, and value tensors.
softmax_scale (float, optional) – The scale factor applied to the softmax function in the attention layer.
prefix_lm (bool, optional, defaults to False) – Whether to use a prefix LM.
qk_ln (bool, optional, defaults to False) – Whether to apply layer normalization to the query and key tensors.
attn_uses_sequence_id (bool, optional, defaults to False) – Whether the attention layer uses sequence IDs.
alibi (bool, optional, defaults to True) – Whether to use the ALiBi (Attention with Linear Biases) method.
alibi_bias_max (int, optional, defaults to 8) – The maximum value for the ALiBi bias.

classmethod from_pretrained(pretrained_model_name_or_path, **kwargs) → EasyDeLBaseConfig[source]#

Loads attention configuration from a pretrained model configuration file.

Parameters

cls (type) – The class itself.
pretrained_model_name_or_path (str) – Path or identifier of the pretrained model.
**kwargs – Additional keyword arguments passed to get_config_dict and from_dict.

Returns

An instance of MptAttentionConfig loaded from the pretrained model.

Return type

EasyDeLBaseConfig

class easydel.modules.mosaic_mpt.__init__.MptConfig(d_model: int = 2048, n_heads: int = 16, n_layers: int = 24, expansion_ratio: int = 4, max_seq_len: int = 2048, vocab_size: int = 50368, resid_prob_drop: float = 0.0, layer_norm_epsilon: float = 1e-05, emb_prob_drop: float = 0.0, learned_pos_emb: bool = True, attn_config: Optional[MptAttentionConfig] = None, init_device: str = 'cpu', logit_scale: Optional[Union[float, str]] = None, no_bias: bool = True, verbose: int = 0, embedding_fraction: float = 1.0, norm_type: str = 'low_precision_layernorm', use_cache: bool = False, initializer_range=0.02, alibi: bool = True, use_bias: bool = False, act_fn: str = 'gelu', qk_ln: bool = False, use_lm_head: bool = False, use_norm_bias: bool = False, gradient_checkpointing: EasyDeLGradientCheckPointers = EasyDeLGradientCheckPointers.NONE, bits: Optional[int] = None, **kwargs)[source]#

Bases: EasyDeLBaseConfig

Configuration objects inherit from [EasyDeLBaseConfig] and can be used to control the model outputs. Read the documentation from [EasyDeLBaseConfig] for more information.

Parameters

d_model (int, optional, defaults to 2048) – Dimensionality of the encoder layers and the pooler layer.
n_heads (int, optional, defaults to 16) – Number of attention heads for each attention layer in the Transformer encoder.
n_layers (int, optional, defaults to 24) – Number of hidden layers in the Transformer encoder.
expansion_ratio (int, optional, defaults to 4) – Expansion ratio of the feed-forward layer.
max_seq_len (int, optional, defaults to 2048) – The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 2048 or 4096).
vocab_size (int, optional, defaults to 50368) – Vocabulary size of the MPT model. Defines the number of different tokens that can be represented by the inputs_ids passed to the forward method.
resid_prob_drop (float, optional, defaults to 0.0) – The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
layer_norm_epsilon (float, optional, defaults to 1e-5) – The epsilon used by the layer normalization layers.
emb_prob_drop (float, optional, defaults to 0.0) – The dropout ratio for the embeddings.
learned_pos_emb (bool, optional, defaults to True) – Whether to learn positional embeddings.
attn_config ([MptAttentionConfig], optional) – The configuration of the attention layer.
init_device (str, optional, defaults to “cpu”) – The device to initialize the model on.
logit_scale (float or str, optional) – The logit scale. If set to “inv_sqrt_d_model”, the logit scale is calculated as 1 / math.sqrt(d_model).
no_bias (bool, optional, defaults to True) – Whether to use bias in the linear layers.
verbose (int, optional, defaults to 0) – The verbosity level.
embedding_fraction (float, optional, defaults to 1.0) – The fraction of the embedding matrix to use.
norm_type (str, optional, defaults to “low_precision_layernorm”) – The type of layer normalization to use.
use_cache (bool, optional, defaults to False) – Whether or not the model should return the last key/values attentions (not used by all models). Only relevant if config.is_decoder=True.
initializer_range (float, optional, defaults to 0.02) – The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
alibi (bool, optional, defaults to True) – Whether to use ALiBi (Attention with Linear Biases) method.
use_bias (bool, optional, defaults to False) – Whether to use bias in the linear layers.
act_fn (str, optional, defaults to “gelu”) – The activation function to use.
qk_ln (bool, optional, defaults to False) – Whether to apply layer normalization to the query and key tensors.
use_lm_head (bool, optional, defaults to False) – Whether to use a language modeling head.
use_norm_bias (bool, optional, defaults to False) – Whether to use bias in the layer normalization layers.
gradient_checkpointing (str, optional, defaults to “nothing_saveable”) – The gradient checkpointing configuration.
bits (int, optional) – The number of bits to quantize the model to.

attach_custom_arguments(gradient_checkpointing: EasyDeLGradientCheckPointers = EasyDeLGradientCheckPointers.NONE, bits: Optional[int] = None, **kwargs)[source]#

Attaches custom arguments to the configuration object.

This method allows adding or overriding configuration attributes dynamically. It primarily sets attributes related to gradient checkpointing and quantization bits.

Parameters

gradient_checkpointing (EasyDeLGradientCheckPointers, optional) – Gradient checkpointing strategy. Defaults to EasyDeLGradientCheckPointers.NONE.
bits (tp.Optional[int], optional) – Quantization bits. Defaults to None.
**kwargs – Additional keyword arguments (ignored).

attribute_map: dict[str, str] = {'hidden_size': 'd_model', 'max_position_embeddings': 'max_seq_len', 'num_attention_heads': 'n_heads', 'num_hidden_layers': 'n_layers', 'tie_word_embeddings': 'use_lm_head'}#

get_partition_rules(*args, **kwargs)[source]#

Get the partition rules for the model. This method defines how the model’s parameters are partitioned across devices for distributed training and inference.

Parameters

*args – Additional positional arguments (unused).
**kwargs – Additional keyword arguments (unused).

Returns

A tuple of partition rules, where each rule is a tuple: containing a regex pattern for parameter names and the corresponding PartitionSpec.

Return type

tp.Tuple[tp.Tuple[str, PartitionSpec]]

property granted_freq_max_position_embedding: int#

Returns the maximum position embedding size specifically for frequency-based position embeddings.

If freq_max_position_embeddings is set, it returns that value. Otherwise, it falls back to max_seq_len.

Returns: The granted maximum position embedding size for frequency encoding.
Return type: int

property granted_mask_max_position_embedding: int#

Returns the maximum position embedding size specifically for mask-based position embeddings.

If mask_max_position_embeddings is set, it returns that value. Otherwise, it falls back to max_seq_len.

Returns: The granted maximum position embedding size for mask encoding.
Return type: int

model_type: str = 'mpt'#

class easydel.modules.mosaic_mpt.__init__.MptForCausalLM(*args: Any, **kwargs: Any)[source]#

Bases: EasyDeLBaseModule

MPT model with a language modeling head.

This model extends the base MptModel by adding a linear layer (lm_head) on top to predict the next token in a sequence, making it suitable for causal language modeling tasks.

config#

Configuration object for the model.

Type: MptConfig

dtype#

Data type for computations.

Type: jnp.dtype

param_dtype#

Data type for parameters.

Type: jnp.dtype

precision#

Precision setting for JAX operations.

Type: jax.lax.PrecisionLike

rngs#

Random number generators.

Type: nn.Rngs

transformer#

The core MPT transformer model.

Type: MptModel

lm_head#

The language modeling head. If use_lm_head in the config is True (tying embeddings), this will be None.

Type: ParallelLinear, optional

class easydel.modules.mosaic_mpt.__init__.MptModel(*args: Any, **kwargs: Any)[source]#

Bases: EasyDeLBaseModule

MPT model implementation.

This class implements the main MPT transformer model architecture, consisting of an embedding layer (token and optional positional), multiple MptBlock layers, and a final layer normalization.

config#

Configuration object for the model.

Type: MptConfig

dtype#

Data type for computations.

Type: jnp.dtype

param_dtype#

Data type for parameters.

Type: jnp.dtype

precision#

Precision setting for JAX operations.

Type: jax.lax.PrecisionLike

rngs#

Random number generators.

Type: nn.Rngs

wte#

Token embedding layer.

Type: nn.Embed

emb_drop#

Dropout layer applied after embeddings.

Type: nn.Dropout

blocks#

List of transformer blocks.

Type: tp.List[MptBlock]

norm_f#

Final layer normalization.

Type: nn.LayerNorm

alibi#

Precomputed ALiBi tensor if using ALiBi.

Type: chex.Array, optional

property alibi#

easydel.modules.mosaic_mpt.__init__

Contents

easydel.modules.mosaic_mpt.__init__#

easydel.modules.mosaic_mpt.init

easydel.modules.mosaic_mpt.init#