easydel.modules.internlm2.__init__#
- class easydel.modules.internlm2.__init__.InternLM2Config(vocab_size=103168, hidden_size=4096, intermediate_size=11008, num_hidden_layers=32, num_attention_heads=32, num_key_value_heads=None, hidden_act='silu', max_position_embeddings=2048, initializer_range=0.02, rms_norm_eps=1e-06, use_cache=True, pad_token_id=0, bos_token_id=1, eos_token_id=2, pretraining_tp=1, tie_word_embeddings=False, bias=True, rope_theta=10000, rope_scaling=None, gradient_checkpointing: EasyDeLGradientCheckPointers = EasyDeLGradientCheckPointers.NONE, fcm_min_ratio: float = -1, fcm_max_ratio: float = -1, scan_mlp_chunk_size: int = 1024, bits: Optional[int] = None, scan_layers: bool = False, **kwargs)[source]#
Bases:
EasyDeLBaseConfigConfiguration objects inherit from [EasyDeLBaseConfig] and can be used to control the model outputs. Read the documentation from [EasyDeLBaseConfig] for more information.
- Parameters
vocab_size (int, optional, defaults to 32000) – Vocabulary size of the InternLM2 model. Defines the number of different tokens that can be represented by the inputs_ids passed to the forward method.
hidden_size (int, optional, defaults to 4096) – Dimensionality of the encoder layers and the pooler layer.
intermediate_size (int, optional, defaults to 11008) – Dimensionality of the “intermediate” (i.e., feed-forward) layer in the Transformer encoder.
num_hidden_layers (int, optional, defaults to 32) – Number of hidden layers in the Transformer encoder.
num_attention_heads (int, optional, defaults to 32) – Number of attention heads for each attention layer in the Transformer encoder.
num_key_value_heads (int, optional) – Number of key and value heads for each attention layer in the Transformer encoder. Will default to number_rep_kv * num_attention_heads if not set.
max_position_embeddings (int, optional, defaults to 2048) – The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 2048 or 4096).
rms_norm_eps (float, optional, defaults to 1e-6) – The epsilon used by the rms normalization layers.
initializer_range (float, optional, defaults to 0.02) – The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
use_cache (bool, optional, defaults to True) – Whether or not the model should return the last key/values attentions (not used by all models). Only relevant if config.is_decoder=True.
pad_token_id (int, optional, defaults to 0) – The id of the pad token.
bos_token_id (int, optional, defaults to 1) – The id of the beginning-of-sequence token.
eos_token_id (int, optional, defaults to 2) – The id of the end-of-sequence token.
attention_dropout (float, optional, defaults to 0.0) – The dropout ratio for the attention probabilities.
rope_theta (float, optional, defaults to 10000.0) – The theta value to use for rotary position embeddings.
bias (bool, optional, defaults to False) – Whether to use attention bias.
tie_word_embeddings (bool, optional, defaults to False) – Whether to tie the weights of the input embeddings and the output embeddings.
gradient_checkpointing (str, optional, defaults to “nothing_saveable”) – The gradient checkpointing configuration.
fcm_min_ratio (float, optional, defaults to -1) – The minimum ratio for Flash Attention.
fcm_max_ratio (float, optional, defaults to -1) – The maximum ratio for Flash Attention.
rope_scaling (tp.Dict[str, tp.Union[str, float]], optional) – The configuration for rope scaling.
scan_mlp_chunk_size (int, optional, defaults to 1024) – The chunk size to use when scanning the MLP.
bits (int, optional) – The number of bits to quantize the model to.
hidden_act (str, optional, defaults to “silu”) – The hidden activation function to use.
pretraining_tp (int, optional, defaults to 1) – The tensor parallelism degree used during pretraining.
mlp_bias (bool, optional, defaults to False) – Whether to use bias in the MLP.
scan_layers (bool, optional, defaults to False) – Whether to use the scan implementation for the layers.
- attach_custom_arguments(tie_word_embeddings: bool = False, gradient_checkpointing: EasyDeLGradientCheckPointers = EasyDeLGradientCheckPointers.NONE, fcm_min_ratio: float = 0.0, fcm_max_ratio: float = 0.0, bits: Optional[int] = None, rope_theta: float = 10000.0, hidden_act: str = 'silu', scan_layers: bool = True, **kwargs)[source]#
Attaches custom arguments to the configuration object.
This method allows adding or overriding configuration attributes dynamically. It iterates through the provided arguments and sets them as attributes of the configuration object.
- Parameters
tie_word_embeddings (bool, optional) – Whether to tie input/output embeddings. Defaults to False.
gradient_checkpointing (EasyDeLGradientCheckPointers, optional) – Gradient checkpointing strategy. Defaults to EasyDeLGradientCheckPointers.NONE.
fcm_min_ratio (float, optional) – Minimum ratio for Flash Attention. Defaults to 0.0.
fcm_max_ratio (float, optional) – Maximum ratio for Flash Attention. Defaults to 0.0.
bits (tp.Optional[int], optional) – Quantization bits. Defaults to None.
rope_theta (float, optional) – Base value for RoPE. Defaults to 10000.0.
hidden_act (str, optional) – Activation function. Defaults to “silu”.
scan_layers (bool, optional) – Whether to use scan layers. Defaults to True.
**kwargs – Additional keyword arguments (ignored in this implementation).
- get_partition_rules(*args, **kwargs)[source]#
Get the partition rules for the model. This method defines how the model’s parameters are partitioned across devices for distributed training and inference.
- Parameters
*args – Additional positional arguments (unused).
**kwargs – Additional keyword arguments (unused).
- Returns
- A tuple of partition rules, where each rule is a tuple
containing a regex pattern for parameter names and the corresponding PartitionSpec.
- Return type
tp.Tuple[tp.Tuple[str, PartitionSpec]]
- static get_weight_decay_exclusions()[source]#
Returns a tuple of parameter names for which weight decay should be excluded.
- Returns
An empty tuple, indicating no specific weight decay exclusions for this model.
- Return type
tuple
- property granted_freq_max_position_embedding: int#
Returns the maximum position embedding size specifically for frequency-based position embeddings.
If freq_max_position_embeddings is set, it returns that value. Otherwise, it falls back to max_position_embeddings.
- Returns
The granted maximum position embedding size for frequency encoding.
- Return type
int
- property granted_mask_max_position_embedding: int#
Returns the maximum position embedding size specifically for mask-based position embeddings.
If mask_max_position_embeddings is set, it returns that value. Otherwise, it falls back to max_position_embeddings.
- Returns
The granted maximum position embedding size for mask encoding.
- Return type
int
- model_type: str = 'internlm2'#
- class easydel.modules.internlm2.__init__.InternLM2ForCausalLM(*args: Any, **kwargs: Any)[source]#
Bases:
EasyDeLBaseModuleInternLM2 model with a Causal Language Modeling head.
This model consists of the base InternLM2 transformer (InternLM2Model) followed by a linear layer (lm_head) that projects the transformer’s output hidden states to the vocabulary size, producing logits for next token prediction.
- config#
Configuration object for the model.
- Type
- dtype#
Data type for computation. Default is jnp.float32.
- Type
jnp.dtype
- param_dtype#
Data type for parameters. Default is jnp.float32.
- Type
jnp.dtype
- precision#
Precision setting for JAX operations. Default is None.
- Type
jax.lax.PrecisionLike
- rngs#
Random number generators.
- Type
nn.Rngs
- module#
The core InternLM2 transformer model.
- Type
- lm_head#
The linear layer for projecting hidden states to vocabulary logits.
- Type
- class easydel.modules.internlm2.__init__.InternLM2ForSequenceClassification(*args: Any, **kwargs: Any)[source]#
Bases:
EasyDeLBaseModuleInternLM2 model with a Sequence Classification head.
This model consists of the base InternLM2 transformer (InternLM2Model) followed by a linear layer (score) that projects the transformer’s output hidden states (typically the hidden state of the first token) to the number of classes for classification.
- config#
Configuration object for the model.
- Type
- dtype#
Data type for computation. Default is jnp.float32.
- Type
jnp.dtype
- param_dtype#
Data type for parameters. Default is jnp.float32.
- Type
jnp.dtype
- precision#
Precision setting for JAX operations. Default is None.
- Type
jax.lax.PrecisionLike
- rngs#
Random number generators.
- Type
nn.Rngs
- module#
The core InternLM2 transformer model.
- Type
- score#
The linear layer for classification.
- Type
- class easydel.modules.internlm2.__init__.InternLM2Model(*args: Any, **kwargs: Any)[source]#
Bases:
EasyDeLBaseModuleThe base InternLM2 model transformer.
This class represents the core transformer architecture of the InternLM2 model, consisting of embedding layers, multiple transformer blocks, and a final layer normalization.
- config#
Configuration object for the model.
- Type
- dtype#
Data type for computation. Default is jnp.float32.
- Type
jnp.dtype
- param_dtype#
Data type for parameters. Default is jnp.float32.
- Type
jnp.dtype
- precision#
Precision setting for JAX operations. Default is None.
- Type
jax.lax.PrecisionLike
- embed_tokens#
Embedding layer for input tokens.
- Type
nn.Embed
- layers#
Sequence of transformer blocks.
- Type
tp.Sequence[InternLM2Block]
- gradient_checkpointing#
Gradient checkpointing configuration.
- scan_layers#
Whether to use JAX scan for layer processing.
- Type
bool
- blocks_class#
The class used for the transformer blocks.
- Type