easydel.modules.olmo.olmo_configuration

easydel.modules.olmo.olmo_configuration#

class easydel.modules.olmo.olmo_configuration.OlmoConfig(vocab_size=50304, hidden_size=4096, intermediate_size=11008, num_hidden_layers=32, num_attention_heads=32, num_key_value_heads=None, hidden_act='silu', max_position_embeddings=2048, initializer_range=0.02, use_cache=True, pad_token_id=1, bos_token_id=None, eos_token_id=50279, tie_word_embeddings=False, rope_theta=10000.0, rope_scaling=None, attention_bias=False, attention_dropout=0.0, clip_qkv=None, gradient_checkpointing: EasyDeLGradientCheckPointers = EasyDeLGradientCheckPointers.NONE, use_scan_mlp: bool = False, scan_mlp_chunk_size: int = 1024, bits: Optional[int] = None, **kwargs)[source]#

Bases: EasyDeLBaseConfig

Configuration objects inherit from [EasyDeLBaseConfig] and can be used to control the model outputs. Read the documentation from [EasyDeLBaseConfig] for more information.

Parameters

vocab_size (int, optional, defaults to 50304) – Vocabulary size of the Olmo model. Defines the number of different tokens that can be represented by the inputs_ids passed to the forward method.
hidden_size (int, optional, defaults to 4096) – Dimensionality of the encoder layers and the pooler layer.
intermediate_size (int, optional, defaults to 11008) – Dimensionality of the “intermediate” (i.e., feed-forward) layer in the Transformer encoder.
num_hidden_layers (int, optional, defaults to 32) – Number of hidden layers in the Transformer encoder.
num_attention_heads (int, optional, defaults to 32) – Number of attention heads for each attention layer in the Transformer encoder.
num_key_value_heads (int, optional) – Number of key and value heads for each attention layer in the Transformer encoder. Will default to num_attention_heads if not set.
hidden_act (str or function, optional, defaults to “silu”) – The non-linear activation function (function or string) to use in the encoder and pooler. If string, “gelu”, “relu”, “swish” and “gelu_new” are supported.
max_position_embeddings (int, optional, defaults to 2048) – The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 2048 or 4096).
initializer_range (float, optional, defaults to 0.02) – The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
use_cache (bool, optional, defaults to True) – Whether or not the model should return the last key/values attentions (not used by all models). Only relevant if config.is_decoder=True.
pad_token_id (int, optional, defaults to 1) – The index of the padding token in the vocabulary.
bos_token_id (int, optional) – The id of the beginning-of-sequence token.
eos_token_id (int, optional, defaults to 50279) – The id of the end-of-sequence token.
tie_word_embeddings (bool, optional, defaults to False) – Whether to tie the weights of the input embeddings and the output embeddings.
rope_theta (float, optional, defaults to 10000.0) – The theta value to use for rotary position embeddings.
rope_scaling (tp.Dict[str, tp.Union[str, float]], optional) – The configuration for rope scaling.
attention_bias (bool, optional, defaults to False) – Whether to use attention bias.
attention_dropout (float, optional, defaults to 0.0) – The dropout ratio for the attention probabilities.
clip_qkv (float, optional) – The clip value applied to the query, key, and value tensors.
gradient_checkpointing (str, optional, defaults to “nothing_saveable”) – The gradient checkpointing configuration.
use_scan_mlp (bool, optional, defaults to False) – Whether to use the scan implementation for the MLP.
scan_mlp_chunk_size (int, optional, defaults to 1024) – The chunk size to use when scanning the MLP.
bits (int, optional) – The number of bits to quantize the model to.

attach_custom_arguments(gradient_checkpointing: EasyDeLGradientCheckPointers = EasyDeLGradientCheckPointers.NONE, use_scan_mlp: bool = False, scan_mlp_chunk_size: int = 1024, bits: Optional[int] = None)[source]#

Attaches custom arguments to the configuration object.

This method allows adding or overriding configuration attributes dynamically. It primarily sets attributes related to gradient checkpointing, MLP scanning, and quantization bits.

Parameters

gradient_checkpointing (EasyDeLGradientCheckPointers, optional) – Gradient checkpointing strategy. Defaults to EasyDeLGradientCheckPointers.NONE.
use_scan_mlp (bool, optional) – Whether to use scan for MLP layers. Defaults to False.
scan_mlp_chunk_size (int, optional) – Chunk size for scan MLP. Defaults to 1024.
bits (tp.Optional[int], optional) – Quantization bits. Defaults to None.

get_partition_rules(*args, **kwargs)[source]#

Get the partition rules for the model. This method defines how the model’s parameters are partitioned across devices for distributed training and inference.

Parameters

*args – Additional positional arguments (unused).
**kwargs – Additional keyword arguments (unused).

Returns

A tuple of partition rules, where each rule is a tuple: containing a regex pattern for parameter names and the corresponding PartitionSpec.

Return type

tp.Tuple[tp.Tuple[str, PartitionSpec]]

property granted_freq_max_position_embedding: int#

Returns the maximum position embedding size specifically for frequency-based position embeddings.

If freq_max_position_embeddings is set, it returns that value. Otherwise, it falls back to max_position_embeddings.

Returns: The granted maximum position embedding size for frequency encoding.
Return type: int

property granted_mask_max_position_embedding: int#

Returns the maximum position embedding size specifically for mask-based position embeddings.

If mask_max_position_embeddings is set, it returns that value. Otherwise, it falls back to max_position_embeddings.

Returns: The granted maximum position embedding size for mask encoding.
Return type: int

model_type: str = 'olmo'#

easydel.modules.olmo.olmo_configuration

Contents

easydel.modules.olmo.olmo_configuration#