easydel.modules.falcon.__init__#
- class easydel.modules.falcon.__init__.FalconConfig(vocab_size=65024, hidden_size=4544, num_hidden_layers=32, num_attention_heads=71, num_ln_in_parallel_attn=None, layer_norm_epsilon=1e-05, initializer_range=0.02, use_cache=True, hidden_dropout=0.0, attention_dropout=0.0, num_kv_heads=None, alibi=False, new_decoder_architecture=False, multi_query=True, parallel_attn=True, bias=False, max_position_embeddings=2048, rope_theta=10000.0, rope_scaling=None, bos_token_id=11, eos_token_id=11, ffn_hidden_size=None, ff_factor=None, activation='gelu', gradient_checkpointing: EasyDeLGradientCheckPointers = EasyDeLGradientCheckPointers.NONE, bits: Optional[int] = None, **kwargs)[source]#
Bases:
EasyDeLBaseConfigConfiguration objects inherit from [EasyDeLBaseConfig] and can be used to control the model outputs. Read the documentation from [EasyDeLBaseConfig] for more information.
- Parameters
vocab_size (int, optional, defaults to 65024) โ Vocabulary size of the Falcon model. Defines the number of different tokens that can be represented by the inputs_ids passed to the forward method.
hidden_size (int, optional, defaults to 4544) โ Dimensionality of the encoder layers and the pooler layer.
num_hidden_layers (int, optional, defaults to 32) โ Number of hidden layers in the Transformer encoder.
num_attention_heads (int, optional, defaults to 71) โ Number of attention heads for each attention layer in the Transformer encoder.
num_ln_in_parallel_attn (int, optional) โ The number of layer norms in the parallel attention layer.
layer_norm_epsilon (float, optional, defaults to 1e-5) โ The epsilon used by the layer normalization layers.
initializer_range (float, optional, defaults to 0.02) โ The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
use_cache (bool, optional, defaults to True) โ Whether or not the model should return the last key/values attentions (not used by all models). Only relevant if config.is_decoder=True.
hidden_dropout (float, optional, defaults to 0.0) โ The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
attention_dropout (float, optional, defaults to 0.0) โ The dropout ratio for the attention probabilities.
num_kv_heads (int, optional) โ Number of key and value heads for each attention layer in the Transformer encoder. Will default to num_attention_heads if not set.
alibi (bool, optional) โ Whether to use alibi attention.
new_decoder_architecture (bool, optional) โ Whether to use the new decoder architecture.
multi_query (bool, optional, defaults to True) โ Whether to use multi-query attention.
parallel_attn (bool, optional, defaults to True) โ Whether to use parallel attention.
bias (bool, optional, defaults to False) โ Whether to use bias in the linear layers.
max_position_embeddings (int, optional, defaults to 2048) โ The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 2048 or 4096).
rope_theta (float, optional, defaults to 10000.0) โ The theta value to use for rotary position embeddings.
rope_scaling (tp.Dict[str, tp.Union[str, float]], optional) โ The rope scaling configuration.
bos_token_id (int, optional, defaults to 11) โ The index of the beginning of sequence token in the vocabulary.
eos_token_id (int, optional, defaults to 11) โ The index of the end of sequence token in the vocabulary.
ffn_hidden_size (int, optional) โ Dimensionality of the hidden layer in the FFN
ff_factor (int, optional) โ The scaling factor of the FFN
activation (str, optional, defaults to โgeluโ) โ The non-linear activation function (function or string) to use in the encoder and pooler. If string, โgeluโ, โreluโ, โswishโ and โgelu_newโ are supported.
gradient_checkpointing (str, optional, defaults to โโ) โ The gradient checkpointing configuration.
bits (int, optional) โ The number of bits to quantize the model to.
- attach_custom_arguments(gradient_checkpointing: EasyDeLGradientCheckPointers = EasyDeLGradientCheckPointers.NONE, bits: Optional[int] = None, **kwargs)[source]#
Attach custom arguments to the configuration.
- Parameters
gradient_checkpointing (EasyDeLGradientCheckPointers, optional) โ Gradient checkpointing strategy. Defaults to EasyDeLGradientCheckPointers.NONE.
bits (int, optional) โ Quantization bits. Defaults to None.
**kwargs โ Additional keyword arguments.
- Returns
The updated configuration instance.
- Return type
- attribute_map: dict[str, str] = {'num_attention_heads': 'num_attention_heads', 'num_hidden_layers': 'num_hidden_layers'}#
- static get_mesh_names()[source]#
Returns the mesh names used for model parallelism.
- Returns
A tuple containing โdpโ, โfsdpโ, and โtpโ as the mesh names.
- Return type
tuple
- get_partition_rules(*args, **kwargs)[source]#
Get the partition rules for the model. :returns: The partition rules. :rtype: tp.Tuple[tp.Tuple[str, PartitionSpec]]
- property granted_freq_max_position_embedding: int#
Returns the maximum position embedding size for frequency-based position embeddings.
- Returns
The maximum position embedding size, falling back to max_position_embeddings if not explicitly set.
- Return type
int
- property granted_mask_max_position_embedding: int#
Returns the maximum position embedding size for mask-based position embeddings.
- Returns
The maximum position embedding size, falling back to max_position_embeddings if not explicitly set.
- Return type
int
- model_type: str = 'falcon'#
- property rotary#
- class easydel.modules.falcon.__init__.FalconForCausalLM(*args: Any, **kwargs: Any)[source]#
Bases:
EasyDeLBaseModuleFalcon model with a language modeling head for causal language modeling tasks.
This model extends the base FalconModel by incorporating a linear language modeling head on top of the base model, designed for generative tasks and text generation. The model can use either alibi positional embeddings or rotary position embeddings (RoPE) based on configuration.
- class easydel.modules.falcon.__init__.FalconModel(*args: Any, **kwargs: Any)[source]#
Bases:
EasyDeLBaseModule