easydel.modules.dbrx.__init__#

class easydel.modules.dbrx.__init__.DbrxAttentionConfig(attn_pdrop: float = 0, clip_qkv: Optional[float] = 8, kv_n_heads: int = 1, rope_theta: float = 10000.0, **kwargs: Any)[source]#

Bases: EasyDeLBaseConfig

This is the configuration class to store the attention related configuration of a [DbrxModel].

Parameters
  • attn_pdrop (float, optional, defaults to 0.0) โ€“ The dropout probability applied to the attention output.

  • clip_qkv (float, optional, defaults to 8.0) โ€“ The clip value applied to the query, key, and value tensors.

  • kv_n_heads (int, optional, defaults to 1) โ€“ The number of attention heads for the key and value tensors.

  • rope_theta (float, optional, defaults to 10000.0) โ€“ The theta value for the rotary position embedding.

classmethod from_pretrained(pretrained_model_name_or_path: str, **kwargs: Any) PretrainedConfig[source]#

Instantiate a [PretrainedConfig] (or a derived class) from a pretrained model configuration.

Parameters
  • pretrained_model_name_or_path (str or os.PathLike) โ€“

    This can be either:

    • a string, the model id of a pretrained model configuration hosted inside a model repo on huggingface.co.

    • a path to a directory containing a configuration file saved using the [~PretrainedConfig.save_pretrained] method, e.g., ./my_model_directory/.

    • a path or url to a saved configuration JSON file, e.g., ./my_model_directory/configuration.json.

  • cache_dir (str or os.PathLike, optional) โ€“ Path to a directory in which a downloaded pretrained model configuration should be cached if the standard cache should not be used.

  • force_download (bool, optional, defaults to False) โ€“ Whether or not to force to (re-)download the configuration files and override the cached versions if they exist.

  • resume_download โ€“ Deprecated and ignored. All downloads are now resumed by default when possible. Will be removed in v5 of Transformers.

  • proxies (Dict[str, str], optional) โ€“ A dictionary of proxy servers to use by protocol or endpoint, e.g., {โ€˜httpโ€™: โ€˜foo.bar:3128โ€™, โ€˜http://hostnameโ€™: โ€˜foo.bar:4012โ€™}. The proxies are used on each request.

  • token (str or bool, optional) โ€“ The token to use as HTTP bearer authorization for remote files. If True, or not specified, will use the token generated when running huggingface-cli login (stored in ~/.huggingface).

  • revision (str, optional, defaults to โ€œmainโ€) โ€“

    The specific model version to use. It can be a branch name, a tag name, or a commit id, since we use a git-based system for storing models and other artifacts on huggingface.co, so revision can be any identifier allowed by git.

    <Tip>

    To test a pull request you made on the Hub, you can pass `revision=โ€refs/pr/<pr_number>โ€.

    </Tip>

  • return_unused_kwargs (bool, optional, defaults to False) โ€“

    If False, then this function returns just the final configuration object.

    If True, then this functions returns a tp.Tuple(config, unused_kwargs) where unused_kwargs is a dictionary consisting of the key/value pairs whose keys are not configuration attributes: i.e., the part of kwargs which has not been used to update config and is otherwise ignored.

  • subfolder (str, optional, defaults to โ€œโ€) โ€“ In case the relevant files are located inside a subfolder of the model repo on huggingface.co, you can specify the folder name here.

  • kwargs (Dict[str, tp.Any], optional) โ€“ The values in kwargs of any keys which are configuration attributes will be used to override the loaded values. Behavior concerning key/value pairs whose keys are not configuration attributes is controlled by the return_unused_kwargs keyword parameter.

Returns

The configuration object instantiated from this pretrained model.

Return type

[PretrainedConfig]

Examples:

>>> # We can't instantiate directly the base class *PretrainedConfig* so let's show the examples on a
>>> # derived class: BertConfig
>>> config = BertConfig.from_pretrained(
...   "google-bert/bert-base-uncased"
>>> )  # Download configuration from huggingface.co and cache.
>>> config = BertConfig.from_pretrained(
...   "./test/saved_model/"
>>> )  # E.g. config (or model) was saved using *save_pretrained('./test/saved_model/')*
>>> config = BertConfig.from_pretrained("./test/saved_model/my_configuration.json")
>>> config = BertConfig.from_pretrained(
...  "google-bert/bert-base-uncased", output_attentions=True, foo=False
>>> )
>>> assert config.output_attentions == True
>>> config, unused_kwargs = BertConfig.from_pretrained(
...  "google-bert/bert-base-uncased",
...  output_attentions=True,
...  foo=False,
...  return_unused_kwargs=True,
>>> )
>>> assert config.output_attentions == True
>>> assert unused_kwargs == {"foo": False}

```

class easydel.modules.dbrx.__init__.DbrxConfig(d_model: int = 2048, n_heads: int = 16, n_layers: int = 24, max_seq_len: int = 2048, vocab_size: int = 32000, resid_pdrop: float = 0.0, emb_pdrop: float = 0.0, attn_config: Optional[DbrxAttentionConfig] = None, ffn_config: Optional[DbrxFFNConfig] = None, use_cache: bool = True, initializer_range: float = 0.02, output_router_logits: bool = False, router_aux_loss_coef: float = 0.05, gradient_checkpointing: EasyDeLGradientCheckPointers = EasyDeLGradientCheckPointers.NONE, **kwargs: Any)[source]#

Bases: EasyDeLBaseConfig

Configuration objects inherit from [EasyDeLBaseConfig] and can be used to control the model outputs. Read the documentation from [EasyDeLBaseConfig] for more information.

Parameters
  • d_model (int, optional, defaults to 2048) โ€“ Dimensionality of the encoder layers and the pooler layer.

  • n_heads (int, optional, defaults to 16) โ€“ Number of attention heads for each attention layer in the Transformer encoder.

  • n_layers (int, optional, defaults to 24) โ€“ Number of hidden layers in the Transformer encoder.

  • max_seq_len (int, optional, defaults to 2048) โ€“ The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 2048 or 4096).

  • vocab_size (int, optional, defaults to 32000) โ€“ Vocabulary size of the DBRX model. Defines the number of different tokens that can be represented by the inputs_ids passed to the forward method.

  • resid_pdrop (float, optional, defaults to 0.0) โ€“ The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.

  • emb_pdrop (float, optional, defaults to 0.0) โ€“ The dropout ratio for the attention probabilities.

  • attn_config ([DbrxAttentionConfig], optional) โ€“ The configuration of the attention layer.

  • ffn_config ([DbrxFFNConfig], optional) โ€“ The configuration of the feed forward layer.

  • use_cache (bool, optional, defaults to True) โ€“ Whether or not the model should return the last key/values attentions (not used by all models). Only relevant if config.is_decoder=True.

  • initializer_range (float, optional, defaults to 0.02) โ€“ The standard deviation of the truncated_normal_initializer for initializing all weight matrices.

  • output_router_logits (bool, optional, defaults to False) โ€“ Whether or not to output the router logits.

  • router_aux_loss_coef (float, optional, defaults to 0.05) โ€“ The coefficient of the router auxiliary loss.

attribute_map: dict[str, str] = {'hidden_size': 'd_model', 'max_position_embeddings': 'max_seq_len', 'num_attention_heads': 'n_heads', 'num_hidden_layers': 'n_layers'}#
get_partition_rules(*args, **kwargs)[source]#

Get the partition rules for the model parameters.

These rules define how parameters should be sharded across devices when using model parallelism.

Parameters
  • *args โ€“ Variable length argument list.

  • **kwargs โ€“ Arbitrary keyword arguments.

Returns

A tuple of partition rules for different parameter patterns.

Return type

Tuple

property granted_freq_max_position_embedding: int#

Returns the maximum position embedding size for frequency-based position embeddings.

Returns

The maximum position embedding size, falling back to max_seq_len if not explicitly set.

Return type

int

property granted_mask_max_position_embedding: int#

Returns the maximum position embedding size for mask-based position embeddings.

Returns

The maximum position embedding size, falling back to max_seq_len if not explicitly set.

Return type

int

model_type: str = 'dbrx'#
class easydel.modules.dbrx.__init__.DbrxFFNConfig(ffn_act_fn: Optional[dict] = None, ffn_hidden_size: int = 3584, moe_num_experts: int = 4, moe_top_k: int = 1, moe_jitter_eps: Optional[float] = None, moe_loss_weight: float = 0.01, moe_normalize_expert_weights: Optional[float] = 1, uniform_expert_assignment: bool = False, **kwargs: Any)[source]#

Bases: EasyDeLBaseConfig

This is the configuration class to store the feed forward related configuration of a [DbrxModel].

Parameters
  • ffn_act_fn (dict, optional) โ€“ The activation function configuration for the feed-forward network.

  • ffn_hidden_size (int, optional, defaults to 3584) โ€“ The hidden size of the feed-forward network.

  • moe_num_experts (int, optional, defaults to 4) โ€“ The number of experts in the Mixture-of-Experts (MoE) layer.

  • moe_top_k (int, optional, defaults to 1) โ€“ The number of top experts to use in the MoE layer.

  • moe_jitter_eps (float, optional) โ€“ The jitter epsilon value for the MoE layer.

  • moe_loss_weight (float, optional, defaults to 0.01) โ€“ The loss weight for the MoE auxiliary loss.

  • moe_normalize_expert_weights (float, optional, defaults to 1.0) โ€“ The normalization factor for the expert weights in the MoE layer.

  • uniform_expert_assignment (bool, optional, defaults to False) โ€“ Whether to use uniform expert assignment in the MoE layer.

classmethod from_pretrained(pretrained_model_name_or_path: str, **kwargs: Any) EasyDeLBaseConfig[source]#

Instantiate a [PretrainedConfig] (or a derived class) from a pretrained model configuration.

Parameters
  • pretrained_model_name_or_path (str or os.PathLike) โ€“

    This can be either:

    • a string, the model id of a pretrained model configuration hosted inside a model repo on huggingface.co.

    • a path to a directory containing a configuration file saved using the [~PretrainedConfig.save_pretrained] method, e.g., ./my_model_directory/.

    • a path or url to a saved configuration JSON file, e.g., ./my_model_directory/configuration.json.

  • cache_dir (str or os.PathLike, optional) โ€“ Path to a directory in which a downloaded pretrained model configuration should be cached if the standard cache should not be used.

  • force_download (bool, optional, defaults to False) โ€“ Whether or not to force to (re-)download the configuration files and override the cached versions if they exist.

  • resume_download โ€“ Deprecated and ignored. All downloads are now resumed by default when possible. Will be removed in v5 of Transformers.

  • proxies (Dict[str, str], optional) โ€“ A dictionary of proxy servers to use by protocol or endpoint, e.g., {โ€˜httpโ€™: โ€˜foo.bar:3128โ€™, โ€˜http://hostnameโ€™: โ€˜foo.bar:4012โ€™}. The proxies are used on each request.

  • token (str or bool, optional) โ€“ The token to use as HTTP bearer authorization for remote files. If True, or not specified, will use the token generated when running huggingface-cli login (stored in ~/.huggingface).

  • revision (str, optional, defaults to โ€œmainโ€) โ€“

    The specific model version to use. It can be a branch name, a tag name, or a commit id, since we use a git-based system for storing models and other artifacts on huggingface.co, so revision can be any identifier allowed by git.

    <Tip>

    To test a pull request you made on the Hub, you can pass `revision=โ€refs/pr/<pr_number>โ€.

    </Tip>

  • return_unused_kwargs (bool, optional, defaults to False) โ€“

    If False, then this function returns just the final configuration object.

    If True, then this functions returns a tp.Tuple(config, unused_kwargs) where unused_kwargs is a dictionary consisting of the key/value pairs whose keys are not configuration attributes: i.e., the part of kwargs which has not been used to update config and is otherwise ignored.

  • subfolder (str, optional, defaults to โ€œโ€) โ€“ In case the relevant files are located inside a subfolder of the model repo on huggingface.co, you can specify the folder name here.

  • kwargs (Dict[str, tp.Any], optional) โ€“ The values in kwargs of any keys which are configuration attributes will be used to override the loaded values. Behavior concerning key/value pairs whose keys are not configuration attributes is controlled by the return_unused_kwargs keyword parameter.

Returns

The configuration object instantiated from this pretrained model.

Return type

[PretrainedConfig]

Examples:

>>> # We can't instantiate directly the base class *PretrainedConfig* so let's show the examples on a
>>> # derived class: BertConfig
>>> config = BertConfig.from_pretrained(
...   "google-bert/bert-base-uncased"
>>> )  # Download configuration from huggingface.co and cache.
>>> config = BertConfig.from_pretrained(
...   "./test/saved_model/"
>>> )  # E.g. config (or model) was saved using *save_pretrained('./test/saved_model/')*
>>> config = BertConfig.from_pretrained("./test/saved_model/my_configuration.json")
>>> config = BertConfig.from_pretrained(
...  "google-bert/bert-base-uncased", output_attentions=True, foo=False
>>> )
>>> assert config.output_attentions == True
>>> config, unused_kwargs = BertConfig.from_pretrained(
...  "google-bert/bert-base-uncased",
...  output_attentions=True,
...  foo=False,
...  return_unused_kwargs=True,
>>> )
>>> assert config.output_attentions == True
>>> assert unused_kwargs == {"foo": False}

```

class easydel.modules.dbrx.__init__.DbrxForCausalLM(*args: Any, **kwargs: Any)[source]#

Bases: EasyDeLBaseModule

class easydel.modules.dbrx.__init__.DbrxForSequenceClassification(*args: Any, **kwargs: Any)[source]#

Bases: EasyDeLBaseModule

class easydel.modules.dbrx.__init__.DbrxModel(*args: Any, **kwargs: Any)[source]#

Bases: EasyDeLBaseModule

Base DBRX Model outputting raw hidden-states.

This model is a Transformer-based model with a mixture of experts (MoE) architecture, implementing the DBRX architecture as described in the original paper.

The model uses specialized attention modules and a router-based MoE FFN layer.

property frequencies#

Retrieves or computes the frequency components (e.g., for RoPE) from the configuration.

Uses self.config.get_basic_frequencies() and caches the result.

Returns

The frequency components, potentially cached.

Return type

jnp.ndarray