easydel.modules.siglip.configuration_siglip#

class easydel.modules.siglip.configuration_siglip.SiglipConfig(text_config=None, vision_config=None, **kwargs)[source]#

Bases: EasyDeLBaseConfig

[SiglipConfig] is the configuration class to store the configuration of a [SiglipModel]. It is used to instantiate a Siglip model according to the specified arguments, defining the text model and vision model configs. Instantiating a configuration with the defaults will yield a similar configuration to that of the Siglip [google/siglip-base-patch16-224](https://huggingface.co/google/siglip-base-patch16-224) architecture.

Configuration objects inherit from [EasyDeLBaseConfig] and can be used to control the model outputs. Read the documentation from [EasyDeLBaseConfig] for more information.

Parameters
  • text_config (dict, optional) – Dictionary of configuration options used to initialize [SiglipTextConfig].

  • vision_config (dict, optional) – Dictionary of configuration options used to initialize [SiglipVisionConfig].

  • kwargs (optional) – Dictionary of keyword arguments.

classmethod from_text_vision_configs(text_config: SiglipTextConfig, vision_config: SiglipVisionConfig, **kwargs)[source]#

Instantiate a [SiglipConfig] (or a derived class) from siglip text model configuration and siglip vision model configuration.

Returns

An instance of a configuration object

Return type

[SiglipConfig]

get_partition_rules(*args, **kwargs)[source]#

Get the partition rules for the model. :returns: The partition rules. :rtype: tp.Tuple[tp.Tuple[str, PartitionSpec]]

model_type: str = 'siglip'#
sub_configs: Dict[str, 'PretrainedConfig'] = {'text_config': <class 'easydel.modules.siglip.configuration_siglip.SiglipTextConfig'>, 'vision_config': <class 'easydel.modules.siglip.configuration_siglip.SiglipVisionConfig'>}#
class easydel.modules.siglip.configuration_siglip.SiglipTextConfig(vocab_size=32000, hidden_size=768, intermediate_size=3072, num_hidden_layers=12, num_attention_heads=12, max_position_embeddings=64, hidden_act='gelu_pytorch_tanh', layer_norm_eps=1e-06, attention_dropout=0.0, pad_token_id=1, bos_token_id=49406, eos_token_id=49407, projection_size=None, **kwargs)[source]#

Bases: EasyDeLBaseConfig

Configuration objects inherit from [EasyDeLBaseConfig] and can be used to control the model outputs. Read the documentation from [EasyDeLBaseConfig] for more information.

Parameters
  • vocab_size (int, optional, defaults to 32000) – Vocabulary size of the Siglip text model. Defines the number of different tokens that can be represented by the inputs_ids passed when calling [SiglipModel].

  • hidden_size (int, optional, defaults to 768) – Dimensionality of the encoder layers and the pooler layer.

  • intermediate_size (int, optional, defaults to 3072) – Dimensionality of the “intermediate” (i.e., feed-forward) layer in the Transformer encoder.

  • num_hidden_layers (int, optional, defaults to 12) – Number of hidden layers in the Transformer encoder.

  • num_attention_heads (int, optional, defaults to 12) – Number of attention heads for each attention layer in the Transformer encoder.

  • max_position_embeddings (int, optional, defaults to 64) – The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).

  • hidden_act (str or function, optional, defaults to “gelu_pytorch_tanh”) – The non-linear activation function (function or string) in the encoder and pooler. If string, “gelu”, “relu”, “selu” and “gelu_new” “quick_gelu” are supported.

  • layer_norm_eps (float, optional, defaults to 1e-06) – The epsilon used by the layer normalization layers.

  • attention_dropout (float, optional, defaults to 0.0) – The dropout ratio for the attention probabilities.

  • pad_token_id (int, optional, defaults to 1) – The id of the padding token in the vocabulary.

  • bos_token_id (int, optional, defaults to 49406) – The id of the beginning-of-sequence token in the vocabulary.

  • eos_token_id (int, optional, defaults to 49407) – The id of the end-of-sequence token in the vocabulary.

  • projection_size (int, optional, defaults to hidden_size) – The size of the projection head.

Example:

```python >>> from transformers import SiglipTextConfig, SiglipTextModel

>>> # Initializing a SiglipTextConfig with google/siglip-base-patch16-224 style configuration
>>> configuration = SiglipTextConfig()
>>> # Initializing a SiglipTextModel (with random weights) from the google/siglip-base-patch16-224 style configuration
>>> model = SiglipTextModel(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config
```
base_config_key: str = 'text_config'#
get_partition_rules(*args, **kwargs)[source]#

Get the partition rules for the model. :returns: The partition rules. :rtype: tp.Tuple[tp.Tuple[str, PartitionSpec]]

model_type: str = 'siglip_text_model'#
class easydel.modules.siglip.configuration_siglip.SiglipVisionConfig(hidden_size=768, intermediate_size=3072, num_hidden_layers=12, num_attention_heads=12, num_channels=3, image_size=224, patch_size=16, hidden_act='gelu_pytorch_tanh', layer_norm_eps=1e-06, attention_dropout=0.0, **kwargs)[source]#

Bases: EasyDeLBaseConfig

Configuration objects inherit from [EasyDeLBaseConfig] and can be used to control the model outputs. Read the documentation from [EasyDeLBaseConfig] for more information.

Parameters
  • hidden_size (int, optional, defaults to 768) – Dimensionality of the encoder layers and the pooler layer.

  • intermediate_size (int, optional, defaults to 3072) – Dimensionality of the “intermediate” (i.e., feed-forward) layer in the Transformer encoder.

  • num_hidden_layers (int, optional, defaults to 12) – Number of hidden layers in the Transformer encoder.

  • num_attention_heads (int, optional, defaults to 12) – Number of attention heads for each attention layer in the Transformer encoder.

  • num_channels (int, optional, defaults to 3) – Number of channels in the input images.

  • image_size (int, optional, defaults to 224) – The size (resolution) of each image.

  • patch_size (int, optional, defaults to 16) – The size (resolution) of each patch.

  • hidden_act (str or function, optional, defaults to “gelu_pytorch_tanh”) – The non-linear activation function (function or string) in the encoder and pooler. If string, “gelu”, “relu”, “selu” and “gelu_new” “quick_gelu” are supported.

  • layer_norm_eps (float, optional, defaults to 1e-06) – The epsilon used by the layer normalization layers.

  • attention_dropout (float, optional, defaults to 0.0) – The dropout ratio for the attention probabilities.

```

base_config_key: str = 'vision_config'#
get_partition_rules(*args, **kwargs)[source]#

Get the partition rules for the model. :returns: The partition rules. :rtype: tp.Tuple[tp.Tuple[str, PartitionSpec]]

model_type: str = 'siglip_vision_model'#