easydel.modules.clip.clip_configuration#
- class easydel.modules.clip.clip_configuration.CLIPConfig(text_config=None, vision_config=None, projection_dim=512, logit_scale_init_value=2.6592, **kwargs)[source]#
Bases:
EasyDeLBaseConfig[CLIPConfig] is the configuration class to store the configuration of a [CLIPModel]. It is used to instantiate a CLIP model according to the specified arguments, defining the text model and vision model configs. Instantiating a configuration with the defaults will yield a similar configuration to that of the CLIP [openai/clip-vit-base-patch32](https://huggingface.co/openai/clip-vit-base-patch32) architecture.
Configuration objects inherit from [EasyDeLBaseConfig] and can be used to control the model outputs. Read the documentation from [EasyDeLBaseConfig] for more information.
- Parameters
text_config (dict, optional) – Dictionary of configuration options used to initialize [CLIPTextConfig].
vision_config (dict, optional) – Dictionary of configuration options used to initialize [CLIPVisionConfig].
projection_dim (int, optional, defaults to 512) – Dimensionality of text and vision projection layers.
logit_scale_init_value (float, optional, defaults to 2.6592) – The initial value of the logit_scale parameter. Default is used as per the original CLIP implementation.
kwargs (optional) – Dictionary of keyword arguments.
Example:
```python >>> from transformers import CLIPConfig, CLIPModel
>>> # Initializing a CLIPConfig with openai/clip-vit-base-patch32 style configuration >>> configuration = CLIPConfig()
>>> # Initializing a CLIPModel (with random weights) from the openai/clip-vit-base-patch32 style configuration >>> model = CLIPModel(configuration)
>>> # Accessing the model configuration >>> configuration = model.config
>>> # We can also initialize a CLIPConfig from a CLIPTextConfig and a CLIPVisionConfig >>> from transformers import CLIPTextConfig, CLIPVisionConfig
>>> # Initializing a CLIPText and CLIPVision configuration >>> config_text = CLIPTextConfig() >>> config_vision = CLIPVisionConfig()
>>> config = CLIPConfig.from_text_vision_configs(config_text, config_vision) ```
- classmethod from_text_vision_configs(text_config: CLIPTextConfig, vision_config: CLIPVisionConfig, **kwargs)[source]#
Instantiate a [CLIPConfig] (or a derived class) from clip text model configuration and clip vision model configuration.
- Returns
An instance of a configuration object
- Return type
[CLIPConfig]
- get_partition_rules(*arg, **kwargs)#
Get the partition rules for the model. :returns: The partition rules. :rtype: tp.Tuple[tp.Tuple[str, PartitionSpec]]
- model_type: str = 'clip'#
- sub_configs: Dict[str, 'PretrainedConfig'] = {'text_config': <class 'easydel.modules.clip.clip_configuration.CLIPTextConfig'>, 'vision_config': <class 'easydel.modules.clip.clip_configuration.CLIPVisionConfig'>}#
- class easydel.modules.clip.clip_configuration.CLIPTextConfig(vocab_size=49408, hidden_size=512, intermediate_size=2048, projection_dim=512, num_hidden_layers=12, num_attention_heads=8, max_position_embeddings=77, hidden_act='quick_gelu', layer_norm_eps=1e-05, attention_dropout=0.0, initializer_range=0.02, initializer_factor=1.0, pad_token_id=1, bos_token_id=49406, eos_token_id=49407, **kwargs)[source]#
Bases:
EasyDeLBaseConfigThis is the configuration class to store the configuration of a [CLIPTextModel]. It is used to instantiate a CLIP text encoder according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the text encoder of the CLIP [openai/clip-vit-base-patch32](https://huggingface.co/openai/clip-vit-base-patch32) architecture.
Configuration objects inherit from [EasyDeLBaseConfig] and can be used to control the model outputs. Read the documentation from [EasyDeLBaseConfig] for more information.
- Parameters
vocab_size (int, optional, defaults to 49408) – Vocabulary size of the CLIP text model. Defines the number of different tokens that can be represented by the inputs_ids passed when calling [CLIPModel].
hidden_size (int, optional, defaults to 512) – Dimensionality of the encoder layers and the pooler layer.
intermediate_size (int, optional, defaults to 2048) – Dimensionality of the “intermediate” (i.e., feed-forward) layer in the Transformer encoder.
projection_dim (int, optional, defaults to 512) – Dimensionality of text and vision projection layers.
num_hidden_layers (int, optional, defaults to 12) – Number of hidden layers in the Transformer encoder.
num_attention_heads (int, optional, defaults to 8) – Number of attention heads for each attention layer in the Transformer encoder.
max_position_embeddings (int, optional, defaults to 77) – The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
hidden_act (str or function, optional, defaults to “quick_gelu”) – The non-linear activation function (function or string) in the encoder and pooler. If string, “gelu”, “relu”, “selu” and “gelu_new” “quick_gelu” are supported.
layer_norm_eps (float, optional, defaults to 1e-05) – The epsilon used by the layer normalization layers.
attention_dropout (float, optional, defaults to 0.0) – The dropout ratio for the attention probabilities.
initializer_range (float, optional, defaults to 0.02) – The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
initializer_factor (float, optional, defaults to 1.0) – A factor for initializing all weight matrices (should be kept to 1, used internally for initialization testing).
pad_token_id (int, optional, defaults to 1) – Padding token id.
bos_token_id (int, optional, defaults to 49406) – Beginning of stream token id.
eos_token_id (int, optional, defaults to 49407) – End of stream token id.
Example:
```python >>> from transformers import CLIPTextConfig, CLIPTextModel
>>> # Initializing a CLIPTextConfig with openai/clip-vit-base-patch32 style configuration >>> configuration = CLIPTextConfig()
>>> # Initializing a CLIPTextModel (with random weights) from the openai/clip-vit-base-patch32 style configuration >>> model = CLIPTextModel(configuration)
>>> # Accessing the model configuration >>> configuration = model.config ```
- base_config_key: str = 'text_config'#
- get_partition_rules(*arg, **kwargs)#
Get the partition rules for the model. :returns: The partition rules. :rtype: tp.Tuple[tp.Tuple[str, PartitionSpec]]
- model_type: str = 'clip_text_model'#
- class easydel.modules.clip.clip_configuration.CLIPVisionConfig(hidden_size=768, intermediate_size=3072, projection_dim=512, num_hidden_layers=12, num_attention_heads=12, num_channels=3, image_size=224, patch_size=32, hidden_act='quick_gelu', layer_norm_eps=1e-05, attention_dropout=0.0, initializer_range=0.02, initializer_factor=1.0, **kwargs)[source]#
Bases:
EasyDeLBaseConfigThis is the configuration class to store the configuration of a [CLIPVisionModel]. It is used to instantiate a CLIP vision encoder according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the vision encoder of the CLIP [openai/clip-vit-base-patch32](https://huggingface.co/openai/clip-vit-base-patch32) architecture.
Configuration objects inherit from [EasyDeLBaseConfig] and can be used to control the model outputs. Read the documentation from [EasyDeLBaseConfig] for more information.
- Parameters
hidden_size (int, optional, defaults to 768) – Dimensionality of the encoder layers and the pooler layer.
intermediate_size (int, optional, defaults to 3072) – Dimensionality of the “intermediate” (i.e., feed-forward) layer in the Transformer encoder.
projection_dim (int, optional, defaults to 512) – Dimensionality of text and vision projection layers.
num_hidden_layers (int, optional, defaults to 12) – Number of hidden layers in the Transformer encoder.
num_attention_heads (int, optional, defaults to 12) – Number of attention heads for each attention layer in the Transformer encoder.
num_channels (int, optional, defaults to 3) – The number of input channels.
image_size (int, optional, defaults to 224) – The size (resolution) of each image.
patch_size (int, optional, defaults to 32) – The size (resolution) of each patch.
hidden_act (str or function, optional, defaults to “quick_gelu”) – The non-linear activation function (function or string) in the encoder and pooler. If string, “gelu”, “relu”, “selu” and “gelu_new” “quick_gelu” are supported.
layer_norm_eps (float, optional, defaults to 1e-05) – The epsilon used by the layer normalization layers.
attention_dropout (float, optional, defaults to 0.0) – The dropout ratio for the attention probabilities.
initializer_range (float, optional, defaults to 0.02) – The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
initializer_factor (float, optional, defaults to 1.0) – A factor for initializing all weight matrices (should be kept to 1, used internally for initialization testing).
Example:
```python >>> from transformers import CLIPVisionConfig, CLIPVisionModel
>>> # Initializing a CLIPVisionConfig with openai/clip-vit-base-patch32 style configuration >>> configuration = CLIPVisionConfig()
>>> # Initializing a CLIPVisionModel (with random weights) from the openai/clip-vit-base-patch32 style configuration >>> model = CLIPVisionModel(configuration)
>>> # Accessing the model configuration >>> configuration = model.config ```
- base_config_key: str = 'vision_config'#
- get_partition_rules(*arg, **kwargs)#
Get the partition rules for the model. :returns: The partition rules. :rtype: tp.Tuple[tp.Tuple[str, PartitionSpec]]
- model_type: str = 'clip_vision_model'#