easydel.modules.clip.__init__#

class easydel.modules.clip.__init__.CLIPConfig(text_config=None, vision_config=None, projection_dim=512, logit_scale_init_value=2.6592, **kwargs)[source]#

Bases: EasyDeLBaseConfig

[CLIPConfig] is the configuration class to store the configuration of a [CLIPModel]. It is used to instantiate a CLIP model according to the specified arguments, defining the text model and vision model configs. Instantiating a configuration with the defaults will yield a similar configuration to that of the CLIP [openai/clip-vit-base-patch32](https://huggingface.co/openai/clip-vit-base-patch32) architecture.

Configuration objects inherit from [EasyDeLBaseConfig] and can be used to control the model outputs. Read the documentation from [EasyDeLBaseConfig] for more information.

Parameters
  • text_config (dict, optional) – Dictionary of configuration options used to initialize [CLIPTextConfig].

  • vision_config (dict, optional) – Dictionary of configuration options used to initialize [CLIPVisionConfig].

  • projection_dim (int, optional, defaults to 512) – Dimensionality of text and vision projection layers.

  • logit_scale_init_value (float, optional, defaults to 2.6592) – The initial value of the logit_scale parameter. Default is used as per the original CLIP implementation.

  • kwargs (optional) – Dictionary of keyword arguments.

Example:

```python >>> from transformers import CLIPConfig, CLIPModel

>>> # Initializing a CLIPConfig with openai/clip-vit-base-patch32 style configuration
>>> configuration = CLIPConfig()
>>> # Initializing a CLIPModel (with random weights) from the openai/clip-vit-base-patch32 style configuration
>>> model = CLIPModel(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config
>>> # We can also initialize a CLIPConfig from a CLIPTextConfig and a CLIPVisionConfig
>>> from transformers import CLIPTextConfig, CLIPVisionConfig
>>> # Initializing a CLIPText and CLIPVision configuration
>>> config_text = CLIPTextConfig()
>>> config_vision = CLIPVisionConfig()
>>> config = CLIPConfig.from_text_vision_configs(config_text, config_vision)
```
classmethod from_text_vision_configs(text_config: CLIPTextConfig, vision_config: CLIPVisionConfig, **kwargs)[source]#

Instantiate a [CLIPConfig] (or a derived class) from clip text model configuration and clip vision model configuration.

Parameters
  • text_config (CLIPTextConfig) – The text model configuration.

  • vision_config (CLIPVisionConfig) – The vision model configuration.

  • **kwargs – Additional keyword arguments.

Returns

An instance of a configuration object

Return type

[CLIPConfig]

get_partition_rules(*arg, **kwargs)#

Generic partition rules for CLIP text and vision models.

Parameters
  • self – The configuration object (unused but part of method signature).

  • *arg – Additional positional arguments (unused).

  • **kwargs – Additional keyword arguments (unused).

Returns

A tuple of partition rules for model parameters.

Return type

Tuple

model_type: str = 'clip'#
sub_configs: dict[str, 'PretrainedConfig'] = {'text_config': <class 'easydel.modules.clip.clip_configuration.CLIPTextConfig'>, 'vision_config': <class 'easydel.modules.clip.clip_configuration.CLIPVisionConfig'>}#
class easydel.modules.clip.__init__.CLIPForImageClassification(*args: Any, **kwargs: Any)[source]#

Bases: EasyDeLBaseModule

CLIP vision model with an image classification head on top (a linear layer on the pooled final hidden state).

config#

Configuration object.

Type

CLIPVisionConfig

dtype#

Data type for computation.

Type

jnp.dtype

param_dtype#

Data type for parameters.

Type

jnp.dtype

precision#

JAX precision level.

Type

jax.lax.PrecisionLike

rngs#

Random number generators.

Type

nn.Rngs

class easydel.modules.clip.__init__.CLIPModel(*args: Any, **kwargs: Any)[source]#

Bases: EasyDeLBaseModule

compute_loss(*, labels=None, loss_config=None, loss_kwargs=None, **batch) Tuple[Any, CLIPOutput][source]#

Computes the loss for the model given a batch of inputs and labels.

This method performs a forward pass using the provided batch arguments, then calculates the loss using the determined loss_function. It handles potential label inference (e.g., using input_ids as labels for Causal LM) and default loss configurations.

Parameters
  • labels (tp.Optional[chex.Array], optional) – The target labels. If None and the task is Causal LM, input_ids from the batch might be used. Defaults to None.

  • loss_config (tp.Optional[LossConfig], optional) – Specific configuration for the loss calculation. If None, defaults might be inferred (e.g., for sequence classification). Defaults to None.

  • loss_kwargs (tp.Optional[tp.Dict], optional) – Additional keyword arguments to pass directly to the loss function. Defaults to None.

  • **batch – Keyword arguments representing the input batch (e.g., input_ids, attention_mask).

Returns

A tuple containing:
  • The model’s output ( Pytree typically including logits, hidden states etc.)

  • A LossMetrics object containing the calculated loss and potentially other metrics.

Return type

tp.Tuple[tp.Any, LossMetrics]

Raises
  • AssertionError – If labels are required for the loss function but are not provided or inferred.

  • AssertionError – If sequence classification loss is used without num_labels in the config.

get_image_features(pixel_values: Union[Array, ndarray, bool, number])[source]#
get_text_features(input_ids: Union[Array, ndarray, bool, number], attention_mask: Optional[Union[Array, ndarray, bool, number]] = None, position_ids: Optional[Union[Array, ndarray, bool, number]] = None)[source]#
class easydel.modules.clip.__init__.CLIPTextConfig(vocab_size=49408, hidden_size=512, intermediate_size=2048, projection_dim=512, num_hidden_layers=12, num_attention_heads=8, max_position_embeddings=77, hidden_act='quick_gelu', layer_norm_eps=1e-05, attention_dropout=0.0, initializer_range=0.02, initializer_factor=1.0, pad_token_id=1, bos_token_id=49406, eos_token_id=49407, **kwargs)[source]#

Bases: EasyDeLBaseConfig

This is the configuration class to store the configuration of a [CLIPTextModel]. It is used to instantiate a CLIP text encoder according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the text encoder of the CLIP [openai/clip-vit-base-patch32](https://huggingface.co/openai/clip-vit-base-patch32) architecture.

Configuration objects inherit from [EasyDeLBaseConfig] and can be used to control the model outputs. Read the documentation from [EasyDeLBaseConfig] for more information.

Parameters
  • vocab_size (int, optional, defaults to 49408) – Vocabulary size of the CLIP text model. Defines the number of different tokens that can be represented by the inputs_ids passed when calling [CLIPModel].

  • hidden_size (int, optional, defaults to 512) – Dimensionality of the encoder layers and the pooler layer.

  • intermediate_size (int, optional, defaults to 2048) – Dimensionality of the “intermediate” (i.e., feed-forward) layer in the Transformer encoder.

  • projection_dim (int, optional, defaults to 512) – Dimensionality of text and vision projection layers.

  • num_hidden_layers (int, optional, defaults to 12) – Number of hidden layers in the Transformer encoder.

  • num_attention_heads (int, optional, defaults to 8) – Number of attention heads for each attention layer in the Transformer encoder.

  • max_position_embeddings (int, optional, defaults to 77) – The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).

  • hidden_act (str or function, optional, defaults to “quick_gelu”) – The non-linear activation function (function or string) in the encoder and pooler. If string, “gelu”, “relu”, “selu” and “gelu_new” “quick_gelu” are supported.

  • layer_norm_eps (float, optional, defaults to 1e-05) – The epsilon used by the layer normalization layers.

  • attention_dropout (float, optional, defaults to 0.0) – The dropout ratio for the attention probabilities.

  • initializer_range (float, optional, defaults to 0.02) – The standard deviation of the truncated_normal_initializer for initializing all weight matrices.

  • initializer_factor (float, optional, defaults to 1.0) – A factor for initializing all weight matrices (should be kept to 1, used internally for initialization testing).

  • pad_token_id (int, optional, defaults to 1) – Padding token id.

  • bos_token_id (int, optional, defaults to 49406) – Beginning of stream token id.

  • eos_token_id (int, optional, defaults to 49407) – End of stream token id.

Example:

```python >>> from transformers import CLIPTextConfig, CLIPTextModel

>>> # Initializing a CLIPTextConfig with openai/clip-vit-base-patch32 style configuration
>>> configuration = CLIPTextConfig()
>>> # Initializing a CLIPTextModel (with random weights) from the openai/clip-vit-base-patch32 style configuration
>>> model = CLIPTextModel(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config
```
base_config_key: str = 'text_config'#
get_partition_rules(*arg, **kwargs)#

Generic partition rules for CLIP text and vision models.

Parameters
  • self – The configuration object (unused but part of method signature).

  • *arg – Additional positional arguments (unused).

  • **kwargs – Additional keyword arguments (unused).

Returns

A tuple of partition rules for model parameters.

Return type

Tuple

model_type: str = 'clip_text_model'#
class easydel.modules.clip.__init__.CLIPTextModel(*args: Any, **kwargs: Any)[source]#

Bases: EasyDeLBaseModule

Bare CLIP text model (transformer) outputting raw hidden-states without any specific head on top.

config#

Configuration object.

Type

CLIPTextConfig

dtype#

Data type for computation.

Type

jnp.dtype

param_dtype#

Data type for parameters.

Type

jnp.dtype

precision#

JAX precision level.

Type

jax.lax.PrecisionLike

rngs#

Random number generators.

Type

nn.Rngs

class easydel.modules.clip.__init__.CLIPTextModelWithProjection(*args: Any, **kwargs: Any)[source]#

Bases: EasyDeLBaseModule

CLIP text model with a projection layer on top.

config#

Configuration object.

Type

CLIPTextConfig

dtype#

Data type for computation.

Type

jnp.dtype

param_dtype#

Data type for parameters.

Type

jnp.dtype

precision#

JAX precision level.

Type

jax.lax.PrecisionLike

rngs#

Random number generators.

Type

nn.Rngs

class easydel.modules.clip.__init__.CLIPVisionConfig(hidden_size=768, intermediate_size=3072, projection_dim=512, num_hidden_layers=12, num_attention_heads=12, num_channels=3, image_size=224, patch_size=32, hidden_act='quick_gelu', layer_norm_eps=1e-05, attention_dropout=0.0, initializer_range=0.02, initializer_factor=1.0, **kwargs)[source]#

Bases: EasyDeLBaseConfig

This is the configuration class to store the configuration of a [CLIPVisionModel]. It is used to instantiate a CLIP vision encoder according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the vision encoder of the CLIP [openai/clip-vit-base-patch32](https://huggingface.co/openai/clip-vit-base-patch32) architecture.

Configuration objects inherit from [EasyDeLBaseConfig] and can be used to control the model outputs. Read the documentation from [EasyDeLBaseConfig] for more information.

Parameters
  • hidden_size (int, optional, defaults to 768) – Dimensionality of the encoder layers and the pooler layer.

  • intermediate_size (int, optional, defaults to 3072) – Dimensionality of the “intermediate” (i.e., feed-forward) layer in the Transformer encoder.

  • projection_dim (int, optional, defaults to 512) – Dimensionality of text and vision projection layers.

  • num_hidden_layers (int, optional, defaults to 12) – Number of hidden layers in the Transformer encoder.

  • num_attention_heads (int, optional, defaults to 12) – Number of attention heads for each attention layer in the Transformer encoder.

  • num_channels (int, optional, defaults to 3) – The number of input channels.

  • image_size (int, optional, defaults to 224) – The size (resolution) of each image.

  • patch_size (int, optional, defaults to 32) – The size (resolution) of each patch.

  • hidden_act (str or function, optional, defaults to “quick_gelu”) – The non-linear activation function (function or string) in the encoder and pooler. If string, “gelu”, “relu”, “selu” and “gelu_new” “quick_gelu” are supported.

  • layer_norm_eps (float, optional, defaults to 1e-05) – The epsilon used by the layer normalization layers.

  • attention_dropout (float, optional, defaults to 0.0) – The dropout ratio for the attention probabilities.

  • initializer_range (float, optional, defaults to 0.02) – The standard deviation of the truncated_normal_initializer for initializing all weight matrices.

  • initializer_factor (float, optional, defaults to 1.0) – A factor for initializing all weight matrices (should be kept to 1, used internally for initialization testing).

Example:

```python >>> from transformers import CLIPVisionConfig, CLIPVisionModel

>>> # Initializing a CLIPVisionConfig with openai/clip-vit-base-patch32 style configuration
>>> configuration = CLIPVisionConfig()
>>> # Initializing a CLIPVisionModel (with random weights) from the openai/clip-vit-base-patch32 style configuration
>>> model = CLIPVisionModel(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config
```
base_config_key: str = 'vision_config'#
get_partition_rules(*arg, **kwargs)#

Generic partition rules for CLIP text and vision models.

Parameters
  • self – The configuration object (unused but part of method signature).

  • *arg – Additional positional arguments (unused).

  • **kwargs – Additional keyword arguments (unused).

Returns

A tuple of partition rules for model parameters.

Return type

Tuple

model_type: str = 'clip_vision_model'#
class easydel.modules.clip.__init__.CLIPVisionModel(*args: Any, **kwargs: Any)[source]#

Bases: EasyDeLBaseModule

Bare CLIP vision model (transformer) outputting raw hidden-states without any specific head on top.

config#

Configuration object.

Type

CLIPVisionConfig

dtype#

Data type for computation.

Type

jnp.dtype

param_dtype#

Data type for parameters.

Type

jnp.dtype

precision#

JAX precision level.

Type

jax.lax.PrecisionLike

rngs#

Random number generators.

Type

nn.Rngs