easydel.modules.clip.init

easydel.modules.clip.init#

class easydel.modules.clip.__init__.CLIPConfig(text_config=None, vision_config=None, projection_dim=512, logit_scale_init_value=2.6592, **kwargs)[source]#

Bases: EasyDeLBaseConfig

[CLIPConfig] is the configuration class to store the configuration of a [CLIPModel]. It is used to instantiate a CLIP model according to the specified arguments, defining the text model and vision model configs. Instantiating a configuration with the defaults will yield a similar configuration to that of the CLIP [openai/clip-vit-base-patch32](https://huggingface.co/openai/clip-vit-base-patch32) architecture.

Configuration objects inherit from [EasyDeLBaseConfig] and can be used to control the model outputs. Read the documentation from [EasyDeLBaseConfig] for more information.

Parameters

text_config (dict, optional) – Dictionary of configuration options used to initialize [CLIPTextConfig].
vision_config (dict, optional) – Dictionary of configuration options used to initialize [CLIPVisionConfig].
projection_dim (int, optional, defaults to 512) – Dimensionality of text and vision projection layers.
logit_scale_init_value (float, optional, defaults to 2.6592) – The initial value of the logit_scale parameter. Default is used as per the original CLIP implementation.
kwargs (optional) – Dictionary of keyword arguments.

Example:

```python >>> from transformers import CLIPConfig, CLIPModel

>>> # Initializing a CLIPConfig with openai/clip-vit-base-patch32 style configuration
>>> configuration = CLIPConfig()

>>> # Initializing a CLIPModel (with random weights) from the openai/clip-vit-base-patch32 style configuration
>>> model = CLIPModel(configuration)

>>> # Accessing the model configuration
>>> configuration = model.config

>>> # We can also initialize a CLIPConfig from a CLIPTextConfig and a CLIPVisionConfig
>>> from transformers import CLIPTextConfig, CLIPVisionConfig

>>> # Initializing a CLIPText and CLIPVision configuration
>>> config_text = CLIPTextConfig()
>>> config_vision = CLIPVisionConfig()

>>> config = CLIPConfig.from_text_vision_configs(config_text, config_vision)
```

classmethod from_text_vision_configs(text_config: CLIPTextConfig, vision_config: CLIPVisionConfig, **kwargs)[source]#

Instantiate a [CLIPConfig] (or a derived class) from clip text model configuration and clip vision model configuration.

Parameters

text_config (CLIPTextConfig) – The text model configuration.
vision_config (CLIPVisionConfig) – The vision model configuration.
**kwargs – Additional keyword arguments.

Returns

An instance of a configuration object

Return type

[CLIPConfig]

get_partition_rules(*arg, **kwargs)#

Generic partition rules for CLIP text and vision models.

Parameters

self – The configuration object (unused but part of method signature).
*arg – Additional positional arguments (unused).
**kwargs – Additional keyword arguments (unused).

Returns

A tuple of partition rules for model parameters.

Return type

Tuple

model_type: str = 'clip'#

sub_configs: dict[str, 'PretrainedConfig'] = {'text_config': <class 'easydel.modules.clip.clip_configuration.CLIPTextConfig'>, 'vision_config': <class 'easydel.modules.clip.clip_configuration.CLIPVisionConfig'>}#

class easydel.modules.clip.__init__.CLIPForImageClassification(*args: Any, **kwargs: Any)[source]#

Bases: EasyDeLBaseModule

CLIP vision model with an image classification head on top (a linear layer on the pooled final hidden state).

config#

Configuration object.

Type: CLIPVisionConfig

dtype#

Data type for computation.

Type: jnp.dtype

param_dtype#

Data type for parameters.

Type: jnp.dtype

precision#

JAX precision level.

Type: jax.lax.PrecisionLike

rngs#

Random number generators.

Type: nn.Rngs

class easydel.modules.clip.__init__.CLIPModel(*args: Any, **kwargs: Any)[source]#

Bases: EasyDeLBaseModule

compute_loss(*, labels=None, loss_config=None, loss_kwargs=None, **batch) → Tuple[Any, CLIPOutput][source]#

Computes the loss for the model given a batch of inputs and labels.

This method performs a forward pass using the provided batch arguments, then calculates the loss using the determined loss_function. It handles potential label inference (e.g., using input_ids as labels for Causal LM) and default loss configurations.

Parameters

labels (tp.Optional[chex.Array], optional) – The target labels. If None and the task is Causal LM, input_ids from the batch might be used. Defaults to None.
loss_config (tp.Optional[LossConfig], optional) – Specific configuration for the loss calculation. If None, defaults might be inferred (e.g., for sequence classification). Defaults to None.
loss_kwargs (tp.Optional[tp.Dict], optional) – Additional keyword arguments to pass directly to the loss function. Defaults to None.
**batch – Keyword arguments representing the input batch (e.g., input_ids, attention_mask).

Returns

A tuple containing:

The model’s output ( Pytree typically including logits, hidden states etc.)
A LossMetrics object containing the calculated loss and potentially other metrics.

Return type

tp.Tuple[tp.Any, LossMetrics]

Raises

AssertionError – If labels are required for the loss function but are not provided or inferred.
AssertionError – If sequence classification loss is used without num_labels in the config.

get_image_features(pixel_values: Union[Array, ndarray, bool, number])[source]#

get_text_features(input_ids: Union[Array, ndarray, bool, number], attention_mask: Optional[Union[Array, ndarray, bool, number]] = None, position_ids: Optional[Union[Array, ndarray, bool, number]] = None)[source]#

class easydel.modules.clip.__init__.CLIPTextConfig(vocab_size=49408, hidden_size=512, intermediate_size=2048, projection_dim=512, num_hidden_layers=12, num_attention_heads=8, max_position_embeddings=77, hidden_act='quick_gelu', layer_norm_eps=1e-05, attention_dropout=0.0, initializer_range=0.02, initializer_factor=1.0, pad_token_id=1, bos_token_id=49406, eos_token_id=49407, **kwargs)[source]#

Bases: EasyDeLBaseConfig

This is the configuration class to store the configuration of a [CLIPTextModel]. It is used to instantiate a CLIP text encoder according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the text encoder of the CLIP [openai/clip-vit-base-patch32](https://huggingface.co/openai/clip-vit-base-patch32) architecture.

Configuration objects inherit from [EasyDeLBaseConfig] and can be used to control the model outputs. Read the documentation from [EasyDeLBaseConfig] for more information.

Parameters

vocab_size (int, optional, defaults to 49408) – Vocabulary size of the CLIP text model. Defines the number of different tokens that can be represented by the inputs_ids passed when calling [CLIPModel].
hidden_size (int, optional, defaults to 512) – Dimensionality of the encoder layers and the pooler layer.
intermediate_size (int, optional, defaults to 2048) – Dimensionality of the “intermediate” (i.e., feed-forward) layer in the Transformer encoder.
projection_dim (int, optional, defaults to 512) – Dimensionality of text and vision projection layers.
num_hidden_layers (int, optional, defaults to 12) – Number of hidden layers in the Transformer encoder.
num_attention_heads (int, optional, defaults to 8) – Number of attention heads for each attention layer in the Transformer encoder.
max_position_embeddings (int, optional, defaults to 77) – The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
hidden_act (str or function, optional, defaults to “quick_gelu”) – The non-linear activation function (function or string) in the encoder and pooler. If string, “gelu”, “relu”, “selu” and “gelu_new” “quick_gelu” are supported.
layer_norm_eps (float, optional, defaults to 1e-05) – The epsilon used by the layer normalization layers.
attention_dropout (float, optional, defaults to 0.0) – The dropout ratio for the attention probabilities.
initializer_range (float, optional, defaults to 0.02) – The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
initializer_factor (float, optional, defaults to 1.0) – A factor for initializing all weight matrices (should be kept to 1, used internally for initialization testing).
pad_token_id (int, optional, defaults to 1) – Padding token id.
bos_token_id (int, optional, defaults to 49406) – Beginning of stream token id.
eos_token_id (int, optional, defaults to 49407) – End of stream token id.

Example:

```python >>> from transformers import CLIPTextConfig, CLIPTextModel

>>> # Initializing a CLIPTextConfig with openai/clip-vit-base-patch32 style configuration
>>> configuration = CLIPTextConfig()

>>> # Initializing a CLIPTextModel (with random weights) from the openai/clip-vit-base-patch32 style configuration
>>> model = CLIPTextModel(configuration)

>>> # Accessing the model configuration
>>> configuration = model.config
```

base_config_key: str = 'text_config'#

get_partition_rules(*arg, **kwargs)#

Generic partition rules for CLIP text and vision models.

Parameters

self – The configuration object (unused but part of method signature).
*arg – Additional positional arguments (unused).
**kwargs – Additional keyword arguments (unused).

Returns

A tuple of partition rules for model parameters.

Return type

Tuple

model_type: str = 'clip_text_model'#

class easydel.modules.clip.__init__.CLIPTextModel(*args: Any, **kwargs: Any)[source]#

Bases: EasyDeLBaseModule

Bare CLIP text model (transformer) outputting raw hidden-states without any specific head on top.

config#

Configuration object.

Type: CLIPTextConfig

dtype#

Data type for computation.

Type: jnp.dtype

param_dtype#

Data type for parameters.

Type: jnp.dtype

precision#

JAX precision level.

Type: jax.lax.PrecisionLike

rngs#

Random number generators.

Type: nn.Rngs

class easydel.modules.clip.__init__.CLIPTextModelWithProjection(*args: Any, **kwargs: Any)[source]#

Bases: EasyDeLBaseModule

CLIP text model with a projection layer on top.

config#

Configuration object.

Type: CLIPTextConfig

dtype#

Data type for computation.

Type: jnp.dtype

param_dtype#

Data type for parameters.

Type: jnp.dtype

precision#

JAX precision level.

Type: jax.lax.PrecisionLike

rngs#

Random number generators.

Type: nn.Rngs

class easydel.modules.clip.__init__.CLIPVisionConfig(hidden_size=768, intermediate_size=3072, projection_dim=512, num_hidden_layers=12, num_attention_heads=12, num_channels=3, image_size=224, patch_size=32, hidden_act='quick_gelu', layer_norm_eps=1e-05, attention_dropout=0.0, initializer_range=0.02, initializer_factor=1.0, **kwargs)[source]#

Bases: EasyDeLBaseConfig

This is the configuration class to store the configuration of a [CLIPVisionModel]. It is used to instantiate a CLIP vision encoder according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the vision encoder of the CLIP [openai/clip-vit-base-patch32](https://huggingface.co/openai/clip-vit-base-patch32) architecture.

Configuration objects inherit from [EasyDeLBaseConfig] and can be used to control the model outputs. Read the documentation from [EasyDeLBaseConfig] for more information.

Parameters

hidden_size (int, optional, defaults to 768) – Dimensionality of the encoder layers and the pooler layer.
intermediate_size (int, optional, defaults to 3072) – Dimensionality of the “intermediate” (i.e., feed-forward) layer in the Transformer encoder.
projection_dim (int, optional, defaults to 512) – Dimensionality of text and vision projection layers.
num_hidden_layers (int, optional, defaults to 12) – Number of hidden layers in the Transformer encoder.
num_attention_heads (int, optional, defaults to 12) – Number of attention heads for each attention layer in the Transformer encoder.
num_channels (int, optional, defaults to 3) – The number of input channels.
image_size (int, optional, defaults to 224) – The size (resolution) of each image.
patch_size (int, optional, defaults to 32) – The size (resolution) of each patch.
hidden_act (str or function, optional, defaults to “quick_gelu”) – The non-linear activation function (function or string) in the encoder and pooler. If string, “gelu”, “relu”, “selu” and “gelu_new” “quick_gelu” are supported.
layer_norm_eps (float, optional, defaults to 1e-05) – The epsilon used by the layer normalization layers.
attention_dropout (float, optional, defaults to 0.0) – The dropout ratio for the attention probabilities.
initializer_range (float, optional, defaults to 0.02) – The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
initializer_factor (float, optional, defaults to 1.0) – A factor for initializing all weight matrices (should be kept to 1, used internally for initialization testing).

Example:

```python >>> from transformers import CLIPVisionConfig, CLIPVisionModel

>>> # Initializing a CLIPVisionConfig with openai/clip-vit-base-patch32 style configuration
>>> configuration = CLIPVisionConfig()

>>> # Initializing a CLIPVisionModel (with random weights) from the openai/clip-vit-base-patch32 style configuration
>>> model = CLIPVisionModel(configuration)

>>> # Accessing the model configuration
>>> configuration = model.config
```

base_config_key: str = 'vision_config'#

get_partition_rules(*arg, **kwargs)#

Generic partition rules for CLIP text and vision models.

Parameters

self – The configuration object (unused but part of method signature).
*arg – Additional positional arguments (unused).
**kwargs – Additional keyword arguments (unused).

Returns

A tuple of partition rules for model parameters.

Return type

Tuple

model_type: str = 'clip_vision_model'#

class easydel.modules.clip.__init__.CLIPVisionModel(*args: Any, **kwargs: Any)[source]#

Bases: EasyDeLBaseModule

Bare CLIP vision model (transformer) outputting raw hidden-states without any specific head on top.

config#

Configuration object.

Type: CLIPVisionConfig

dtype#

Data type for computation.

Type: jnp.dtype

param_dtype#

Data type for parameters.

Type: jnp.dtype

precision#

JAX precision level.

Type: jax.lax.PrecisionLike

rngs#

Random number generators.

Type: nn.Rngs

easydel.modules.clip.__init__

Contents

easydel.modules.clip.__init__#

easydel.modules.clip.init

easydel.modules.clip.init#