easydel.modules.siglip.modeling_siglip

Contents

easydel.modules.siglip.modeling_siglip#

class easydel.modules.siglip.modeling_siglip.MultiheadAttention(*args: Any, **kwargs: Any)[source]#

Bases: Module

Simple multi-head attention used by the vision pooling head.

class easydel.modules.siglip.modeling_siglip.SiglipAttention(*args: Any, **kwargs: Any)[source]#

Bases: AttentionModule

Multi-head self-attention module used across SigLIP encoders.

class easydel.modules.siglip.modeling_siglip.SiglipEncoder(*args: Any, **kwargs: Any)[source]#

Bases: Module

Stack of SigLIP encoder layers with optional attention and hidden state capture.

class easydel.modules.siglip.modeling_siglip.SiglipEncoderLayer(*args: Any, **kwargs: Any)[source]#

Bases: Module

Transformer encoder block with pre-norm attention and MLP.

class easydel.modules.siglip.modeling_siglip.SiglipForImageClassification(*args: Any, **kwargs: Any)[source]#

Bases: EasyDeLBaseModule

Image-classification head on top of the SigLIP vision encoder.

get_decoder()[source]#

Returns the decoder part of the model’s graph definition. This is an encoder-only model for classification.

get_embedding()[source]#

Returns the embedding layer of the module.

get_encoder()[source]#

Returns the encoder part of the model’s graph definition.

get_lm_head()[source]#

Returns the language model head of the module. This model has an image classification head, not a language model head.

class easydel.modules.siglip.modeling_siglip.SiglipMLP(*args: Any, **kwargs: Any)[source]#

Bases: Module

Two-layer feed-forward network for SigLIP transformer blocks.

class easydel.modules.siglip.modeling_siglip.SiglipModel(*args: Any, **kwargs: Any)[source]#

Bases: EasyDeLBaseModule

Full SigLIP contrastive model combining text and vision towers.

get_decoder()[source]#

Returns the decoder part of the model’s graph definition. The text model acts as the decoder in this multi-modal setup.

get_embedding()[source]#

Returns the embedding layer of the text model.

get_encoder()[source]#

Returns the encoder part of the model’s graph definition. The vision tower acts as the encoder in this multi-modal setup.

get_image_features(pixel_values: Optional[Union[Array, ndarray, bool, number]] = None, output_attentions: bool | None = None, output_hidden_states: bool | None = None, interpolate_pos_encoding: bool = False) Union[Array, ndarray, bool, number][source]#
get_lm_head()[source]#

Returns the language model head of the module. This model does not have a traditional language model head, but a projection head.

get_text_features(input_ids: jaxtyping.Int[Array, 'batch seq_len'] | None = None, attention_mask: jaxtyping.Bool[Array, 'batch seq_len'] | None = None, mask_info: ejkernel.types.mask.MaskInfo | None = None, position_ids: jaxtyping.Int[Array, 'batch seq_len'] | None = None, output_attentions: bool | None = None, output_hidden_states: bool | None = None) Union[Array, ndarray, bool, number][source]#
class easydel.modules.siglip.modeling_siglip.SiglipMultiheadAttentionPoolingHead(*args: Any, **kwargs: Any)[source]#

Bases: Module

Pools vision tokens with a learned probe followed by MLP refinement.

class easydel.modules.siglip.modeling_siglip.SiglipOutput(loss: Optional[Union[Array, ndarray, bool, number]] = None, logits_per_image: Union[Array, ndarray, bool, number] = None, logits_per_text: Union[Array, ndarray, bool, number] = None, text_embeds: Union[Array, ndarray, bool, number] = None, image_embeds: Union[Array, ndarray, bool, number] = None, text_model_output: BaseModelOutputWithPooling = None, vision_model_output: BaseModelOutputWithPooling = None)[source]#

Bases: ModelOutput

Contrastive SigLIP output bundling text/vision logits and embeddings.

classmethod from_dict(data: dict[str, Any]) T#

Deserializes a dictionary into a PyTree object.

classmethod from_json(json_str: str) T#

Deserializes a JSON string into a PyTree object.

image_embeds: Union[Array, ndarray, bool, number] = None#
logits_per_image: Union[Array, ndarray, bool, number] = None#
logits_per_text: Union[Array, ndarray, bool, number] = None#
loss: Optional[Union[Array, ndarray, bool, number]] = None#
replace(**kwargs)#

Creates a new instance with specified fields replaced.

text_embeds: Union[Array, ndarray, bool, number] = None#
text_model_output: BaseModelOutputWithPooling = None#
to_dict() dict[str, Any]#

Serializes the PyTree object to a dictionary.

to_json(**kwargs) str#

Serializes the PyTree object to a JSON string.

to_tuple() tuple[Any][source]#

Convert self to a tuple containing all the attributes/keys that are not None.

vision_model_output: BaseModelOutputWithPooling = None#
class easydel.modules.siglip.modeling_siglip.SiglipTextEmbeddings(*args: Any, **kwargs: Any)[source]#

Bases: Module

Token and position embeddings for the SigLIP text encoder.

class easydel.modules.siglip.modeling_siglip.SiglipTextModel(*args: Any, **kwargs: Any)[source]#

Bases: EasyDeLBaseModule

Public text-only SigLIP model wrapper exposing the transformer backbone.

get_decoder()[source]#

Returns the decoder part of the model’s graph definition. This is an encoder-only model.

get_embedding()[source]#

Returns the embedding layer of the module.

get_encoder()[source]#

Returns the encoder part of the model’s graph definition.

get_lm_head()[source]#

Returns the language model head of the module. This model has a projection head, not a language model head.

class easydel.modules.siglip.modeling_siglip.SiglipTextModelOutput(text_embeds: Optional[Union[Array, ndarray, bool, number]] = None, last_hidden_state: Union[Array, ndarray, bool, number] = None, hidden_states: tuple[Union[jax.Array, numpy.ndarray, numpy.bool, numpy.number], ...] | None = None, attentions: tuple[Union[jax.Array, numpy.ndarray, numpy.bool, numpy.number], ...] | None = None)[source]#

Bases: ModelOutput

Outputs from the SigLIP text encoder with optional attentions.

attentions: tuple[Union[jax.Array, numpy.ndarray, numpy.bool, numpy.number], ...] | None = None#
classmethod from_dict(data: dict[str, Any]) T#

Deserializes a dictionary into a PyTree object.

classmethod from_json(json_str: str) T#

Deserializes a JSON string into a PyTree object.

hidden_states: tuple[Union[jax.Array, numpy.ndarray, numpy.bool, numpy.number], ...] | None = None#
last_hidden_state: Union[Array, ndarray, bool, number] = None#
replace(**kwargs)#

Creates a new instance with specified fields replaced.

text_embeds: Optional[Union[Array, ndarray, bool, number]] = None#
to_dict() dict[str, Any]#

Serializes the PyTree object to a dictionary.

to_json(**kwargs) str#

Serializes the PyTree object to a JSON string.

class easydel.modules.siglip.modeling_siglip.SiglipTextTransformer(*args: Any, **kwargs: Any)[source]#

Bases: EasyDeLBaseModule

Text-side transformer backbone providing embeddings, encoder, and projection head.

get_decoder()[source]#

Returns the decoder part of the model’s graph definition. This is an encoder-only model.

get_embedding()[source]#

Returns the embedding layer of the module.

get_encoder()[source]#

Returns the encoder part of the model’s graph definition.

get_lm_head()[source]#

Returns the language model head of the module. This model has a projection head, not a language model head.

class easydel.modules.siglip.modeling_siglip.SiglipVisionEmbeddings(*args: Any, **kwargs: Any)[source]#

Bases: Module

Patch projection and positional encoding for the SigLIP vision encoder.

interpolate(embeddings: Union[Array, ndarray, bool, number], height: int, width: int)[source]#
class easydel.modules.siglip.modeling_siglip.SiglipVisionModel(*args: Any, **kwargs: Any)[source]#

Bases: Module

Convenience wrapper around the SigLIP vision transformer backbone.

get_decoder()[source]#

Returns the decoder part of the model’s graph definition. This is an encoder-only model.

get_embedding()[source]#

Returns the embedding layer of the module.

get_encoder()[source]#

Returns the encoder part of the model’s graph definition.

get_lm_head()[source]#

Returns the language model head of the module. This vision model does not have a language model head.

class easydel.modules.siglip.modeling_siglip.SiglipVisionModelOutput(image_embeds: Optional[Union[Array, ndarray, bool, number]] = None, last_hidden_state: Union[Array, ndarray, bool, number] = None, hidden_states: tuple[Union[jax.Array, numpy.ndarray, numpy.bool, numpy.number], ...] | None = None, attentions: tuple[Union[jax.Array, numpy.ndarray, numpy.bool, numpy.number], ...] | None = None)[source]#

Bases: ModelOutput

Outputs from the SigLIP vision tower including pooled embeddings.

attentions: tuple[Union[jax.Array, numpy.ndarray, numpy.bool, numpy.number], ...] | None = None#
classmethod from_dict(data: dict[str, Any]) T#

Deserializes a dictionary into a PyTree object.

classmethod from_json(json_str: str) T#

Deserializes a JSON string into a PyTree object.

hidden_states: tuple[Union[jax.Array, numpy.ndarray, numpy.bool, numpy.number], ...] | None = None#
image_embeds: Optional[Union[Array, ndarray, bool, number]] = None#
last_hidden_state: Union[Array, ndarray, bool, number] = None#
replace(**kwargs)#

Creates a new instance with specified fields replaced.

to_dict() dict[str, Any]#

Serializes the PyTree object to a dictionary.

to_json(**kwargs) str#

Serializes the PyTree object to a JSON string.

class easydel.modules.siglip.modeling_siglip.SiglipVisionTransformer(*args: Any, **kwargs: Any)[source]#

Bases: EasyDeLBaseModule

Vision-side transformer encoder used by SigLIP for patch representations.

get_decoder()[source]#

Returns the decoder part of the model’s graph definition. This is an encoder-only model.

get_embedding()[source]#

Returns the embedding layer of the module.

get_encoder()[source]#

Returns the encoder part of the model’s graph definition.

get_lm_head()[source]#

Returns the language model head of the module. This vision model does not have a language model head.