easydel.modules.mistral3.modeling_mistral3

Contents

easydel.modules.mistral3.modeling_mistral3#

class easydel.modules.mistral3.modeling_mistral3.Mistral3CausalLMOutputWithPast(loss: Optional[Union[Array, ndarray, bool, number]] = None, logits: Union[Array, ndarray, bool, number] = None, past_key_values: easydel.layers.caching.transformer.cache.TransformerCache | None = None, hidden_states: tuple[Union[jax.Array, numpy.ndarray, numpy.bool, numpy.number]] | None = None, attentions: tuple[Union[jax.Array, numpy.ndarray, numpy.bool, numpy.number]] | None = None, image_hidden_states: jaxtyping.Float[Array, 'batch seq_len hidden_dim'] | None = None)[source]#

Bases: ModelOutput

Base class for Mistral3 causal language model (or autoregressive) outputs.

Parameters
  • loss (chex.Array of shape (1,), optional, returned when labels is provided) – Language modeling loss (for next-token prediction).

  • logits (chex.Array of shape (batch_size, sequence_length, config.vocab_size)) – Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).

  • past_key_values (`tuple(tuple(chex.Array)) –

    passed or when config.use_cache=True): Tuple of tuple(chex.Array) of length config.n_layers, with each tuple having 2 tensors of shape (batch_size, num_heads, sequence_length, embed_size_per_head))

    Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see past_key_values input) to speed up sequential decoding.

  • hidden_states (`tuple(chex.Array) –

    config.output_hidden_states=True): Tuple of chex.Array (one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

    Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.

  • attentions (`tuple(chex.Array) –

    or when config.output_attentions=True): Tuple of chex.Array (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

  • image_hidden_states (chex.Array, optional) – A chex.Array of size (batch_size * num_patches, num_images, sequence_length, hidden_size)`. image_hidden_states of the model produced by the vision encoder and after projecting the last hidden state.

attentions: tuple[Union[jax.Array, numpy.ndarray, numpy.bool, numpy.number]] | None = None#
classmethod from_dict(data: dict[str, Any]) T#

Deserializes a dictionary into a PyTree object.

classmethod from_json(json_str: str) T#

Deserializes a JSON string into a PyTree object.

hidden_states: tuple[Union[jax.Array, numpy.ndarray, numpy.bool, numpy.number]] | None = None#
image_hidden_states: jaxtyping.Float[Array, 'batch seq_len hidden_dim'] | None = None#
logits: Union[Array, ndarray, bool, number] = None#
loss: Optional[Union[Array, ndarray, bool, number]] = None#
past_key_values: easydel.layers.caching.transformer.cache.TransformerCache | None = None#
replace(**kwargs)#

Creates a new instance with specified fields replaced.

to_dict() dict[str, Any]#

Serializes the PyTree object to a dictionary.

to_json(**kwargs) str#

Serializes the PyTree object to a JSON string.

class easydel.modules.mistral3.modeling_mistral3.Mistral3ForConditionalGeneration(*args: Any, **kwargs: Any)[source]#

Bases: BaseVisionLanguageModule[Mistral3Model, Mistral3Config]

Mistral3 model for conditional generation with vision-language capabilities.

Combines a vision tower, patch merger/projector, and language model for image-to-text generation. Inherits from BaseVisionLanguageModule.

config#

Configuration object.

Type

Mistral3Config

dtype#

Data type for computation.

Type

jnp.dtype

param_dtype#

Data type for parameters.

Type

jnp.dtype

precision#

JAX precision level.

Type

jax.lax.PrecisionLike

rngs#

Random number generators.

Type

nn.Rngs

Class Attributes:

_task_type: IMAGE_TEXT_TO_TEXT task type _model_type: “mistral3” model identifier _supports_video: False (Mistral3 is image-only) _uses_mrope: False (uses standard RoPE)

apply_lm_head(hidden_states: Array) Array[source]#

Apply the language modeling head.

get_image_features(pixel_values: Float[Array, 'batch channels height width'], image_sizes: Optional[Union[Array, ndarray, bool, number]] = None, **kwargs) Float[Array, 'batch num_patches hidden'][source]#

Extract and project image features from pixel values.

Mistral3 uses a patch merger that requires image_sizes to handle variable-sized images.

Parameters
  • pixel_values – Input image pixel values

  • image_sizes – Original sizes of the images (height, width) for patch merging

  • **kwargs – Additional arguments (unused)

Returns

Projected image features ready for merging with text embeddings

get_language_model() Module[source]#

Returns the language model component.

get_projector() Module[source]#

Returns the multimodal projector component.

get_vision_tower() Module[source]#

Returns the vision tower component.

init_cache(batch_size: int, max_length: int, starts: int | None = None, shardings: dict | None = None, pad_token_id: int | None = None)[source]#

Initialize KV cache for generation.

loss_type = 'ForCausalLM'#
class easydel.modules.mistral3.modeling_mistral3.Mistral3Model(*args: Any, **kwargs: Any)[source]#

Bases: EasyDeLBaseModule

Multimodal Mistral3 wrapper combining a vision tower, projector, and language model.

get_decoder()[source]#

Returns the decoder part of the model’s graph definition.

get_embedding()[source]#

Returns the embedding layer of the module.

get_encoder()[source]#

Returns the encoder part of the model’s graph definition. The vision tower acts as the encoder in this multi-modal setup.

get_image_features(pixel_values: Union[Array, ndarray, bool, number], image_sizes: Union[Array, ndarray, bool, number]) Union[Array, ndarray, bool, number][source]#
get_lm_head()[source]#

Returns the language model head of the module. Base Models don’t have a Language Model Head.

init_cache(batch_size: int, max_length: int, starts: int | None = None, shardings: dict | None = None, pad_token_id: int | None = None)[source]#

Initializes and returns a standard (non-paged) Key-Value cache.

This method first creates the necessary metadata using create_cache_metadata and then calls TransformerCache.init_cache to allocate and initialize the cache tensors based on the model’s configuration, dtype, sharding, quantization settings, and provided batch size and maximum length.

Parameters
  • batch_size (int) – The batch size for the cache.

  • max_length (int) – The maximum sequence length the cache needs to support.

  • starts (int | None) – Optional starting positions for the cache sequences. If provided, influences the initial state. Defaults to None (usually 0).

  • shardings (dict | None) – Optional dictionary specifying sharding configurations. (Note: This argument appears unused in the current implementation shown).

  • pad_token_id (int | None) – The ID of the padding token. If None, it’s inferred.

Returns

An initialized standard TransformerCache object.

Return type

TransformerCache

prepare_inputs_for_generation(input_ids: Int[Array, 'batch seq_len'], max_length: int, pad_token_id: int, starts: int | None = None, pixel_values: Optional[Union[Array, ndarray, bool, number]] = None, attention_mask: jaxtyping.Bool[Array, 'batch seq_len'] | None = None)[source]#

Sets up the initial inputs required for starting autoregressive generation.

This function initializes the Key-Value cache (past_key_values) using init_cache, calculates the initial position_ids based on the input attention_mask (or assumes a contiguous range if no mask is provided), and prepares an extended attention_mask suitable for caching. It ensures inputs are placed on the correct devices/shards.

Parameters
  • input_ids (chex.Array) – The initial sequence of token IDs. Shape (batch_size, seq_length).

  • max_length (int) – The maximum sequence length that the KV cache should support.

  • pad_token_id (int) – The ID used for padding tokens. Used to calculate starts if not provided.

  • starts (int | None) – Optional pre-calculated starting positions (number of leading pads). If None, calculated using compute_prefill_length.

  • shardings (dict | None) – Optional sharding configuration passed to init_cache.

  • attention_mask (tp.Optional[chex.Array]) – An optional mask indicating which tokens should be attended to. Shape (batch_size, seq_length).

  • token_type_ids (tp.Optional[chex.Array]) – Optional segment IDs for models that use them.

Returns

A dictionary containing the prepared inputs, typically including:
  • ”past_key_values”: The initialized KV cache.

  • ”attention_mask”: The extended attention mask for generation.

  • ”position_ids”: The calculated initial position IDs.

  • ”token_type_ids”: (Optional) Prepared token type IDs.

This dictionary is then passed through prepare_inputs_for_call.

Return type

dict

update_inputs_for_generation(model_outputs, model_kwargs)[source]#

Updates the keyword arguments for the next generation step.

Specifically, it takes the past_key_values from the model_outputs of the current step and updates the model_kwargs with them. It also increments the position_ids by one for the next token prediction.

Parameters
  • model_outputs – The output object from the model’s forward pass in the previous step (should contain a past_key_values attribute).

  • model_kwargs (dict) – The dictionary of keyword arguments used for the model call. This dictionary will be modified in-place or a new one returned.

Returns

The updated model_kwargs dictionary ready for the next generation step.

Return type

dict

class easydel.modules.mistral3.modeling_mistral3.Mistral3ModelOutput(last_hidden_state: Union[Array, ndarray, bool, number] = None, hidden_states: tuple[Union[jax.Array, numpy.ndarray, numpy.bool, numpy.number]] | None = None, attentions: tuple[Union[jax.Array, numpy.ndarray, numpy.bool, numpy.number]] | None = None, past_key_values: dict[str, Union[jax.Array, numpy.ndarray, numpy.bool, numpy.number]] | None = None, loss: Optional[Union[Array, ndarray, bool, number]] = None, image_hidden_states: jaxtyping.Float[Array, 'batch seq_len hidden_dim'] | None = None)[source]#

Bases: BaseModelOutput

Model output carrying text hidden states and optional projected image embeddings.

classmethod from_dict(data: dict[str, Any]) T#

Deserializes a dictionary into a PyTree object.

classmethod from_json(json_str: str) T#

Deserializes a JSON string into a PyTree object.

image_hidden_states: jaxtyping.Float[Array, 'batch seq_len hidden_dim'] | None = None#
replace(**kwargs)#

Creates a new instance with specified fields replaced.

to_dict() dict[str, Any]#

Serializes the PyTree object to a dictionary.

to_json(**kwargs) str#

Serializes the PyTree object to a JSON string.

class easydel.modules.mistral3.modeling_mistral3.Mistral3MultiModalProjector(*args: Any, **kwargs: Any)[source]#

Bases: Module

Projects vision tower features into the language model embedding space.

class easydel.modules.mistral3.modeling_mistral3.Mistral3PatchMerger(*args: Any, **kwargs: Any)[source]#

Bases: Module

Spatially merges neighboring vision patches before projecting into text space.

forward(image_features: Array, image_sizes: Array) Array[source]#