easydel.modules.qwen2_vl.modeling_qwen2_vl

Contents

easydel.modules.qwen2_vl.modeling_qwen2_vl#

class easydel.modules.qwen2_vl.modeling_qwen2_vl.Qwen2VLAttention(*args: Any, **kwargs: Any)[source]#

Bases: UnifiedAttention

Causal self-attention used in the Qwen2-VL language decoder.

class easydel.modules.qwen2_vl.modeling_qwen2_vl.Qwen2VLCausalLMOutputWithPast(loss: Optional[Union[Array, ndarray, bool, number]] = None, logits: Union[Array, ndarray, bool, number] = None, past_key_values: list[Union[jax.Array, numpy.ndarray, numpy.bool, numpy.number]] | None = None, hidden_states: tuple[Union[jax.Array, numpy.ndarray, numpy.bool, numpy.number]] | None = None, attentions: tuple[Union[jax.Array, numpy.ndarray, numpy.bool, numpy.number]] | None = None, rope_deltas: Optional[Union[Array, ndarray, bool, number]] = None)[source]#

Bases: ModelOutput

Base class for Qwen2VL causal language model (or autoregressive) outputs.

attentions: tuple[Union[jax.Array, numpy.ndarray, numpy.bool, numpy.number]] | None = None#
classmethod from_dict(data: dict[str, Any]) T#

Deserializes a dictionary into a PyTree object.

classmethod from_json(json_str: str) T#

Deserializes a JSON string into a PyTree object.

hidden_states: tuple[Union[jax.Array, numpy.ndarray, numpy.bool, numpy.number]] | None = None#
logits: Union[Array, ndarray, bool, number] = None#
loss: Optional[Union[Array, ndarray, bool, number]] = None#
past_key_values: list[Union[jax.Array, numpy.ndarray, numpy.bool, numpy.number]] | None = None#
replace(**kwargs)#

Creates a new instance with specified fields replaced.

rope_deltas: Optional[Union[Array, ndarray, bool, number]] = None#
to_dict() dict[str, Any]#

Serializes the PyTree object to a dictionary.

to_json(**kwargs) str#

Serializes the PyTree object to a JSON string.

class easydel.modules.qwen2_vl.modeling_qwen2_vl.Qwen2VLDecoderLayer(*args: Any, **kwargs: Any)[source]#

Bases: Module

Transformer decoder layer coupling Qwen2-VL attention and feed-forward modules.

class easydel.modules.qwen2_vl.modeling_qwen2_vl.Qwen2VLForConditionalGeneration(*args: Any, **kwargs: Any)[source]#

Bases: BaseVisionLanguageModule[Qwen2VLModel, Qwen2VLConfig]

Multimodal Qwen2-VL model for conditional generation from images/video and text.

Inherits from BaseVisionLanguageModule to leverage common VLM infrastructure.

config#

Configuration object.

Type

Qwen2VLConfig

dtype#

Data type for computation.

Type

jnp.dtype

param_dtype#

Data type for parameters.

Type

jnp.dtype

precision#

JAX precision level.

Type

jax.lax.PrecisionLike

rngs#

Random number generators.

Type

nn.Rngs

Class Attributes:

_task_type: IMAGE_TEXT_TO_TEXT task type _model_type: “qwen2_vl” model identifier _supports_video: True (Qwen2-VL supports video input) _uses_mrope: True (uses multi-dimensional RoPE)

apply_lm_head(hidden_states: Array) Array[source]#

Apply the language modeling head.

get_image_features(pixel_values: Float[Array, 'batch channels height width'], image_grid_thw: tuple | None = None, **kwargs) Float[Array, 'batch num_patches hidden'][source]#

Extract and project image features.

Parameters
  • pixel_values – Input image pixel values

  • image_grid_thw – Image grid shape (temporal=1, height, width)

  • **kwargs – Additional arguments

Returns

Projected image features

get_input_embeddings()[source]#
get_language_model() Module[source]#

Returns the language model component.

get_static_arguments()[source]#

Returns a tuple of static arguments required by the module’s __call__ method.

Static arguments are those that don’t change across calls and can be potentially cached or handled differently by JIT compilation. This base implementation returns an empty tuple. Subclasses should override this if they have static arguments.

Returns

A tuple containing static arguments.

Return type

tp.Tuple

get_video_features(pixel_values_videos: Float[Array, 'batch temporal channels height width'], video_grid_thw: tuple | None = None, **kwargs) Float[Array, 'batch num_tokens hidden'][source]#

Extract and project video features.

Parameters
  • pixel_values_videos – Input video pixel values

  • video_grid_thw – Video grid shape (temporal, height, width)

  • **kwargs – Additional arguments

Returns

Projected video features

get_vision_tower() Module[source]#

Returns the vision tower component.

loss_type = 'ForCausalLM'#
prepare_inputs_for_call(image_grid_thw: Optional[Union[Array, ndarray, bool, number]] = None, video_grid_thw: Optional[Union[Array, ndarray, bool, number]] = None, drop_ids: bool = True, **others)[source]#

Prepare inputs with mRoPE position IDs computed from grid shapes.

prepare_inputs_for_generation(input_ids, max_length: int, pad_token_id: int, starts: int | None = None, past_key_values=None, attention_mask=None, mask_info=None, inputs_embeds=None, position_ids=None, pixel_values=None, pixel_values_videos=None, image_grid_thw=None, video_grid_thw=None, **kwargs)[source]#

Prepare inputs for generation, including vision inputs.

Parameters
  • input_ids – Input token IDs

  • max_length – Maximum generation length

  • pad_token_id – Padding token ID

  • starts – Starting positions

  • pixel_values – Image pixel values

  • attention_mask – Attention mask

  • **kwargs – Additional kwargs

Returns

Dictionary of prepared inputs

set_input_embeddings(value)[source]#
update_inputs_for_generation(model_outputs, model_kwargs)[source]#

Update inputs for next generation step, removing vision inputs.

Vision inputs are only used on the first generation step, so they are removed for subsequent steps.

Parameters
  • model_outputs – Outputs from the model

  • model_kwargs – Current model kwargs

Returns

Updated model kwargs with vision inputs removed

property visual#

Property to access the vision transformer for backward compatibility.

class easydel.modules.qwen2_vl.modeling_qwen2_vl.Qwen2VLMLP(*args: Any, **kwargs: Any)[source]#

Bases: Module

Feed-forward network used in the Qwen2-VL language decoder.

class easydel.modules.qwen2_vl.modeling_qwen2_vl.Qwen2VLModel(*args: Any, **kwargs: Any)[source]#

Bases: EasyDeLBaseModule

The Qwen2-VL model which consists of a vision encoder and a language model.

get_image_features(pixel_values: Union[Array, ndarray, bool, number], image_grid_thw: Optional[Union[Array, ndarray, bool, number]] = None)[source]#
get_input_embeddings()[source]#
get_rope_index(input_ids: Union[Array, ndarray, bool, number] = None, image_grid_thw: Union[Array, ndarray, bool, number] = None, video_grid_thw: Union[Array, ndarray, bool, number] = None, attention_mask: Union[Array, ndarray, bool, number] = None)[source]#
get_video_features(pixel_values_videos: Union[Array, ndarray, bool, number], video_grid_thw: Optional[Union[Array, ndarray, bool, number]] = None)[source]#
set_input_embeddings(value)[source]#
class easydel.modules.qwen2_vl.modeling_qwen2_vl.Qwen2VLPatchEmbed(*args: Any, **kwargs: Any)[source]#

Bases: Module

Convert images or video frames into patch embeddings for Qwen2-VL.

class easydel.modules.qwen2_vl.modeling_qwen2_vl.Qwen2VLPatchMerger(*args: Any, **kwargs: Any)[source]#

Bases: Module

Merge neighboring spatial patches to downsample visual tokens.

class easydel.modules.qwen2_vl.modeling_qwen2_vl.Qwen2VLTextModel(*args: Any, **kwargs: Any)[source]#

Bases: EasyDeLBaseModule

Language decoder stack for Qwen2-VL that consumes projected vision tokens.

get_decoder()[source]#

Returns the decoder part of the model’s graph definition.

get_embedding()[source]#

Returns the embedding layer of the module.

get_encoder()[source]#

Returns the encoder part of the model’s graph definition. Decoder-Only models don’t have an encoder.

get_input_embeddings()[source]#

Returns the input embedding layer of the module.

get_lm_head()[source]#

Returns the language model head of the module. Base Models don’t have a Language Model Head.

set_input_embeddings(value)[source]#

Sets the input embedding layer of the module.

class easydel.modules.qwen2_vl.modeling_qwen2_vl.Qwen2VLVisionAttention(*args: Any, **kwargs: Any)[source]#

Bases: UnifiedAttention

Self-attention layer for vision patches with rotary position encoding.

define_network(config, dtype: dtype, param_dtype: dtype, precision: Union[None, str, Precision, tuple[str, str], tuple[jax._src.lax.lax.Precision, jax._src.lax.lax.Precision], DotAlgorithm, DotAlgorithmPreset], rngs: Rngs) None[source]#

Define network structure.

Override this to customize projection structure (e.g., fused QKV). Default creates separate Q/K/V/O projections.

Parameters
  • config – Model configuration

  • dtype – Data type for computations

  • param_dtype – Data type for parameters

  • precision – JAX precision setting

  • rngs – Random number generators

class easydel.modules.qwen2_vl.modeling_qwen2_vl.Qwen2VLVisionBlock(*args: Any, **kwargs: Any)[source]#

Bases: Module

Vision transformer block combining attention and MLP with pre-normalization.

class easydel.modules.qwen2_vl.modeling_qwen2_vl.Qwen2VLVisionMLP(*args: Any, **kwargs: Any)[source]#

Bases: Module

Feed-forward module for the Qwen2-VL vision encoder.

class easydel.modules.qwen2_vl.modeling_qwen2_vl.Qwen2VLVisionTransformer(*args: Any, **kwargs: Any)[source]#

Bases: EasyDeLBaseModule

Vision transformer encoder used to extract image features for Qwen2-VL.

config_class#

alias of Qwen2VLVisionConfig

get_decoder()[source]#

Returns the decoder part of the model’s graph definition. This is an encoder-only model and does not have a decoder.

get_dtype() dtype[source]#
get_embedding()[source]#

Returns the embedding layer of the module. In this case, it’s the patch embedding layer.

get_encoder()[source]#

Returns the encoder part of the model’s graph definition. This vision model acts as the encoder.

get_lm_head()[source]#

Returns the language model head of the module. This vision model does not have a language model head.

rot_pos_emb(grid_thw, max_grid_size)[source]#
easydel.modules.qwen2_vl.modeling_qwen2_vl.apply_rotary_pos_emb_vision(array: Union[Array, ndarray, bool, number], freqs: Union[Array, ndarray, bool, number]) Union[Array, ndarray, bool, number][source]#

Apply rotary positional embedding to vision features.

easydel.modules.qwen2_vl.modeling_qwen2_vl.create_attention_mask(cu_seqlens, seq_length, dtype)[source]#

Creates an attention mask matrix.

Parameters
  • cu_seqlens – Cumulative sequence lengths.

  • seq_length – Length of each sequence.

  • dtype – Data type of the mask.

Returns

Attention mask matrix.

easydel.modules.qwen2_vl.modeling_qwen2_vl.get_rope_index(input_ids: ndarray, image_grid_thw: numpy.ndarray | None = None, video_grid_thw: numpy.ndarray | None = None, attention_mask: numpy.ndarray | None = None, spatial_merge_size: int = 1, image_token_id: int = -1, video_token_id: int = -1, vision_start_token_id: int = -1, tokens_per_second: float = 1.0, second_per_grid_ts: list[float] | None = None, context_len: int = 0, seq_len: int | None = None) tuple[numpy.ndarray, numpy.ndarray][source]#

Calculate the 3D rope index based on image and video’s temporal, height, and width in LLM.

Parameters
  • input_ids (np.ndarray of shape (batch_size, sequence_length)) – Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide it.

  • image_grid_thw (np.ndarray of shape (num_images, 3), optional) – The temporal, height, and width of feature shape of each image in LLM.

  • video_grid_thw (np.ndarray of shape (num_videos, 3), optional) – The temporal, height, and width of feature shape of each video in LLM.

  • attention_mask (np.ndarray of shape (batch_size, sequence_length), optional) – Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]: - 1 for tokens that are not masked, - 0 for tokens that are masked.

  • spatial_merge_size (int) – The spatial merge size for vision embeddings.

  • image_token_id (int) – The token ID representing an image.

  • video_token_id (int) – The token ID representing a video.

  • vision_start_token_id (int) – The token ID representing the start of a vision sequence.

  • tokens_per_second (float) – Temporal scaling applied to video tokens.

  • second_per_grid_ts (list[float] | None) – Per-video seconds per temporal grid step, if available.

  • context_len (int) – Length of any existing KV context to offset positions.

  • seq_len (int | None) – Target sequence length to slice positions to. Defaults to full length.

Returns

position_ids (np.ndarray of shape (3, batch_size, sequence_length)) mrope_position_deltas (np.ndarray of shape (batch_size))

easydel.modules.qwen2_vl.modeling_qwen2_vl.merge_multimodal_embeddings(input_ids: Array, inputs_embeds: Array, multimodal_embeddings: Array, placeholder_token_id: int | list[int]) Array[source]#

Overwrite inputs_embeds wherever input_ids matches placeholder tokens.

easydel.modules.qwen2_vl.modeling_qwen2_vl.precompute_vl_rotary(dim, theta, max_position)[source]#

Precompute rotary angle matrix for the vision-language attention stack.

easydel.modules.qwen2_vl.modeling_qwen2_vl.rotate_half(x)[source]#

Rotates half the hidden dims of the input.