easydel.modules.qwen2_vl.modeling_qwen2_vl#
- class easydel.modules.qwen2_vl.modeling_qwen2_vl.Qwen2VLAttention(*args: Any, **kwargs: Any)[source]#
Bases:
UnifiedAttentionCausal self-attention used in the Qwen2-VL language decoder.
- class easydel.modules.qwen2_vl.modeling_qwen2_vl.Qwen2VLCausalLMOutputWithPast(loss: Optional[Union[Array, ndarray, bool, number]] = None, logits: Union[Array, ndarray, bool, number] = None, past_key_values: list[Union[jax.Array, numpy.ndarray, numpy.bool, numpy.number]] | None = None, hidden_states: tuple[Union[jax.Array, numpy.ndarray, numpy.bool, numpy.number]] | None = None, attentions: tuple[Union[jax.Array, numpy.ndarray, numpy.bool, numpy.number]] | None = None, rope_deltas: Optional[Union[Array, ndarray, bool, number]] = None)[source]#
Bases:
ModelOutputBase class for Qwen2VL causal language model (or autoregressive) outputs.
- attentions: tuple[Union[jax.Array, numpy.ndarray, numpy.bool, numpy.number]] | None = None#
- classmethod from_dict(data: dict[str, Any]) T#
Deserializes a dictionary into a PyTree object.
- classmethod from_json(json_str: str) T#
Deserializes a JSON string into a PyTree object.
- past_key_values: list[Union[jax.Array, numpy.ndarray, numpy.bool, numpy.number]] | None = None#
- replace(**kwargs)#
Creates a new instance with specified fields replaced.
- to_dict() dict[str, Any]#
Serializes the PyTree object to a dictionary.
- to_json(**kwargs) str#
Serializes the PyTree object to a JSON string.
- class easydel.modules.qwen2_vl.modeling_qwen2_vl.Qwen2VLDecoderLayer(*args: Any, **kwargs: Any)[source]#
Bases:
ModuleTransformer decoder layer coupling Qwen2-VL attention and feed-forward modules.
- class easydel.modules.qwen2_vl.modeling_qwen2_vl.Qwen2VLForConditionalGeneration(*args: Any, **kwargs: Any)[source]#
Bases:
BaseVisionLanguageModule[Qwen2VLModel,Qwen2VLConfig]Multimodal Qwen2-VL model for conditional generation from images/video and text.
Inherits from BaseVisionLanguageModule to leverage common VLM infrastructure.
- config#
Configuration object.
- Type
- dtype#
Data type for computation.
- Type
jnp.dtype
- param_dtype#
Data type for parameters.
- Type
jnp.dtype
- precision#
JAX precision level.
- Type
jax.lax.PrecisionLike
- rngs#
Random number generators.
- Type
nn.Rngs
- Class Attributes:
_task_type: IMAGE_TEXT_TO_TEXT task type _model_type: “qwen2_vl” model identifier _supports_video: True (Qwen2-VL supports video input) _uses_mrope: True (uses multi-dimensional RoPE)
- get_image_features(pixel_values: Float[Array, 'batch channels height width'], image_grid_thw: tuple | None = None, **kwargs) Float[Array, 'batch num_patches hidden'][source]#
Extract and project image features.
- Parameters
pixel_values – Input image pixel values
image_grid_thw – Image grid shape (temporal=1, height, width)
**kwargs – Additional arguments
- Returns
Projected image features
- get_static_arguments()[source]#
Returns a tuple of static arguments required by the module’s __call__ method.
Static arguments are those that don’t change across calls and can be potentially cached or handled differently by JIT compilation. This base implementation returns an empty tuple. Subclasses should override this if they have static arguments.
- Returns
A tuple containing static arguments.
- Return type
tp.Tuple
- get_video_features(pixel_values_videos: Float[Array, 'batch temporal channels height width'], video_grid_thw: tuple | None = None, **kwargs) Float[Array, 'batch num_tokens hidden'][source]#
Extract and project video features.
- Parameters
pixel_values_videos – Input video pixel values
video_grid_thw – Video grid shape (temporal, height, width)
**kwargs – Additional arguments
- Returns
Projected video features
- loss_type = 'ForCausalLM'#
- prepare_inputs_for_call(image_grid_thw: Optional[Union[Array, ndarray, bool, number]] = None, video_grid_thw: Optional[Union[Array, ndarray, bool, number]] = None, drop_ids: bool = True, **others)[source]#
Prepare inputs with mRoPE position IDs computed from grid shapes.
- prepare_inputs_for_generation(input_ids, max_length: int, pad_token_id: int, starts: int | None = None, past_key_values=None, attention_mask=None, mask_info=None, inputs_embeds=None, position_ids=None, pixel_values=None, pixel_values_videos=None, image_grid_thw=None, video_grid_thw=None, **kwargs)[source]#
Prepare inputs for generation, including vision inputs.
- Parameters
input_ids – Input token IDs
max_length – Maximum generation length
pad_token_id – Padding token ID
starts – Starting positions
pixel_values – Image pixel values
attention_mask – Attention mask
**kwargs – Additional kwargs
- Returns
Dictionary of prepared inputs
- update_inputs_for_generation(model_outputs, model_kwargs)[source]#
Update inputs for next generation step, removing vision inputs.
Vision inputs are only used on the first generation step, so they are removed for subsequent steps.
- Parameters
model_outputs – Outputs from the model
model_kwargs – Current model kwargs
- Returns
Updated model kwargs with vision inputs removed
- property visual#
Property to access the vision transformer for backward compatibility.
- class easydel.modules.qwen2_vl.modeling_qwen2_vl.Qwen2VLMLP(*args: Any, **kwargs: Any)[source]#
Bases:
ModuleFeed-forward network used in the Qwen2-VL language decoder.
- class easydel.modules.qwen2_vl.modeling_qwen2_vl.Qwen2VLModel(*args: Any, **kwargs: Any)[source]#
Bases:
EasyDeLBaseModuleThe Qwen2-VL model which consists of a vision encoder and a language model.
- get_image_features(pixel_values: Union[Array, ndarray, bool, number], image_grid_thw: Optional[Union[Array, ndarray, bool, number]] = None)[source]#
- get_rope_index(input_ids: Union[Array, ndarray, bool, number] = None, image_grid_thw: Union[Array, ndarray, bool, number] = None, video_grid_thw: Union[Array, ndarray, bool, number] = None, attention_mask: Union[Array, ndarray, bool, number] = None)[source]#
- class easydel.modules.qwen2_vl.modeling_qwen2_vl.Qwen2VLPatchEmbed(*args: Any, **kwargs: Any)[source]#
Bases:
ModuleConvert images or video frames into patch embeddings for Qwen2-VL.
- class easydel.modules.qwen2_vl.modeling_qwen2_vl.Qwen2VLPatchMerger(*args: Any, **kwargs: Any)[source]#
Bases:
ModuleMerge neighboring spatial patches to downsample visual tokens.
- class easydel.modules.qwen2_vl.modeling_qwen2_vl.Qwen2VLTextModel(*args: Any, **kwargs: Any)[source]#
Bases:
EasyDeLBaseModuleLanguage decoder stack for Qwen2-VL that consumes projected vision tokens.
- get_encoder()[source]#
Returns the encoder part of the model’s graph definition. Decoder-Only models don’t have an encoder.
- class easydel.modules.qwen2_vl.modeling_qwen2_vl.Qwen2VLVisionAttention(*args: Any, **kwargs: Any)[source]#
Bases:
UnifiedAttentionSelf-attention layer for vision patches with rotary position encoding.
- define_network(config, dtype: dtype, param_dtype: dtype, precision: Union[None, str, Precision, tuple[str, str], tuple[jax._src.lax.lax.Precision, jax._src.lax.lax.Precision], DotAlgorithm, DotAlgorithmPreset], rngs: Rngs) None[source]#
Define network structure.
Override this to customize projection structure (e.g., fused QKV). Default creates separate Q/K/V/O projections.
- Parameters
config – Model configuration
dtype – Data type for computations
param_dtype – Data type for parameters
precision – JAX precision setting
rngs – Random number generators
- class easydel.modules.qwen2_vl.modeling_qwen2_vl.Qwen2VLVisionBlock(*args: Any, **kwargs: Any)[source]#
Bases:
ModuleVision transformer block combining attention and MLP with pre-normalization.
- class easydel.modules.qwen2_vl.modeling_qwen2_vl.Qwen2VLVisionMLP(*args: Any, **kwargs: Any)[source]#
Bases:
ModuleFeed-forward module for the Qwen2-VL vision encoder.
- class easydel.modules.qwen2_vl.modeling_qwen2_vl.Qwen2VLVisionTransformer(*args: Any, **kwargs: Any)[source]#
Bases:
EasyDeLBaseModuleVision transformer encoder used to extract image features for Qwen2-VL.
- config_class#
alias of
Qwen2VLVisionConfig
- get_decoder()[source]#
Returns the decoder part of the model’s graph definition. This is an encoder-only model and does not have a decoder.
- get_embedding()[source]#
Returns the embedding layer of the module. In this case, it’s the patch embedding layer.
- get_encoder()[source]#
Returns the encoder part of the model’s graph definition. This vision model acts as the encoder.
- easydel.modules.qwen2_vl.modeling_qwen2_vl.apply_rotary_pos_emb_vision(array: Union[Array, ndarray, bool, number], freqs: Union[Array, ndarray, bool, number]) Union[Array, ndarray, bool, number][source]#
Apply rotary positional embedding to vision features.
- easydel.modules.qwen2_vl.modeling_qwen2_vl.create_attention_mask(cu_seqlens, seq_length, dtype)[source]#
Creates an attention mask matrix.
- Parameters
cu_seqlens – Cumulative sequence lengths.
seq_length – Length of each sequence.
dtype – Data type of the mask.
- Returns
Attention mask matrix.
- easydel.modules.qwen2_vl.modeling_qwen2_vl.get_rope_index(input_ids: ndarray, image_grid_thw: numpy.ndarray | None = None, video_grid_thw: numpy.ndarray | None = None, attention_mask: numpy.ndarray | None = None, spatial_merge_size: int = 1, image_token_id: int = -1, video_token_id: int = -1, vision_start_token_id: int = -1, tokens_per_second: float = 1.0, second_per_grid_ts: list[float] | None = None, context_len: int = 0, seq_len: int | None = None) tuple[numpy.ndarray, numpy.ndarray][source]#
Calculate the 3D rope index based on image and video’s temporal, height, and width in LLM.
- Parameters
input_ids (np.ndarray of shape (batch_size, sequence_length)) – Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide it.
image_grid_thw (np.ndarray of shape (num_images, 3), optional) – The temporal, height, and width of feature shape of each image in LLM.
video_grid_thw (np.ndarray of shape (num_videos, 3), optional) – The temporal, height, and width of feature shape of each video in LLM.
attention_mask (np.ndarray of shape (batch_size, sequence_length), optional) – Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]: - 1 for tokens that are not masked, - 0 for tokens that are masked.
spatial_merge_size (int) – The spatial merge size for vision embeddings.
image_token_id (int) – The token ID representing an image.
video_token_id (int) – The token ID representing a video.
vision_start_token_id (int) – The token ID representing the start of a vision sequence.
tokens_per_second (float) – Temporal scaling applied to video tokens.
second_per_grid_ts (list[float] | None) – Per-video seconds per temporal grid step, if available.
context_len (int) – Length of any existing KV context to offset positions.
seq_len (int | None) – Target sequence length to slice positions to. Defaults to full length.
- Returns
position_ids (np.ndarray of shape (3, batch_size, sequence_length)) mrope_position_deltas (np.ndarray of shape (batch_size))
- easydel.modules.qwen2_vl.modeling_qwen2_vl.merge_multimodal_embeddings(input_ids: Array, inputs_embeds: Array, multimodal_embeddings: Array, placeholder_token_id: int | list[int]) Array[source]#
Overwrite inputs_embeds wherever input_ids matches placeholder tokens.