easydel.modules.qwen2_vl.qwen2_vl_configuration#

class easydel.modules.qwen2_vl.qwen2_vl_configuration.Qwen2VLConfig(text_config: Optional[Union[Mapping[str, Any], Qwen2VLTextConfig]] = None, vision_config: Optional[Union[Mapping[str, Any], Qwen2VLVisionConfig]] = None, image_token_id: int = 151655, video_token_id: int = 151656, vision_start_token_id: int = 151652, vision_end_token_id: int = 151653, **kwargs)[source]#

Bases: EasyDeLBaseConfig

This is the configuration class to store the configuration of a [Qwen2VLModel]. It is used to instantiate a Qwen2-VL model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of Qwen2-VL-7B-Instruct [Qwen/Qwen2-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct).

Parameters
  • text_config (Union[Qwen2VLTextConfig, dict], optional) – The config for the text decoder.

  • vision_config (Union[Qwen2VLVisionConfig, dict], optional) – The config for the vision encoder.

  • image_token_id (int, optional, defaults to 151655) – The image token index to encode image prompts.

  • video_token_id (int, optional, defaults to 151656) – The video token index to encode video prompts.

  • vision_start_token_id (int, optional, defaults to 151652) – The token index to denote start of vision input.

  • vision_end_token_id (int, optional, defaults to 151653) – The token index to denote end of vision input.

get_mask_details() dict[int, easydel.infra.utils.AttnMaskDetail][source]#

Retrieve attention mask details for each layer in the model.

This method generates a dictionary mapping layer indices to their corresponding attention mask details. If a sliding window is defined, each layer is assigned a sliding window attention mask with the specified size.

Returns

A dictionary where keys are layer indices (int) and values are AttnMaskDetail objects specifying the attention mask type and size for each layer.

Return type

dict[int, AttnMaskDetail]

Notes

  • If self.sliding_window is None, an empty dictionary is returned.

  • The method iterates over self.num_hidden_layers to assign mask details for each layer.

  • The attention mask type is set to AttnMaskType.SLIDING when a sliding window is defined.

get_partition_rules(*args, **kwargs)[source]#

Get the partition rules for the model. :returns: The partition rules. :rtype: tp.Tuple[tp.Tuple[str, PartitionSpec]]

keys_to_ignore_at_inference: ClassVar = ['past_key_values']#
model_type: str = 'qwen2_vl'#
sub_configs: ClassVar = {'text_config': <class 'easydel.modules.qwen2_vl.qwen2_vl_configuration.Qwen2VLTextConfig'>, 'vision_config': <class 'easydel.modules.qwen2_vl.qwen2_vl_configuration.Qwen2VLVisionConfig'>}#
class easydel.modules.qwen2_vl.qwen2_vl_configuration.Qwen2VLTextConfig(vocab_size: int = 152064, hidden_size: int = 8192, intermediate_size: int = 29568, num_hidden_layers: int = 80, num_attention_heads: int = 64, num_key_value_heads: int | None = None, hidden_act: str = 'silu', max_position_embeddings: int = 32768, initializer_range: float = 0.02, rms_norm_eps: float = 1e-05, use_cache: bool = True, tie_word_embeddings: bool = False, rope_theta: float = 1000000.0, use_sliding_window: bool = False, sliding_window: int = 4096, max_window_layers: int = 80, attention_dropout: float = 0.0, rope_scaling: dict | None = None, rope_parameters: dict | None = None, layer_types: list[str] | None = None, **kwargs)[source]#

Bases: EasyDeLBaseConfig

Configuration for the Qwen2-VL text decoder stack.

base_config_key: str = 'text_config'#
keys_to_ignore_at_inference: ClassVar = ['past_key_values']#
model_type: str = 'qwen2_vl_text'#
class easydel.modules.qwen2_vl.qwen2_vl_configuration.Qwen2VLVisionConfig(depth=32, embed_dim=1280, hidden_size=3584, hidden_act='quick_gelu', mlp_ratio=4, num_heads=16, in_channels=3, patch_size=14, spatial_merge_size=2, temporal_patch_size=2, initializer_range=0.02, **kwargs)[source]#

Bases: EasyDeLBaseConfig

Configuration class for the vision component of Qwen2VL model. This class stores the configuration parameters for the vision encoder part of the Qwen2VL multimodal model.

Parameters
  • depth (int, optional, defaults to 32) – Number of layers in the vision transformer.

  • embed_dim (int, optional, defaults to 1280) – Dimensionality of the embeddings produced by the vision encoder.

  • hidden_size (int, optional, defaults to 3584) – Dimensionality of the intermediate representations in the vision transformer.

  • hidden_act (str, optional, defaults to “quick_gelu”) – The non-linear activation function used in the vision transformer.

  • mlp_ratio (int, optional, defaults to 4) – Ratio of the hidden size to the intermediate size in the MLP layers.

  • num_heads (int, optional, defaults to 16) – Number of attention heads in the vision transformer.

  • in_channels (int, optional, defaults to 3) – Number of input channels for the image (typically 3 for RGB).

  • patch_size (int, optional, defaults to 14) – Size of the patches that the image is divided into.

  • spatial_merge_size (int, optional, defaults to 2) – The merge size for spatial dimensions in the vision transformer.

  • temporal_patch_size (int, optional, defaults to 2) – Size of the temporal patches when processing video input.

base_config_key: str = 'vision_config'#
model_type: str = 'qwen2_vl'#