easydel.modules.qwen2_vl.qwen2_vl_configuration

easydel.modules.qwen2_vl.qwen2_vl_configuration#

class easydel.modules.qwen2_vl.qwen2_vl_configuration.Qwen2VLConfig(text_config: Optional[Union[Mapping[str, Any], Qwen2VLTextConfig]] = None, vision_config: Optional[Union[Mapping[str, Any], Qwen2VLVisionConfig]] = None, image_token_id: int = 151655, video_token_id: int = 151656, vision_start_token_id: int = 151652, vision_end_token_id: int = 151653, **kwargs)[source]#

Bases: EasyDeLBaseConfig

This is the configuration class to store the configuration of a [Qwen2VLModel]. It is used to instantiate a Qwen2-VL model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of Qwen2-VL-7B-Instruct [Qwen/Qwen2-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct).

Parameters

text_config (Union[Qwen2VLTextConfig, dict], optional) – The config for the text decoder.
vision_config (Union[Qwen2VLVisionConfig, dict], optional) – The config for the vision encoder.
image_token_id (int, optional, defaults to 151655) – The image token index to encode image prompts.
video_token_id (int, optional, defaults to 151656) – The video token index to encode video prompts.
vision_start_token_id (int, optional, defaults to 151652) – The token index to denote start of vision input.
vision_end_token_id (int, optional, defaults to 151653) – The token index to denote end of vision input.

get_mask_details() → dict[int, easydel.infra.utils.AttnMaskDetail][source]#

Retrieve attention mask details for each layer in the model.

This method generates a dictionary mapping layer indices to their corresponding attention mask details. If a sliding window is defined, each layer is assigned a sliding window attention mask with the specified size.

Returns: A dictionary where keys are layer indices (int) and values are AttnMaskDetail objects specifying the attention mask type and size for each layer.
Return type: dict[int, AttnMaskDetail]

Notes

If self.sliding_window is None, an empty dictionary is returned.
The method iterates over self.num_hidden_layers to assign mask details for each layer.
The attention mask type is set to AttnMaskType.SLIDING when a sliding window is defined.

get_partition_rules(*args, **kwargs)[source]#: Get the partition rules for the model. :returns: The partition rules. :rtype: tp.Tuple[tp.Tuple[str, PartitionSpec]]

keys_to_ignore_at_inference: ClassVar = ['past_key_values']#

model_type: str = 'qwen2_vl'#

sub_configs: ClassVar = {'text_config': <class 'easydel.modules.qwen2_vl.qwen2_vl_configuration.Qwen2VLTextConfig'>, 'vision_config': <class 'easydel.modules.qwen2_vl.qwen2_vl_configuration.Qwen2VLVisionConfig'>}#

class easydel.modules.qwen2_vl.qwen2_vl_configuration.Qwen2VLTextConfig(vocab_size: int = 152064, hidden_size: int = 8192, intermediate_size: int = 29568, num_hidden_layers: int = 80, num_attention_heads: int = 64, num_key_value_heads: int | None = None, hidden_act: str = 'silu', max_position_embeddings: int = 32768, initializer_range: float = 0.02, rms_norm_eps: float = 1e-05, use_cache: bool = True, tie_word_embeddings: bool = False, rope_theta: float = 1000000.0, use_sliding_window: bool = False, sliding_window: int = 4096, max_window_layers: int = 80, attention_dropout: float = 0.0, rope_scaling: dict | None = None, rope_parameters: dict | None = None, layer_types: list[str] | None = None, **kwargs)[source]#

Bases: EasyDeLBaseConfig

Configuration for the Qwen2-VL text decoder stack.

base_config_key: str = 'text_config'#

keys_to_ignore_at_inference: ClassVar = ['past_key_values']#

model_type: str = 'qwen2_vl_text'#

class easydel.modules.qwen2_vl.qwen2_vl_configuration.Qwen2VLVisionConfig(depth=32, embed_dim=1280, hidden_size=3584, hidden_act='quick_gelu', mlp_ratio=4, num_heads=16, in_channels=3, patch_size=14, spatial_merge_size=2, temporal_patch_size=2, initializer_range=0.02, **kwargs)[source]#

Bases: EasyDeLBaseConfig

Configuration class for the vision component of Qwen2VL model. This class stores the configuration parameters for the vision encoder part of the Qwen2VL multimodal model.

Parameters

depth (int, optional, defaults to 32) – Number of layers in the vision transformer.
embed_dim (int, optional, defaults to 1280) – Dimensionality of the embeddings produced by the vision encoder.
hidden_size (int, optional, defaults to 3584) – Dimensionality of the intermediate representations in the vision transformer.
hidden_act (str, optional, defaults to “quick_gelu”) – The non-linear activation function used in the vision transformer.
mlp_ratio (int, optional, defaults to 4) – Ratio of the hidden size to the intermediate size in the MLP layers.
num_heads (int, optional, defaults to 16) – Number of attention heads in the vision transformer.
in_channels (int, optional, defaults to 3) – Number of input channels for the image (typically 3 for RGB).
patch_size (int, optional, defaults to 14) – Size of the patches that the image is divided into.
spatial_merge_size (int, optional, defaults to 2) – The merge size for spatial dimensions in the vision transformer.
temporal_patch_size (int, optional, defaults to 2) – Size of the temporal patches when processing video input.

base_config_key: str = 'vision_config'#

model_type: str = 'qwen2_vl'#

easydel.modules.qwen2_vl.qwen2_vl_configuration

Contents

easydel.modules.qwen2_vl.qwen2_vl_configuration#