easydel.modules.qwen2_vl.qwen2_vl_configuration#
- class easydel.modules.qwen2_vl.qwen2_vl_configuration.Qwen2VLConfig(text_config: Optional[Union[Mapping[str, Any], Qwen2VLTextConfig]] = None, vision_config: Optional[Union[Mapping[str, Any], Qwen2VLVisionConfig]] = None, image_token_id: int = 151655, video_token_id: int = 151656, vision_start_token_id: int = 151652, vision_end_token_id: int = 151653, **kwargs)[source]#
Bases:
EasyDeLBaseConfigThis is the configuration class to store the configuration of a [Qwen2VLModel]. It is used to instantiate a Qwen2-VL model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of Qwen2-VL-7B-Instruct [Qwen/Qwen2-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct).
- Parameters
text_config (Union[Qwen2VLTextConfig, dict], optional) – The config for the text decoder.
vision_config (Union[Qwen2VLVisionConfig, dict], optional) – The config for the vision encoder.
image_token_id (int, optional, defaults to 151655) – The image token index to encode image prompts.
video_token_id (int, optional, defaults to 151656) – The video token index to encode video prompts.
vision_start_token_id (int, optional, defaults to 151652) – The token index to denote start of vision input.
vision_end_token_id (int, optional, defaults to 151653) – The token index to denote end of vision input.
- get_mask_details() dict[int, easydel.infra.utils.AttnMaskDetail][source]#
Retrieve attention mask details for each layer in the model.
This method generates a dictionary mapping layer indices to their corresponding attention mask details. If a sliding window is defined, each layer is assigned a sliding window attention mask with the specified size.
- Returns
A dictionary where keys are layer indices (int) and values are AttnMaskDetail objects specifying the attention mask type and size for each layer.
- Return type
dict[int, AttnMaskDetail]
Notes
If self.sliding_window is None, an empty dictionary is returned.
The method iterates over self.num_hidden_layers to assign mask details for each layer.
The attention mask type is set to AttnMaskType.SLIDING when a sliding window is defined.
- get_partition_rules(*args, **kwargs)[source]#
Get the partition rules for the model. :returns: The partition rules. :rtype: tp.Tuple[tp.Tuple[str, PartitionSpec]]
- keys_to_ignore_at_inference: ClassVar = ['past_key_values']#
- model_type: str = 'qwen2_vl'#
- sub_configs: ClassVar = {'text_config': <class 'easydel.modules.qwen2_vl.qwen2_vl_configuration.Qwen2VLTextConfig'>, 'vision_config': <class 'easydel.modules.qwen2_vl.qwen2_vl_configuration.Qwen2VLVisionConfig'>}#
- class easydel.modules.qwen2_vl.qwen2_vl_configuration.Qwen2VLTextConfig(vocab_size: int = 152064, hidden_size: int = 8192, intermediate_size: int = 29568, num_hidden_layers: int = 80, num_attention_heads: int = 64, num_key_value_heads: int | None = None, hidden_act: str = 'silu', max_position_embeddings: int = 32768, initializer_range: float = 0.02, rms_norm_eps: float = 1e-05, use_cache: bool = True, tie_word_embeddings: bool = False, rope_theta: float = 1000000.0, use_sliding_window: bool = False, sliding_window: int = 4096, max_window_layers: int = 80, attention_dropout: float = 0.0, rope_scaling: dict | None = None, rope_parameters: dict | None = None, layer_types: list[str] | None = None, **kwargs)[source]#
Bases:
EasyDeLBaseConfigConfiguration for the Qwen2-VL text decoder stack.
- base_config_key: str = 'text_config'#
- keys_to_ignore_at_inference: ClassVar = ['past_key_values']#
- model_type: str = 'qwen2_vl_text'#
- class easydel.modules.qwen2_vl.qwen2_vl_configuration.Qwen2VLVisionConfig(depth=32, embed_dim=1280, hidden_size=3584, hidden_act='quick_gelu', mlp_ratio=4, num_heads=16, in_channels=3, patch_size=14, spatial_merge_size=2, temporal_patch_size=2, initializer_range=0.02, **kwargs)[source]#
Bases:
EasyDeLBaseConfigConfiguration class for the vision component of Qwen2VL model. This class stores the configuration parameters for the vision encoder part of the Qwen2VL multimodal model.
- Parameters
depth (int, optional, defaults to 32) – Number of layers in the vision transformer.
embed_dim (int, optional, defaults to 1280) – Dimensionality of the embeddings produced by the vision encoder.
hidden_size (int, optional, defaults to 3584) – Dimensionality of the intermediate representations in the vision transformer.
hidden_act (str, optional, defaults to “quick_gelu”) – The non-linear activation function used in the vision transformer.
mlp_ratio (int, optional, defaults to 4) – Ratio of the hidden size to the intermediate size in the MLP layers.
num_heads (int, optional, defaults to 16) – Number of attention heads in the vision transformer.
in_channels (int, optional, defaults to 3) – Number of input channels for the image (typically 3 for RGB).
patch_size (int, optional, defaults to 14) – Size of the patches that the image is divided into.
spatial_merge_size (int, optional, defaults to 2) – The merge size for spatial dimensions in the vision transformer.
temporal_patch_size (int, optional, defaults to 2) – Size of the temporal patches when processing video input.
- base_config_key: str = 'vision_config'#
- model_type: str = 'qwen2_vl'#