easydel.modules.llava.llava_configuration#

class easydel.modules.llava.llava_configuration.LlavaConfig(vision_config=None, text_config=None, image_token_index=32000, projector_hidden_act='gelu', vision_feature_select_strategy='default', vision_feature_layer=-2, image_seq_length=576, multimodal_projector_bias=True, **kwargs)[source]#

Bases: EasyDeLBaseConfig

This is the configuration class to store the configuration of a [LlavaForConditionalGeneration]. It is used to instantiate an Llava model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the Llava-9B.

e.g. [llava-hf/llava-9b](https://huggingface.co/llava-hf/llava-9b)

Configuration objects inherit from [PretrainedConfig] and can be used to control the model outputs. Read the documentation from [PretrainedConfig] for more information.

Parameters
  • vision_config (Union[AutoConfig, dict], optional, defaults to CLIPVisionConfig) – The config object or dictionary of the vision backbone.

  • text_config (Union[AutoConfig, dict], optional, defaults to LlamaConfig) – The config object or dictionary of the text backbone.

  • image_token_index (int, optional, defaults to 32000) – The image token index to encode the image prompt.

  • projector_hidden_act (str, optional, defaults to “gelu”) – The activation function used by the multimodal projector.

  • vision_feature_select_strategy (str, optional, defaults to “default”) – The feature selection strategy used to select the vision feature from the vision backbone. Can be one of “default” or “full”.

  • vision_feature_layer (Union[int, List[int]], optional, defaults to -2) – The index of the layer to select the vision feature. If multiple indices are provided, the vision feature of the corresponding indices will be concatenated to form the vision features.

  • image_seq_length (int, optional, defaults to 576) – Sequence length of one image embedding.

  • multimodal_projector_bias (bool, optional, defaults to True) – Whether to use bias in the multimodal projector.

get_partition_rules(*args, **kwargs)[source]#

Get the partition rules for distributed training by combining the partition rules from both the text and vision configurations.

This method retrieves the partition rules from the text_config and vision_config components and combines them to create a comprehensive set of rules for the entire multimodal model.

Parameters
  • *args – Variable length argument list to be passed to the text and vision configs.

  • **kwargs – Arbitrary keyword arguments to be passed to the text and vision configs.

Returns

A combined tuple of partition rules from both text and vision configurations.

Return type

tuple

model_type: str = 'llava'#
sub_configs: dict[str, 'PretrainedConfig'] = {'text_config': <class 'easydel.modules.auto.auto_configuration.AutoEasyDeLConfig'>, 'vision_config': <class 'easydel.modules.auto.auto_configuration.AutoEasyDeLConfig'>}#