easydel.modules.aya_vision.__init__#
- class easydel.modules.aya_vision.__init__.AyaVisionConfig(vision_config=None, text_config=None, vision_feature_select_strategy='full', vision_feature_layer=-1, downsample_factor=2, adapter_layer_norm_eps=1e-06, image_token_index=255036, **kwargs)[source]#
Bases:
EasyDeLBaseConfigThis is the configuration class to store the configuration of a [AyaVisionForConditionalGeneration]. It is used to instantiate an AyaVision model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of AyaVision. e.g. [CohereForAI/aya-vision-8b](https://huggingface.co/CohereForAI/aya-vision-8b)
Configuration objects inherit from [PretrainedConfig] and can be used to control the model outputs. Read the documentation from [PretrainedConfig] for more information.
- Parameters
vision_config (Union[AutoConfig, dict], optional, defaults to CLIPVisionConfig) – The config object or dictionary of the vision backbone.
text_config (Union[AutoConfig, dict], optional, defaults to LlamaConfig) – The config object or dictionary of the text backbone.
vision_feature_select_strategy (str, optional, defaults to “full”) – The feature selection strategy used to select the vision feature from the vision backbone. Can be one of “default” or “full”. If “default”, the CLS token is removed from the vision features. If “full”, the full vision features are used.
vision_feature_layer (int, optional, defaults to -1) – The index of the layer to select the vision feature.
downsample_factor (int, optional, defaults to 2) – The downsample factor to apply to the vision features.
adapter_layer_norm_eps (float, optional, defaults to 1e-06) – The epsilon value used for layer normalization in the adapter.
image_token_index (int, optional, defaults to 255036) – The image token index to encode the image prompt.
- get_partition_rules(*args, **kwargs)[source]#
Retrieves the combined partition rules from the text and vision configurations.
- Parameters
*args – Positional arguments passed to the underlying config partition rule methods.
**kwargs – Keyword arguments passed to the underlying config partition rule methods.
- Returns
Combined partition rules from both text and vision models.
- Return type
Tuple
- model_type: str = 'aya_vision'#
- sub_configs: dict[str, 'PretrainedConfig'] = {'text_config': <class 'easydel.modules.auto.auto_configuration.AutoEasyDeLConfig'>, 'vision_config': <class 'easydel.modules.auto.auto_configuration.AutoEasyDeLConfig'>}#
- class easydel.modules.aya_vision.__init__.AyaVisionForConditionalGeneration(*args: Any, **kwargs: Any)[source]#
Bases:
EasyDeLBaseModuleAyaVision model for conditional text generation based on image inputs. Combines a vision tower and a language model with a multi-modal projector.
- config#
Configuration object.
- Type
- dtype#
Data type for computation.
- Type
jnp.dtype
- param_dtype#
Data type for parameters.
- Type
jnp.dtype
- precision#
JAX precision level.
- Type
jax.lax.PrecisionLike
- rngs#
Random number generators.
- Type
nn.Rngs
- get_image_features(pixel_values: Union[Array, ndarray, bool, number]) Union[Array, ndarray, bool, number][source]#
Extracts and projects image features from the vision tower.
- Parameters
pixel_values (chex.Array) – Input pixel values for the images.
- Returns
Processed image features ready for the language model.
- Return type
chex.Array
- init_cache(batch_size, max_length, starts=None, shardings=None, pad_token_id=None)[source]#
Initializes and returns a standard (non-paged) Key-Value cache.
This method first creates the necessary metadata using create_cache_metadata and then calls TransformerCache.init_cache to allocate and initialize the cache tensors based on the model’s configuration, dtype, sharding, quantization settings, and provided batch size and maximum length.
- Parameters
batch_size (int) – The batch size for the cache.
max_length (int) – The maximum sequence length the cache needs to support.
starts (int | None) – Optional starting positions for the cache sequences. If provided, influences the initial state. Defaults to None (usually 0).
shardings (dict | None) – Optional dictionary specifying sharding configurations. (Note: This argument appears unused in the current implementation shown).
pad_token_id (int | None) – The ID of the padding token. If None, it’s inferred.
- Returns
An initialized standard TransformerCache object.
- Return type
- loss_type = 'ForCausalLM'#
- prepare_inputs_for_generation(input_ids: Union[Array, ndarray, bool, number], max_length: int, pad_token_id: int, starts: int | None = None, pixel_values: Optional[Union[Array, ndarray, bool, number]] = None, attention_mask: Optional[Union[Array, ndarray, bool, number]] = None)[source]#
Prepares inputs for text generation, including pixel values if provided.
- Parameters
input_ids (chex.Array) – Initial input token IDs.
max_length (int) – Maximum generation length.
pixel_values (Optional[chex.Array]) – Pixel values for image input.
attention_mask (Optional[chex.Array]) – Attention mask.
- Returns
Model inputs ready for generation.
- Return type
dict
- update_inputs_for_generation(model_outputs, model_kwargs)[source]#
Updates model inputs for the next step of generation, removing pixel values after the first step.
- Parameters
model_outputs – Outputs from the previous generation step.
model_kwargs – Current keyword arguments for the model.
- Returns
Updated model keyword arguments.
- Return type
dict