easydel.modules.aya_vision.__init__#

class easydel.modules.aya_vision.__init__.AyaVisionConfig(vision_config=None, text_config=None, vision_feature_select_strategy='full', vision_feature_layer=-1, downsample_factor=2, adapter_layer_norm_eps=1e-06, image_token_index=255036, **kwargs)[source]#

Bases: EasyDeLBaseConfig

This is the configuration class to store the configuration of a [AyaVisionForConditionalGeneration]. It is used to instantiate an AyaVision model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of AyaVision. e.g. [CohereForAI/aya-vision-8b](https://huggingface.co/CohereForAI/aya-vision-8b)

Configuration objects inherit from [PretrainedConfig] and can be used to control the model outputs. Read the documentation from [PretrainedConfig] for more information.

Parameters
  • vision_config (Union[AutoConfig, dict], optional, defaults to CLIPVisionConfig) – The config object or dictionary of the vision backbone.

  • text_config (Union[AutoConfig, dict], optional, defaults to LlamaConfig) – The config object or dictionary of the text backbone.

  • vision_feature_select_strategy (str, optional, defaults to “full”) – The feature selection strategy used to select the vision feature from the vision backbone. Can be one of “default” or “full”. If “default”, the CLS token is removed from the vision features. If “full”, the full vision features are used.

  • vision_feature_layer (int, optional, defaults to -1) – The index of the layer to select the vision feature.

  • downsample_factor (int, optional, defaults to 2) – The downsample factor to apply to the vision features.

  • adapter_layer_norm_eps (float, optional, defaults to 1e-06) – The epsilon value used for layer normalization in the adapter.

  • image_token_index (int, optional, defaults to 255036) – The image token index to encode the image prompt.

get_partition_rules(*args, **kwargs)[source]#

Retrieves the combined partition rules from the text and vision configurations.

Parameters
  • *args – Positional arguments passed to the underlying config partition rule methods.

  • **kwargs – Keyword arguments passed to the underlying config partition rule methods.

Returns

Combined partition rules from both text and vision models.

Return type

Tuple

model_type: str = 'aya_vision'#
sub_configs: dict[str, 'PretrainedConfig'] = {'text_config': <class 'easydel.modules.auto.auto_configuration.AutoEasyDeLConfig'>, 'vision_config': <class 'easydel.modules.auto.auto_configuration.AutoEasyDeLConfig'>}#
class easydel.modules.aya_vision.__init__.AyaVisionForConditionalGeneration(*args: Any, **kwargs: Any)[source]#

Bases: EasyDeLBaseModule

AyaVision model for conditional text generation based on image inputs. Combines a vision tower and a language model with a multi-modal projector.

config#

Configuration object.

Type

AyaVisionConfig

dtype#

Data type for computation.

Type

jnp.dtype

param_dtype#

Data type for parameters.

Type

jnp.dtype

precision#

JAX precision level.

Type

jax.lax.PrecisionLike

rngs#

Random number generators.

Type

nn.Rngs

get_image_features(pixel_values: Union[Array, ndarray, bool, number]) Union[Array, ndarray, bool, number][source]#

Extracts and projects image features from the vision tower.

Parameters

pixel_values (chex.Array) – Input pixel values for the images.

Returns

Processed image features ready for the language model.

Return type

chex.Array

init_cache(batch_size, max_length, starts=None, shardings=None, pad_token_id=None)[source]#

Initializes and returns a standard (non-paged) Key-Value cache.

This method first creates the necessary metadata using create_cache_metadata and then calls TransformerCache.init_cache to allocate and initialize the cache tensors based on the model’s configuration, dtype, sharding, quantization settings, and provided batch size and maximum length.

Parameters
  • batch_size (int) – The batch size for the cache.

  • max_length (int) – The maximum sequence length the cache needs to support.

  • starts (int | None) – Optional starting positions for the cache sequences. If provided, influences the initial state. Defaults to None (usually 0).

  • shardings (dict | None) – Optional dictionary specifying sharding configurations. (Note: This argument appears unused in the current implementation shown).

  • pad_token_id (int | None) – The ID of the padding token. If None, it’s inferred.

Returns

An initialized standard TransformerCache object.

Return type

TransformerCache

loss_type = 'ForCausalLM'#
prepare_inputs_for_generation(input_ids: Union[Array, ndarray, bool, number], max_length: int, pad_token_id: int, starts: int | None = None, pixel_values: Optional[Union[Array, ndarray, bool, number]] = None, attention_mask: Optional[Union[Array, ndarray, bool, number]] = None)[source]#

Prepares inputs for text generation, including pixel values if provided.

Parameters
  • input_ids (chex.Array) – Initial input token IDs.

  • max_length (int) – Maximum generation length.

  • pixel_values (Optional[chex.Array]) – Pixel values for image input.

  • attention_mask (Optional[chex.Array]) – Attention mask.

Returns

Model inputs ready for generation.

Return type

dict

update_inputs_for_generation(model_outputs, model_kwargs)[source]#

Updates model inputs for the next step of generation, removing pixel values after the first step.

Parameters
  • model_outputs – Outputs from the previous generation step.

  • model_kwargs – Current keyword arguments for the model.

Returns

Updated model keyword arguments.

Return type

dict