easydel.infra.base_config

easydel.infra.base_config#

class easydel.infra.base_config.EasyDeLBaseConfig(axis_dims: ~typing.Sequence[int] = (1, -1, 1, 1), dcn_axis_dims: ~typing.Optional[~typing.Sequence[int]] = None, axis_names: ~typing.Sequence[str] = ('dp', 'fsdp', 'tp', 'sp'), attn_mechanism: ~typing.Literal['vanilla', 'flash_attn2', 'splash', 'ring', 'cudnn', 'blockwise', 'sdpa', 'autoregressive_decodeattn'] = 'vanilla', decode_attn_mechanism: ~typing.Literal['vanilla', 'flash_attn2', 'splash', 'ring', 'cudnn', 'blockwise', 'sdpa', 'autoregressive_decodeattn'] = None, blocksize_k: int = 128, blocksize_q: int = 128, blocksize_b: int = 1, partition_axis: ~eformer.escale.partition.manager.PartitionAxis = PartitionAxis(data_parallel_axis='dp', fully_sharded_data_parallel_axis='fsdp', tensor_parallel_axis='tp', sequence_parallel_axis='sp', expert_parallel_axis='ep', batch_axis=('fsdp', 'dp'), sequence_axis='sp', query_sequence_axis='sp', head_axis='tp', kv_head_axis=None, key_sequence_axis='sp', hidden_state_axis='tp', mlp_intermediate_axis='tp', vocab_axis='tp', expert_axis='ep', expert_gate_axis=None, attention_dim_axis=None, attention_kv_dim_axis=None, bias_head_sequence_axis=None, bias_key_sequence_axis=None, decode_batch_axis=('fsdp', 'dp'), decode_query_sequence_axis=None, decode_head_axis='tp', decode_kv_head_axis=None, decode_key_sequence_axis='sp', decode_attention_dim_axis=None, decode_attention_kv_dim_axis=None), shard_attention_computation: bool = True, use_sharded_kv_caching: bool = False, use_sharding_constraint: bool = False, backend: ~typing.Optional[~easydel.infra.etils.EasyDeLBackends] = None, platform: ~typing.Optional[~easydel.infra.etils.EasyDeLPlatforms] = None, easy_method: ~typing.Literal['train', 'serve', 'convert'] = 'train', bits: ~typing.Optional[int] = None, scan_ring_attention: bool = True, scan_attention_layers: bool = False, use_scan_mlp: bool = False, scan_mlp_chunk_size: int = 1024, sequence_axis_name: str = 'sp', gradient_checkpointing: ~easydel.infra.etils.EasyDeLGradientCheckPointers = EasyDeLGradientCheckPointers.NONE, precompute_masks: bool = True, kv_cache_quantization_method: ~easydel.infra.etils.EasyDeLQuantizationMethods = EasyDeLQuantizationMethods.NONE, kv_cache_quantization_blocksize: int = 64, quantization_method: ~easydel.infra.etils.EasyDeLQuantizationMethods = EasyDeLQuantizationMethods.NONE, quantization_pattern: str = '.*', quantization_blocksize: int = 64, kv_cache_sharding_sequence_axis_name: ~typing.Union[str, ~typing.Tuple[str, ...]] = 'sp', flash_attention_backward_pass_impl: ~typing.Literal['triton', 'xla'] = 'triton', attn_dtype: ~numpy.dtype = <class 'jax.numpy.float32'>, attn_softmax_dtype: ~numpy.dtype = <class 'jax.numpy.float32'>, fcm_max_ratio: float = 0.0, fcm_min_ratio: float = 0.0, hardware_abstraction: bool = False, pallas_m_block_size: int = 128, pallas_k_block_size: int = 128, pallas_n_block_size: int = 128, **kwargs)[source]#

Bases: PretrainedConfig

Initialize the configuration for EasyDeL. :param axis_dims: Dimensions of the axes. Default is (1, -1, 1, 1). :type axis_dims: tp.Sequence[int] :param axis_names: Names of the axes. Default is (“dp”, “fsdp”, “tp”, “sp”). :type axis_names: tp.Sequence[str] :param attn_mechanism: Attention mechanism to use. Default is DEFAULT_ATTENTION_MECHANISM.

decode_attn_mechanism (AVAILABLE_ATTENTION_MECHANISMS): Attention mechanism to use for decode phase. Default is None.

Parameters

blocksize_k (int) – Block size for key. Default is 128.
blocksize_q (int) – Block size for query. Default is 128.
blocksize_b (int) – Block size for batch. Default is 1.
partition_axis (PartitionAxis) – Partition axis configuration. Default is PartitionAxis().
shard_attention_computation (bool) – Whether to shard attention computation. Default is True.
use_sharded_kv_caching (bool) – Whether to use sharded key-value caching. Default is False.
use_sharding_constraint (bool) – Whether to use sharding constraint. Default is False.
backend (tp.Optional[EasyDeLBackends]) – Backend to use. Default is None.
platform (tp.Optional[EasyDeLPlatforms]) – Platform to use. Default is None.
easy_method (tp.Literal["train", "serve", "convert"]) – Method to use. Default is EasyMethod.TRAIN.
bits (tp.Optional[int]) – Number of bits for quantization. Default is None.
scan_ring_attention (bool) – Whether to scan ring attention. Default is True.
scan_attention_layers (bool) – Whether to scan attention layers. Default is False.
use_scan_mlp (bool) – Whether to use scan MLP. Default is False.
scan_mlp_chunk_size (int) – Chunk size for scan MLP. Default is 1024.
sequence_axis_name (str) – Name of the attention axis. Default is “sp”.
gradient_checkpointing (EasyDeLGradientCheckPointers) – Gradient checkpointing method. Default is EasyDeLGradientCheckPointers.NONE.
kv_cache_quantization_method (EasyDeLQuantizationMethods) – Key-value cache quantization method. Default is EasyDeLQuantizationMethods.NONE.
kv_cache_quantization_blocksize (int) – Block size for key-value cache quantization. Default is 64.
quantization_method (EasyDeLQuantizationMethods) – Quantization method. Default is EasyDeLQuantizationMethods.NONE.
quantization_pattern (str) – Pattern for quantization. Default is “.*”.
quantization_blocksize (int) – Block size for quantization. Default is 64.
kv_cache_sharding_sequence_axis_name (tp.Union[str, tp.Tuple[str, ...]]) – Name of the key-value cache sharding sequence axis. Default is “sp”.
flash_attention_backward_pass_impl (tp.Literal["triton", "xla"]) – Implementation for flash attention backward pass. Default is “triton”.
attn_dtype (jnp.dtype) – Data type for attention. Default is device half.
attn_softmax_dtype (jnp.dtype) – Data type for softmax ops in attention. Default is jnp.float32.
fcm_max_ratio (float) – Maximum ratio for FCM. Default is 0.0.
fcm_min_ratio (float) – Minimum ratio for FCM. Default is 0.0.
hardware_abstraction (bool) – Whether to use hardware abstraction. Default is DEFAULT_HARDWARE_ABSTRACTION.
pallas_m_block_size (int) – Block size for Pallas M. Default is DEFAULT_PALLAS_M_BLOCK_SIZE.
pallas_k_block_size (int) – Block size for Pallas K. Default is DEFAULT_PALLAS_K_BLOCK_SIZE.
pallas_n_block_size (int) – Block size for Pallas N. Default is DEFAULT_PALLAS_N_BLOCK_SIZE.
**kwargs – Additional keyword arguments.

Raises

Warning – If kv_cache_quantization_method is not NONE and use_sharded_kv_caching is True.

add_basic_configurations(axis_dims: ~typing.Sequence[int] = <eformer.common_types._Empty object>, dcn_axis_dims: ~typing.Optional[~typing.Sequence[int]] = <eformer.common_types._Empty object>, axis_names: ~typing.Sequence[str] = <eformer.common_types._Empty object>, attn_mechanism: ~typing.Literal['vanilla', 'flash_attn2', 'splash', 'ring', 'cudnn', 'blockwise', 'sdpa', 'autoregressive_decodeattn'] = <eformer.common_types._Empty object>, decode_attn_mechanism: ~typing.Literal['vanilla', 'flash_attn2', 'splash', 'ring', 'cudnn', 'blockwise', 'sdpa', 'autoregressive_decodeattn'] = <eformer.common_types._Empty object>, blocksize_k: int = <eformer.common_types._Empty object>, blocksize_q: int = <eformer.common_types._Empty object>, blocksize_b: int = <eformer.common_types._Empty object>, partition_axis: ~eformer.escale.partition.manager.PartitionAxis = <eformer.common_types._Empty object>, shard_attention_computation: bool = <eformer.common_types._Empty object>, use_sharded_kv_caching: bool = <eformer.common_types._Empty object>, backend: ~typing.Optional[~easydel.infra.etils.EasyDeLBackends] = <eformer.common_types._Empty object>, platform: ~typing.Optional[~easydel.infra.etils.EasyDeLPlatforms] = <eformer.common_types._Empty object>, easy_method: ~typing.Literal['train', 'serve', 'convert'] = <eformer.common_types._Empty object>, bits: ~typing.Optional[int] = <eformer.common_types._Empty object>, scan_ring_attention: bool = <eformer.common_types._Empty object>, scan_attention_layers: bool = <eformer.common_types._Empty object>, use_sharding_constraint: bool = <eformer.common_types._Empty object>, use_scan_mlp: bool = <eformer.common_types._Empty object>, scan_mlp_chunk_size: int = <eformer.common_types._Empty object>, sequence_axis_name: str = <eformer.common_types._Empty object>, gradient_checkpointing: ~easydel.infra.etils.EasyDeLGradientCheckPointers = <eformer.common_types._Empty object>, precompute_masks: bool = <eformer.common_types._Empty object>, kv_cache_quantization_method: ~easydel.infra.etils.EasyDeLQuantizationMethods = <eformer.common_types._Empty object>, kv_cache_quantization_blocksize: int = <eformer.common_types._Empty object>, quantization_method: ~easydel.infra.etils.EasyDeLQuantizationMethods = <eformer.common_types._Empty object>, quantization_blocksize: int = <eformer.common_types._Empty object>, quantization_pattern: str = <eformer.common_types._Empty object>, kv_cache_sharding_sequence_axis_name: ~typing.Union[str, ~typing.Tuple[str, ...]] = <eformer.common_types._Empty object>, flash_attention_backward_pass_impl: ~typing.Literal['triton', 'xla'] = <eformer.common_types._Empty object>, attn_dtype: ~numpy.dtype = <eformer.common_types._Empty object>, attn_softmax_dtype: ~numpy.dtype = <eformer.common_types._Empty object>, hardware_abstraction: bool = <eformer.common_types._Empty object>, pallas_m_block_size: int = <eformer.common_types._Empty object>, pallas_k_block_size: int = <eformer.common_types._Empty object>, pallas_n_block_size: int = <eformer.common_types._Empty object>, **kwargs)[source]#

It initializes all the attributes of an object, and it’s called when you create a new instance of that class.

Parameters

axis_dims (tp.Sequence[int], optional) – Specify the number of dimensions for each axis. Defaults to (1, -1, 1, 1).
axis_names (tp.Sequence[str], optional) – Set the names of the axes. Defaults to (“dp”, “fsdp”, “tp”, “sp”).
attn_mechanism (AVAILABLE_ATTENTION_MECHANISMS, optional) – attention mechanism to use. Defaults to DEFAULT_ATTENTION_MECHANISM. decode_attn_mechanism (AVAILABLE_ATTENTION_MECHANISMS): Attention mechanism to use for decode phase. Default is None.
blocksize_k (int, optional) – block size of key_states. Defaults to 128.
blocksize_q (int, optional) – block size of query_states. Defaults to 128.
blocksize_b (int, optional) – block size of bias. Defaults to 1.
partition_axis (PartitionAxis, optional) – PartitionAxis is new module used for partitioning arrays in easydel. Defaults to PartitionAxis().
shard_attention_computation (bool, optional) – whenever to use shard_map for attention. Defaults to True.
use_sharded_kv_caching (bool, optional) – whenever to use shard_map and sharding for key and value. Defaults to True.
backend (tp.Optional[EasyDeLBackends], optional) – Specify the backend to use. Defaults to None.
platform (tp.Optional[EasyDeLPlatforms], optional) – Specify the platform to used to use. Defaults to None.
easy_method (tp.Literal["train", "serve", "convert"], optional) – easydel Quantization Method to be applied for. Defaults to EasyMethod.TRAIN.
bits (tp.Optional[int], optional) – Model bits for quantization. Defaults to None.
scan_ring_attention (bool, optional) – Whether to use can for ring attention. Defaults to True.
scan_attention_layers (bool, optional) – Whether to use can for attention layers. Defaults to False.
use_sharding_constraint (bool, optional) – whether to use sharding constraint for the arrays. Defaults to False.
use_scan_mlp (bool, optional) – Determine whether to use scan_mlp or not. Defaults to False.
scan_mlp_chunk_size (int, optional) – Size of chunks in scan MLP. Defaults to 1024.
sequence_axis_name (str, optional) – Name of the attention axis name. Defaults to “sp”.
gradient_checkpointing (EasyDeLQuantizationMethods, optional) – Gradient Checkpointing method for created or loaded module (applied on mlp and attn layers most of the times).
kv_cache_quantization_method (EasyDeLQuantizationMethods, optional) – key and value quantization type. Defaults to EasyDeLQuantizationMethods.NONE.
kv_cache_quantization_blocksize (int, optional) – size of kv cache quantization. Defaults to 64.
quantization_method (EasyDeLQuantizationMethods, optional) – linear modules quantization type. Defaults to EasyDeLQuantizationMethods.NONE.
quantization_blocksize (int, optional) – size of linear quantization. Defaults to 64.
quantization_pattern (str) – re pattern to be used for quantizing layers.
kv_cache_sharding_sequence_axis_name (tp.Union[str, tp.Tuple[str, ...]], optional) – axis name to target for sharding sequences. Defaults to “sp”.
flash_attention_backward_pass_impl (tp.Literal["triton", "xla"], optional) – Specify the backward pass kernel for flash attention. Defaults to “triton”.
attn_dtype (jnp.dtype, optional) – Data type for attention computations. Defaults to device half.
attn_softmax_dtype (jnp.dtype, optional) – Data type for softmax in attention op computations. Defaults to jnp.float32.
fcm_max_ratio (float, optional) – Maximum ratio for flash cross attention. Defaults to 0.0.
fcm_min_ratio (float, optional) – Minimum ratio for flash cross attention. Defaults to 0.0.
hardware_abstraction (bool, optional) – whenever to switch to custom pallas kernels instead of JAX. Defaults to DEFAULT_HARDWARE_ABSTRACTION.
pallas_m_block_size (int, optional) – block size m dim in matmul for pallas kernel A(mk)@B(kn)=B(mn). Defaults to DEFAULT_PALLAS_M_BLOCK_SIZE.
pallas_k_block_size (int, optional) – block size k dim in matmul for pallas kernel A(mk)@B(kn)=B(mn). Defaults to DEFAULT_PALLAS_K_BLOCK_SIZE.
pallas_n_block_size (int, optional) – block size n dim in matmul for pallas kernel A(mk)@B(kn)=B(mn). Defaults to DEFAULT_PALLAS_N_BLOCK_SIZE.

attach_custom_arguments(**kwargs)[source]#

static create_mesh(axis_dims: Sequence[int] = (1, -1, 1, 1), axis_names: Sequence[str] = ('dp', 'fsdp', 'tp', 'sp'), dcn_axis_dims: Optional[Sequence[int]] = None, process_is_granule: bool = False, should_sort_granules_by_key: bool = True, allow_split_physical_axes: bool = True, backend: Optional[str] = None)[source]#

The create_mesh function creates a mesh object that can be used to shard arrays.

Returns: A mesh object

classmethod from_pretrained(pretrained_model_name_or_path: Union[str, PathLike], cache_dir: Optional[Union[str, PathLike]] = None, force_download: bool = False, local_files_only: bool = False, token: Optional[Union[str, bool]] = None, revision: str = 'main', **kwargs) → PretrainedConfig[source]#

Instantiate a [PretrainedConfig] (or a derived class) from a pretrained model configuration.

Parameters

pretrained_model_name_or_path (str or os.PathLike) –
This can be either:
- a string, the model id of a pretrained model configuration hosted inside a model repo on huggingface.co.
- a path to a directory containing a configuration file saved using the [~PretrainedConfig.save_pretrained] method, e.g., ./my_model_directory/.
- a path or url to a saved configuration JSON file, e.g., ./my_model_directory/configuration.json.
cache_dir (str or os.PathLike, optional) – Path to a directory in which a downloaded pretrained model configuration should be cached if the standard cache should not be used.
force_download (bool, optional, defaults to False) – Whether or not to force to (re-)download the configuration files and override the cached versions if they exist.
resume_download – Deprecated and ignored. All downloads are now resumed by default when possible. Will be removed in v5 of Transformers.
proxies (Dict[str, str], optional) – A dictionary of proxy servers to use by protocol or endpoint, e.g., {‘http’: ‘foo.bar:3128’, ‘http://hostname’: ‘foo.bar:4012’}. The proxies are used on each request.
token (str or bool, optional) – The token to use as HTTP bearer authorization for remote files. If True, or not specified, will use the token generated when running huggingface-cli login (stored in ~/.huggingface).
revision (str, optional, defaults to “main”) –
The specific model version to use. It can be a branch name, a tag name, or a commit id, since we use a git-based system for storing models and other artifacts on huggingface.co, so revision can be any identifier allowed by git.

<Tip>

To test a pull request you made on the Hub, you can pass `revision=”refs/pr/<pr_number>”.

</Tip>
return_unused_kwargs (bool, optional, defaults to False) –
If False, then this function returns just the final configuration object.

If True, then this functions returns a tp.Tuple(config, unused_kwargs) where unused_kwargs is a dictionary consisting of the key/value pairs whose keys are not configuration attributes: i.e., the part of kwargs which has not been used to update config and is otherwise ignored.
subfolder (str, optional, defaults to “”) – In case the relevant files are located inside a subfolder of the model repo on huggingface.co, you can specify the folder name here.
kwargs (Dict[str, tp.Any], optional) – The values in kwargs of any keys which are configuration attributes will be used to override the loaded values. Behavior concerning key/value pairs whose keys are not configuration attributes is controlled by the return_unused_kwargs keyword parameter.

Returns

The configuration object instantiated from this pretrained model.

Return type

[PretrainedConfig]

Examples:

>>> # We can't instantiate directly the base class *PretrainedConfig* so let's show the examples on a
>>> # derived class: BertConfig
>>> config = BertConfig.from_pretrained(
...   "google-bert/bert-base-uncased"
>>> )  # Download configuration from huggingface.co and cache.
>>> config = BertConfig.from_pretrained(
...   "./test/saved_model/"
>>> )  # E.g. config (or model) was saved using *save_pretrained('./test/saved_model/')*
>>> config = BertConfig.from_pretrained("./test/saved_model/my_configuration.json")
>>> config = BertConfig.from_pretrained(
...  "google-bert/bert-base-uncased", output_attentions=True, foo=False
>>> )
>>> assert config.output_attentions == True
>>> config, unused_kwargs = BertConfig.from_pretrained(
...  "google-bert/bert-base-uncased",
...  output_attentions=True,
...  foo=False,
...  return_unused_kwargs=True,
>>> )
>>> assert config.output_attentions == True
>>> assert unused_kwargs == {"foo": False}

```

get_axis_dims() → Sequence[int][source]#

The get_axis_dims function returns a sequence of integers representing the dimensions of each axis.

Parameters: self – Represent the instance of the class
Returns: The dimensions of the axes

get_axis_names() → Sequence[str][source]#

The get_axis_names function returns a list of the names of the axes.

Parameters: self – Represent the instance of the class
Returns: A list of the names of all axes

get_backend() → str[source]#

The get_backend function returns the backend that is currently being used. If no backend has been set, it will return the default JAX backend.

Parameters: self – Bind the method to an object
Returns: The backend platform

get_basic_causal_mask(*args, **kwargs)[source]#

get_basic_frequencies(head_size: Optional[int] = None, rotary_dim: Optional[int] = None, base: Optional[float] = None) → Any[source]#

Get basic frequencies for rotary embeddings.

Parameters

head_size – Size of attention heads (defaults to self.head_dim)
rotary_dim – Dimension for rotary embeddings (defaults to head_size)
base – Base value for frequency computation (defaults to self.rope_theta)

Returns

ModuleCaches instance containing computed frequencies

get_basic_inv_frequencies(head_size: Optional[int] = None, rotary_dim: Optional[int] = None, base: Optional[float] = None, partial_rotary_factor: float = 1.0) → Any[source]#

Get basic inv frequencies for rotary embeddings.

Parameters

head_size – Size of attention heads (defaults to self.head_dim)
rotary_dim – Dimension for rotary embeddings (defaults to head_size)
base – Base value for frequency computation (defaults to self.rope_theta)

Returns

ModuleCaches instance containing computed frequencies

get_basic_rope(dtype: Union[Array, ndarray, bool, number], head_size: int, rotary_dim: Optional[int] = None, is_neox_style: bool = True, base: Optional[float] = None)[source]#

Get basic rotary position embeddings.

Parameters

dtype – Data type for the embeddings
head_size – Size of attention heads
rotary_dim – Dimension for rotary embeddings (defaults to head_size)
is_neox_style – Whether to use NeoX style embeddings
base – Base value for frequency computation (defaults to self.rope_theta)

Returns

Rotary position embeddings func

get_fcm_mask(batch_size, seq_length, deterministic: bool)[source]#

get_partition_rules(*args, **kwargs)[source]#: Get the partition rules for the model. :returns: The partition rules. :rtype: tp.Tuple[tp.Tuple[str, PartitionSpec]]

property granted_freq_max_position_embedding: int#

property granted_mask_max_position_embedding: int#

jax_mesh()[source]#

property mesh#

The mesh property is a helper property that creates a Mesh object from the axis_dims and axis_names attributes of an object, which are assumed to be lists of integers and strings, respectively. The platform attribute is also used if it exists.

Parameters: self – Refer to the object itself
Returns: A jaxMesh

property partition_manager: PartitionManager#

read_basics_from_config(config: EasyDeLBaseConfig)[source]#

to_dict() → Dict[str, Any][source]#

Serializes this instance to a Python dictionary.

Returns: Dictionary of all the attributes that make up this configuration instance.
Return type: Dict[str, Any]

class easydel.infra.base_config.EasyDeLBaseConfigDict[source]#: Bases: TypedDict

class easydel.infra.base_config.EasyMethod(TRAIN: 'str' = 'train', SERVE: 'str' = 'serve', EVAL: 'str' = 'serve', CONVERT: 'str' = 'convert')[source]#

Bases: object

CONVERT: str = 'convert'#

EVAL: str = 'serve'#

SERVE: str = 'serve'#

TRAIN: str = 'train'#

classmethod from_dict(data: Dict[str, Any]) → T#: Deserializes a dictionary into a PyTree object.

classmethod from_json(json_str: str) → T#: Deserializes a JSON string into a PyTree object.

replace(**kwargs)#: Creates a new instance with specified fields replaced.

to_dict() → Dict[str, Any]#: Serializes the PyTree object to a dictionary.

to_json(**kwargs) → str#: Serializes the PyTree object to a JSON string.

easydel.infra.base_config.set_attrs_smartly(self, attr_name: str, default: Any, new_attr: Any)[source]#

easydel.infra.base_config

Contents

easydel.infra.base_config#