easydel.inference.esurge.core.interface#
- class easydel.inference.esurge.core.interface.AttentionSpec(page_size: int, num_kv_heads: int, head_size: int, dtype: numpy.dtype, use_mla: bool)[source]#
Bases:
CacheSpec- head_size: int#
- num_kv_heads: int#
- property page_size_bytes: int#
The size of a page with page_size tokens in bytes.
- Returns
The page size
- use_mla: bool#
- class easydel.inference.esurge.core.interface.CacheGroupSpec(kv_cache_spec: CacheSpec, layer_names: list[str] | None = None)[source]#
Bases:
objectRepresents a group of model layers that share the same KV cache page table. These layers are regarded as one layer in the KV cache manager.
- class easydel.inference.esurge.core.interface.CacheGroupsConfig(num_pages: int, kv_cache_groups: list[easydel.inference.esurge.core.interface.CacheGroupSpec])[source]#
Bases:
objectThe KV cache configuration of a model.
- kv_cache_groups: list[easydel.inference.esurge.core.interface.CacheGroupSpec]#
- num_pages: int#
- class easydel.inference.esurge.core.interface.CacheSpec(page_size: int)[source]#
Bases:
objectA base class for specifying the KV cache format of one layer.
- max_memory_usage_bytes(*args, **kwargs) int[source]#
The maximum possible memory usage of this KV cache in bytes.
- Returns
The KV cache size in bytes
- classmethod merge(specs: list[Self]) Self[source]#
Merge a list of CacheSpec objects into a single CacheSpec object.
- page_size: int#
- property page_size_bytes: int#
The size of a page with page_size tokens in bytes.
- Returns
The page size
- property type_id: str#
The type identifier of this KV cache. Return different strings for layers with different KV cache type (e.g., different number of tokens like full attention vs sliding window attention, different KV cache size per token like layers with different number of heads)
- Returns
The type identifier of this KV cache.
- class easydel.inference.esurge.core.interface.ChunkedLocalAttentionSpec(page_size: int, num_kv_heads: int, head_size: int, dtype: numpy.dtype, use_mla: bool, attention_chunk_size: int)[source]#
Bases:
AttentionSpec- attention_chunk_size: int#
- max_memory_usage_bytes(max_model_len, max_num_batched_tokens, **kwargs) int[source]#
The maximum possible memory usage of this KV cache in bytes.
- Returns
The KV cache size in bytes
- property type_id: str#
The type identifier of this KV cache. Return different strings for layers with different KV cache type (e.g., different number of tokens like full attention vs sliding window attention, different KV cache size per token like layers with different number of heads)
- Returns
The type identifier of this KV cache.
- class easydel.inference.esurge.core.interface.FullAttentionSpec(page_size: int, num_kv_heads: int, head_size: int, dtype: numpy.dtype, use_mla: bool, sliding_window: int | None = None, attention_chunk_size: int | None = None)[source]#
Bases:
AttentionSpec- attention_chunk_size: int | None = None#
When hybrid allocator is disabled and the model contains both full attention layers and sliding window attention layers, sliding window attention are regarded as full attention in KV cache manager (pages are allocated for all tokens), while computed as sliding window attention in model runner. In this case, we use FullAttentionSpec and record the sliding window size. Default to None for not using sliding window attention.
- max_memory_usage_bytes(max_model_len, **kwargs) int[source]#
The maximum possible memory usage of this KV cache in bytes.
- Returns
The KV cache size in bytes
- classmethod merge(specs: list[Self]) Self[source]#
Merge a list of FullAttentionSpec objects into a single FullAttentionSpec object.
- property type_id: str#
The type identifier of this KV cache. Return different strings for layers with different KV cache type (e.g., different number of tokens like full attention vs sliding window attention, different KV cache size per token like layers with different number of heads)
- Returns
The type identifier of this KV cache.
- class easydel.inference.esurge.core.interface.MambaSpec(page_size: int, shapes: tuple[tuple[int, ...], ...], dtype: numpy.dtype, page_size_padded: int | None = None)[source]#
Bases:
CacheSpec- max_memory_usage_bytes(**kwargs) int[source]#
The maximum possible memory usage of this KV cache in bytes.
- Returns
The KV cache size in bytes
- property page_size_bytes: int#
The size of a page with page_size tokens in bytes.
- Returns
The page size
- shapes: tuple[tuple[int, ...], ...]#
- property type_id: str#
The type identifier of this KV cache. Return different strings for layers with different KV cache type (e.g., different number of tokens like full attention vs sliding window attention, different KV cache size per token like layers with different number of heads)
- Returns
The type identifier of this KV cache.
- class easydel.inference.esurge.core.interface.SlidingWindowSpec(page_size: int, num_kv_heads: int, head_size: int, dtype: numpy.dtype, use_mla: bool, sliding_window: int)[source]#
Bases:
AttentionSpec- max_memory_usage_bytes(max_model_len, max_num_batched_tokens, **kwargs) int[source]#
The maximum possible memory usage of this KV cache in bytes.
- Returns
The KV cache size in bytes
- sliding_window: int#
- property type_id: str#
The type identifier of this KV cache. Return different strings for layers with different KV cache type (e.g., different number of tokens like full attention vs sliding window attention, different KV cache size per token like layers with different number of heads)
- Returns
The type identifier of this KV cache.
- easydel.inference.esurge.core.interface.create_kv_cache_specs_from_config(config: EasyDeLBaseConfig, page_size: int, num_kv_heads: int, head_size: int, dtype: dtype, use_mla: bool = False) list[easydel.inference.esurge.core.interface.CacheGroupSpec][source]#
Convert model config’s get_mask_details() to CacheGroupSpec list.
This function reads the attention mask details from the model configuration and creates appropriate cache specifications for each attention type. Layers with the same attention type are grouped together.
- Parameters
config – Model configuration with get_mask_details() method.
page_size – Number of tokens per cache page.
num_kv_heads – Number of key-value attention heads.
head_size – Dimension of each attention head.
dtype – Data type for cache tensors.
use_mla – Whether to use Multi-head Latent Attention.
- Returns
List of CacheGroupSpec, one per attention type found in the config. Falls back to a single FullAttentionSpec if no mask details available.