easydel.inference.esurge.core.interface#

class easydel.inference.esurge.core.interface.AttentionSpec(page_size: int, num_kv_heads: int, head_size: int, dtype: numpy.dtype, use_mla: bool)[source]#

Bases: CacheSpec

dtype: dtype#
head_size: int#
num_kv_heads: int#
property page_size_bytes: int#

The size of a page with page_size tokens in bytes.

Returns

The page size

use_mla: bool#
class easydel.inference.esurge.core.interface.CacheGroupSpec(kv_cache_spec: CacheSpec, layer_names: list[str] | None = None)[source]#

Bases: object

Represents a group of model layers that share the same KV cache page table. These layers are regarded as one layer in the KV cache manager.

kv_cache_spec: CacheSpec#
layer_names: list[str] | None = None#
class easydel.inference.esurge.core.interface.CacheGroupsConfig(num_pages: int, kv_cache_groups: list[easydel.inference.esurge.core.interface.CacheGroupSpec])[source]#

Bases: object

The KV cache configuration of a model.

kv_cache_groups: list[easydel.inference.esurge.core.interface.CacheGroupSpec]#
num_pages: int#
class easydel.inference.esurge.core.interface.CacheSpec(page_size: int)[source]#

Bases: object

A base class for specifying the KV cache format of one layer.

max_memory_usage_bytes(*args, **kwargs) int[source]#

The maximum possible memory usage of this KV cache in bytes.

Returns

The KV cache size in bytes

classmethod merge(specs: list[Self]) Self[source]#

Merge a list of CacheSpec objects into a single CacheSpec object.

page_size: int#
property page_size_bytes: int#

The size of a page with page_size tokens in bytes.

Returns

The page size

property type_id: str#

The type identifier of this KV cache. Return different strings for layers with different KV cache type (e.g., different number of tokens like full attention vs sliding window attention, different KV cache size per token like layers with different number of heads)

Returns

The type identifier of this KV cache.

class easydel.inference.esurge.core.interface.ChunkedLocalAttentionSpec(page_size: int, num_kv_heads: int, head_size: int, dtype: numpy.dtype, use_mla: bool, attention_chunk_size: int)[source]#

Bases: AttentionSpec

attention_chunk_size: int#
max_memory_usage_bytes(max_model_len, max_num_batched_tokens, **kwargs) int[source]#

The maximum possible memory usage of this KV cache in bytes.

Returns

The KV cache size in bytes

property type_id: str#

The type identifier of this KV cache. Return different strings for layers with different KV cache type (e.g., different number of tokens like full attention vs sliding window attention, different KV cache size per token like layers with different number of heads)

Returns

The type identifier of this KV cache.

class easydel.inference.esurge.core.interface.FullAttentionSpec(page_size: int, num_kv_heads: int, head_size: int, dtype: numpy.dtype, use_mla: bool, sliding_window: int | None = None, attention_chunk_size: int | None = None)[source]#

Bases: AttentionSpec

attention_chunk_size: int | None = None#

When hybrid allocator is disabled and the model contains both full attention layers and sliding window attention layers, sliding window attention are regarded as full attention in KV cache manager (pages are allocated for all tokens), while computed as sliding window attention in model runner. In this case, we use FullAttentionSpec and record the sliding window size. Default to None for not using sliding window attention.

max_memory_usage_bytes(max_model_len, **kwargs) int[source]#

The maximum possible memory usage of this KV cache in bytes.

Returns

The KV cache size in bytes

classmethod merge(specs: list[Self]) Self[source]#

Merge a list of FullAttentionSpec objects into a single FullAttentionSpec object.

classmethod merge_window_sizes(window_sizes: set[int]) int | None[source]#
sliding_window: int | None = None#
property type_id: str#

The type identifier of this KV cache. Return different strings for layers with different KV cache type (e.g., different number of tokens like full attention vs sliding window attention, different KV cache size per token like layers with different number of heads)

Returns

The type identifier of this KV cache.

class easydel.inference.esurge.core.interface.MambaSpec(page_size: int, shapes: tuple[tuple[int, ...], ...], dtype: numpy.dtype, page_size_padded: int | None = None)[source]#

Bases: CacheSpec

dtype: dtype#
max_memory_usage_bytes(**kwargs) int[source]#

The maximum possible memory usage of this KV cache in bytes.

Returns

The KV cache size in bytes

property page_size_bytes: int#

The size of a page with page_size tokens in bytes.

Returns

The page size

page_size_padded: int | None = None#
shapes: tuple[tuple[int, ...], ...]#
property type_id: str#

The type identifier of this KV cache. Return different strings for layers with different KV cache type (e.g., different number of tokens like full attention vs sliding window attention, different KV cache size per token like layers with different number of heads)

Returns

The type identifier of this KV cache.

class easydel.inference.esurge.core.interface.SlidingWindowSpec(page_size: int, num_kv_heads: int, head_size: int, dtype: numpy.dtype, use_mla: bool, sliding_window: int)[source]#

Bases: AttentionSpec

max_memory_usage_bytes(max_model_len, max_num_batched_tokens, **kwargs) int[source]#

The maximum possible memory usage of this KV cache in bytes.

Returns

The KV cache size in bytes

sliding_window: int#
property type_id: str#

The type identifier of this KV cache. Return different strings for layers with different KV cache type (e.g., different number of tokens like full attention vs sliding window attention, different KV cache size per token like layers with different number of heads)

Returns

The type identifier of this KV cache.

easydel.inference.esurge.core.interface.create_kv_cache_specs_from_config(config: EasyDeLBaseConfig, page_size: int, num_kv_heads: int, head_size: int, dtype: dtype, use_mla: bool = False) list[easydel.inference.esurge.core.interface.CacheGroupSpec][source]#

Convert model config’s get_mask_details() to CacheGroupSpec list.

This function reads the attention mask details from the model configuration and creates appropriate cache specifications for each attention type. Layers with the same attention type are grouped together.

Parameters
  • config – Model configuration with get_mask_details() method.

  • page_size – Number of tokens per cache page.

  • num_kv_heads – Number of key-value attention heads.

  • head_size – Dimension of each attention head.

  • dtype – Data type for cache tensors.

  • use_mla – Whether to use Multi-head Latent Attention.

Returns

List of CacheGroupSpec, one per attention type found in the config. Falls back to a single FullAttentionSpec if no mask details available.