easydel.inference.esurge.config#
- class easydel.inference.esurge.config.CacheConfig(num_pages: int | None, page_size: int, enable_prefix_caching: bool)[source]#
Bases:
objectConfiguration for the KV (key-value) cache.
Manages memory allocation and caching strategies for attention mechanisms.
- num_pages#
Number of GPU pages allocated for cache (None for automatic).
- Type
int | None
- page_size#
Size of each cache page in tokens.
- Type
int
- enable_prefix_caching#
Enable caching of common prefixes across requests.
- Type
bool
Example
>>> config = CacheConfig( ... num_pages=1000, ... page_size=16, ... enable_prefix_caching=True ... )
Note
Page-based allocation allows efficient memory management and sharing of cache blocks between sequences.
- enable_prefix_caching: bool#
A flag to enable or disable prefix caching.
- page_size: int#
The size of each cache page.
- class easydel.inference.esurge.config.Config(scheduler_config: SchedulerConfig, cache_config: CacheConfig, speculative_config: easydel.inference.esurge.config.SpeculativeConfig | None = None)[source]#
Bases:
objectUnified configuration for the eSurge engine.
Combines scheduler and cache configurations into a single object.
- scheduler_config#
Configuration for request scheduling.
- cache_config#
Configuration for KV cache management.
- speculative_config#
Configuration for speculative decoding.
Example
>>> config = Config( ... scheduler_config=SchedulerConfig(...), ... cache_config=CacheConfig(...), ... speculative_config=SpeculativeConfig(num_speculative_tokens=5) ... )
- cache_config: CacheConfig#
Nested configuration for the cache.
- scheduler_config: SchedulerConfig#
Nested configuration for the scheduler.
- speculative_config: easydel.inference.esurge.config.SpeculativeConfig | None = None#
Nested configuration for speculative decoding.
- class easydel.inference.esurge.config.SchedulerConfig(max_num_seqs: int, max_num_batched_tokens: int, max_model_len: int, policy: Literal['priority', 'fcfs'] = 'fcfs', long_prefill_token_threshold: int = 256, chunked_prefill_enabled: bool = False, token_safety_margin: int | None = None, max_num_seq_buckets: tuple[int, ...] | None = None, async_scheduling: bool = False)[source]#
Bases:
objectConfiguration for the request scheduler.
Controls how requests are scheduled and batched for processing.
- max_num_seqs#
Maximum number of sequences running simultaneously.
- Type
int
- max_num_batched_tokens#
Maximum tokens processed in a single batch.
- Type
int
- max_model_len#
Maximum input length the model can handle.
- Type
int
- policy#
Scheduling policy (‘fcfs’ for first-come-first-served, ‘priority’ for priority-based).
- Type
Literal[‘priority’, ‘fcfs’]
- long_prefill_token_threshold#
Token count threshold for identifying long prefill requests.
- Type
int
- chunked_prefill_enabled#
Enable chunked processing of long prefill requests.
- Type
bool
Example
>>> config = SchedulerConfig( ... max_num_seqs=16, ... max_num_batched_tokens=2048, ... max_model_len=8192, ... policy="priority" ... )
- async_scheduling: bool = False#
Enable async token sampling to overlap with next forward pass (30-40% latency reduction).
- chunked_prefill_enabled: bool = False#
A flag to enable or disable chunked prefilling.
- long_prefill_token_threshold: int = 256#
A token threshold for handling long prefill requests.
- max_model_len: int#
The maximum length of the model’s input.
- max_num_batched_tokens: int#
The maximum number of tokens to be processed in a single batch.
- max_num_seq_buckets: tuple[int, ...] | None = None#
Optional explicit request-capacity buckets (e.g., (8, 16, 32, 64)).
- max_num_seqs: int#
The maximum number of sequences running at the same time.
- policy: Literal['priority', 'fcfs'] = 'fcfs'#
The scheduling policy to use, such as ‘priority’ or ‘fcfs’.
- class easydel.inference.esurge.config.SpeculativeConfig(num_speculative_tokens: int = 0, speculative_model: str | None = None)[source]#
Bases:
objectConfiguration for speculative decoding.
- num_speculative_tokens#
Number of speculative tokens to generate.
- Type
int
- speculative_model#
Path to the speculative model (e.g., Eagle model).
- Type
str | None
- num_speculative_tokens: int = 0#