easydel.inference.esurge.core.manager#
- class easydel.inference.esurge.core.manager.CacheManager(num_pages: int, kv_cache_groups: list[easydel.inference.esurge.core.interface.CacheGroupSpec], max_model_len: int, enable_caching: bool = True, use_eagle: bool = False)[source]#
Bases:
object- allocate_slots(request: EngineRequest, num_new_tokens: int, num_new_computed_tokens: int = 0, new_computed_pages: easydel.inference.esurge.core.manager.CachePages | None = None, num_lookahead_tokens: int = 0, delay_cache_pages: bool = False) easydel.inference.esurge.core.manager.CachePages | None[source]#
Add slots for a request with new tokens to append.
- Parameters
request – The request to allocate slots.
num_new_tokens – The number of tokens to allocate, including external tokens. Note that this does not include tokens that have already been computed locally (i.e. new_computed_pages).
num_new_computed_tokens – The number of new computed tokens just hitting the prefix caching, excluding external tokens.
new_computed_pages – The cached pages for the above new computed tokens.
num_lookahead_tokens – The number of speculative tokens to allocate. This is used by spec decode proposers with kv-cache such as eagle.
delay_cache_pages – Whether to skip caching the pages. This is used by P/D when allocating pages used in a KV transfer which will complete in a future step.
Pages layout: ``` ———————————————————————– | < computed > | < new computed > | < new > | < pre-allocated > | ———————————————————————– | < required > | ————————————————– | < full > | ————————————————
<new full> |``` The following *_pages are illustrated in this layout.
- Returns
A list of new allocated pages.
- cache_pages(request: EngineRequest, num_computed_tokens: int) None[source]#
Cache the pages for the request, if enabled.
- create_empty_page_list() CachePages[source]#
Creates a new CachePages instance with no pages.
- free(request: EngineRequest) None[source]#
Free the pages allocated for the request. We free the pages in reverse order so that he tail pages are evicted first when caching is enabled.
- Parameters
request – The request to free the pages.
- free_page_hashes(request: EngineRequest) None[source]#
Discard the page hashes for the request.
NOTE: Unlike free, this method should be called only when the request is finished, not when it is preempted.
- get_computed_pages(request: EngineRequest) tuple[easydel.inference.esurge.core.manager.CachePages, int][source]#
Get the computed (cached) pages for the request. Note that the computed pages must be full.
- Parameters
request – The request to get the computed pages.
- Returns
A list of pages that are computed for the request.
The number of computed tokens.
- Return type
A tuple containing
- get_num_common_prefix_pages(request: EngineRequest, num_scheduled_requests: int) list[int][source]#
Calculate the number of common prefix pages shared by all requests in the RUNNING state for each kv cache group.
The function determines this by selecting any request and iterating through its pages. A page is considered a common prefix page if its ref_cnt equals the total number of requests in the RUNNING state.
NOTE(woosuk): The number of requests in the RUNNING state is greater than or equal to the number of requests scheduled in the current step. This is because the RUNNING state only indicates that: 1. The request has not yet finished, and 2. The request holds its pages unfreed.
While all scheduled requests must be in the RUNNING state, the inverse is not necessarily true. There may be RUNNING requests that are not scheduled in the current step.
This can result in an edge case where the number of common prefix pages is 0, even though all scheduled requests share a common prefix. This occurs because there may be unscheduled RUNNING requests that do not share the common prefix. Currently, this case cannot be easily detected, so the function returns 0 in such cases.
- Parameters
request – Any request in the RUNNING state, used to identify the common prefix pages.
num_running_requests – The total number of requests in the RUNNING state. This can be different from the number of scheduled requests in the current step.
- Returns
The number of common prefix pages for each kv cache group.
- Return type
list[int]
- reset_prefix_cache() bool[source]#
Reset prefix cache. This function may be used in RLHF flows to invalidate prefix caching after the weights are updated, or used for resetting prefix caching status for benchmarking.
- Returns
True if the prefix cache is successfully reset, False otherwise.
- Return type
bool
- property usage: float#
Get the KV cache usage.
- Returns
The KV cache usage (between 0.0 and 1.0).
- class easydel.inference.esurge.core.manager.CachePages(pages: tuple[list[easydel.inference.esurge.core.utils.CachePage], ...])[source]#
Bases:
objectThe allocation result of CacheManager, work as the interface between Scheduler and CacheManager, to hide CacheManager’s internal data structure from the Scheduler.
- get_page_ids() tuple[list[int], ...][source]#
Converts the CachePages instance to page_ids.
- Returns
A tuple of lists where * the outer tuple corresponds to KV cache groups * each inner list contains the page_ids of the pages in that group
- Return type
tuple[list[int], …]
- new_empty() CachePages[source]#
Creates a new CachePages instance with no pages.
- pages: tuple[list[easydel.inference.esurge.core.utils.CachePage], ...]#