easydel.inference.esurge.scheduler.scheduler

easydel.inference.esurge.scheduler.scheduler#

class easydel.inference.esurge.scheduler.scheduler.Scheduler(config: Config, kv_cache_config: CacheGroupsConfig, include_finished_set: bool = False, max_num_seq_buckets: list[int] | None = None)[source]#

Bases: SchedulerInterface

add_request(request: EngineRequest) → None[source]#

Add a new request to the scheduler’s internal queue.

Parameters: request – The new request being added.

finish_requests(request_ids: str | collections.abc.Iterable[str], finished_status: EngineRequestStatus) → None[source]#

Handles the finish signal from outside the scheduler.

For example, the API server can abort a request when the client disconnects.

classmethod from_runner(runner: eSurgeRunner, max_num_batched_tokens: int | None = None, enable_prefix_caching: bool = True) → Scheduler[source]#

Create a Scheduler instance from an eSurgeRunner.

This method automatically detects the model’s attention types (full, sliding window, chunked) from the model config and creates appropriate cache specifications.

Parameters

runner – The eSurgeRunner instance.
max_num_batched_tokens – Maximum tokens per batch. Defaults to max_model_len.
enable_prefix_caching – Enable prefix caching for faster inference.

get_num_unfinished_requests() → int[source]#: Number of unfinished requests in the scheduler’s internal queue.

get_request_counts() → tuple[int, int][source]#: Returns (num_running_reqs, num_waiting_reqs).

has_finished_requests() → bool[source]#

Returns True if there are finished requests that need to be cleared. NOTE: This is different from not self.has_unfinished_requests().

The scheduler maintains an internal list of the requests finished in the previous step. This list is returned from the next call to schedule(), to be sent to the model runner in the next step to clear cached states for these finished requests.

This method checks if this internal list of finished requests is non-empty. This information is useful for DP attention.

reset_prefix_cache() → bool[source]#

Reset the prefix cache for KV cache.

This is particularly required when the model weights are live-updated.

schedule() → SchedulerOutput[source]#

Schedule the requests to process in this scheduling step.

The scheduling decision is made at the iteration level. Each scheduling step corresponds to a single forward pass of the model. Therefore, this method is called repeatedly by a busy loop in the engine.

Essentially, the scheduler produces a dictionary of {req_id: num_tokens} that specifies how many tokens to process for each request in this scheduling step. For example, num_tokens can be as large as the number of prompt tokens for new requests, or it can be 1 for the requests that are auto-regressively generating new tokens one by one. Otherwise, it can be somewhere in between in case of chunked prefills, prefix caching, speculative decoding, etc.

Additionally, the scheduler also returns useful data about each request or the batch as a whole. The model runner will use this information in preparing inputs to the model.

Returns: A SchedulerOutput object containing information about the scheduled requests.

shutdown() → None[source]#: Shutdown the scheduler.

update_from_output(scheduler_output: SchedulerOutput, model_runner_output: ModelRunnerOutput) → dict[int, easydel.inference.esurge.engine_types.EngineCoreOutputs][source]#

Update the scheduler state based on the model runner output.

This method is called after the model runner has processed the scheduled requests. The model runner output includes generated token ids, draft token ids for next step, etc. The scheduler uses this information to update its states, checks the finished requests, and returns the output for each request.

Returns: A dict of client index to EngineCoreOutputs object containing the outputs for each request originating from that client.

easydel.inference.esurge.scheduler.scheduler

Contents

easydel.inference.esurge.scheduler.scheduler#