easydel.inference.vsurge.vsurge

easydel.inference.vsurge.vsurge#

class easydel.inference.vsurge.vsurge.vSurge(driver: Union[vDriver, oDriver], vsurge_name: str | None = None)[source]#

Bases: object

Orchestrates the interaction between client requests and the vDriver.

compile()[source]#

async complete(request: vSurgeRequest) → tp.AsyncGenerator[tp.List[ReturnSample]][source]#

Initiates and streams the results of a text completion request.

Creates an ActiveRequest using the plain prompt from the vSurgeRequest, places it on the driver’s prefill queue, and then asynchronously iterates through the results provided by the ActiveRequest’s return_channel.

It handles both client-side and server-side tokenization scenarios, buffering and processing results appropriately before yielding them.

Parameters: request – The vSurgeRequest containing the prompt and generation parameters.
Yields: Processed generation results, similar to the decode method. The format depends on the tokenization mode.
Raises: RuntimeError – If the prefill queue is full when trying to place the request.

count_tokens(text_or_conversation: Union[str, list]) → int[source]#

Counts the number of tokens in a given string or conversation list.

Uses the underlying driver’s processor to tokenize the input and returns the count of tokens.

Parameters: text_or_conversation – Either a single string or a list of message dictionaries (like OpenAI chat format).
Returns: The total number of tokens in the input.
Raises: ValueError – If the input type is invalid or tokenization fails.

classmethod create_odriver(model: Any, processor: Any, storage: Optional[PagedAttentionCache] = None, manager: Optional[HBMPageManager] = None, page_size: int = 128, hbm_utilization: float = 0.6, max_concurrent_prefill: int | None = None, max_concurrent_decodes: int | None = None, prefill_lengths: int | None = None, max_prefill_length: int | None = None, max_length: int | None = None, seed: int = 894, vsurge_name: str | None = None) → vSurge[source]#

classmethod create_vdriver(model: Any, processor: Any, max_concurrent_decodes: int | None = None, prefill_lengths: int | None = None, max_prefill_length: int | None = None, max_length: int | None = None, seed: int = 894, vsurge_name: str | None = None) → vSurge[source]#

Creates a new instance of vSurge with configured vDriver and vEngines.

This class method provides a convenient way to instantiate the vSurge by setting up the necessary prefill and decode engines with the provided model, processor, and configuration parameters.

Parameters

model – The EasyDeLBaseModule instance representing the model.
processor – The tokenizer/processor instance.
max_concurrent_decodes – Maximum number of concurrent decode requests the decode engine can handle.
prefill_lengths – A list of prefill lengths to compile for the prefill engine.
max_prefill_length – The maximum prefill length for the prefill engine.
max_length – The maximum total sequence length for both engines.
seed – The random seed for reproducibility.
vsurge_name – An optional name for the vsurge.

Returns

A new instance of vSurge.

property driver#: Provides access to the underlying vDriver instance.

async generate(prompts: tp.Union[str, tp.Sequence[str]], sampling_params: tp.Optional[tp.Union[SamplingParams, tp.Sequence[SamplingParams]]] = None, stream: bool = False) → tp.Union[tp.List[ReturnSample], tp.AsyncGenerator[tp.List[ReturnSample]]][source]#

Generates text completions concurrently for the given prompts.

Parameters

prompts – A single prompt string or a list of prompt strings.
sampling_params – A single SamplingParams object or a list of SamplingParams objects. If None, default SamplingParams will be used. If a single SamplingParams object is provided with multiple prompts, it will be applied to all prompts. If a list is provided, it must have the same length as the prompts list.
stream – If True, yields results (List[ReturnSample]) from any request as they become available. The list corresponds to one generation step from one request. If False, waits for all requests to complete and returns a list containing one aggregated ReturnSample per prompt.

Returns

An async generator yielding lists of ReturnSample as: steps complete across concurrent requests.
If stream is False: A list of aggregated ReturnSample objects, one for: each input prompt, after all requests have finished.

Return type

If stream is True

Raises

ValueError – If the lengths of prompts and sampling_params lists mismatch.
RuntimeError – If the underlying driver’s queue is full.

process_client_side_tokenization_response(response: list[easydel.inference.vsurge.utils.ReturnSample])[source]#

Processes responses when tokenization is handled client-side.

In this case, the response items (ReturnSample) are typically yielded directly without further server-side processing like detokenization or buffering.

Parameters: response – A list of ReturnSample objects from a single generation step.
Returns: The input list of ReturnSample objects, unchanged.

process_server_side_tokenization_response(response: list[easydel.inference.vsurge.utils.ReturnSample], buffered_response_list: list[list[easydel.inference.vsurge.utils.ReturnSample]]) → list[easydel.inference.vsurge.utils.ReturnSample][source]#

Processes responses when tokenization/detokenization is server-side.

Combines the text and token IDs from the current response and any buffered previous responses for each sample. It then uses the metrics (TPS, generated token count) from the latest response in the sequence for the final output.

Parameters

response – The list of ReturnSample objects from the current step.
buffered_response_list – A list containing lists of ReturnSample objects from previous steps that were buffered.

Returns

A list of tuples, where each tuple represents a completed sample and contains: (decoded_string, all_token_ids, latest_tps, latest_num_generated_tokens).

property processor: Any#: Returns the processor/tokenizer associated with the underlying driver.

should_buffer_response(response: list[easydel.inference.vsurge.utils.ReturnSample]) → bool[source]#

Determines if a response needs buffering for server-side detokenization.

Buffering is needed if any sample in the response ends with a byte token (e.g., “<0xAB>”), as this indicates an incomplete multi-byte character that requires subsequent tokens for proper decoding.

Parameters: response – A list of ReturnSample objects from a single generation step.
Returns: True if buffering is required, False otherwise.

start()[source]#

stop()[source]#

property vsurge_name#

class easydel.inference.vsurge.vsurge.vSurgeMetadata[source]#

Bases: object

Tracks timing information for requests processed by the vsurge.

start_time#: The time when the request processing started.

class easydel.inference.vsurge.vsurge.vSurgeRequest(prompt: str, max_tokens: int, top_p: float = 1.0, top_k: int = 0, min_p: float = 0.0, temperature: float = 0.7, presence_penalty: float = 0.0, frequency_penalty: float = 0.0, repetition_penalty: float = 1.0, metadata: easydel.inference.vsurge.vsurge.vSurgeMetadata | None = None, is_client_side_tokenization: bool = False)[source]#

Bases: object

Represents a request specifically for text completion.

frequency_penalty: float = 0.0#

classmethod from_sampling_params(prompt: str, sampling_params: SamplingParams)[source]#

is_client_side_tokenization: bool = False#

max_tokens: int#

metadata: easydel.inference.vsurge.vsurge.vSurgeMetadata | None = None#

min_p: float = 0.0#

presence_penalty: float = 0.0#

prompt: str#

repetition_penalty: float = 1.0#

temperature: float = 0.7#

top_k: int = 0#

top_p: float = 1.0#

easydel.inference.vsurge.vsurge

Contents

easydel.inference.vsurge.vsurge#