easydel.inference.vsurge.vsurge#
- class easydel.inference.vsurge.vsurge.vSurge(driver: Union[vDriver, oDriver], vsurge_name: str | None = None)[source]#
Bases:
objectOrchestrates the interaction between client requests and the vDriver.
- async complete(request: vSurgeRequest) tp.AsyncGenerator[tp.List[ReturnSample]][source]#
Initiates and streams the results of a text completion request.
Creates an ActiveRequest using the plain prompt from the vSurgeRequest, places it on the driverโs prefill queue, and then asynchronously iterates through the results provided by the ActiveRequestโs return_channel.
It handles both client-side and server-side tokenization scenarios, buffering and processing results appropriately before yielding them.
- Parameters
request โ The vSurgeRequest containing the prompt and generation parameters.
- Yields
Processed generation results, similar to the decode method. The format depends on the tokenization mode.
- Raises
RuntimeError โ If the prefill queue is full when trying to place the request.
- count_tokens(text_or_conversation: Union[str, list]) int[source]#
Counts the number of tokens in a given string or conversation list.
Uses the underlying driverโs processor to tokenize the input and returns the count of tokens.
- Parameters
text_or_conversation โ Either a single string or a list of message dictionaries (like OpenAI chat format).
- Returns
The total number of tokens in the input.
- Raises
ValueError โ If the input type is invalid or tokenization fails.
- classmethod create_odriver(model: Any, processor: Any, storage: Optional[PagedAttentionCache] = None, manager: Optional[HBMPageManager] = None, page_size: int = 128, hbm_utilization: float = 0.6, max_concurrent_prefill: int | None = None, max_concurrent_decodes: int | None = None, prefill_lengths: int | None = None, max_prefill_length: int | None = None, max_length: int | None = None, seed: int = 894, vsurge_name: str | None = None) vSurge[source]#
- classmethod create_vdriver(model: Any, processor: Any, max_concurrent_decodes: int | None = None, prefill_lengths: int | None = None, max_prefill_length: int | None = None, max_length: int | None = None, seed: int = 894, vsurge_name: str | None = None) vSurge[source]#
Creates a new instance of vSurge with configured vDriver and vEngines.
This class method provides a convenient way to instantiate the vSurge by setting up the necessary prefill and decode engines with the provided model, processor, and configuration parameters.
- Parameters
model โ The EasyDeLBaseModule instance representing the model.
processor โ The tokenizer/processor instance.
max_concurrent_decodes โ Maximum number of concurrent decode requests the decode engine can handle.
prefill_lengths โ A list of prefill lengths to compile for the prefill engine.
max_prefill_length โ The maximum prefill length for the prefill engine.
max_length โ The maximum total sequence length for both engines.
seed โ The random seed for reproducibility.
vsurge_name โ An optional name for the vsurge.
- Returns
A new instance of vSurge.
- property driver#
Provides access to the underlying vDriver instance.
- async generate(prompts: tp.Union[str, tp.Sequence[str]], sampling_params: tp.Optional[tp.Union[SamplingParams, tp.Sequence[SamplingParams]]] = None, stream: bool = False) tp.Union[tp.List[ReturnSample], tp.AsyncGenerator[tp.List[ReturnSample]]][source]#
Generates text completions concurrently for the given prompts.
- Parameters
prompts โ A single prompt string or a list of prompt strings.
sampling_params โ A single SamplingParams object or a list of SamplingParams objects. If None, default SamplingParams will be used. If a single SamplingParams object is provided with multiple prompts, it will be applied to all prompts. If a list is provided, it must have the same length as the prompts list.
stream โ If True, yields results (List[ReturnSample]) from any request as they become available. The list corresponds to one generation step from one request. If False, waits for all requests to complete and returns a list containing one aggregated ReturnSample per prompt.
- Returns
- An async generator yielding lists of ReturnSample as
steps complete across concurrent requests.
- If stream is False: A list of aggregated ReturnSample objects, one for
each input prompt, after all requests have finished.
- Return type
If stream is True
- Raises
ValueError โ If the lengths of prompts and sampling_params lists mismatch.
RuntimeError โ If the underlying driverโs queue is full.
- process_client_side_tokenization_response(response: list[easydel.inference.vsurge.utils.ReturnSample])[source]#
Processes responses when tokenization is handled client-side.
In this case, the response items (ReturnSample) are typically yielded directly without further server-side processing like detokenization or buffering.
- Parameters
response โ A list of ReturnSample objects from a single generation step.
- Returns
The input list of ReturnSample objects, unchanged.
- process_server_side_tokenization_response(response: list[easydel.inference.vsurge.utils.ReturnSample], buffered_response_list: list[list[easydel.inference.vsurge.utils.ReturnSample]]) list[easydel.inference.vsurge.utils.ReturnSample][source]#
Processes responses when tokenization/detokenization is server-side.
Combines the text and token IDs from the current response and any buffered previous responses for each sample. It then uses the metrics (TPS, generated token count) from the latest response in the sequence for the final output.
- Parameters
response โ The list of ReturnSample objects from the current step.
buffered_response_list โ A list containing lists of ReturnSample objects from previous steps that were buffered.
- Returns
A list of tuples, where each tuple represents a completed sample and contains: (decoded_string, all_token_ids, latest_tps, latest_num_generated_tokens).
- property processor: Any#
Returns the processor/tokenizer associated with the underlying driver.
- should_buffer_response(response: list[easydel.inference.vsurge.utils.ReturnSample]) bool[source]#
Determines if a response needs buffering for server-side detokenization.
Buffering is needed if any sample in the response ends with a byte token (e.g., โ<0xAB>โ), as this indicates an incomplete multi-byte character that requires subsequent tokens for proper decoding.
- Parameters
response โ A list of ReturnSample objects from a single generation step.
- Returns
True if buffering is required, False otherwise.
- property vsurge_name#
- class easydel.inference.vsurge.vsurge.vSurgeMetadata[source]#
Bases:
objectTracks timing information for requests processed by the vsurge.
- start_time#
The time when the request processing started.
- class easydel.inference.vsurge.vsurge.vSurgeRequest(prompt: str, max_tokens: int, top_p: float = 1.0, top_k: int = 0, min_p: float = 0.0, temperature: float = 0.7, presence_penalty: float = 0.0, frequency_penalty: float = 0.0, repetition_penalty: float = 1.0, metadata: easydel.inference.vsurge.vsurge.vSurgeMetadata | None = None, is_client_side_tokenization: bool = False)[source]#
Bases:
objectRepresents a request specifically for text completion.
- frequency_penalty: float = 0.0#
- classmethod from_sampling_params(prompt: str, sampling_params: SamplingParams)[source]#
- is_client_side_tokenization: bool = False#
- max_tokens: int#
- metadata: easydel.inference.vsurge.vsurge.vSurgeMetadata | None = None#
- min_p: float = 0.0#
- presence_penalty: float = 0.0#
- prompt: str#
- repetition_penalty: float = 1.0#
- temperature: float = 0.7#
- top_k: int = 0#
- top_p: float = 1.0#