easydel.inference.vinference.vinference#

class easydel.inference.vinference.vinference.PromptOutput(*, text: Optional[str] = None, generated_tokens: Optional[int] = None, tokens_per_second: Optional[float] = None, error: Optional[str] = None, finish_reason: Optional[str] = None)[source]#

Bases: BaseModel

Structure for holding the output of a processed prompt.

error: Optional[str]#
finish_reason: Optional[str]#
generated_tokens: Optional[int]#
model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

text: Optional[str]#
tokens_per_second: Optional[float]#
class easydel.inference.vinference.vinference.vInference(model: None, processor_class: None, generation_config: Optional[vInferenceConfig] = None, seed: Optional[int] = None, input_partition_spec: Optional[PartitionSpec] = None, max_new_tokens: int = 512, inference_name: Optional[str] = None)[source]#

Bases: object

Class for performing text generation using a pre-trained language graphdef in EasyDeL.

This class handles the generation process, including initialization, precompilation, and generating text in streaming chunks.

property SEQUENCE_DIM_MAPPING#
adjust_kwargs(input_ids: Array, attention_mask: Optional[Array] = None, **model_kwargs)[source]#
count_tokens(messages: List[Dict[str, str]], oai_like: bool = False)[source]#
count_tokens(text: str, oai_like: bool = False)
execute_decode(state: SampleState, *, graphstate: Optional[State[Key, VariableState[Any]]] = None, graphother: Optional[State[Key, VariableState[Any]]] = None, compile_config: Optional[vInferencePreCompileConfig] = None, sampling_params: Optional[SamplingParams] = None, func: Optional[Callable[[Any], SampleState]]) SampleState[source]#
execute_prefill(state: SampleState, *, graphstate: Optional[State[Key, VariableState[Any]]] = None, graphother: Optional[State[Key, VariableState[Any]]] = None, compile_config: Optional[vInferencePreCompileConfig] = None, sampling_params: Optional[SamplingParams] = None, func: Optional[Callable[[Any], SampleState]]) SampleState[source]#

Executes a single generation step with performance monitoring.

generate(input_ids: Array, attention_mask: Optional[Array] = None, *, graphstate: Optional[State[Key, VariableState[Any]]] = None, graphother: Optional[State[Key, VariableState[Any]]] = None, sampling_params: Optional[SamplingParams] = None, **model_kwargs) Generator[Union[SampleState, Any], SampleState, SampleState][source]#

Generates text in streaming chunks with comprehensive input adjustment.

Parameters
  • input_ids – Input token IDs as a JAX array

  • attention_mask – Optional attention mask for the input

  • graphstate (nn.GraphState, optional) – in case that you want to update model state for generation.

  • graphother (nn.GraphState, optional) – in case that you want to update model ostate for generation.

  • **model_kwargs – Additional model-specific keyword arguments

Returns

Generator yielding SampleState objects containing generation results and metrics

property inference_name#
classmethod load_inference(path: Union[PathLike, str], model: None, processor_class: None)[source]#
property metrics#
property model#
property model_prefill_length: int#

Calculate the maximum length available for input prefill by subtracting the maximum new tokens from the model’s maximum sequence length.

Returns

The maximum length available for input prefill

Return type

int

Raises

ValueError – If no maximum sequence length configuration is found

precompile(config: vInferencePreCompileConfig)[source]#

Precompiles the generation functions for a given batch size and input length.

This function checks if the generation functions have already been compiled for the given configuration. If not, it compiles them asynchronously and stores them in a cache.

Returns

True if precompilation was successful, False otherwise.

Return type

bool

process_prompt(prompt: Union[str, List[str], List[Dict[str, str]]], sampling_params: Union[SamplingParams, Dict], stream: bool = False) Union[PromptOutput, List[PromptOutput], Generator[str, None, SampleState], List[Generator[str, None, SampleState]]][source]#

Processes a prompt (string, list of strings, or OpenAI messages) and generates a response.

Parameters
  • prompt – The input prompt. Can be a single string, a list of strings (processed sequentially), or a list of dictionaries representing OpenAI-style chat messages (processed as a single batch).

  • sampling_params – Configuration for the generation process (temperature, top_p, etc.). Can be a SamplingParams object or a dictionary.

  • stream – If True, yields generated tokens incrementally. If False, returns the complete generation(s) at the end.

Returns

A PromptOutput object containing the full text and metrics. - If input is str or List[Dict] and stream=True: A generator yielding string chunks. The generator’s return value (accessible via try/except StopIteration) is the final SampleState. - If input is List[str] and stream=False: A list of PromptOutput objects. - If input is List[str] and stream=True: A list where each element is a generator as described above for the single stream case.

Return type

  • If input is str or List[Dict] and stream=False

Raises
  • TypeError – If the prompt format is invalid or processor does not support chat templates.

  • ValueError – If tokenization or processing fails.

async process_prompts_concurrently(prompts: List[Union[str, List[Dict[str, str]]]], max_concurrent_requests: int, sampling_params: Optional[SamplingParams] = None, stream: bool = False, progress_callback: Optional[Callable[[int, int], None]] = None) Union[List[PromptOutput], AsyncGenerator[Tuple[int, str, Any], None]][source]#

Processes a list of prompts concurrently, supporting both streaming and non-streaming modes.

Parameters
  • prompts – A list of prompts (strings or OpenAI-style message lists).

  • max_concurrent_requests – The maximum number of prompts to process in parallel. If <= 0, processing will be sequential (but still async).

  • sampling_params – Optional sampling parameters to override the default ones for this batch of requests. Passed to process_prompt.

  • stream – If True, returns an async generator yielding tuples of (index, type, data). If False, returns a list of PromptOutput objects.

  • progress_callback – An optional function called after each prompt finishes processing. Receives (completed_count, total_count).

Returns

A list of PromptOutput objects containing the full text, metrics,

or error information for each prompt, in the original order.

  • If stream=True: An async generator yielding tuples (index, type, data) where:
    • type is ‘text’ and data is the string chunk.

    • type is ‘error’ and data is the error string.

    • type is ‘final’ and data is the final PromptOutput object with metrics.

    The generator finishes when all prompts are processed.

Return type

  • If stream=False

Raises
  • ValueError – If input arguments are invalid.

  • RuntimeError – If an unexpected error occurs during processing.

save_inference(path: Union[PathLike, str])[source]#
property tokenizer#
class easydel.inference.vinference.vinference.vInferenceMetaData(*, inference_name: str, generation_config: vInferenceConfig, precompiled_configs: Dict[int, vInferencePreCompileConfig], in_compiling_process: set, input_partition_spec: PartitionSpec, uuid4: str)[source]#

Bases: BaseModel

generation_config: vInferenceConfig#
in_compiling_process: set#
inference_name: str#
input_partition_spec: PartitionSpec#
model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

precompiled_configs: Dict[int, vInferencePreCompileConfig]#
uuid4: str#