easydel.inference.vinference.api_server.__init__#
- class easydel.inference.vinference.api_server.__init__.ChatCompletionRequest(*, model: str, messages: ~typing.List[~easydel.inference.vinference.api_server.api_models.ChatMessage], max_tokens: int = 16, presence_penalty: float = 0.0, frequency_penalty: float = 0.0, repetition_penalty: float = 1.0, temperature: float = 0.0, top_p: float = 1.0, top_k: int = 0, min_p: float = 0.0, suppress_tokens: ~typing.List[int] = <factory>, functions: ~typing.Optional[~typing.List[~easydel.inference.vinference.api_server.api_models.FunctionDefinition]] = None, function_call: ~typing.Optional[~typing.Union[str, ~typing.Dict[str, ~typing.Any]]] = None, tools: ~typing.Optional[~typing.List[~easydel.inference.vinference.api_server.api_models.ToolDefinition]] = None, tool_choice: ~typing.Optional[~typing.Union[str, ~typing.Dict[str, ~typing.Any]]] = None, n: ~typing.Optional[int] = 1, stream: ~typing.Optional[bool] = False, stop: ~typing.Optional[~typing.Union[str, ~typing.List[str]]] = None, logit_bias: ~typing.Optional[~typing.Dict[str, float]] = None, user: ~typing.Optional[str] = None)[source]#
Bases:
BaseModelRepresents a request to the chat completion endpoint. Mirrors the OpenAI ChatCompletion request structure.
- frequency_penalty: float#
- function_call: Optional[Union[str, Dict[str, Any]]]#
- functions: Optional[List[FunctionDefinition]]#
- logit_bias: Optional[Dict[str, float]]#
- max_tokens: int#
- messages: List[ChatMessage]#
- min_p: float#
- model: str#
- model_config: ClassVar[ConfigDict] = {}#
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- n: Optional[int]#
- presence_penalty: float#
- repetition_penalty: float#
- stop: Optional[Union[str, List[str]]]#
- stream: Optional[bool]#
- suppress_tokens: List[int]#
- temperature: float#
- tool_choice: Optional[Union[str, Dict[str, Any]]]#
- tools: Optional[List[ToolDefinition]]#
- top_k: int#
- top_p: float#
- user: Optional[str]#
- class easydel.inference.vinference.api_server.__init__.ChatCompletionResponse(*, id: str = <factory>, object: str = 'chat.completion', created: int = <factory>, model: str, choices: ~typing.List[~easydel.inference.vinference.api_server.api_models.ChatCompletionResponseChoice], usage: ~easydel.inference.vinference.api_server.api_models.UsageInfo)[source]#
Bases:
BaseModelRepresents a non-streaming response from the chat completion endpoint.
- choices: List[ChatCompletionResponseChoice]#
- created: int#
- id: str#
- model: str#
- model_config: ClassVar[ConfigDict] = {}#
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- object: str#
- class easydel.inference.vinference.api_server.__init__.ChatCompletionResponseChoice(*, index: int, message: ChatMessage, finish_reason: Optional[Literal['stop', 'length', 'function_call']] = None)[source]#
Bases:
BaseModelRepresents a single choice within a non-streaming chat completion response.
- finish_reason: Optional[Literal['stop', 'length', 'function_call']]#
- index: int#
- message: ChatMessage#
- model_config: ClassVar[ConfigDict] = {}#
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class easydel.inference.vinference.api_server.__init__.ChatCompletionStreamResponse(*, id: str = <factory>, object: str = 'chat.completion.chunk', created: int = <factory>, model: str, choices: ~typing.List[~easydel.inference.vinference.api_server.api_models.ChatCompletionStreamResponseChoice], usage: ~easydel.inference.vinference.api_server.api_models.UsageInfo)[source]#
Bases:
BaseModelRepresents a single chunk in a streaming response from the chat completion endpoint.
- choices: List[ChatCompletionStreamResponseChoice]#
- created: int#
- id: str#
- model: str#
- model_config: ClassVar[ConfigDict] = {}#
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- object: str#
- class easydel.inference.vinference.api_server.__init__.ChatCompletionStreamResponseChoice(*, index: int, delta: DeltaMessage, finish_reason: Optional[Literal['stop', 'length', 'function_call']] = None)[source]#
Bases:
BaseModelRepresents a single choice within a streaming chat completion response chunk.
- delta: DeltaMessage#
- finish_reason: Optional[Literal['stop', 'length', 'function_call']]#
- index: int#
- model_config: ClassVar[ConfigDict] = {}#
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class easydel.inference.vinference.api_server.__init__.ChatMessage(*, role: str, content: Union[str, List[Mapping[str, str]]], name: Optional[str] = None, function_call: Optional[Dict[str, Any]] = None)[source]#
Bases:
BaseModelRepresents a single message in a chat conversation.
- content: Union[str, List[Mapping[str, str]]]#
- function_call: Optional[Dict[str, Any]]#
- model_config: ClassVar[ConfigDict] = {}#
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- name: Optional[str]#
- role: str#
- class easydel.inference.vinference.api_server.__init__.CountTokenRequest(*, model: str, conversation: Union[str, List[ChatMessage]])[source]#
Bases:
BaseModelRepresents a request to the token counting endpoint.
- conversation: Union[str, List[ChatMessage]]#
- model: str#
- model_config: ClassVar[ConfigDict] = {}#
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class easydel.inference.vinference.api_server.__init__.vInferenceApiServer(inference_map: Union[Dict[str, Any], Any] = None, inference_init_call: Optional[Callable[[], Any]] = None, max_workers: int = 10)[source]#
Bases:
objectFastAPI server for serving vInference instances.
This server provides endpoints mimicking the OpenAI API structure for chat completions, liveness/readiness checks, token counting, and listing available models. It handles both streaming and non-streaming requests asynchronously using a thread pool.
- async chat_completions(request: ChatCompletionRequest)[source]#
Handles chat completion requests (POST /v1/chat/completions).
Validates the request, retrieves the appropriate vInference model, tokenizes the input, and delegates to streaming or non-streaming handlers.
- Parameters
request (ChatCompletionRequest) โ The incoming request data.
- Returns
- The generated response, either
a complete JSON object or a streaming event-stream.
- Return type
Union[JSONResponse, StreamingResponse]
- async completions(request: CompletionRequest)[source]#
Handles completion requests (POST /v1/completions).
Processes the prompt for completion and returns generated text.
- Parameters
request (CompletionRequest) โ The incoming request data.
- Returns
The generated response.
- Return type
Union[JSONResponse, StreamingResponse]
- async count_tokens(request: CountTokenRequest)[source]#
Token counting endpoint (POST /v1/count_tokens).
- fire(host='0.0.0.0', port=11556, metrics_port: Optional[int] = None, log_level='info', ssl_keyfile: Optional[str] = None, ssl_certfile: Optional[str] = None)[source]#
Starts the uvicorn server to run the FastAPI application.
- Parameters
host (str) โ The host address to bind to. Defaults to โ0.0.0.0โ.
port (int) โ The port to listen on. Defaults to 11556.
metrics_port (tp.Optional[int]) โ The port for the Prometheus metrics server. If None, defaults to port + 1. Set to -1 to disable.
log_level (str) โ The logging level for uvicorn. Defaults to โinfoโ.
ssl_keyfile (tp.Optional[str]) โ Path to the SSL key file for HTTPS.
ssl_certfile (tp.Optional[str]) โ Path to the SSL certificate file for HTTPS.
- class easydel.inference.vinference.api_server.__init__.vInferenceChatCompletionClient(base_url: str, max_retries: int = 5, timeout: float = 30.0)[source]#
Bases:
objectClient for interacting with the vInference Chat Completion API endpoint.
This client handles communication with the vInference server, including sending requests, handling responses (streaming or non-streaming), managing retries, and parsing errors.
- create_chat_completion(request: ChatCompletionRequest, extra_headers: Optional[dict] = None) Generator[Union[ChatCompletionStreamResponse, ChatCompletionResponse], None, None][source]#
Sends a chat completion request to the vInference API.
Handles both streaming and non-streaming responses based on the stream attribute in the request object.
- Parameters
request (ChatCompletionRequest) โ The chat completion request object.
extra_headers (tp.Optional[dict]) โ Optional dictionary of extra headers to include in the request. Defaults to None.
- Yields
tp.Union[ChatCompletionStreamResponse, ChatCompletionResponse] โ For streaming requests, yields ChatCompletionStreamResponse objects for each chunk received. For non-streaming requests, yields a single ChatCompletionResponse object.
- Raises
vInferenceAPIError โ If the API returns an error status code or if thereโs an issue parsing the response.
requests.RequestException โ For underlying network connection issues.