easydel.inference.inference_engine_interface

Contents

easydel.inference.inference_engine_interface#

Base interface for EasyDeL inference API servers.

This module provides abstract base classes and utilities for building standardized inference API servers with OpenAI API compatibility.

Classes:

ServerStatus: Enum representing server operational states ServerMetrics: Dataclass for tracking server performance metrics EndpointConfig: Configuration for API endpoints ErrorResponse: Standard error response format BaseInferenceApiServer: Abstract base class for inference servers InferenceEngineAdapter: Abstract adapter for different inference engines

class easydel.inference.inference_engine_interface.BaseInferenceApiServer(max_workers: int | None = None, enable_cors: bool = True, cors_origins: list[str] | None = None, max_request_size: int = 10485760, request_timeout: float = 300.0, enable_function_calling: bool = True, default_function_format: FunctionCallFormat = FunctionCallFormat.OPENAI, server_name: str = 'EasyDeL Inference API Server', server_description: str = 'High-performance inference server with OpenAI API compatibility', server_version: str = '2.0.0', enable_auth_ui: bool = True)[source]#

Bases: ABC

Abstract base class for inference API servers.

This interface defines the standard structure and methods that all inference API servers should implement to ensure consistency across different inference modules.

abstract async chat_completions(request: ChatCompletionRequest, raw_request: Request) easydel.inference.openai_api_modules.ChatCompletionResponse | starlette.responses.StreamingResponse | starlette.responses.JSONResponse[source]#

Handle chat completion requests.

Parameters
  • request – The chat completion request

  • raw_request – Raw FastAPI request containing headers

Returns

Chat completion response (streaming or non-streaming)

abstract async completions(request: CompletionRequest, raw_request: Request) easydel.inference.openai_api_modules.CompletionResponse | starlette.responses.StreamingResponse | starlette.responses.JSONResponse[source]#

Handle completion requests.

Parameters
  • request – The completion request

  • raw_request – Raw FastAPI request containing headers

Returns

Completion response (streaming or non-streaming)

abstract async execute_tool(request: Request) JSONResponse[source]#

Execute a tool/function call.

Parameters

request – The tool execution request

Returns

Tool execution result

extract_tools(request: ChatCompletionRequest) list[dict] | None[source]#
fire(host: str = '0.0.0.0', port: int = 11556, workers: int = 1, log_level: str = 'info', ssl_keyfile: str | None = None, ssl_certfile: str | None = None, reload: bool = False) None#

Start the server with enhanced configuration.

Parameters
  • host – Host address to bind to

  • port – Port to listen on

  • workers – Number of worker processes

  • log_level – Logging level

  • ssl_keyfile – Path to SSL key file

  • ssl_certfile – Path to SSL certificate file

  • reload – Enable auto-reload for development

abstract async get_metrics(raw_request: Request) JSONResponse[source]#

Get server performance metrics.

Parameters

raw_request – Raw FastAPI request containing headers

Returns

Server metrics information

abstract async get_model(model_id: str, raw_request: Request) JSONResponse[source]#

Get detailed information about a specific model.

Parameters
  • model_id – The model identifier

  • raw_request – Raw FastAPI request containing headers

Returns

Model details

abstract async health_check(raw_request: Request) JSONResponse[source]#

Perform comprehensive health check.

Parameters

raw_request – Raw FastAPI request containing headers

Returns

Health status information

abstract async list_models(raw_request: Request) JSONResponse[source]#

List available models.

Parameters

raw_request – Raw FastAPI request containing headers

Returns

List of available models with metadata

abstract async list_tools(raw_request: Request) JSONResponse[source]#

List available tools/functions.

Parameters

raw_request – Raw FastAPI request containing headers

Returns

Available tools information

async on_shutdown() None[source]#

Hook for server shutdown.

Override in subclasses to perform cleanup tasks such as saving state, closing connections, or releasing resources. This method is called once when the server shuts down.

async on_startup() None[source]#

Hook for server startup.

Override in subclasses to perform custom initialization tasks such as loading models, establishing connections, or warming up caches. This method is called once when the server starts.

run(host: str = '0.0.0.0', port: int = 11556, workers: int = 1, log_level: str = 'info', ssl_keyfile: str | None = None, ssl_certfile: str | None = None, reload: bool = False) None[source]#

Start the server with enhanced configuration.

Parameters
  • host – Host address to bind to

  • port – Port to listen on

  • workers – Number of worker processes

  • log_level – Logging level

  • ssl_keyfile – Path to SSL key file

  • ssl_certfile – Path to SSL certificate file

  • reload – Enable auto-reload for development

class easydel.inference.inference_engine_interface.EndpointConfig(*, path: str, handler: Callable, methods: list[str], summary: str | None = None, tags: list[str] | None = None, response_model: Any = None)[source]#

Bases: BaseModel

Configuration for a FastAPI endpoint.

Defines the structure for registering API endpoints.

path#

URL path for the endpoint

Type

str

handler#

Callable that handles requests

Type

Callable

methods#

HTTP methods supported (GET, POST, etc.)

Type

list[str]

summary#

Brief description of the endpoint

Type

str | None

tags#

Tags for API documentation grouping

Type

list[str] | None

response_model#

Pydantic model for response validation

Type

Any

handler: tp.Callable#
methods: list[str]#
model_config: ClassVar[ConfigDict] = {}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

path: str#
response_model: tp.Any#
summary: str | None#
tags: list[str] | None#
class easydel.inference.inference_engine_interface.ErrorResponse(*, error: dict[str, str], request_id: str | None = None, timestamp: float = <factory>)[source]#

Bases: BaseModel

Standard error response model.

Provides a consistent error response format across all endpoints.

error#

Dictionary containing error message and type

Type

dict[str, str]

request_id#

Optional unique identifier for the request

Type

str | None

timestamp#

Unix timestamp when error occurred

Type

float

error: dict[str, str]#
model_config: ClassVar[ConfigDict] = {}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

request_id: str | None#
timestamp: float#
class easydel.inference.inference_engine_interface.InferenceEngineAdapter[source]#

Bases: ABC

Abstract adapter interface for different inference engines.

This allows different inference engines (eSurge, vLLM, TGI, etc.) to be used with the same API server interface.

abstract count_tokens(content: str) int[source]#

Count tokens in the given content.

Parameters

content – Text content

Returns

Number of tokens

abstract async generate(prompts: str | list[str], sampling_params: SamplingParams, stream: bool = False) list[ReturnSample] | tp.AsyncGenerator[list[ReturnSample], None][source]#

Generate text from prompts.

Parameters
  • prompts – Input prompts

  • sampling_params – Sampling parameters

  • stream – Whether to stream the response

Returns

Generated samples (list or async generator)

abstract get_model_info() dict[str, Any][source]#

Get information about the loaded model.

Returns

Model information dictionary

abstract property model_name: str#

Get the name of the model.

abstract property processor: Any#

Get the processor/tokenizer for the model.

class easydel.inference.inference_engine_interface.ServerMetrics(total_requests: int = 0, successful_requests: int = 0, failed_requests: int = 0, total_tokens_generated: int = 0, average_tokens_per_second: float = 0.0, uptime_seconds: float = 0.0, start_time: float = <factory>)[source]#

Bases: object

Server performance metrics.

Tracks key performance indicators for the inference server.

total_requests#

Total number of requests received

Type

int

successful_requests#

Number of successfully completed requests

Type

int

failed_requests#

Number of failed requests

Type

int

total_tokens_generated#

Total tokens generated across all requests

Type

int

average_tokens_per_second#

Average generation speed

Type

float

uptime_seconds#

Time since server started

Type

float

start_time#

Unix timestamp when server started

Type

float

average_tokens_per_second: float = 0.0#
failed_requests: int = 0#
start_time: float#
successful_requests: int = 0#
total_requests: int = 0#
total_tokens_generated: int = 0#
uptime_seconds: float = 0.0#
class easydel.inference.inference_engine_interface.ServerStatus(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]#

Bases: str, Enum

Server status enumeration.

Represents the operational state of an inference server.

STARTING#

Server is initializing

READY#

Server is ready to accept requests

BUSY#

Server is processing requests at capacity

ERROR#

Server encountered an error

SHUTTING_DOWN#

Server is gracefully shutting down

BUSY = 'busy'#
ERROR = 'error'#
READY = 'ready'#
SHUTTING_DOWN = 'shutting_down'#
STARTING = 'starting'#
easydel.inference.inference_engine_interface.create_error_response(status_code: HTTPStatus, message: str, request_id: str | None = None) JSONResponse[source]#

Creates a standardized JSON error response.

Parameters
  • status_code – HTTP status code for the error

  • message – Human-readable error message

  • request_id – Optional request identifier for tracking

Returns

JSONResponse with error details and appropriate status code