easydel.inference.inference_engine_interface

easydel.inference.inference_engine_interface#

Base interface for EasyDeL inference API servers.

This module provides abstract base classes and utilities for building standardized inference API servers with OpenAI API compatibility.

Classes:: ServerStatus: Enum representing server operational states ServerMetrics: Dataclass for tracking server performance metrics EndpointConfig: Configuration for API endpoints ErrorResponse: Standard error response format BaseInferenceApiServer: Abstract base class for inference servers InferenceEngineAdapter: Abstract adapter for different inference engines

class easydel.inference.inference_engine_interface.BaseInferenceApiServer(max_workers: int | None = None, enable_cors: bool = True, cors_origins: list[str] | None = None, max_request_size: int = 10485760, request_timeout: float = 300.0, enable_function_calling: bool = True, default_function_format: FunctionCallFormat = FunctionCallFormat.OPENAI, server_name: str = 'EasyDeL Inference API Server', server_description: str = 'High-performance inference server with OpenAI API compatibility', server_version: str = '2.0.0', enable_auth_ui: bool = True)[source]#

Bases: ABC

Abstract base class for inference API servers.

This interface defines the standard structure and methods that all inference API servers should implement to ensure consistency across different inference modules.

abstract async chat_completions(request: ChatCompletionRequest, raw_request: Request) → easydel.inference.openai_api_modules.ChatCompletionResponse | starlette.responses.StreamingResponse | starlette.responses.JSONResponse[source]#

Handle chat completion requests.

Parameters

request – The chat completion request
raw_request – Raw FastAPI request containing headers

Returns

Chat completion response (streaming or non-streaming)

abstract async completions(request: CompletionRequest, raw_request: Request) → easydel.inference.openai_api_modules.CompletionResponse | starlette.responses.StreamingResponse | starlette.responses.JSONResponse[source]#

Handle completion requests.

Parameters

request – The completion request
raw_request – Raw FastAPI request containing headers

Returns

Completion response (streaming or non-streaming)

abstract async execute_tool(request: Request) → JSONResponse[source]#

Execute a tool/function call.

Parameters: request – The tool execution request
Returns: Tool execution result

extract_tools(request: ChatCompletionRequest) → list[dict] | None[source]#

fire(host: str = '0.0.0.0', port: int = 11556, workers: int = 1, log_level: str = 'info', ssl_keyfile: str | None = None, ssl_certfile: str | None = None, reload: bool = False) → None#

Start the server with enhanced configuration.

Parameters

host – Host address to bind to
port – Port to listen on
workers – Number of worker processes
log_level – Logging level
ssl_keyfile – Path to SSL key file
ssl_certfile – Path to SSL certificate file
reload – Enable auto-reload for development

abstract async get_metrics(raw_request: Request) → JSONResponse[source]#

Get server performance metrics.

Parameters: raw_request – Raw FastAPI request containing headers
Returns: Server metrics information

abstract async get_model(model_id: str, raw_request: Request) → JSONResponse[source]#

Get detailed information about a specific model.

Parameters

model_id – The model identifier
raw_request – Raw FastAPI request containing headers

Returns

Model details

abstract async health_check(raw_request: Request) → JSONResponse[source]#

Perform comprehensive health check.

Parameters: raw_request – Raw FastAPI request containing headers
Returns: Health status information

abstract async list_models(raw_request: Request) → JSONResponse[source]#

List available models.

Parameters: raw_request – Raw FastAPI request containing headers
Returns: List of available models with metadata

abstract async list_tools(raw_request: Request) → JSONResponse[source]#

List available tools/functions.

Parameters: raw_request – Raw FastAPI request containing headers
Returns: Available tools information

async on_shutdown() → None[source]#

Hook for server shutdown.

Override in subclasses to perform cleanup tasks such as saving state, closing connections, or releasing resources. This method is called once when the server shuts down.

async on_startup() → None[source]#

Hook for server startup.

Override in subclasses to perform custom initialization tasks such as loading models, establishing connections, or warming up caches. This method is called once when the server starts.

run(host: str = '0.0.0.0', port: int = 11556, workers: int = 1, log_level: str = 'info', ssl_keyfile: str | None = None, ssl_certfile: str | None = None, reload: bool = False) → None[source]#

Start the server with enhanced configuration.

Parameters

host – Host address to bind to
port – Port to listen on
workers – Number of worker processes
log_level – Logging level
ssl_keyfile – Path to SSL key file
ssl_certfile – Path to SSL certificate file
reload – Enable auto-reload for development

class easydel.inference.inference_engine_interface.EndpointConfig(*, path: str, handler: Callable, methods: list[str], summary: str | None = None, tags: list[str] | None = None, response_model: Any = None)[source]#

Bases: BaseModel

Configuration for a FastAPI endpoint.

Defines the structure for registering API endpoints.

path#

URL path for the endpoint

Type: str

handler#

Callable that handles requests

Type: Callable

methods#

HTTP methods supported (GET, POST, etc.)

Type: list[str]

summary#

Brief description of the endpoint

Type: str | None

tags#

Tags for API documentation grouping

Type: list[str] | None

response_model#

Pydantic model for response validation

Type: Any

handler: tp.Callable#

methods: list[str]#

model_config: ClassVar[ConfigDict] = {}#: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

path: str#

response_model: tp.Any#

summary: str | None#

tags: list[str] | None#

class easydel.inference.inference_engine_interface.ErrorResponse(*, error: dict[str, str], request_id: str | None = None, timestamp: float = <factory>)[source]#

Bases: BaseModel

Standard error response model.

Provides a consistent error response format across all endpoints.

error#

Dictionary containing error message and type

Type: dict[str, str]

request_id#

Optional unique identifier for the request

Type: str | None

timestamp#

Unix timestamp when error occurred

Type: float

error: dict[str, str]#

model_config: ClassVar[ConfigDict] = {}#: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

request_id: str | None#

timestamp: float#

class easydel.inference.inference_engine_interface.InferenceEngineAdapter[source]#

Bases: ABC

Abstract adapter interface for different inference engines.

This allows different inference engines (eSurge, vLLM, TGI, etc.) to be used with the same API server interface.

abstract count_tokens(content: str) → int[source]#

Count tokens in the given content.

Parameters: content – Text content
Returns: Number of tokens

abstract async generate(prompts: str | list[str], sampling_params: SamplingParams, stream: bool = False) → list[ReturnSample] | tp.AsyncGenerator[list[ReturnSample], None][source]#

Generate text from prompts.

Parameters

prompts – Input prompts
sampling_params – Sampling parameters
stream – Whether to stream the response

Returns

Generated samples (list or async generator)

abstract get_model_info() → dict[str, Any][source]#

Get information about the loaded model.

Returns: Model information dictionary

abstract property model_name: str#: Get the name of the model.

abstract property processor: Any#: Get the processor/tokenizer for the model.

class easydel.inference.inference_engine_interface.ServerMetrics(total_requests: int = 0, successful_requests: int = 0, failed_requests: int = 0, total_tokens_generated: int = 0, average_tokens_per_second: float = 0.0, uptime_seconds: float = 0.0, start_time: float = <factory>)[source]#

Bases: object

Server performance metrics.

Tracks key performance indicators for the inference server.

total_requests#

Total number of requests received

Type: int

successful_requests#

Number of successfully completed requests

Type: int

failed_requests#

Number of failed requests

Type: int

total_tokens_generated#

Total tokens generated across all requests

Type: int

average_tokens_per_second#

Average generation speed

Type: float

uptime_seconds#

Time since server started

Type: float

start_time#

Unix timestamp when server started

Type: float

average_tokens_per_second: float = 0.0#

failed_requests: int = 0#

start_time: float#

successful_requests: int = 0#

total_requests: int = 0#

total_tokens_generated: int = 0#

uptime_seconds: float = 0.0#

class easydel.inference.inference_engine_interface.ServerStatus(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]#

Bases: str, Enum

Server status enumeration.

Represents the operational state of an inference server.

STARTING#: Server is initializing

READY#: Server is ready to accept requests

BUSY#: Server is processing requests at capacity

ERROR#: Server encountered an error

SHUTTING_DOWN#: Server is gracefully shutting down

BUSY = 'busy'#

ERROR = 'error'#

READY = 'ready'#

SHUTTING_DOWN = 'shutting_down'#

STARTING = 'starting'#

easydel.inference.inference_engine_interface.create_error_response(status_code: HTTPStatus, message: str, request_id: str | None = None) → JSONResponse[source]#

Creates a standardized JSON error response.

Parameters

status_code – HTTP status code for the error
message – Human-readable error message
request_id – Optional request identifier for tracking

Returns

JSONResponse with error details and appropriate status code

easydel.inference.inference_engine_interface

Contents

easydel.inference.inference_engine_interface#