easydel.inference.inference_engine_interface#
Base interface for EasyDeL inference API servers.
This module provides abstract base classes and utilities for building standardized inference API servers with OpenAI API compatibility.
- Classes:
ServerStatus: Enum representing server operational states ServerMetrics: Dataclass for tracking server performance metrics EndpointConfig: Configuration for API endpoints ErrorResponse: Standard error response format BaseInferenceApiServer: Abstract base class for inference servers InferenceEngineAdapter: Abstract adapter for different inference engines
- class easydel.inference.inference_engine_interface.BaseInferenceApiServer(max_workers: int | None = None, enable_cors: bool = True, cors_origins: list[str] | None = None, max_request_size: int = 10485760, request_timeout: float = 300.0, enable_function_calling: bool = True, default_function_format: FunctionCallFormat = FunctionCallFormat.OPENAI, server_name: str = 'EasyDeL Inference API Server', server_description: str = 'High-performance inference server with OpenAI API compatibility', server_version: str = '2.0.0', enable_auth_ui: bool = True)[source]#
Bases:
ABCAbstract base class for inference API servers.
This interface defines the standard structure and methods that all inference API servers should implement to ensure consistency across different inference modules.
- abstract async chat_completions(request: ChatCompletionRequest, raw_request: Request) easydel.inference.openai_api_modules.ChatCompletionResponse | starlette.responses.StreamingResponse | starlette.responses.JSONResponse[source]#
Handle chat completion requests.
- Parameters
request – The chat completion request
raw_request – Raw FastAPI request containing headers
- Returns
Chat completion response (streaming or non-streaming)
- abstract async completions(request: CompletionRequest, raw_request: Request) easydel.inference.openai_api_modules.CompletionResponse | starlette.responses.StreamingResponse | starlette.responses.JSONResponse[source]#
Handle completion requests.
- Parameters
request – The completion request
raw_request – Raw FastAPI request containing headers
- Returns
Completion response (streaming or non-streaming)
- abstract async execute_tool(request: Request) JSONResponse[source]#
Execute a tool/function call.
- Parameters
request – The tool execution request
- Returns
Tool execution result
- extract_tools(request: ChatCompletionRequest) list[dict] | None[source]#
- fire(host: str = '0.0.0.0', port: int = 11556, workers: int = 1, log_level: str = 'info', ssl_keyfile: str | None = None, ssl_certfile: str | None = None, reload: bool = False) None#
Start the server with enhanced configuration.
- Parameters
host – Host address to bind to
port – Port to listen on
workers – Number of worker processes
log_level – Logging level
ssl_keyfile – Path to SSL key file
ssl_certfile – Path to SSL certificate file
reload – Enable auto-reload for development
- abstract async get_metrics(raw_request: Request) JSONResponse[source]#
Get server performance metrics.
- Parameters
raw_request – Raw FastAPI request containing headers
- Returns
Server metrics information
- abstract async get_model(model_id: str, raw_request: Request) JSONResponse[source]#
Get detailed information about a specific model.
- Parameters
model_id – The model identifier
raw_request – Raw FastAPI request containing headers
- Returns
Model details
- abstract async health_check(raw_request: Request) JSONResponse[source]#
Perform comprehensive health check.
- Parameters
raw_request – Raw FastAPI request containing headers
- Returns
Health status information
- abstract async list_models(raw_request: Request) JSONResponse[source]#
List available models.
- Parameters
raw_request – Raw FastAPI request containing headers
- Returns
List of available models with metadata
- abstract async list_tools(raw_request: Request) JSONResponse[source]#
List available tools/functions.
- Parameters
raw_request – Raw FastAPI request containing headers
- Returns
Available tools information
- async on_shutdown() None[source]#
Hook for server shutdown.
Override in subclasses to perform cleanup tasks such as saving state, closing connections, or releasing resources. This method is called once when the server shuts down.
- async on_startup() None[source]#
Hook for server startup.
Override in subclasses to perform custom initialization tasks such as loading models, establishing connections, or warming up caches. This method is called once when the server starts.
- run(host: str = '0.0.0.0', port: int = 11556, workers: int = 1, log_level: str = 'info', ssl_keyfile: str | None = None, ssl_certfile: str | None = None, reload: bool = False) None[source]#
Start the server with enhanced configuration.
- Parameters
host – Host address to bind to
port – Port to listen on
workers – Number of worker processes
log_level – Logging level
ssl_keyfile – Path to SSL key file
ssl_certfile – Path to SSL certificate file
reload – Enable auto-reload for development
- class easydel.inference.inference_engine_interface.EndpointConfig(*, path: str, handler: Callable, methods: list[str], summary: str | None = None, tags: list[str] | None = None, response_model: Any = None)[source]#
Bases:
BaseModelConfiguration for a FastAPI endpoint.
Defines the structure for registering API endpoints.
- path#
URL path for the endpoint
- Type
str
- handler#
Callable that handles requests
- Type
Callable
- methods#
HTTP methods supported (GET, POST, etc.)
- Type
list[str]
- summary#
Brief description of the endpoint
- Type
str | None
- tags#
Tags for API documentation grouping
- Type
list[str] | None
- response_model#
Pydantic model for response validation
- Type
Any
- handler: tp.Callable#
- methods: list[str]#
- model_config: ClassVar[ConfigDict] = {}#
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- path: str#
- response_model: tp.Any#
- class easydel.inference.inference_engine_interface.ErrorResponse(*, error: dict[str, str], request_id: str | None = None, timestamp: float = <factory>)[source]#
Bases:
BaseModelStandard error response model.
Provides a consistent error response format across all endpoints.
- error#
Dictionary containing error message and type
- Type
dict[str, str]
- request_id#
Optional unique identifier for the request
- Type
str | None
- timestamp#
Unix timestamp when error occurred
- Type
float
- error: dict[str, str]#
- model_config: ClassVar[ConfigDict] = {}#
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- timestamp: float#
- class easydel.inference.inference_engine_interface.InferenceEngineAdapter[source]#
Bases:
ABCAbstract adapter interface for different inference engines.
This allows different inference engines (eSurge, vLLM, TGI, etc.) to be used with the same API server interface.
- abstract count_tokens(content: str) int[source]#
Count tokens in the given content.
- Parameters
content – Text content
- Returns
Number of tokens
- abstract async generate(prompts: str | list[str], sampling_params: SamplingParams, stream: bool = False) list[ReturnSample] | tp.AsyncGenerator[list[ReturnSample], None][source]#
Generate text from prompts.
- Parameters
prompts – Input prompts
sampling_params – Sampling parameters
stream – Whether to stream the response
- Returns
Generated samples (list or async generator)
- abstract get_model_info() dict[str, Any][source]#
Get information about the loaded model.
- Returns
Model information dictionary
- abstract property model_name: str#
Get the name of the model.
- abstract property processor: Any#
Get the processor/tokenizer for the model.
- class easydel.inference.inference_engine_interface.ServerMetrics(total_requests: int = 0, successful_requests: int = 0, failed_requests: int = 0, total_tokens_generated: int = 0, average_tokens_per_second: float = 0.0, uptime_seconds: float = 0.0, start_time: float = <factory>)[source]#
Bases:
objectServer performance metrics.
Tracks key performance indicators for the inference server.
- total_requests#
Total number of requests received
- Type
int
- successful_requests#
Number of successfully completed requests
- Type
int
- failed_requests#
Number of failed requests
- Type
int
- total_tokens_generated#
Total tokens generated across all requests
- Type
int
- average_tokens_per_second#
Average generation speed
- Type
float
- uptime_seconds#
Time since server started
- Type
float
- start_time#
Unix timestamp when server started
- Type
float
- average_tokens_per_second: float = 0.0#
- failed_requests: int = 0#
- start_time: float#
- successful_requests: int = 0#
- total_requests: int = 0#
- total_tokens_generated: int = 0#
- uptime_seconds: float = 0.0#
- class easydel.inference.inference_engine_interface.ServerStatus(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]#
Bases:
str,EnumServer status enumeration.
Represents the operational state of an inference server.
- STARTING#
Server is initializing
- READY#
Server is ready to accept requests
- BUSY#
Server is processing requests at capacity
- ERROR#
Server encountered an error
- SHUTTING_DOWN#
Server is gracefully shutting down
- BUSY = 'busy'#
- ERROR = 'error'#
- READY = 'ready'#
- SHUTTING_DOWN = 'shutting_down'#
- STARTING = 'starting'#
- easydel.inference.inference_engine_interface.create_error_response(status_code: HTTPStatus, message: str, request_id: str | None = None) JSONResponse[source]#
Creates a standardized JSON error response.
- Parameters
status_code – HTTP status code for the error
message – Human-readable error message
request_id – Optional request identifier for tracking
- Returns
JSONResponse with error details and appropriate status code