easydel.inference.esurge.server.api_server#

FastAPI server for eSurge with OpenAI API compatibility.

class easydel.inference.esurge.server.api_server.ErrorResponse(*, error: dict[str, str], request_id: str | None = None, timestamp: float = <factory>)[source]#

Bases: BaseModel

Standard error response model.

error: dict[str, str]#
model_config: ClassVar[ConfigDict] = {}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

request_id: str | None#
timestamp: float#
class easydel.inference.esurge.server.api_server.ServerMetrics(total_requests: int = 0, successful_requests: int = 0, failed_requests: int = 0, total_tokens_generated: int = 0, average_tokens_per_second: float = 0.0, uptime_seconds: float = 0.0, start_time: float = <factory>)[source]#

Bases: object

Server performance metrics.

Tracks aggregate performance statistics for the API server. Updated in real-time as requests are processed.

total_requests#

Total number of requests received.

Type

int

successful_requests#

Number of successfully completed requests.

Type

int

failed_requests#

Number of failed requests.

Type

int

total_tokens_generated#

Cumulative tokens generated across all requests.

Type

int

average_tokens_per_second#

Rolling average generation throughput.

Type

float

uptime_seconds#

Server uptime in seconds.

Type

float

start_time#

Server start timestamp.

Type

float

average_tokens_per_second: float = 0.0#
failed_requests: int = 0#
start_time: float#
successful_requests: int = 0#
total_requests: int = 0#
total_tokens_generated: int = 0#
uptime_seconds: float = 0.0#
class easydel.inference.esurge.server.api_server.ServerStatus(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]#

Bases: str, Enum

Server status enumeration.

Represents the current operational state of the API server. Used for health checks and monitoring.

Values:

STARTING: Server is initializing READY: Server is ready to accept requests BUSY: Server is processing at capacity ERROR: Server encountered an error SHUTTING_DOWN: Server is shutting down gracefully

BUSY = 'busy'#
ERROR = 'error'#
READY = 'ready'#
SHUTTING_DOWN = 'shutting_down'#
STARTING = 'starting'#
easydel.inference.esurge.server.api_server.create_error_response(status_code: HTTPStatus, message: str, request_id: str | None = None) JSONResponse[source]#

Creates a standardized JSON error response.

Parameters
  • status_code – HTTP status code for the error.

  • message – Human-readable error message.

  • request_id – Optional request ID for tracking.

Returns

JSONResponse with error details in OpenAI API format.

class easydel.inference.esurge.server.api_server.eSurgeAdapter(esurge_instance: eSurge, model_name: str)[source]#

Bases: InferenceEngineAdapter

Adapter for eSurge inference engine.

Bridges the synchronous eSurge engine with the async FastAPI server. Implements the InferenceEngineAdapter interface for compatibility with the base API server infrastructure.

count_tokens(content: str) int[source]#

Count tokens using eSurge tokenizer.

Parameters

content – Text to tokenize.

Returns

Number of tokens in the content.

async generate(prompts: str | list[str], sampling_params: SamplingParams, stream: bool = False) Union[list[easydel.inference.esurge.esurge_engine.RequestOutput], AsyncGenerator[RequestOutput, None]][source]#

Generate text using eSurge engine.

Parameters
  • prompts – Input prompt(s) for generation.

  • sampling_params – Generation parameters.

  • stream – Whether to stream results (not implemented).

Returns

List of RequestOutput objects for batch generation.

Raises

NotImplementedError – If stream=True (streaming not supported here).

get_model_info() dict[str, Any][source]#

Get eSurge model information.

Returns

name, type, architecture, max_model_len, and max_num_seqs.

Return type

Dictionary containing model metadata

property model_name: str#

Return the model name.

property processor: Any#

Get the processor/tokenizer for the model.

class easydel.inference.esurge.server.api_server.eSurgeApiServer(esurge_map: dict[str, easydel.inference.esurge.esurge_engine.eSurge] | easydel.inference.esurge.esurge_engine.eSurge, oai_like_processor: bool = True, enable_function_calling: bool = True, tool_parser_name: str = 'hermes', require_api_key: bool = False, admin_key: str | None = None, enable_audit_logging: bool = True, max_audit_entries: int = 10000, storage_dir: str | None = None, enable_persistence: bool = True, auto_save_interval: float = 60.0, auth_worker_client: Any | None = None, max_concurrent_generations: int | None = None, overload_message: str = 'Server is busy, please try again later', refine_sampling_params: Optional[Callable[[SamplingParams, easydel.inference.openai_api_modules.ChatCompletionRequest | easydel.inference.openai_api_modules.CompletionRequest, eSurge], easydel.inference.sampling_params.SamplingParams | None]] = None, refine_chat_request: Optional[Callable[[ChatCompletionRequest], easydel.inference.openai_api_modules.ChatCompletionRequest | None]] = None, **kwargs)[source]#

Bases: BaseInferenceApiServer, ToolCallingMixin, AuthEndpointsMixin

eSurge-specific API server implementation with OpenAI compatibility.

Provides a FastAPI-based REST API server that exposes eSurge engines through OpenAI-compatible endpoints. Supports multiple models, streaming, function calling, and comprehensive monitoring.

Features: - OpenAI API v1 compatibility (/v1/chat/completions, /v1/completions) - Multi-model support with dynamic routing - Streaming responses with Server-Sent Events (SSE) - Function/tool calling support - Real-time metrics and health monitoring - Thread-safe request handling - Production-grade authentication with RBAC, rate limiting, and audit logging

async chat_completions(request: ChatCompletionRequest, raw_request: Request) Any[source]#

Handle chat completion requests.

Main endpoint for /v1/chat/completions. Supports both streaming and non-streaming responses, with optional function calling.

Parameters

request – Chat completion request (with or without tools).

Returns

ChatCompletionResponse for non-streaming. StreamingResponse for streaming. JSONResponse with error on failure.

Raises

HTTPException – For client errors (400, 404).

async completions(request: CompletionRequest, raw_request: Request) Any[source]#

Handle completion requests.

Endpoint for /v1/completions. Simpler text completion without chat formatting.

Parameters

request – Completion request.

Returns

CompletionResponse or StreamingResponse. JSONResponse with error on failure.

Raises

HTTPException – For client errors.

async execute_tool(raw_request: Request) JSONResponse[source]#

Execute a tool/function call.

Placeholder endpoint for tool execution. Implement this method to integrate with actual tool execution systems.

Parameters

raw_request – Tool execution request.

Returns

JSONResponse with NOT_IMPLEMENTED status.

Note

This is a placeholder that should be implemented based on specific tool execution requirements.

generate_api_key(name: str, role: Any = None, **kwargs) tuple[str, Any][source]#

Create and register a new random API key with enhanced features.

Parameters
  • name – Human-readable name for the key.

  • role – Access control role (ApiKeyRole). Defaults to USER.

  • **kwargs – Additional arguments passed to auth_manager.generate_api_key() (description, expires_in_days, rate_limits, quota, permissions, tags, metadata)

Returns

Tuple of (raw_key, metadata). Store raw_key securely - it won’t be retrievable later.

async get_metrics(raw_request: Request) JSONResponse[source]#

Get server performance metrics.

Returns

JSONResponse with comprehensive server metrics including request counts, token statistics, throughput, and status.

async get_model(model_id: str, raw_request: Request) JSONResponse[source]#

Get model details.

Parameters

model_id – Model identifier.

Returns

JSONResponse with model metadata.

Raises

HTTPException – If model not found.

async health_check(raw_request: Request) JSONResponse[source]#

Health check endpoint.

Returns server health status and model information.

Returns

  • status: Current server status

  • timestamp: Current time

  • uptime_seconds: Server uptime

  • models: Loaded model information

  • active_requests: Current request count

Status code 200 if READY, 503 otherwise.

Return type

JSONResponse with

async list_models(raw_request: Request) JSONResponse[source]#

List available models.

OpenAI-compatible model listing endpoint.

Returns

JSONResponse with list of available models and their metadata.

async list_tools(raw_request: Request) JSONResponse[source]#

List available tools/functions for each model.

Returns example tool definitions and supported formats. This is a placeholder that can be extended with actual tools.

Returns

JSONResponse with tool definitions per model.

async on_shutdown() None[source]#

Custom shutdown logic for eSurge.

Called when the FastAPI server shuts down. Cleans up ZMQ workers.

async on_startup() None[source]#

Custom startup logic for eSurge.

Called when the FastAPI server starts. Logs loaded models and sets server status to READY.