easydel.inference.esurge.server.api_server

easydel.inference.esurge.server.api_server#

FastAPI server for eSurge with OpenAI API compatibility.

class easydel.inference.esurge.server.api_server.ErrorResponse(*, error: dict[str, str], request_id: str | None = None, timestamp: float = <factory>)[source]#

Bases: BaseModel

Standard error response model.

error: dict[str, str]#

model_config: ClassVar[ConfigDict] = {}#: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

request_id: str | None#

timestamp: float#

class easydel.inference.esurge.server.api_server.ServerMetrics(total_requests: int = 0, successful_requests: int = 0, failed_requests: int = 0, total_tokens_generated: int = 0, average_tokens_per_second: float = 0.0, uptime_seconds: float = 0.0, start_time: float = <factory>)[source]#

Bases: object

Server performance metrics.

Tracks aggregate performance statistics for the API server. Updated in real-time as requests are processed.

total_requests#

Total number of requests received.

Type: int

successful_requests#

Number of successfully completed requests.

Type: int

failed_requests#

Number of failed requests.

Type: int

total_tokens_generated#

Cumulative tokens generated across all requests.

Type: int

average_tokens_per_second#

Rolling average generation throughput.

Type: float

uptime_seconds#

Server uptime in seconds.

Type: float

start_time#

Server start timestamp.

Type: float

average_tokens_per_second: float = 0.0#

failed_requests: int = 0#

start_time: float#

successful_requests: int = 0#

total_requests: int = 0#

total_tokens_generated: int = 0#

uptime_seconds: float = 0.0#

class easydel.inference.esurge.server.api_server.ServerStatus(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]#

Bases: str, Enum

Server status enumeration.

Represents the current operational state of the API server. Used for health checks and monitoring.

Values:: STARTING: Server is initializing READY: Server is ready to accept requests BUSY: Server is processing at capacity ERROR: Server encountered an error SHUTTING_DOWN: Server is shutting down gracefully

BUSY = 'busy'#

ERROR = 'error'#

READY = 'ready'#

SHUTTING_DOWN = 'shutting_down'#

STARTING = 'starting'#

easydel.inference.esurge.server.api_server.create_error_response(status_code: HTTPStatus, message: str, request_id: str | None = None) → JSONResponse[source]#

Creates a standardized JSON error response.

Parameters

status_code – HTTP status code for the error.
message – Human-readable error message.
request_id – Optional request ID for tracking.

Returns

JSONResponse with error details in OpenAI API format.

class easydel.inference.esurge.server.api_server.eSurgeAdapter(esurge_instance: eSurge, model_name: str)[source]#

Bases: InferenceEngineAdapter

Adapter for eSurge inference engine.

Bridges the synchronous eSurge engine with the async FastAPI server. Implements the InferenceEngineAdapter interface for compatibility with the base API server infrastructure.

count_tokens(content: str) → int[source]#

Count tokens using eSurge tokenizer.

Parameters: content – Text to tokenize.
Returns: Number of tokens in the content.

async generate(prompts: str | list[str], sampling_params: SamplingParams, stream: bool = False) → Union[list[easydel.inference.esurge.esurge_engine.RequestOutput], AsyncGenerator[RequestOutput, None]][source]#

Generate text using eSurge engine.

Parameters

prompts – Input prompt(s) for generation.
sampling_params – Generation parameters.
stream – Whether to stream results (not implemented).

Returns

List of RequestOutput objects for batch generation.

Raises

NotImplementedError – If stream=True (streaming not supported here).

get_model_info() → dict[str, Any][source]#

Get eSurge model information.

Returns: name, type, architecture, max_model_len, and max_num_seqs.
Return type: Dictionary containing model metadata

property model_name: str#: Return the model name.

property processor: Any#: Get the processor/tokenizer for the model.

class easydel.inference.esurge.server.api_server.eSurgeApiServer(esurge_map: dict[str, easydel.inference.esurge.esurge_engine.eSurge] | easydel.inference.esurge.esurge_engine.eSurge, oai_like_processor: bool = True, enable_function_calling: bool = True, tool_parser_name: str = 'hermes', require_api_key: bool = False, admin_key: str | None = None, enable_audit_logging: bool = True, max_audit_entries: int = 10000, storage_dir: str | None = None, enable_persistence: bool = True, auto_save_interval: float = 60.0, auth_worker_client: Any | None = None, max_concurrent_generations: int | None = None, overload_message: str = 'Server is busy, please try again later', refine_sampling_params: Optional[Callable[[SamplingParams, easydel.inference.openai_api_modules.ChatCompletionRequest | easydel.inference.openai_api_modules.CompletionRequest, eSurge], easydel.inference.sampling_params.SamplingParams | None]] = None, refine_chat_request: Optional[Callable[[ChatCompletionRequest], easydel.inference.openai_api_modules.ChatCompletionRequest | None]] = None, **kwargs)[source]#

Bases: BaseInferenceApiServer, ToolCallingMixin, AuthEndpointsMixin

eSurge-specific API server implementation with OpenAI compatibility.

Provides a FastAPI-based REST API server that exposes eSurge engines through OpenAI-compatible endpoints. Supports multiple models, streaming, function calling, and comprehensive monitoring.

Features: - OpenAI API v1 compatibility (/v1/chat/completions, /v1/completions) - Multi-model support with dynamic routing - Streaming responses with Server-Sent Events (SSE) - Function/tool calling support - Real-time metrics and health monitoring - Thread-safe request handling - Production-grade authentication with RBAC, rate limiting, and audit logging

async chat_completions(request: ChatCompletionRequest, raw_request: Request) → Any[source]#

Handle chat completion requests.

Main endpoint for /v1/chat/completions. Supports both streaming and non-streaming responses, with optional function calling.

Parameters: request – Chat completion request (with or without tools).
Returns: ChatCompletionResponse for non-streaming. StreamingResponse for streaming. JSONResponse with error on failure.
Raises: HTTPException – For client errors (400, 404).

async completions(request: CompletionRequest, raw_request: Request) → Any[source]#

Handle completion requests.

Endpoint for /v1/completions. Simpler text completion without chat formatting.

Parameters: request – Completion request.
Returns: CompletionResponse or StreamingResponse. JSONResponse with error on failure.
Raises: HTTPException – For client errors.

async execute_tool(raw_request: Request) → JSONResponse[source]#

Execute a tool/function call.

Placeholder endpoint for tool execution. Implement this method to integrate with actual tool execution systems.

Parameters: raw_request – Tool execution request.
Returns: JSONResponse with NOT_IMPLEMENTED status.

Note

This is a placeholder that should be implemented based on specific tool execution requirements.

generate_api_key(name: str, role: Any = None, **kwargs) → tuple[str, Any][source]#

Create and register a new random API key with enhanced features.

Parameters

name – Human-readable name for the key.
role – Access control role (ApiKeyRole). Defaults to USER.
**kwargs – Additional arguments passed to auth_manager.generate_api_key() (description, expires_in_days, rate_limits, quota, permissions, tags, metadata)

Returns

Tuple of (raw_key, metadata). Store raw_key securely - it won’t be retrievable later.

async get_metrics(raw_request: Request) → JSONResponse[source]#

Get server performance metrics.

Returns: JSONResponse with comprehensive server metrics including request counts, token statistics, throughput, and status.

async get_model(model_id: str, raw_request: Request) → JSONResponse[source]#

Get model details.

Parameters: model_id – Model identifier.
Returns: JSONResponse with model metadata.
Raises: HTTPException – If model not found.

async health_check(raw_request: Request) → JSONResponse[source]#

Health check endpoint.

Returns server health status and model information.

Returns

status: Current server status
timestamp: Current time
uptime_seconds: Server uptime
models: Loaded model information
active_requests: Current request count

Status code 200 if READY, 503 otherwise.

Return type

JSONResponse with

async list_models(raw_request: Request) → JSONResponse[source]#

List available models.

OpenAI-compatible model listing endpoint.

Returns: JSONResponse with list of available models and their metadata.

async list_tools(raw_request: Request) → JSONResponse[source]#

List available tools/functions for each model.

Returns example tool definitions and supported formats. This is a placeholder that can be extended with actual tools.

Returns: JSONResponse with tool definitions per model.

async on_shutdown() → None[source]#

Custom shutdown logic for eSurge.

Called when the FastAPI server shuts down. Cleans up ZMQ workers.

async on_startup() → None[source]#

Custom startup logic for eSurge.

Called when the FastAPI server starts. Logs loaded models and sets server status to READY.

easydel.inference.esurge.server.api_server

Contents

easydel.inference.esurge.server.api_server#