easydel.inference.esurge.server.api_server#
FastAPI server for eSurge with OpenAI API compatibility.
- class easydel.inference.esurge.server.api_server.ErrorResponse(*, error: dict[str, str], request_id: str | None = None, timestamp: float = <factory>)[source]#
Bases:
BaseModelStandard error response model.
- error: dict[str, str]#
- model_config: ClassVar[ConfigDict] = {}#
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- timestamp: float#
- class easydel.inference.esurge.server.api_server.ServerMetrics(total_requests: int = 0, successful_requests: int = 0, failed_requests: int = 0, total_tokens_generated: int = 0, average_tokens_per_second: float = 0.0, uptime_seconds: float = 0.0, start_time: float = <factory>)[source]#
Bases:
objectServer performance metrics.
Tracks aggregate performance statistics for the API server. Updated in real-time as requests are processed.
- total_requests#
Total number of requests received.
- Type
int
- successful_requests#
Number of successfully completed requests.
- Type
int
- failed_requests#
Number of failed requests.
- Type
int
- total_tokens_generated#
Cumulative tokens generated across all requests.
- Type
int
- average_tokens_per_second#
Rolling average generation throughput.
- Type
float
- uptime_seconds#
Server uptime in seconds.
- Type
float
- start_time#
Server start timestamp.
- Type
float
- average_tokens_per_second: float = 0.0#
- failed_requests: int = 0#
- start_time: float#
- successful_requests: int = 0#
- total_requests: int = 0#
- total_tokens_generated: int = 0#
- uptime_seconds: float = 0.0#
- class easydel.inference.esurge.server.api_server.ServerStatus(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]#
Bases:
str,EnumServer status enumeration.
Represents the current operational state of the API server. Used for health checks and monitoring.
- Values:
STARTING: Server is initializing READY: Server is ready to accept requests BUSY: Server is processing at capacity ERROR: Server encountered an error SHUTTING_DOWN: Server is shutting down gracefully
- BUSY = 'busy'#
- ERROR = 'error'#
- READY = 'ready'#
- SHUTTING_DOWN = 'shutting_down'#
- STARTING = 'starting'#
- easydel.inference.esurge.server.api_server.create_error_response(status_code: HTTPStatus, message: str, request_id: str | None = None) JSONResponse[source]#
Creates a standardized JSON error response.
- Parameters
status_code – HTTP status code for the error.
message – Human-readable error message.
request_id – Optional request ID for tracking.
- Returns
JSONResponse with error details in OpenAI API format.
- class easydel.inference.esurge.server.api_server.eSurgeAdapter(esurge_instance: eSurge, model_name: str)[source]#
Bases:
InferenceEngineAdapterAdapter for eSurge inference engine.
Bridges the synchronous eSurge engine with the async FastAPI server. Implements the InferenceEngineAdapter interface for compatibility with the base API server infrastructure.
- count_tokens(content: str) int[source]#
Count tokens using eSurge tokenizer.
- Parameters
content – Text to tokenize.
- Returns
Number of tokens in the content.
- async generate(prompts: str | list[str], sampling_params: SamplingParams, stream: bool = False) Union[list[easydel.inference.esurge.esurge_engine.RequestOutput], AsyncGenerator[RequestOutput, None]][source]#
Generate text using eSurge engine.
- Parameters
prompts – Input prompt(s) for generation.
sampling_params – Generation parameters.
stream – Whether to stream results (not implemented).
- Returns
List of RequestOutput objects for batch generation.
- Raises
NotImplementedError – If stream=True (streaming not supported here).
- get_model_info() dict[str, Any][source]#
Get eSurge model information.
- Returns
name, type, architecture, max_model_len, and max_num_seqs.
- Return type
Dictionary containing model metadata
- property model_name: str#
Return the model name.
- property processor: Any#
Get the processor/tokenizer for the model.
- class easydel.inference.esurge.server.api_server.eSurgeApiServer(esurge_map: dict[str, easydel.inference.esurge.esurge_engine.eSurge] | easydel.inference.esurge.esurge_engine.eSurge, oai_like_processor: bool = True, enable_function_calling: bool = True, tool_parser_name: str = 'hermes', require_api_key: bool = False, admin_key: str | None = None, enable_audit_logging: bool = True, max_audit_entries: int = 10000, storage_dir: str | None = None, enable_persistence: bool = True, auto_save_interval: float = 60.0, auth_worker_client: Any | None = None, max_concurrent_generations: int | None = None, overload_message: str = 'Server is busy, please try again later', refine_sampling_params: Optional[Callable[[SamplingParams, easydel.inference.openai_api_modules.ChatCompletionRequest | easydel.inference.openai_api_modules.CompletionRequest, eSurge], easydel.inference.sampling_params.SamplingParams | None]] = None, refine_chat_request: Optional[Callable[[ChatCompletionRequest], easydel.inference.openai_api_modules.ChatCompletionRequest | None]] = None, **kwargs)[source]#
Bases:
BaseInferenceApiServer,ToolCallingMixin,AuthEndpointsMixineSurge-specific API server implementation with OpenAI compatibility.
Provides a FastAPI-based REST API server that exposes eSurge engines through OpenAI-compatible endpoints. Supports multiple models, streaming, function calling, and comprehensive monitoring.
Features: - OpenAI API v1 compatibility (/v1/chat/completions, /v1/completions) - Multi-model support with dynamic routing - Streaming responses with Server-Sent Events (SSE) - Function/tool calling support - Real-time metrics and health monitoring - Thread-safe request handling - Production-grade authentication with RBAC, rate limiting, and audit logging
- async chat_completions(request: ChatCompletionRequest, raw_request: Request) Any[source]#
Handle chat completion requests.
Main endpoint for /v1/chat/completions. Supports both streaming and non-streaming responses, with optional function calling.
- Parameters
request – Chat completion request (with or without tools).
- Returns
ChatCompletionResponse for non-streaming. StreamingResponse for streaming. JSONResponse with error on failure.
- Raises
HTTPException – For client errors (400, 404).
- async completions(request: CompletionRequest, raw_request: Request) Any[source]#
Handle completion requests.
Endpoint for /v1/completions. Simpler text completion without chat formatting.
- Parameters
request – Completion request.
- Returns
CompletionResponse or StreamingResponse. JSONResponse with error on failure.
- Raises
HTTPException – For client errors.
- async execute_tool(raw_request: Request) JSONResponse[source]#
Execute a tool/function call.
Placeholder endpoint for tool execution. Implement this method to integrate with actual tool execution systems.
- Parameters
raw_request – Tool execution request.
- Returns
JSONResponse with NOT_IMPLEMENTED status.
Note
This is a placeholder that should be implemented based on specific tool execution requirements.
- generate_api_key(name: str, role: Any = None, **kwargs) tuple[str, Any][source]#
Create and register a new random API key with enhanced features.
- Parameters
name – Human-readable name for the key.
role – Access control role (ApiKeyRole). Defaults to USER.
**kwargs – Additional arguments passed to auth_manager.generate_api_key() (description, expires_in_days, rate_limits, quota, permissions, tags, metadata)
- Returns
Tuple of (raw_key, metadata). Store raw_key securely - it won’t be retrievable later.
- async get_metrics(raw_request: Request) JSONResponse[source]#
Get server performance metrics.
- Returns
JSONResponse with comprehensive server metrics including request counts, token statistics, throughput, and status.
- async get_model(model_id: str, raw_request: Request) JSONResponse[source]#
Get model details.
- Parameters
model_id – Model identifier.
- Returns
JSONResponse with model metadata.
- Raises
HTTPException – If model not found.
- async health_check(raw_request: Request) JSONResponse[source]#
Health check endpoint.
Returns server health status and model information.
- Returns
status: Current server status
timestamp: Current time
uptime_seconds: Server uptime
models: Loaded model information
active_requests: Current request count
Status code 200 if READY, 503 otherwise.
- Return type
JSONResponse with
- async list_models(raw_request: Request) JSONResponse[source]#
List available models.
OpenAI-compatible model listing endpoint.
- Returns
JSONResponse with list of available models and their metadata.
- async list_tools(raw_request: Request) JSONResponse[source]#
List available tools/functions for each model.
Returns example tool definitions and supported formats. This is a placeholder that can be extended with actual tools.
- Returns
JSONResponse with tool definitions per model.