easydel.inference.esurge.runners.model_runner#
eSurge Model Runner - High-performance inference execution engine.
This module implements the core execution logic for the eSurge inference engine, providing efficient model execution with advanced features like paged attention, dynamic batching, and compilation caching.
- Key Components:
ExecutionManager: Manages compiled execution functions for different batch/token configurations eSurgeRunner: Main runner class that orchestrates model execution
- Architecture:
The module uses a two-stage compilation strategy: 1. Pre-compilation of functions for different token/batch size combinations 2. Runtime selection of appropriate compiled function based on input shape
- Performance Features:
Paged attention for efficient KV cache management
Vectorized operations for batch processing
Pre-allocated buffers to minimize memory allocation
Compilation caching to avoid recompilation
Progress logging for long compilation processes
Example
>>> from easydel.infra import EasyDeLBaseModule
>>> from easydel.inference.esurge.runners import eSurgeRunner
>>>
>>> # Initialize model
>>> model = EasyDeLBaseModule.from_pretrained("model-name")
>>>
>>> # Create runner
>>> runner = eSurgeRunner(
... model=model,
... max_model_len=2048,
... max_num_seqs=8,
... hbm_utilization=0.9
... )
>>>
>>> # Compile for different configurations
>>> runner.compile()
>>>
>>> # Execute model
>>> output = runner.execute_model(scheduler_output)
- class easydel.inference.esurge.runners.model_runner.eSurgeRunner(model: EasyDeLBaseModule, hbm_utilization: float = 0.5, page_size: int = 128, max_model_len: int = 8192, min_input_pad: int = 256, max_num_seqs: int = 16, max_num_seq_buckets: list[int] | None = None, use_aot_forward: bool = True, verbose: bool = False, enable_overlap_execution: bool = False, enable_sampler_metrics: bool = False)[source]#
Bases:
objectHigh-performance model runner for efficient batched inference.
The eSurgeRunner orchestrates model execution with advanced features: - Paged attention for memory-efficient KV cache management - Dynamic batching with request scheduling - Pre-allocated buffers for zero-copy operations - Vectorized token processing - Compilation caching for different batch/sequence configurations
The runner maintains an internal state of active requests and manages their lifecycle from prompt processing through token generation.
- Architecture:
Request Flow: 1. Scheduler provides requests to execute 2. Runner updates internal state (add/remove requests) 3. Prepares inputs with proper padding and batching 4. Executes model using pre-compiled functions 5. Processes sampled tokens and updates buffers 6. Returns results to scheduler
- Memory Management:
Pre-allocated buffers for common operations
Paged KV cache with configurable page size
Efficient slot mapping for attention
Buffer reuse across batches
- model#
The EasyDeL model to run
- metadata#
Paged attention metadata
- max_num_seqs#
Maximum concurrent sequences
- max_model_len#
Maximum sequence length
- executor_manager#
Manages compiled functions
- sequence_buffer#
Manages active sequences
- requests#
Active request states
Example
>>> runner = eSurgeRunner( ... model=model, ... max_model_len=2048, ... max_num_seqs=8, ... hbm_utilization=0.9, ... page_size=128 ... ) >>> >>> # Compile for all configurations >>> runner.compile() >>> >>> # Execute requests from scheduler >>> output = runner.execute_model(scheduler_output) >>> >>> # Process results >>> for req_id, tokens in zip(output.req_ids, output.sampled_token_ids): ... print(f"Request {req_id}: {tokens}")
- execute_model(scheduler_output: SchedulerOutput) ModelRunnerOutput[source]#
- execute_model_async(scheduler_output: SchedulerOutput) Future[ModelRunnerOutput][source]#
Execute model asynchronously in a background thread.
This method enables async scheduling by executing the model in a separate thread, allowing the caller to continue scheduling the next batch while the current batch is being processed.
- The async execution workflow:
Submit model execution to thread pool executor
Return immediately with a Future object
Caller can schedule next batch while this executes
Use wait_for_execution(future) to get results when needed
- Parameters
scheduler_output – Scheduling decisions for this iteration
- Returns
- Future that will contain the model output
when execution completes. Can be waited on using wait_for_execution().
- Return type
Future[ModelRunnerOutput]
- Raises
RuntimeError – If async execution is not enabled (executor not initialized)
Note
This method requires async scheduling to be enabled and the executor to be initialized. Initialize the executor by calling initialize_async_executor() first.
Example
>>> # Initialize async executor first >>> runner.initialize_async_executor() >>> >>> # Execute asynchronously >>> future = runner.execute_model_async(scheduler_output) >>> >>> # Do other work while model executes... >>> next_schedule = scheduler.schedule() >>> >>> # Wait for current execution to finish >>> output = runner.wait_for_execution(future)
- initialize_async_executor() None[source]#
Initialize the thread pool executor for async model execution.
This method creates a single-threaded executor that will be used to run model execution in the background, enabling async scheduling.
- Side Effects:
Creates self._executor as a ThreadPoolExecutor with 1 worker
Existing executor is shutdown if present
Note
This should be called before using execute_model_async(). The executor uses a single worker to maintain execution order.
- property mesh#
- reset_state() None[source]#
Clear sequence state and request bookkeeping.
Useful when pausing or resetting the runner to ensure no stale pages or request metadata linger between sessions.
- update_model_weights(model: EasyDeLBaseModule | None = None, *, graphdef=None, graphstate=None, graphother=None, reset_state: bool = True) None[source]#
Update the runner’s model weights/graphs and optionally reset state.
- Parameters
model – Optional EasyDeL model instance providing new weights. If omitted, graph components must be supplied explicitly.
graphdef – Optional graphdef override.
graphstate – Optional graphstate override.
graphother – Optional graphother override.
reset_state – When True (default) reinitializes internal buffers and cached requests to ensure the new weights are applied cleanly.
- Raises
RuntimeError – If active requests exist while reset_state is True.
- wait_for_execution(future: Future) ModelRunnerOutput[source]#
Wait for an async execution to complete and return the result.
- Parameters
future – The Future object returned by execute_model_async()
- Returns
The completed model execution output
- Return type
Note
This call blocks until the future completes.