easydel.inference.esurge.runners.model_runner

easydel.inference.esurge.runners.model_runner#

eSurge Model Runner - High-performance inference execution engine.

This module implements the core execution logic for the eSurge inference engine, providing efficient model execution with advanced features like paged attention, dynamic batching, and compilation caching.

Key Components:

ExecutionManager: Manages compiled execution functions for different batch/token configurations eSurgeRunner: Main runner class that orchestrates model execution

Architecture:

The module uses a two-stage compilation strategy: 1. Pre-compilation of functions for different token/batch size combinations 2. Runtime selection of appropriate compiled function based on input shape

Performance Features:

Paged attention for efficient KV cache management
Vectorized operations for batch processing
Pre-allocated buffers to minimize memory allocation
Compilation caching to avoid recompilation
Progress logging for long compilation processes

Example

>>> from easydel.infra import EasyDeLBaseModule
>>> from easydel.inference.esurge.runners import eSurgeRunner
>>>
>>> # Initialize model
>>> model = EasyDeLBaseModule.from_pretrained("model-name")
>>>
>>> # Create runner
>>> runner = eSurgeRunner(
...     model=model,
...     max_model_len=2048,
...     max_num_seqs=8,
...     hbm_utilization=0.9
... )
>>>
>>> # Compile for different configurations
>>> runner.compile()
>>>
>>> # Execute model
>>> output = runner.execute_model(scheduler_output)

class easydel.inference.esurge.runners.model_runner.eSurgeRunner(model: EasyDeLBaseModule, hbm_utilization: float = 0.5, page_size: int = 128, max_model_len: int = 8192, min_input_pad: int = 256, max_num_seqs: int = 16, max_num_seq_buckets: list[int] | None = None, use_aot_forward: bool = True, verbose: bool = False, enable_overlap_execution: bool = False, enable_sampler_metrics: bool = False)[source]#

Bases: object

High-performance model runner for efficient batched inference.

The eSurgeRunner orchestrates model execution with advanced features: - Paged attention for memory-efficient KV cache management - Dynamic batching with request scheduling - Pre-allocated buffers for zero-copy operations - Vectorized token processing - Compilation caching for different batch/sequence configurations

The runner maintains an internal state of active requests and manages their lifecycle from prompt processing through token generation.

Architecture:

Request Flow: 1. Scheduler provides requests to execute 2. Runner updates internal state (add/remove requests) 3. Prepares inputs with proper padding and batching 4. Executes model using pre-compiled functions 5. Processes sampled tokens and updates buffers 6. Returns results to scheduler

Memory Management:

Pre-allocated buffers for common operations
Paged KV cache with configurable page size
Efficient slot mapping for attention
Buffer reuse across batches

model#: The EasyDeL model to run

metadata#: Paged attention metadata

max_num_seqs#: Maximum concurrent sequences

max_model_len#: Maximum sequence length

executor_manager#: Manages compiled functions

sequence_buffer#: Manages active sequences

requests#: Active request states

Example

>>> runner = eSurgeRunner(
...     model=model,
...     max_model_len=2048,
...     max_num_seqs=8,
...     hbm_utilization=0.9,
...     page_size=128
... )
>>>
>>> # Compile for all configurations
>>> runner.compile()
>>>
>>> # Execute requests from scheduler
>>> output = runner.execute_model(scheduler_output)
>>>
>>> # Process results
>>> for req_id, tokens in zip(output.req_ids, output.sampled_token_ids):
...     print(f"Request {req_id}: {tokens}")

compile()[source]#: Compile the model for all token padding sizes.

destroy_kv_cache() → None[source]#: Destroy the current ragged KV cache to release memory.

execute_model(scheduler_output: SchedulerOutput) → ModelRunnerOutput[source]#

execute_model_async(scheduler_output: SchedulerOutput) → Future[ModelRunnerOutput][source]#

Execute model asynchronously in a background thread.

This method enables async scheduling by executing the model in a separate thread, allowing the caller to continue scheduling the next batch while the current batch is being processed.

The async execution workflow:

Submit model execution to thread pool executor
Return immediately with a Future object
Caller can schedule next batch while this executes
Use wait_for_execution(future) to get results when needed

Parameters

scheduler_output – Scheduling decisions for this iteration

Returns

Future that will contain the model output: when execution completes. Can be waited on using wait_for_execution().

Return type

Future[ModelRunnerOutput]

Raises

RuntimeError – If async execution is not enabled (executor not initialized)

Note

This method requires async scheduling to be enabled and the executor to be initialized. Initialize the executor by calling initialize_async_executor() first.

Example

>>> # Initialize async executor first
>>> runner.initialize_async_executor()
>>>
>>> # Execute asynchronously
>>> future = runner.execute_model_async(scheduler_output)
>>>
>>> # Do other work while model executes...
>>> next_schedule = scheduler.schedule()
>>>
>>> # Wait for current execution to finish
>>> output = runner.wait_for_execution(future)

initialize_async_executor() → None[source]#

Initialize the thread pool executor for async model execution.

This method creates a single-threaded executor that will be used to run model execution in the background, enabling async scheduling.

Side Effects:

Creates self._executor as a ThreadPoolExecutor with 1 worker
Existing executor is shutdown if present

Note

This should be called before using execute_model_async(). The executor uses a single worker to maintain execution order.

initialize_kv_cache() → None[source]#: Reinitialize the ragged KV cache if it has been destroyed.

property mesh#

reset_state() → None[source]#

Clear sequence state and request bookkeeping.

Useful when pausing or resetting the runner to ensure no stale pages or request metadata linger between sessions.

shutdown() → None[source]#: Cleanup resources including async executor if present.

update_model_weights(model: EasyDeLBaseModule | None = None, *, graphdef=None, graphstate=None, graphother=None, reset_state: bool = True) → None[source]#

Update the runner’s model weights/graphs and optionally reset state.

Parameters

model – Optional EasyDeL model instance providing new weights. If omitted, graph components must be supplied explicitly.
graphdef – Optional graphdef override.
graphstate – Optional graphstate override.
graphother – Optional graphother override.
reset_state – When True (default) reinitializes internal buffers and cached requests to ensure the new weights are applied cleanly.

Raises

RuntimeError – If active requests exist while reset_state is True.

wait_for_execution(future: Future) → ModelRunnerOutput[source]#

Wait for an async execution to complete and return the result.

Parameters: future – The Future object returned by execute_model_async()
Returns: The completed model execution output
Return type: ModelRunnerOutput

Note

This call blocks until the future completes.

easydel.inference.esurge.runners.model_runner

Contents

easydel.inference.esurge.runners.model_runner#