oDrive/oEngine Example

oDrive/oEngine Example#

The oDrive engine is one of the available engine drivers for EasyDeL’s vSurge inference server. It is specifically designed for efficient inference by leveraging Paged Attention. Paged Attention is a key technique that optimizes the management of the KV (Key-Value) cache, which stores the intermediate attention outputs during sequence generation. By dividing the KV cache into fixed-size “pages,” oDrive can handle variable-length sequences more efficiently, reduce memory fragmentation, and increase the utilization of High Bandwidth Memory (HBM).

The oDrive engine is instantiated using the ed.vSurge.create_odriver class method. Below is an example demonstrating how to create an oDrive instance:

surge = ed.vSurge.create_odriver(
    model=model,
    processor=processor,
    max_prefill_length=prefill_length,
    prefill_lengths=[prefill_length],
    page_size=page_size,
    hbm_utilization=hbm_utilization,
    max_concurrent_prefill=max_concurrent_decodes,
    max_concurrent_decodes=max_concurrent_decodes,
    seed=877,
    vsurge_name="my_odrive_server",
)

Parameters:#

The create_odriver function accepts several parameters to configure the oDrive engine:

model: The loaded EasyDeL model instance that will be used for inference.
processor: The tokenizer or processor object required for encoding input prompts and decoding generated tokens.
max_prefill_length: The maximum sequence length allowed during the initial prompt processing (prefill) phase. Prompts longer than this may be truncated.
prefill_lengths: An integer specifying a maximum prefill length to optimize kernels for, or None. This helps the engine optimize for specific input lengths.
page_size: The size of memory pages used for managing the KV cache. A core parameter for the Paged Attention mechanism, influencing memory allocation granularity.
hbm_utilization: The target utilization ratio for High Bandwidth Memory (HBM) allocated for the KV cache. This helps control memory usage.
max_concurrent_prefill: The maximum number of prefill requests that the engine can process simultaneously. This affects the throughput of initial prompt processing.
max_concurrent_decodes: The maximum number of decoding steps that can be executed concurrently across all active inference requests. This parameter also effectively limits the total number of concurrent requests the driver can handle.
seed: A random seed used for operations within the engine, contributing to reproducibility.
vsurge_name: (Optional) A string identifier assigned to this specific vSurge instance. Defaults to the driver’s name if not provided.

The oDrive engine handles the underlying inference logic, manages the KV cache efficiently using Paged Attention (configured via page_size and hbm_utilization), and schedules incoming requests based on parameters like max_concurrent_prefill and max_concurrent_decodes.

oDrive/oEngine Example

Contents

oDrive/oEngine Example#

Parameters:#