EasyDeL vInference API Server#

The EasyDeL vInference API Server provides an OpenAI-compatible API for serving language models and multimodal models through JAX and EasyDeL. It supports both text-only and text+image interactions, with streaming, token counting, and other advanced features.

Key Features#

  • OpenAI-Compatible API: Drop-in replacement for OpenAI API clients

  • Text & Multimodal Support: Works with both text-only LLMs and vision-language models

  • Streaming Responses: Progressive generation with minimal latency

  • Token Counting: Calculate token usage for inputs

  • Model Management: Serve multiple models from a single server

  • Performance Metrics: Optional Prometheus-compatible metrics

  • Hardware Optimization: JAX-powered acceleration on GPU/TPU

API Endpoints#

Chat Completions API#

POST /v1/chat/completions

Generate a model response for the given chat conversation.

Text-Only Example Request#

{
  "model": "LLaMA",
  "messages": [
    {
      "role": "user",
      "content": "Explain quantum computing in simple terms"
    }
  ],
  "temperature": 0.7,
  "max_tokens": 500,
  "stream": false
}

Multimodal Example Request#

{
  "model": "multimodal",
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "image",
          "image": "https://example.com/image.jpg"
        },
        {
          "type": "text",
          "text": "Describe this image in detail."
        }
      ]
    }
  ],
  "temperature": 0.8,
  "max_tokens": 300,
  "stream": false
}

Response Format#

{
  "id": "chat-abc123",
  "object": "chat.completion",
  "created": 1677858242,
  "model": "LLaMA",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Quantum computing is a field that uses quantum mechanics to perform calculations..."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 14,
    "completion_tokens": 120,
    "total_tokens": 134,
    "tokens_per_second": 42.5,
    "processing_time": 2.82
  }
}

Token Counting API#

POST /v1/count_tokens

Count the number of tokens in a given text or conversation.

Request Example#

{
  "model": "LLaMA",
  "conversation": "Explain quantum computing in simple terms"
}

Response Format#

{
  "model": "LLaMA",
  "count": 7
}

Models API#

GET /v1/models

Get the list of available models on the server.

Response Format#

{
  "object": "list",
  "data": [
    {
      "id": "LLaMA",
      "object": "model",
      "owned_by": "easydel",
      "permission": []
    },
    {
      "id": "multimodal",
      "object": "model",
      "owned_by": "easydel",
      "permission": []
    }
  ]
}

Health Check APIs#

GET /liveness
GET /readiness

Check if the API server is running and ready to receive requests.

Setup Guide#

Prerequisites#

  • JAX with appropriate GPU/TPU backend

  • EasyDeL

  • FastAPI and dependencies

Starting a Text-Only Model Server#

import jax
import easydel as ed
from jax import numpy as jnp
from transformers import AutoTokenizer

# Load the model
model = ed.AutoEasyDeLModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.2-1B-Instruct",
    # Model loading configuration...
)

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B-Instruct")
if tokenizer.pad_token_id is None:
    tokenizer.pad_token_id = tokenizer.eos_token_id
tokenizer.padding_side = "left"

# Create inference object
inference = ed.vInference(
    model=model,
    processor_class=tokenizer,
    generation_config=ed.vInferenceConfig(
        max_new_tokens=2048,
        streaming_chunks=64,
        num_return_sequences=1,
    ),
    inference_name="LLaMA",
)

# Precompile for better performance
inference.precompile(
    ed.vInferencePreCompileConfig(
        batch_size=1,
        prefill_length=1024,
    )
)

# Start the API server
ed.vInferenceApiServer(inference).fire(
    port=8000,
    metrics_port=8001,
)

Starting a Multimodal Model Server#

import jax
import easydel as ed
from jax import numpy as jnp
from transformers import AutoProcessor

# Load the processor and model
processor = AutoProcessor.from_pretrained("llava-hf/llava-1.5-7b-hf")
processor.padding_side = "left"

model = ed.AutoEasyDeLModelForImageTextToText.from_pretrained(
    "llava-hf/llava-1.5-7b-hf",
    # Model loading configuration...
)

# Create inference object
inference = ed.vInference(
    model=model,
    processor_class=processor,
    generation_config=ed.vInferenceConfig(
        max_new_tokens=1024,
        streaming_chunks=32,
        num_return_sequences=1,
    ),
    inference_name="multimodal",
)

# Precompile with vision settings
inference.precompile(
    ed.vInferencePreCompileConfig(
        batch_size=1,
        prefill_length=2048,
        vision_included=True,
        vision_batch_size=1,
        vision_channels=3,
        vision_height=336,
        vision_width=336,
    )
)

# Start the API server
ed.vInferenceApiServer(inference).fire(
    port=8000,
    metrics_port=8001,
)

Client Usage Examples#

Text-Only Chat Completion#

import requests

api_url = "http://localhost:8000"
model_id = "LLaMA"
prompt = "Explain quantum computing in simple terms"

response = requests.post(
    f"{api_url}/v1/chat/completions",
    json={
        "model": model_id,
        "messages": [{"role": "user", "content": prompt}],
        "temperature": 0.7,
        "max_tokens": 500
    }
)

result = response.json()
print(result["choices"][0]["message"]["content"])

Streaming Chat Completion#

import json
import requests

api_url = "http://localhost:8000"
model_id = "LLaMA"
prompt = "Write a short story about a robot learning to paint"

response = requests.post(
    f"{api_url}/v1/chat/completions",
    json={
        "model": model_id,
        "messages": [{"role": "user", "content": prompt}],
        "temperature": 0.7,
        "max_tokens": 500,
        "stream": True
    },
    stream=True
)

for line in response.iter_lines():
    if line:
        line_text = line.decode('utf-8')
        if line_text.startswith("data: "):
            data = line_text[6:]
            if data == "[DONE]":
                break
            try:
                json_data = json.loads(data)
                if "choices" in json_data and json_data["choices"]:
                    delta = json_data["choices"][0]["delta"]
                    if "content" in delta:
                        print(delta["content"], end="", flush=True)
            except json.JSONDecodeError:
                pass

Multimodal Chat Completion#

import base64
import requests

api_url = "http://localhost:8000"
model_id = "multimodal"
text_prompt = "Describe this image in detail."
image_path = "path/to/image.jpg"

# Read and encode image as base64
with open(image_path, "rb") as image_file:
    base64_image = base64.b64encode(image_file.read()).decode("utf-8")

response = requests.post(
    f"{api_url}/v1/chat/completions",
    json={
        "model": model_id,
        "messages": [
            {
                "role": "user",
                "content": [
                    {
                        "type": "image",
                        "image": f"data:image/jpeg;base64,{base64_image}"
                    },
                    {
                        "type": "text",
                        "text": text_prompt
                    }
                ]
            }
        ],
        "temperature": 0.8,
        "max_tokens": 300
    }
)

result = response.json()
print(result["choices"][0]["message"]["content"])

Token Counting#

import requests

api_url = "http://localhost:8000"
model_id = "LLaMA"
text = "Explain quantum computing in simple terms"

response = requests.post(
    f"{api_url}/v1/count_tokens",
    json={
        "model": model_id,
        "conversation": text
    }
)

result = response.json()
print(f"Token count: {result['count']}")

Listing Available Models#

import requests

api_url = "http://localhost:8000"

response = requests.get(f"{api_url}/v1/models")
result = response.json()

print("Available models:")
for model in result["data"]:
    print(f"- {model['id']}")

Advanced Configuration#

Creating a Server with Multiple Models#

import easydel as ed

# Initialize your model inference objects
inference1 = ed.vInference(
    # Configuration for first model
    inference_name="model1"
)

inference2 = ed.vInference(
    # Configuration for second model
    inference_name="model2"
)

# Create a server with multiple models
server = ed.vInferenceApiServer({
    "model1": inference1,
    "model2": inference2
})

# Start the server
server.fire(port=8000)

Customizing Sampling Parameters#

import easydel as ed

# Create inference with custom sampling parameters
inference = ed.vInference(
    # Other configuration...
    generation_config=ed.vInferenceConfig(
        max_new_tokens=2048,
        streaming_chunks=64,
        sampling_params=ed.SamplingParams(
            temperature=0.8,
            top_p=0.95,
            top_k=50,
            repetition_penalty=1.1,
            presence_penalty=0.1,
            frequency_penalty=0.1,
        )
    )
)

Performance Considerations#

  • JAX compilation happens on the first inference, expect higher latency initially

  • Precompilation improves first inference latency

  • For multimodal models, optimize vision preprocessing parameters

  • Adjust batch size and max_workers based on available hardware

  • Streaming generation may have slightly higher total latency but better perceived responsiveness