easydel.inference.vwhisper.core#

class easydel.inference.vwhisper.core.vWhisperInference(model: WhisperForConditionalGeneration, tokenizer: WhisperTokenizer, processor: WhisperProcessor, inference_config: vWhisperInferenceConfig | None = None, dtype: jax.typing.DTypeLike = <class 'jax.numpy.float32'>)[source]#

Bases: object

Speech-to-text inference engine using Whisper models.

vWhisperInference provides a high-performance pipeline for transcribing and translating audio using OpenAI’s Whisper models, optimized for JAX. It supports long-form audio processing with automatic chunking and can generate timestamps for subtitle creation.

Features:
  • Audio transcription in multiple languages

  • Translation to English

  • Timestamp generation for subtitles

  • Long-form audio processing with chunking

  • Batch processing for efficiency

  • JAX/XLA acceleration

model#

The Whisper model for conditional generation

tokenizer#

Tokenizer for text processing

processor#

Audio processor for feature extraction

inference_config#

Configuration settings

dtype#

Data type for computations

graphdef#

Model graph definition

graphstate#

Model state

Parameters
  • model – Fine-tuned Whisper model for inference.

  • tokenizer – Whisper tokenizer.

  • processor – Whisper processor for audio processing.

  • inference_config – Optional configuration settings.

  • dtype – Data type for JAX computations (default: float32).

Example

>>> engine = vWhisperInference(
...     model=whisper_model,
...     tokenizer=tokenizer,
...     processor=processor
... )
>>> result = engine.transcribe(
...     "audio.mp3",
...     language="en"
... )
>>> print(result["text"])
generate(audio_input: str | bytes | numpy.ndarray | dict[str, numpy.ndarray | int], chunk_length_s: float = 30.0, stride_length_s: float | list[float] | None = None, batch_size: int | None = None, language: str | None = None, task: str | None = None, return_timestamps: bool | None = None)[source]#

Transcribe or translate audio input.

Parameters
  • audio_input (tp.Union[str, bytes, np.ndarray, tp.Dict[str, tp.Union[np.ndarray, int]]]) – Input audio. Can be a local file path, URL, bytes, numpy array, or a dictionary containing the array and sampling rate.

  • chunk_length_s (float, optional, defaults to 30.0) – Length of audio chunks in seconds.

  • stride_length_s (float or list[float], optional) – Stride length for chunking audio, in seconds. Defaults to chunk_length_s / 6.

  • batch_size (int, optional) – Batch size for processing. Defaults to the batch_size in inference_config.

  • language (str, optional) – Language of the input audio. Defaults to the language in inference_config.

  • task (str, optional) – Task to perform (e.g., “transcribe”, “translate”). Defaults to the task in inference_config.

  • return_timestamps (bool, optional) –

    Whether to return timestamps with the transcription.

    Defaults to the return_timestamps in inference_config.

Returns

A dictionary containing the transcribed text (“text”) and optionally other information like timestamps or detected language.

Return type

dict