easydel.inference.vwhisper.core

easydel.inference.vwhisper.core#

class easydel.inference.vwhisper.core.vWhisperInference(model: WhisperForConditionalGeneration, tokenizer: WhisperTokenizer, processor: WhisperProcessor, inference_config: vWhisperInferenceConfig | None = None, dtype: jax.typing.DTypeLike = <class 'jax.numpy.float32'>)[source]#

Bases: object

Speech-to-text inference engine using Whisper models.

vWhisperInference provides a high-performance pipeline for transcribing and translating audio using OpenAI’s Whisper models, optimized for JAX. It supports long-form audio processing with automatic chunking and can generate timestamps for subtitle creation.

Features:

Audio transcription in multiple languages
Translation to English
Timestamp generation for subtitles
Long-form audio processing with chunking
Batch processing for efficiency
JAX/XLA acceleration

model#: The Whisper model for conditional generation

tokenizer#: Tokenizer for text processing

processor#: Audio processor for feature extraction

inference_config#: Configuration settings

dtype#: Data type for computations

graphdef#: Model graph definition

graphstate#: Model state

Parameters

model – Fine-tuned Whisper model for inference.
tokenizer – Whisper tokenizer.
processor – Whisper processor for audio processing.
inference_config – Optional configuration settings.
dtype – Data type for JAX computations (default: float32).

Example

>>> engine = vWhisperInference(
...     model=whisper_model,
...     tokenizer=tokenizer,
...     processor=processor
... )
>>> result = engine.transcribe(
...     "audio.mp3",
...     language="en"
... )
>>> print(result["text"])

Transcribe or translate audio input.

Parameters

audio_input (tp.Union[str, bytes, np.ndarray, tp.Dict[str, tp.Union[np.ndarray, int]]]) – Input audio. Can be a local file path, URL, bytes, numpy array, or a dictionary containing the array and sampling rate.
chunk_length_s (float, optional, defaults to 30.0) – Length of audio chunks in seconds.
stride_length_s (float or list[float], optional) – Stride length for chunking audio, in seconds. Defaults to chunk_length_s / 6.
batch_size (int, optional) – Batch size for processing. Defaults to the batch_size in inference_config.
language (str, optional) – Language of the input audio. Defaults to the language in inference_config.
task (str, optional) – Task to perform (e.g., “transcribe”, “translate”). Defaults to the task in inference_config.
return_timestamps (bool, optional) –

Whether to return timestamps with the transcription.
Defaults to the return_timestamps in inference_config.

Returns

A dictionary containing the transcribed text (“text”) and optionally other information like timestamps or detected language.

Return type

dict

easydel.inference.vwhisper.core

Contents

easydel.inference.vwhisper.core#