easydel.inference.vwhisper.core#
- class easydel.inference.vwhisper.core.vWhisperInference(model: WhisperForConditionalGeneration, tokenizer: WhisperTokenizer, processor: WhisperProcessor, inference_config: vWhisperInferenceConfig | None = None, dtype: jax.typing.DTypeLike = <class 'jax.numpy.float32'>)[source]#
Bases:
objectSpeech-to-text inference engine using Whisper models.
vWhisperInference provides a high-performance pipeline for transcribing and translating audio using OpenAI’s Whisper models, optimized for JAX. It supports long-form audio processing with automatic chunking and can generate timestamps for subtitle creation.
- Features:
Audio transcription in multiple languages
Translation to English
Timestamp generation for subtitles
Long-form audio processing with chunking
Batch processing for efficiency
JAX/XLA acceleration
- model#
The Whisper model for conditional generation
- tokenizer#
Tokenizer for text processing
- processor#
Audio processor for feature extraction
- inference_config#
Configuration settings
- dtype#
Data type for computations
- graphdef#
Model graph definition
- graphstate#
Model state
- Parameters
model – Fine-tuned Whisper model for inference.
tokenizer – Whisper tokenizer.
processor – Whisper processor for audio processing.
inference_config – Optional configuration settings.
dtype – Data type for JAX computations (default: float32).
Example
>>> engine = vWhisperInference( ... model=whisper_model, ... tokenizer=tokenizer, ... processor=processor ... ) >>> result = engine.transcribe( ... "audio.mp3", ... language="en" ... ) >>> print(result["text"])
- generate(audio_input: str | bytes | numpy.ndarray | dict[str, numpy.ndarray | int], chunk_length_s: float = 30.0, stride_length_s: float | list[float] | None = None, batch_size: int | None = None, language: str | None = None, task: str | None = None, return_timestamps: bool | None = None)[source]#
Transcribe or translate audio input.
- Parameters
audio_input (tp.Union[str, bytes, np.ndarray, tp.Dict[str, tp.Union[np.ndarray, int]]]) – Input audio. Can be a local file path, URL, bytes, numpy array, or a dictionary containing the array and sampling rate.
chunk_length_s (float, optional, defaults to 30.0) – Length of audio chunks in seconds.
stride_length_s (float or list[float], optional) – Stride length for chunking audio, in seconds. Defaults to chunk_length_s / 6.
batch_size (int, optional) – Batch size for processing. Defaults to the batch_size in inference_config.
language (str, optional) – Language of the input audio. Defaults to the language in inference_config.
task (str, optional) – Task to perform (e.g., “transcribe”, “translate”). Defaults to the task in inference_config.
return_timestamps (bool, optional) –
- Whether to return timestamps with the transcription.
Defaults to the return_timestamps in inference_config.
- Returns
A dictionary containing the transcribed text (“text”) and optionally other information like timestamps or detected language.
- Return type
dict