easydel.inference.vwhisper.init

easydel.inference.vwhisper.init#

class easydel.inference.vwhisper.__init__.WhisperModel(model_name=None, dtype=<class 'jax.numpy.bfloat16'>)[source]#

Bases: object

Singleton wrapper for the Whisper model to avoid reloading.

easydel.inference.vwhisper.__init__.chunk_iter_with_batch(audio_array: ndarray, chunk_length: int, stride_left: int, stride_right: int, batch_size: int, feature_extractor)[source]#

Process an audio array into chunks with overlapping strides.

Parameters

audio_array – Input audio array
chunk_length – Length of each chunk in samples
stride_left – Left stride in samples
stride_right – Right stride in samples
batch_size – Number of chunks to process at once
feature_extractor – Feature extractor to process audio

Yields

Batches of processed audio chunks

easydel.inference.vwhisper.__init__.create_whisper_app(model_name: str = 'openai/whisper-large-v3-turbo', dtype=<class 'jax.numpy.bfloat16'>)[source]#: Create a FastAPI app for Whisper transcription.

easydel.inference.vwhisper.__init__.get_decoder_input_ids(model_config, generation_config=None, task=None, language=None, return_timestamps=False)[source]#: Helper function to get decoder input IDs for Whisper.

easydel.inference.vwhisper.__init__.process_audio_input(audio_input: Union[str, bytes, ndarray, Dict[str, Union[ndarray, int]]], feature_extractor)[source]#

Process audio input into a numpy array with correct sampling rate.

Parameters

audio_input – Input audio in various formats
feature_extractor – Feature extractor with sampling rate info

Returns

Tuple of (audio_array, stride)

easydel.inference.vwhisper.__init__.run_cli()#: CLI entry point for running the vWhisper server.

easydel.inference.vwhisper.__init__.run_server(model_name: str = 'openai/whisper-large-v3-turbo', host: str = '0.0.0.0', port: int = 8000, dtype=<class 'jax.numpy.bfloat16'>)[source]#

Run the Whisper FastAPI server.

Parameters

model_name – Name of the Whisper model to use (from HuggingFace)
host – Host to bind the server
port – Port to bind the server
dtype – Data type for the model (default: bfloat16)

class easydel.inference.vwhisper.__init__.vWhisperInference(model: ~typing.Any, tokenizer: ~typing.Any, processor: ~typing.Any, inference_config: ~typing.Optional[~easydel.inference.vwhisper.config.vWhisperInferenceConfig] = None, dtype: ~typing.Union[str, type[typing.Any], ~numpy.dtype, ~jax._src.typing.SupportsDType] = <class 'jax.numpy.float32'>)[source]#

Bases: object

Whisper inference pipeline for performing speech-to-text transcription or translation.

Parameters

model (WhisperForConditionalGeneration) – The fine-tuned Whisper model to use for inference.
tokenizer (WhisperTokenizer) – Tokenizer for Whisper.
processor (WhisperProcessor) – Processor for Whisper.
inference_config (vWhisperInferenceConfig, optional) – Inference configuration.
dtype (jax.typing.DTypeLike, optional, defaults to jnp.float32) – Data type for computations.

generate(audio_input: Union[str, bytes, ndarray, Dict[str, Union[ndarray, int]]], chunk_length_s: float = 30.0, stride_length_s: Optional[Union[float, list[float]]] = None, batch_size: Optional[int] = None, language: Optional[str] = None, task: Optional[str] = None, return_timestamps: Optional[bool] = None)[source]#

Transcribe or translate audio input.

Parameters

audio_input (tp.Union[str, bytes, np.ndarray, tp.Dict[str, tp.Union[np.ndarray, int]]]) – Input audio. Can be a local file path, URL, bytes, numpy array, or a dictionary containing the array and sampling rate.
chunk_length_s (float, optional, defaults to 30.0) – Length of audio chunks in seconds.
stride_length_s (float or list[float], optional) – Stride length for chunking audio, in seconds. Defaults to chunk_length_s / 6.
batch_size (int, optional) – Batch size for processing. Defaults to the batch_size in inference_config.
language (str, optional) – Language of the input audio. Defaults to the language in inference_config.
task (str, optional) – Task to perform (e.g., “transcribe”, “translate”). Defaults to the task in inference_config.
return_timestamps (bool, optional) – Whether to return timestamps with the transcription. Defaults to the return_timestamps in inference_config.

Returns

A dictionary containing the transcribed text (“text”) and optionally other information like timestamps or detected language.

Return type

dict

class easydel.inference.vwhisper.__init__.vWhisperInferenceConfig(batch_size: Optional[int] = 1, max_length: Optional[int] = None, generation_config: Optional[Any] = None)[source]#

Bases: object

Configuration class for Whisper inference.

Parameters

batch_size (int, optional, defaults to 1) – Batch size used for inference.
max_length (int, optional) – Maximum sequence length for generation.
generation_config (transformers.GenerationConfig, optional) – Generation configuration object.
logits_processor (optional) – Not used.
return_timestamps (bool, optional) – Whether to return timestamps with the transcribed text.
task (str, optional) – Task for the model (e.g., “transcribe”, “translate”).
language (str, optional) – Language of the input audio.
is_multilingual (bool, optional) – Whether the model is multilingual.

batch_size: Optional[int] = 1#

classmethod from_dict(data: Dict[str, Any]) → T#: Deserializes a dictionary into a PyTree object.

classmethod from_json(json_str: str) → T#: Deserializes a JSON string into a PyTree object.

generation_config: Optional[Any] = None#

is_multilingual = None#

language = None#

logits_processor = None#

max_length: Optional[int] = None#

replace(**kwargs)#: Creates a new instance with specified fields replaced.

return_timestamps = None#

task = None#

to_dict() → Dict[str, Any]#: Serializes the PyTree object to a dictionary.

to_json(**kwargs) → str#: Serializes the PyTree object to a JSON string.

easydel.inference.vwhisper.__init__

Contents

easydel.inference.vwhisper.__init__#

easydel.inference.vwhisper.init

easydel.inference.vwhisper.init#