介绍

whisperX

This repository provides fast automatic speech recognition (70x realtime with large-v2) with word-level timestamps and speaker diarization.
此存储库提供快速自动语音识别(large-v2 为 70 倍实时),具有单词级时间戳和说话人分类。

  • ⚡️ Batched inference for 70x realtime transcription using whisper large-v2
    ⚡️ 使用 whisper large-v2 进行 70 倍实时转录的批量推理
  • 🪶 faster-whisper backend, requires <8GB gpu memory for large-v2 with beam_size=5
    🪶 faster-whisper 后端,对于 beam_size=5 的 large-v2,需要 <8GB GPU 内存
  • 🎯 Accurate word-level timestamps using wav2vec2 alignment
    🎯 使用 wav2vec2 对齐的准确字级时间戳
  • 👯‍♂️ Multispeaker ASR using speaker diarization from pyannote-audio (speaker ID labels)
    👯 ♂️ 使用 pyannote-audio 中的说话人分类的多说话人 ASR(说话人 ID 标签)
  • 🗣️ VAD preprocessing, reduces hallucination & batching with no WER degradation
    🗣️ VAD 预处理,减少幻觉和批处理,且不会降低 WER 性能

Whisper  is an ASR model developed by OpenAI, trained on a large dataset of diverse audio. Whilst it does produces highly accurate transcriptions, the corresponding timestamps are at the utterance-level, not per word, and can be inaccurate by several seconds. OpenAI’s whisper does not natively support batching.
Whisper  是由 OpenAI 开发的 ASR 模型,在包含不同音频的大型数据集上进行训练。虽然它确实会生成高度准确的转录,但相应的时间戳是话语级别的,而不是每个单词的,并且可能会不准确几秒钟。OpenAI 的 whisper 本身不支持批处理。

Phoneme-Based ASR  A suite of models finetuned to recognise the smallest unit of speech distinguishing one word from another, e.g. the element p in “tap”. A popular example model is wav2vec2.0.
基于音素的 ASR  一组经过微调的模型,用于识别区分一个单词和另一个单词的最小语音单位,例如“tap”中的元素 p。一个流行的示例模型是 wav2vec2.0

Forced Alignment  refers to the process by which orthographic transcriptions are aligned to audio recordings to automatically generate phone level segmentation.
强制对齐 是指将正字转录与录音对齐以自动生成电话级分段的过程。

Voice Activity Detection (VAD)  is the detection of the presence or absence of human speech.
语音活动检测 (VAD)  是检测是否存在人类语音。

Speaker Diarization  is the process of partitioning an audio stream containing human speech into homogeneous segments according to the identity of each speaker.
说话人细分 是根据每个说话人的身份将包含人类语音的音频流划分为同质片段的过程。

实现上

需要使用 replicate 和 Hugging Face 中的模型