Comparing Speech-to-text Models

On this page we discuss the technical details of the speech-to-text models that we use for the Transcribe service to help users choose what model to use for their use cases.

Overview

The CCV AI Transcribe service uses state-of-the-art speech-to-text and voice activity detection (VAD) models to provide high-quality and fast transcriptions. Currently, we offer a proprietary speech-to-text model Google Gemini and an open-source OpenAI Whisper model for users to choose from. We are continually adding high-performance transcription models as they become available.

Below is a quick comparison between the two models. Please continue reading for more techinical details of the models.

Model
Google Gemini
OpenAI Whisper

Word error rate (for English)*

Unpublished**

5.7%*

Diarization quality

Best

Better

Multiple language support

Yes

Yes

Open source

No

Yes

Runs on Brown-managed infrastructure

No

Yes

Speed

<5 min/audio hr, multiple audio files uploaded in the same job are transcribed simultaneously

< 5 min/audio hr

Recommendation

Better for short audio files and best-in-class transcription and diarization results

Better for long audio files for fast transcription with decent diarization result

* Performed by PicoVoice using methodology described on this page. WER will change based on the corpus that the models are tested on.

** Audio transcription metrics are not published by Google, but initial qualitative analysis suggests that Gemini performs better in audio transcription than other two models.

The Google Gemini model

Gemini- is the flagship multimodal large language model by Google that offers state-of-the-art performance for audio transcription.

At the moment, the Gemini model offers best-in-class performance in terms of transcription accuracy, diarization accuracy, and transcription speed, and it is affordable enough that we can offer the use of Gemini for free. Therefore, it is our recommended model to try out when you use the Transcribe module.

Languages supported by the Gemini model:

  • 🇸🇦 Arabic

  • 🇨🇳 Chinese (Mandarin)

  • 🇳🇱 Dutch

  • 🇪🇸 Spanish

  • 🇮🇹 Italian

  • 🇩🇪 German

  • 🇷🇺 Russian

  • 🇵🇹 Portuguese

  • 🇬🇧 English

  • 🇯🇵 Japanese

  • 🇫🇷 French

  • 🇻🇳 Vietnamese

While the Whisper models supports Korean, Gemini does not support Korean at the moment.

The OpenAI Whisper model

Under the hood, we use the WhisperX implementation to handle transcription tasks. The WhisperX implementation performs voice activity detection (VAD) first on the audio file and chunks the audio into smaller speech segments before sending the segments to the Whipser model. In our experience, WhisperX significantly reduces model hallucination and improves accuracy on most transcriptions tasks.

The OpenAI Whisper is a state-of-the-art open source speech-to-text model first released by OpenAI in late 2022. Since release, it has been one of the top open-source models for automated speech recognition (ASR) tasks. The Whisper-large-v3model that the Transcribe service uses was released in September 2023. According to its model card, "[t]he models are trained on 680,000 hours of audio and the corresponding transcripts collected from the internet. 65% of this data (or 438,000 hours) represents English-language audio and matched English transcripts, roughly 18% (or 126,000 hours) represents non-English audio and English transcripts, while the final 17% (or 117,000 hours) represents non-English audio and the corresponding transcript. This non-English data represents 98 different languages." Thus, the model supports almost 100 languages. However, since different amounts of training data was available for difference languages, the CCV AI Transcribe service offers options for the following languages:

  • 🇸🇦 Arabic

  • 🇨🇳 Chinese (Mandarin)

  • 🇳🇱 Dutch

  • 🇪🇸 Spanish

  • 🇮🇹 Italian

  • 🇩🇪 German

  • 🇷🇺 Russian

  • 🇵🇹 Portuguese

  • 🇬🇧 English

  • 🇯🇵 Japanese

  • 🇫🇷 French

  • 🇻🇳 Vietnamese

All transcription jobs using the OpenAI Whisper model are run on GPU in a Google Cloud Run container. Therefore, no calls to a third-party API happens in this process, so that users are assured data does not leave Brown-managed infrastructure.

When the OpenAI Whisper model is selected, speaker diarization (recognizing and tracking different speakers) is performed by another open-source model specializing in speaker diarization, pyannote.audio. As OpenAI Whisper does audio transcription only and does not support speaker diarization, both models are run together over Brown-managed services. Although one of the best open-source speaker diarization models available, pyannote.audio still trails behind commercial alternatives. Therefore, if the accuracy of speaker diarization is a priority and/or the audio includes many speakers talking over each other, please choose the Microsoft Azure model for better performance in those tasks.

Last updated

Was this helpful?