Speech-to-text Models
On this page we discuss the technical details of the speech-to-text models that we use for the Transcribe service to help users choose what model to use for their use cases.
Overview
The CCV AI Transcribe service uses state-of-the-art speech-to-text and voice activity detection (VAD) models to provide high-quality and fast transcriptions. Currently, we offer a proprietary speech-to-text model from Azure AI Speech and an open-source OpenAI Whisper model for users to choose from. We are continually adding high-performance transcription models as they become available.
Below is a quick comparison between the two models. Please continue reading for more techinical details of the models.
Rate:
Free
Free
Free during pilot. $1.5/audio hr afterwards
Word error rate (for English)*
Unpublished**
5.7%*
5.6%*
Diarization quality
Best
Better
Better
Multiple language support
Yes
Yes
Yes, and better support for regional dialects
Open source
No
Yes
No
Runs on Brown-managed infrastructure
No
Yes
No
Speed
<5 min/audio hr, multiple audio files uploaded in the same job are transcribed simultaneously
< 5 min/audio hr
50-55 min/audio hr
Recommendation
Better for short audio files and best-in-class transcription and diarization results
Better for long audio files for fast transcription with decent diarization result
Better if consistency is a priority and/or if speech includes dialects
* Performed by PicoVoice using methodology described on this page. WER will change based on the corpus that the models are tested on.
** Audio transcription metrics are not published by Google, but initial qualitative analysis suggests that Gemini performs better in audio transcription than other two models.
The Google Gemini model
Gemini- is the flagship multimodal large language model by Google that offers state-of-the-art performance for audio transcription.
CCV AI Services use the Gemini API offered through Vertex AI on Google Cloud to be able to handle large video/audio files, which is different from the Google Gemini web interface available to all Brown community members offered directly from Google. Currently we are still only capable of handling data at Risk Level 2 and below.
At the moment, the Gemini model offers best-in-class performance in terms of transcription accuracy, diarization accuracy, and transcription speed, and it is affordable enough that we can offer the use of Gemini for free. Therefore, it is our recommended model to try out when you use the Transcribe module.
The OpenAI Whisper model
The OpenAI Whisper is a state-of-the-art open source speech-to-text model first released by OpenAI in late 2022. Since release, it has been one of the top open-source models for automated speech recognition (ASR) tasks. The Whisper-large-v3
model that the Transcribe service uses was released in September 2023. According to its model card, "[t]he models are trained on 680,000 hours of audio and the corresponding transcripts collected from the internet. 65% of this data (or 438,000 hours) represents English-language audio and matched English transcripts, roughly 18% (or 126,000 hours) represents non-English audio and English transcripts, while the final 17% (or 117,000 hours) represents non-English audio and the corresponding transcript. This non-English data represents 98 different languages." Thus, the model supports almost 100 languages. However, since different amounts of training data was available for difference languages, the CCV AI Transcribe service only offers options for languages for which Whisper has a Word Error Rate (WER) lower than 13%:
๐ณ๐ฑ Dutch
๐ช๐ธ Spanish
๐ฐ๐ท Korean
๐ฎ๐น Italian
๐ฉ๐ช German
๐น๐ญ Thai
๐ท๐บ Russian
๐ต๐น Portuguese
๐ต๐ฑ Polish
๐ฎ๐ฉ Indonesian
๐ธ๐ช Swedish
๐จ๐ฟ Czech
๐ฌ๐ง English
๐ฏ๐ต Japanese
๐ซ๐ท French
๐ท๐ด Romanian
๐ญ๐ฐ Cantonese
๐น๐ท Turkish
๐จ๐ณ Chinese
All transcription jobs using the OpenAI Whisper model are run on GPU in a Google Cloud Run container. Therefore, no calls to a third-party API happens in this process, so that users are assured data does not leave Brown-managed infrastructure.
When the OpenAI Whisper model is selected, speaker diarization (recognizing and tracking different speakers) is performed by another open-source model specializing in speaker diarization, pyannote.audio. As OpenAI Whisper does audio transcription only and does not support speaker diarization, both models are run together over Brown-managed services. Although one of the best open-source speaker diarization models available, pyannote.audio still trails behind commercial alternatives. Therefore, if the accuracy of speaker diarization is a priority and/or the audio includes many speakers talking over each other, please choose the Microsoft Azure model for better performance in those tasks.
The Azure AI Speech-to-Text model
The Azure AI Speech-to-Text is a proprietary transcription service provided by Microsoft Azure. As a proprietary service, its technical details are not published.
From empirical evidence, it seems that the Azure AI Speech-to-Text offers comparable performance against the open source Whisper model, if not slightly better. The model can transcribe at a real-time rate, which means that for an hour of speech, it takes the model about the same amount of time to complete transcription. This gives an edge to the Whisper models, which can transcribe and diarize an hour of speech within 5 minutes, if speed is the main concern. However, as a proprietary product offered by a major cloud service provider, the Azure model is more reliable, especially in speech diarization performance.
The Azure model not only supports over 100 languages, but it also supports regional dialects. We are providing support for the following languages, for which Azure has the most complete function set:
๐ฉ๐ฐ Danish
๐ฉ๐ช German
๐ฆ๐บ English (Australia)
๐จ๐ฆ English (Canada)
๐ฌ๐ง English (United Kingdom)
๐ญ๐ฐ English (Hong Kong SAR)
๐ฎ๐ช English (Ireland)
๐ฎ๐ณ English (India)
๐ณ๐ฌ English (Nigeria)
๐ณ๐ฟ English (New Zealand)
๐ต๐ญ English (Philippines)
๐ธ๐ฌ English (Singapore)
๐บ๐ธ English (United States)
๐ช๐ธ Spanish (Spain)
๐ฒ๐ฝ Spanish (Mexico)
๐ซ๐ฎ Finnish
๐จ๐ฆ French (Canada)
๐ซ๐ท French (France)
๐ฎ๐ณ Hindi
๐ฎ๐น Italian
๐ฏ๐ต Japanese
๐ฐ๐ท Korean
๐ณ๐ด Norwegian Bokmรฅl
๐ณ๐ฑ Dutch
๐ต๐ฑ Polish
๐ง๐ท Portuguese (Brazil)
๐ต๐น Portuguese (Portugal)
๐ธ๐ช Swedish
๐น๐ท Turkish
๐จ๐ณ Chinese (Mandarin, Simplified)
๐ญ๐ฐ Chinese (Cantonese, Traditional)
Last updated
Was this helpful?