LogoLogo
  • About
  • Privacy Statement
  • Sustainability
  • Terms of Use
  • Getting Help
  • Info for Pilot Participants
  • Services
    • Transcribe
      • Speech-to-text Models
    • Ask Oscar
    • LibreChat
    • Upcoming services
Powered by GitBook
On this page
  • Overview
  • The OpenAI Whisper model
  • The Azure AI Speech-to-Text model

Was this helpful?

Export as PDF
  1. Services
  2. Transcribe

Speech-to-text Models

On this page we discuss the technical details of the speech-to-text models that we use for the Transcribe service to help users choose what model to use for their use cases.

PreviousTranscribeNextAsk Oscar

Last updated 2 months ago

Was this helpful?

Overview

The CCV AI Transcribe service uses state-of-the-art speech-to-text and voice activity detection (VAD) models to provide high-quality and fast transcriptions. Currently, we offer a proprietary speech-to-text model from Azure AI Speech and an open-source OpenAI Whisper model for users to choose from. We are continually adding high-performance transcription models as they become available.

Below is a quick comparison between the two models. Please continue reading for more techinical details of the models.

Model
OpenAI Whisper-large-v3
Azure AI Speech-to-Text

Rate:

Free

Free during pilot. $1.5/audio hr afterwards

Word error rate (for English)*

5.7%*

5.6%*

Diarization quality

Good

Better

Multiple language support

Yes

Yes, and better support for regional dialects

Open source

Yes

No

Runs on Brown-managed infrastructure

Yes

No

Speed

< 5 min/audio hr

50-55 min/audio hr

Recommendation

Better for long audio files with fewer over-talking

Better if diarization quality is a priority and/or if speech includes dialects

* performed by PicoVoice using methodology described . WER will change based on the corpus that the models are tested on.

The OpenAI Whisper model

is a state-of-the-art open source speech-to-text model first released by OpenAI in late 2022. Since release, it has been one of . The Whisper-large-v3model that the Transcribe service uses was released in September 2023. According to , "[t]he models are trained on 680,000 hours of audio and the corresponding transcripts collected from the internet. 65% of this data (or 438,000 hours) represents English-language audio and matched English transcripts, roughly 18% (or 126,000 hours) represents non-English audio and English transcripts, while the final 17% (or 117,000 hours) represents non-English audio and the corresponding transcript. This non-English data represents 98 different languages." Thus, the model supports almost 100 languages. However, since different amounts of training data was available for difference languages, the CCV AI Transcribe service only offers options for languages for which Whisper has a Word Error Rate (WER) lower than 13%:

  • 🇳🇱 Dutch

  • 🇪🇸 Spanish

  • 🇰🇷 Korean

  • 🇮🇹 Italian

  • 🇩🇪 German

  • 🇹🇭 Thai

  • 🇷🇺 Russian

  • 🇵🇹 Portuguese

  • 🇵🇱 Polish

  • 🇮🇩 Indonesian

  • 🇸🇪 Swedish

  • 🇨🇿 Czech

  • 🇬🇧 English

  • 🇯🇵 Japanese

  • 🇫🇷 French

  • 🇷🇴 Romanian

  • 🇭🇰 Cantonese

  • 🇹🇷 Turkish

  • 🇨🇳 Chinese

All transcription jobs using the OpenAI Whisper model are run on GPU in a Google Cloud Run container. Therefore, no calls to a third-party API happens in this process, so that users are assured data does not leave Brown-managed infrastructure.

The Azure AI Speech-to-Text model

From empirical evidence, it seems that the Azure AI Speech-to-Text offers comparable performance against the open source Whisper model, if not slightly better. The model can transcribe at a real-time rate, which means that for an hour of speech, it takes the model about the same amount of time to complete transcription. This gives an edge to the Whisper models, which can transcribe and diarize an hour of speech within 5 minutes, if speed is the main concern. However, as a proprietary product offered by a major cloud service provider, the Azure model is more reliable, especially in speech diarization performance.

The Azure model not only supports over 100 languages, but it also supports regional dialects. We are providing support for the following languages, for which Azure has the most complete function set:

  • 🇩🇰 Danish

  • 🇩🇪 German

  • 🇦🇺 English (Australia)

  • 🇨🇦 English (Canada)

  • 🇬🇧 English (United Kingdom)

  • 🇭🇰 English (Hong Kong SAR)

  • 🇮🇪 English (Ireland)

  • 🇮🇳 English (India)

  • 🇳🇬 English (Nigeria)

  • 🇳🇿 English (New Zealand)

  • 🇵🇭 English (Philippines)

  • 🇸🇬 English (Singapore)

  • 🇺🇸 English (United States)

  • 🇪🇸 Spanish (Spain)

  • 🇲🇽 Spanish (Mexico)

  • 🇫🇮 Finnish

  • 🇨🇦 French (Canada)

  • 🇫🇷 French (France)

  • 🇮🇳 Hindi

  • 🇮🇹 Italian

  • 🇯🇵 Japanese

  • 🇰🇷 Korean

  • 🇳🇴 Norwegian BokmÃ¥l

  • 🇳🇱 Dutch

  • 🇵🇱 Polish

  • 🇧🇷 Portuguese (Brazil)

  • 🇵🇹 Portuguese (Portugal)

  • 🇸🇪 Swedish

  • 🇹🇷 Turkish

  • 🇨🇳 Chinese (Mandarin, Simplified)

  • 🇭🇰 Chinese (Cantonese, Traditional)

When the OpenAI Whisper model is selected, speaker diarization (recognizing and tracking different speakers) is performed by another open-source model specializing in speaker diarization, . As OpenAI Whisper does audio transcription only and does not support speaker diarization, both models are run . Although one of the best open-source speaker diarization models available, pyannote.audio still trails behind commercial alternatives. Therefore, if the accuracy of speaker diarization is a priority and/or the audio includes many speakers talking over each other, please choose the Microsoft Azure model for better performance in those tasks.

The is a proprietary transcription service provided by Microsoft Azure. As a proprietary service, its technical details are not published.

on this page
The OpenAI Whisper
the top open-source models for automated speech recognition (ASR) tasks
its model card
Azure AI Speech-to-Text
pyannote.audio
together over Brown-managed services