# Selecting the Best AI Model for Transcription

Transcribe provides [multiple AI models ](/ai-tools/services/transcribe/comparing-speech-to-text-models.md)for your transcription needs. Each of these AI models comes with different characteristics for different transcription needs. This page discusses which model works best for you based on the type of audio files that you have. Below is the general rule-of-thumb:

## Recommendations by Specific Requirements:

We recommend that if uncertain, please pick the **OpenAI Whisper** model for the most robust and consistent performance and the widest range of features. If you have any specific needs, please see below for our recommended model:

<table><thead><tr><th width="536.9453125">Requirement</th><th width="201.765625">Recommendation</th></tr></thead><tbody><tr><td><strong>Verbatim transcription</strong> (preserve disfluencies such as filler words, stutters, and repetitions)</td><td>Qwen</td></tr><tr><td><strong>Translation</strong> (transcribe audio into texts of a different language)</td><td>Gemini</td></tr><tr><td><strong>Chinese/Cantonese</strong> transcription</td><td>Qwen</td></tr><tr><td><strong>Singing voices</strong></td><td>Qwen</td></tr><tr><td><strong>Captions/Subtitles</strong> (SubRip (.srt) or WebVTT (.vtt) format)</td><td>Whisper, Qwen, or Cohere</td></tr><tr><td>I do NOT want my audio/video media processed by a 3rd party like Google</td><td>Whisper, Qwen, or Cohere</td></tr></tbody></table>

## Recommendations by Model:

### Select **Gemini for**:

* short audio/video files below approximately 30 minutes in duration
* audio/video files contain conversational content without much background noise or long segments with silence, as they are prone to causing the model to hallucinate
* better speaker diarization
* better speed
* accurate transcription text without accurate timestamps
* translation

{% hint style="warning" %}
Note

Although Gemini models technically support audio input of over 9 hours with its long context window, we do NOT recommend Gemini models for files longer than 1 hour.
{% endhint %}

### Select **OpenAI Whisper/Cohere Transcribe** model for

* long audio/video files over 30 minutes
* more accurate timestamps
* word-level timestamps
* private transcription without a 3rd party API
* captions/subtitles in `.srt` or `.vtt`

### Select **Qwen3-ASR** model for

* verbatim transcription of disfluencies, filler words (such as "you know," "like," etc.)
* more accurate timestamps
* word-level timestamps
* private transcription without a 3rd party API
* captions/subtitles in `.srt` or `.vtt`
* at least 1.5 times faster speed vs. Whisper
* better performance with audio with noisy backgrounds
* better performance with singing voices
* better performance with Chinese/Cantonese dialects


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.ccv.brown.edu/ai-tools/services/transcribe/selecting-the-best-ai-model-for-transcription.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
