Selecting the best AI model for transcription
Discusses which models to choose for your audio/video files
CCV AI Services provides multiple AI models for your transcription needs. Each of these AI models comes with different characteristics for different transcription needs. This page discusses which model works best for you based on the type of audio files that you have.
TL;DR:
Below is the general rule-of-thumb:
Select Gemini when:
I have short audio/video files below 40 minutes in duration
My audio/video files contain normal conversational content without much background noise
I care about speaker alignment
I want results fast
I don't mind if the timestamps are slightly inaccurate
Note
Although Gemini models technically support audio input of over 9 hours with its long context window, we do NOT recommend Gemini models for files longer than 1 hour.
Select OpenAI Whisper model when
I have long audio/video files over 40 minutes
I want results fast
I want more accurate timestamps
I want transcription to be done without a 3rd party API
Select Azure Speech-to-Text model when:
I want the most consistent results in accuracy and speaker alignment
My files have lots of background noise in them
I do not mind waiting for a while to get my results back
I do not mind some cost in using the model (It is free for now)
Last updated
Was this helpful?