Improving accessibility of audio/video media with Transcribe

This page provides guidance and recommendations on using Transcribe to support creating accessible audio/video media.

While Transcribe is helpful in enhancing audio/video media by automating speech recognition with AI, it currently still CANNOT generate captions/subtitles fully compatible with the WCAG 2.1 Level AA standard. This page provide tips on optimal workflows and necessary next steps towards compliance. Substantial work is still required after automatic speech recognition.

To determine whether the captions/subtitles are fully compliant with Brown University Digital Accessibility requirement, please visit the University's digital accessibility page for more information.

Captions vs. Subtitles

While the terms "captions" and "subtitles" are often used interchangeably, subtitles do not provide the same information and are designed for different purposes. Subtitles provide a text version of the dialogue only and often in different languages. Essentially, subtitles assume an audience can hear the audio but also need the dialogue provided as text. Therefore, "captions" are generally used for accessibility purposes.

Transcribe adopts the term "captions" only because it currently can only transcribe speech. It cannot perform translation. However, it cannot provide all information required in captions, because Transcribe uses speech recognition AI models that focus on speech. It usually cannot include non-speech information in the captions (even though some new models like Gemini are demonstrating emerging capabilities of capturing non-speech information). We recommend that the users add such information themselves.

Does my media need captions?

WCAG 2.1 Level AA requires synchronized, accurate closed captions for all pre-recorded audio in multimedia and live captions for live, time-based media. Since Transcribe can only handle offline transcription at the moment, we assume that you have pre-recorded audio/video files that can be uploaded to Transcribe. We do not support live transcription of meetings or events at the moment.

Examples of videos that need captioning include:

Videos promoting your program or attracting students, participants, and alumni
Videos showcasing curriculum, research, exhibitions, or collections
Videos profiling students, faculty, or researchers
Videos providing instructions for how to apply or register for programs
Videos listing news stories about your department or program

For pre-recorded audio contents, a transcript must be provided. A transcript is the same word-for-word content as captions but presented in a separate file. It provides a text alternative of the audio presentation and is not synchronized with the audio timeline. A transcript should contain relevant speaker information to distinguish who is saying what information.

Source: Video Captions and Audio Description page in OIT Knowledge Base

Obtaining captions from Transcribe

To create captions or transcripts from Transcribe, create a job in Transcribe by uploading your audio/video files. If you need time-accurate captions, you must choose the OpenAI Whisper model, Qwen3-ASR model, or Cohere Transcribe model. Check the speech-to-text model comparison page for differences between the models.

Once the transcription job finishes, go to the transcription for each file and open the download transcription diallog. There, you have the option to download the transcript as an enhanced .srt file.

A screenshot of the download transcript window, when the "Enhanced SRT Captions" option is selected.

Once you choose the "Enhanced SRT captions (.srt)" format, you can:

Choose the number of characters per line, and maximum number of lines per caption.
Choose whether to include the names of speakers in each caption.

We recommend that for captions in English, the number of characters each line does not exceed 32, and the number of lines per caption does not exceed 2. For captions in other languages, please follow guidelines for those languages.

We also recommend checking the "Include speaker names" option. By default, Transcribe performs speaker diarization (distinguishing different speakers), and assign each speaker a default name like "Speaker 1". Please check the Viewing/Editing Transcriptions page to see how to edit speaker names.

Speaker diarization is a difficult problem and even the best speaker diarization system regularly makes mistakes. Please carefully check your generated SRT file errors.

For power users who prefer to create captions themselves, they can also choose the "Word-level timestamps (.json)" option, from which they can obtain a json file with raw word-level timestamps for their own workflows.

Next steps

The captions obtained from Transcribe as-is are far from meeting accessibility standards. Transcribing audio/video files with accurate timestamps is a time-consuming process, and Transcribe saves time through automation. However, more work is still need to make the captions fully compliant.

Proofing

Although AI has become much more accurate in speech recognition, it still makes mistakes. Please carefully check the AI generated files and check potential errors in speech recognition, speaker assignment, and timestamps.

Adding description for non-speech sounds

Transcribe focuses on speech and does not include descriptions of non-speech sounds. They need to be added manually.

Embed captions in videos

The SRT files that Transcribe provides can be embedded into videos files that you share.

If you are sharing you videos on YouTube, please follow YouTube's guide on adding caption tracks to your videos.
If you are sharing you videos via Google Drive, please follow Google Drive's guide on adding caption tracks to you videos.
Some video players also supports playback with caption/subtitle files. Some can embed caption/subtitle files into the video. Please check their website/user guides for help.

PreviousComparing Speech-to-text Models NextFrequently Asked Questions

Last updated 1 day ago

Was this helpful?

hashtagCaptions vs. Subtitles

hashtagDoes my media need captions?

hashtagObtaining captions from Transcribe

hashtagNext steps

hashtagProofing

hashtagAdding description for non-speech sounds

hashtagEmbed captions in videos

Captions vs. Subtitles

Does my media need captions?

Obtaining captions from Transcribe

Next steps

Proofing

Adding description for non-speech sounds

Embed captions in videos