Improving accessibility of audio/video media with Transcribe
This page provides guidance and recommendations on using Transcribe to support creating accessible audio/video media.
While Transcribe is helpful in enhancing audio/video media by automating speech recognition with AI, it currently still CANNOT generate captions/subtitles fully compatible with the WCAG 2.1 Level AA standard. This page provide tips on optimal workflows and necessary next steps towards compliance. Substantial work is still required after automatic speech recognition.
To determine whether the captions/subtitles are fully compliant with Brown University Digital Accessibility requirement, please visit the University's digital accessibility page for more information.
More helpful links:
Brown University Digital Accessibility requirements mandate that all digital content created by and shared with the Brown community is compliant with the WCAG 2.1 Level AA standard. Using a speech-recognition tool to automatically transcribe speech can be a helpful first step towards making audio/visual media accessible. Starting in February 2026, Transcribe is introducing many new features related to creating time-aligned audio/video captioning/subtitles that can help achieve this goal. This page provide information on how to use Transcribe to create time-aligned captions/subtitles for audio and video files.
Below is a video that demonstrates the quality of the captions generated by Transcribe with the OpenAI Whisper model. Please note that the captions are NOT fully compliant with accessibility standards. Please visit NPR's YouTube channel if you would like to watch the original video.
Captions vs. Subtitles
While the terms "captions" and "subtitles" are often used interchangeably, subtitles do not provide the same information and are designed for different purposes. Subtitles provide a text version of the dialogue only and often in different languages. Essentially, subtitles assume an audience can hear the audio but also need the dialogue provided as text. Therefore, "captions" are generally used for accessibility purposes.
Transcribe adopts the term "captions" only because it currently can only transcribe speech. It cannot perform translation. However, it cannot provide all information required in captions, because Transcribe uses speech recognition AI models that focus on speech. It usually cannot include non-speech information in the captions (even though some new models like Gemini are demonstrating emerging capabilities of capturing non-speech information). We recommend that the users add such information themselves.
Does my media need captions?
WCAG 2.1 Level AA requires synchronized, accurate closed captions for all pre-recorded audio in multimedia and live captions for live, time-based media. Since Transcribe can only handle offline transcription at the moment, we assume that you have pre-recorded audio/video files that can be uploaded to Transcribe. We do not support live transcription of meetings or events at the moment.
Examples of videos that need captioning include:
Videos promoting your program or attracting students, participants, and alumni
Videos showcasing curriculum, research, exhibitions, or collections
Videos profiling students, faculty, or researchers
Videos providing instructions for how to apply or register for programs
Videos listing news stories about your department or program
For pre-recorded audio contents, a transcript must be provided. A transcript is the same word-for-word content as captions but presented in a separate file. It provides a text alternative of the audio presentation and is not synchronized with the audio timeline. A transcript should contain relevant speaker information to distinguish who is saying what information.
Source: Video Captions and Audio Description page in OIT Knowledge Base
Obtaining captions from Transcribe
To create captions or transcripts from Transcribe, create a job in Transcribe by uploading your audio/video files. If you need time-accurate captions, you must choose the OpenAI Whisper model, Qwen3-ASR model, or Cohere Transcribe model. Check the speech-to-text model comparison page for differences between the models.
Once the transcription job finishes, go to the transcription for each file and open the download transcription diallog. There, you have the option to download the transcript as an enhanced .srt file.

Once you choose the "Enhanced SRT captions (.srt)" format, you can:
Choose the number of characters per line, and maximum number of lines per caption.
Choose whether to include the names of speakers in each caption.
We recommend that for captions in English, the number of characters each line does not exceed 32, and the number of lines per caption does not exceed 2. For captions in other languages, please follow guidelines for those languages.
We also recommend checking the "Include speaker names" option. By default, Transcribe performs speaker diarization (distinguishing different speakers), and assign each speaker a default name like "Speaker 1". Please check the Viewing/Editing Transcriptions page to see how to edit speaker names.
Speaker diarization is a difficult problem and even the best speaker diarization system regularly makes mistakes. Please carefully check your generated SRT file errors.
For power users who prefer to create captions themselves, they can also choose the "Word-level timestamps (.json)" option, from which they can obtain a json file with raw word-level timestamps for their own workflows.
Next steps
The captions obtained from Transcribe as-is are far from meeting accessibility standards. Transcribing audio/video files with accurate timestamps is a time-consuming process, and Transcribe saves time through automation. However, more work is still need to make the captions fully compliant.
Proofing
Although AI has become much more accurate in speech recognition, it still makes mistakes. Please carefully check the AI generated files and check potential errors in speech recognition, speaker assignment, and timestamps.
Adding description for non-speech sounds
Transcribe focuses on speech and does not include descriptions of non-speech sounds. They need to be added manually.
Embed captions in videos
The SRT files that Transcribe provides can be embedded into videos files that you share.
If you are sharing you videos on YouTube, please follow YouTube's guide on adding caption tracks to your videos.
If you are sharing you videos via Google Drive, please follow Google Drive's guide on adding caption tracks to you videos.
Some video players also supports playback with caption/subtitle files. Some can embed caption/subtitle files into the video. Please check their website/user guides for help.
Last updated
Was this helpful?
