> For the complete documentation index, see [llms.txt](https://docs.ccv.brown.edu/ai-tools/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://docs.ccv.brown.edu/ai-tools/services/transcribe/improving-accessibility-of-audio-video-media-with-transcribe.md).

# Improving accessibility of audio/video media with Transcribe

{% hint style="warning" %}
While Transcribe is helpful for enhancing audio/video media by automating speech recognition with AI, it currently still CANNOT generate captions/subtitles fully compatible with the WCAG 2.1 Level AA standard. This page provides tips on optimal workflows and necessary next steps towards compliance. Substantial work is still required after the automatic speech recognition step that Transcribe provides.

To determine whether the captions/subtitles are fully compliant with Brown University Digital Accessibility requirement, please visit the University's [digital accessibility](https://digital-accessibility.brown.edu/) page for more information.

More helpful links:

* [Video Captions and Audio Description page in OIT Knowledge Base](https://ithelp.brown.edu/kb/articles/video-captions-and-audio-descriptions)
* [Captions/Subtitles page from the Web Accessibility Initiative (WAI) website](https://www.w3.org/WAI/media/av/captions/#captions-and-subtitles)
  {% endhint %}

Brown University Digital Accessibility requirements mandate that all digital content created by and shared with the Brown community is compliant with the WCAG 2.1 Level AA standard. Using a speech-recognition tool to automatically transcribe speech can be a helpful first step towards making audio/visual media accessible. Starting in February 2026, Transcribe is introducing many new features related to creating time-aligned audio/video captioning/subtitles that can help achieve this goal. This page provides information on how to use Transcribe to create time-aligned captions/subtitles for audio and video files.

Below is a video that demonstrates the quality of the captions generated by Transcribe with the OpenAI Whisper model. Please note that the captions are NOT fully compliant with accessibility standards. [Please visit NPR's YouTube channel if you would like to watch the original video.](https://www.youtube.com/watch?v=iBGZtNJAt-M)

{% embed url="<https://drive.google.com/file/d/1Xe129-fcB_TIo4QoZseupXgKh7nasNgt/view?usp=drive_link>" %}

## Captions vs. Subtitles

While the terms "captions" and "subtitles" are often used interchangeably, subtitles do not provide the same information and are designed for different purposes. Subtitles provide a text version of the dialogue only and often in different languages. Essentially, subtitles assume an audience can hear the audio but also need the dialogue provided as text. Therefore, "captions" are generally used for accessibility purposes.

Transcribe can provide accurate, time-synced captions for spoken words in audio/video media. However, it cannot provide all information required for these captions to be compliant with accessibility standards, because Transcribe uses speech recognition AI models that focus only on speech. It usually cannot include non-speech information in the captions (even though some new models like Gemini are demonstrating emerging capabilities of capturing non-speech information). We recommend that the users add such information themselves, including

* Description of non-speech sounds, such as "\[intense music playing]", "\[birds chirping]", etc.
* Names of the speaker, especially if it is not immediately clear in the visual content who is speaking
* Any other helpful context for understanding the visual content without sounds.

We also recommend that the captions are easily understood by audiences. For example, for captions in English, each line of the caption should contain no more than 32 characters, and no more than 2 lines should be displayed at the same time.

## Does my media need captions?

WCAG 2.1 Level AA requires synchronized, accurate closed captions for all pre-recorded audio in multimedia and live captions for live, time-based media. Since Transcribe can only handle offline transcription at the moment, we assume that you have pre-recorded audio/video files that can be uploaded to Transcribe. We do not support live transcription of meetings or events at the moment.

Examples of videos that need captioning include:

* Videos promoting your program or attracting students, participants, and alumni
* Videos showcasing curriculum, research, exhibitions, or collections
* Videos profiling students, faculty, or researchers
* Videos providing instructions for how to apply or register for programs
* Videos listing news stories about your department or program

For pre-recorded audio contents, a transcript must be provided. A transcript is the same word-for-word content as captions but presented in a separate file. It provides a text alternative of the audio presentation and is not synchronized with the audio timeline. A transcript should contain relevant speaker information to distinguish who is saying what information.

Source: [Video Captions and Audio Description page in OIT Knowledge Base](https://ithelp.brown.edu/kb/articles/video-captions-and-audio-descriptions)

## Obtaining captions from Transcribe

To create captions or transcripts from Transcribe, [create a job](/ai-tools/services/transcribe/creating-a-job.md) in Transcribe by uploading your audio/video files. **If you need time-accurate captions, you&#x20;*****must*****&#x20;choose the OpenAI Whisper model, Qwen3-ASR model,** or **Cohere Transcribe model.** Check the speech-to-text model comparison page for differences between the models.

Once the transcription job finishes, [go to the transcription for each file](/ai-tools/services/transcribe/viewing-editing-transcriptions.md) and [open the download transcription diallog](/ai-tools/services/transcribe/downloading-transcriptions.md). There, you have the option to [download the transcript as an enhanced `.srt` or `.vtt` file](/ai-tools/services/transcribe/downloading-transcriptions/subtitles.md).

<figure><img src="/files/kQPo05FDvJeiTy45WKUf" alt="A screenshot of the download transcript window, when the &#x22;Enhanced SRT Captions&#x22; option is selected." width="375"><figcaption></figcaption></figure>

Once you choose an enhanced caption format, you can:

1. Choose the number of characters per line, and maximum number of lines per caption.
2. Choose whether to include the names of speakers in each caption.

We recommend that for captions in English, the number of characters each line does not exceed **32**, and the number of lines per caption does not exceed **2**. For captions in other languages, please follow guidelines for those languages. The same guidance applies whether you download SRT or WebVTT.

We also recommend checking the "Include speaker names" option. By default, Transcribe performs speaker diarization (distinguishing different speakers), and assign each speaker a default name like "Speaker 1". Please check the [Viewing/Editing Transcriptions page](/ai-tools/services/transcribe/viewing-editing-transcriptions.md) to see how to edit speaker names.

{% hint style="warning" %}
Speaker diarization is a difficult problem and even the best speaker diarization system regularly makes mistakes. Please carefully check your generated caption file for errors.
{% endhint %}

For power users who prefer to create captions themselves, they can also choose the "Word-level timestamps (.json)" option, from which they can obtain a json file with raw word-level timestamps for their own workflows.

## Next steps

The captions obtained from Transcribe as-is are far from meeting accessibility standards. Transcribing audio/video files with accurate timestamps is a time-consuming process, and Transcribe saves time through automation. However, more work is still need to make the captions fully compliant.

### Proofing

Although AI has become much more accurate in speech recognition, it still makes mistakes. Please carefully check the AI generated files and check potential errors in speech recognition, speaker assignment, and timestamps.

### Adding description for non-speech sounds

Transcribe focuses on speech and does not include descriptions of non-speech sounds. They need to be added manually.

### Embed captions in videos

The SRT and WebVTT files that Transcribe provides can be embedded into video files that you share.

* If you are sharing you videos on YouTube, please follow [YouTube's guide on adding caption tracks to your videos](https://support.google.com/youtube/answer/2734796?hl=en#zippy=%2Cupload-a-file).
* If you are sharing you videos via Google Drive, please follow [Google Drive's guide on adding caption tracks to you videos](https://support.google.com/drive/answer/1372218?hl=en\&co=GENIE.Platform%3DDesktop#zippy=%2Cupload-captions).
* Some video players also supports playback with caption/subtitle files. Some can embed caption/subtitle files into the video. Please check their website/user guides for help.


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://docs.ccv.brown.edu/ai-tools/services/transcribe/improving-accessibility-of-audio-video-media-with-transcribe.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
