Don’t Rely on YouTube Transcripts

Let’s establish something first — auto-generated captions are a problem. They almost guarantee a WCAG failure and can leave users more confused (or offended) than when they started.

YouTube creates the transcript from the closed captions of a video (the text that overlays the video, as opposed to burned-in or open captions). If you are using auto-generated captions, your transcripts will already be insufficient (or terrible).

Caption Limitations

Assuming you start with good captions, there are challenges to using them as a transcript.

BBC has guidelines for creating captions (which those in the UK refer to as subtitles), though YouTube does not support all the recommended text formatting options. Most of those formatting options cannot appear in the generated transcript anyway.

When there is more than one speaker, speaker identification in captions generally happens only when the speaker is introduced. A leading dash then indicates a change in speaker, with the viewer able to see who is talking to understand to who they should attribute the words. In more robust captions, position and text style can also indicate the speaker. A transcript offers no visual context, so every change in speaker must identify them.

Captions do not describe the scene visually. It is assumed a closed caption user can see the video and get all the context, including reading signs, charts, graphics, t-shirts, chyrons, expressions, and so on. If the setting and surrounding visuals are critical to understanding the meaning, that must be conveyed in the transcript.

Transcripts

Transcripts are most easily described as a script you could use to re-enact the media,though instead of stage directions they describe the actions, scenes, and/or their outcomes as appropriate.

The transcripts YouTube provides rely on the content of your captions. If you want your captions to be dual-purpose (transcript and captions), then your captions should identify every speaker change and provide a description of each part of the scene that is necessary to understand the context without the visuals.

Unfortunately, YouTube does not seem to understand the difference between captions and transcripts. In its help guide, Tips for creating a transcript file shows you how to add a caption file. The two do not appear to be distinct as far as YouTube is concerned.

The YouTube player with two people talking, their heads clipped by the top of the window so it is not possible to see who is speaking; the captions and transcript signify differing speakers with a leading dash, but no names. — The caption does not identify the speaker, relying on visual cues instead. The transcript mimics the caption, making for a transcript that is hard to follow out of context (without seeing the video).

You Still Cannot Rely on YouTube

Let’s say you have a podcast you are hosting on YouTube (so it’s a vodcast that your audience cannot download) and you took the time to set up your captions to include speaker identification for each change in speaker. Maybe it’s just a couple people in front of microphones with webcams looking up their noses, so you do not need (want) to include scene descriptions. Congratulations, YouTube’s auto-generated transcript might be fine.

Perhaps you embedded this video on your site, where the transcript option is not available, so you link to the YouTube video with a note that the transcript can be found there. All done, right?

Well…

If your users come in on mobile, the mobile YouTube site does not offer any way to get to the transcript. You can test it yourself even without a mobile device handy. Use your browser’s dev tools to mimic a mobile device and head over to https://m.youtube.com/watch?v=UjXCAuWuXdk.

I cannot seem to find transcripts in the mobile app either, though I am happy to be shown where to find them.

If you use YouTube to host your non-podcast vodcast (for example), then you cannot rely on YouTube’s automatic transcript generator, and you definitely cannot rely on its ability to present that transcript to users.

Update: 27 March 2022

YouTube now shows its fake transcripts in the mobile app. Not the mobile site, of course.

So if you can guarantee your audience will always use the latest version of the YouTube app, and never rely on the mobile site, and they will risk tapping the detail area of the video, and then scroll down below the description and discover the SHOW TRANSCRIPT control (that does not announce as a button), and then move through the entire video player yet again (because focus is not managed), and then discover the list of captions is not really a transcript, then sure, you can rely on YouTube to provide access to the collection of captions.

YouTube app on Android in landscape mode with the video and captions on the left and list of captions on the right with the transcript heading. — In landscape mode, the fake transcript scrolls alongside the video. In portrait mode, it appears below.

What to Do

Luckily the solution is simple. Host the transcript on your own site. Maybe below the embedded video. Hide it behind a <details> / <summary> if you think it clutters the page too much.