Don’t Rely on YouTube Transcripts
Let’s establish something first — auto-generated captions are a problem. They almost guarantee a WCAG failure and can leave users more confused (or offended) than when they started.
YouTube creates the transcript from the closed captions of a video (the text that overlays the video, as opposed to burned-in or open captions). If you are using auto-generated captions, your transcripts will already be insufficient (or terrible).
Assuming you start with good captions, there are challenges to using them as a transcript.
BBC has guidelines for creating captions (which those in the UK refer to as subtitles), though YouTube does not support all the recommended text formatting options. Most of those formatting options cannot appear in the generated transcript anyway.
When there is more than one speaker, speaker identification in captions generally happens only when the speaker is introduced. A leading dash then indicates a change in speaker, with the viewer able to see who is talking to understand to who they should attribute the words. In more robust captions, position and text style can also indicate the speaker. A transcript offers no visual context, so every change in speaker must identify them.
Captions do not describe the scene visually. It is assumed a closed caption user can see the video and get all the context, including reading signs, charts, graphics, t-shirts, chyrons, expressions, and so on. If the setting and surrounding visuals are critical to understanding the meaning, that must be conveyed in the transcript.
Transcripts are most easily described as a script you could use to re-enact the media,though instead of stage directions they describe the actions, scenes, and/or their outcomes as appropriate.
The transcripts YouTube provides rely on the content of your captions. If you want your captions to be dual-purpose (transcript and captions), then your captions should identify every speaker change and provide a description of each part of the scene that is necessary to understand the context without the visuals.
Unfortunately, YouTube does not seem to understand the difference between captions and transcripts. In its help guide, Tips for creating a transcript file shows you how to add a caption file. The two do not appear to be distinct as far as YouTube is concerned.
You Still Cannot Rely on YouTube
Let’s say you have a podcast you are hosting on YouTube (so it’s a vodcast that your audience cannot download) and you took the time to set up your captions to include speaker identification for each change in speaker. Maybe it’s just a couple people in front of microphones with webcams looking up their noses, so you do not need (want) to include scene descriptions. Congratulations, YouTube’s auto-generated transcript might be fine.
Perhaps you embedded this video on your site, where the transcript option is not available, so you link to the YouTube video with a note that the transcript can be found there. All done, right?
If your users come in on mobile, the mobile YouTube site does not offer any way to get to the transcript. You can test it yourself even without a mobile device handy. Use your browser’s dev tools to mimic a mobile device and head over to https://m.youtube.com/watch?v=UjXCAuWuXdk.
I cannot seem to find transcripts in the mobile app either, though I am happy to be shown where to find them.
If you use YouTube to host your non-podcast vodcast (for example), then you cannot rely on YouTube’s automatic transcript generator, and you definitely cannot rely on its ability to present that transcript to users.
What to Do
Luckily the solution is simple. Host the transcript on your own site. Maybe below the embedded video. Hide it behind a
<summary> if you think it clutters the page too much.