Speech Viewer Logs of Lies

When sighted users test with a screen reader it is common to rely on the visual output — checking to see where focus goes, confirming that controls behave, watching the spoken output in a text log.

The problem is that what you see in those speech viewers is not always what you hear.

Consider how an emoji is represented in a log versus how it is announced. While visually it may be simple, it can be an audible assault for a screen reader user. 👏 is much easier to skim past when looking at a tweet, but harder when it is announced as graphic clickable emoji colon clapping hands sign.

The same is true for other extended characters, as in the following example…

As captured in VoiceOver for macOS using Safari. I wrote the captions instead of relying on the speech viewer.

When we rely on the display of the speech instead of listening to it, we as developers or testers or authors are unaware that ┏ is announced as box drawings heavy down and right in VoiceOver. Similarly, the related ━, which may look like an em-dash, is announced as box drawings heavy horizontal.

If I just relied on the VoiceOver speech viewer, those captions would have been much easier to type and also completely inaccurate.

Demonstration

Obviously these characters are exceptions. Using extended characters always comes with risk, but sometimes even the most mundane text can have an unexpected announcement. I made a quick demo with common content I see on the web:

See the Pen PoNYREO by Adrian Roselli (@aardrian) on CodePen.

Here is how that content is recorded in the JAWS Speech History window:

JAWS Speech History window showing the following text: “Demo | SR Announcements, heading level 1, First Name:, First Name: Edit, Type in text, Submit > Button, list of 1 items, 📌, visited Link WAI-ARIA 1.1, list end, NASA and IT got together on 8/4/2020 at 10:00am, 👆 what he said.”
Get to this screen by pressing JAWS + Space, and then H. You can then Ctrl + A to select it all, then Ctrl + C to copy it before accidentally closing the window. Your JAWS key will be either Caps Lock or number pad Insert depending on your keyboard set-up.

Here is how that same content sounds when spoken — I have added captions that reflect the full announcement, so please enable them and read along as you listen:

Using JAWS 2019 with Chrome 84 on Windows 10, using only to navigate.
Transcript

Demo vertical bar ess are Announcements heading level one

First Name colon

First Name colon edit type in text

Submit greater button

list of one items

pushpin

visited link why dash ARIA one point one

list end

blank

NASA and it got together on eight slash four slash twenty twenty at ten AM

blank

white up pointing backhand index what he said

I want to impress upon you, dear reader, that I am only using JAWS because it was quickest for me. Each screen reader will have slightly different heuristics for how these are announced. I simply do not have time to record them all. This post is about assumptions regardless of screen reader.

Takeaways

Do not use this as justification to try to override what a screen reader says. I noted this on Stack Overflow, back when I was wasting too much time there. It has come up on the WebAIM list many times. I have to explain it to clients on the regular.

If you read my post Stop Giving Control Hints to Screen Readers then you already know trying to tailor content for screen reader users can be problematic. Similarly, forcing your preferred pronunciation or phrasing is a bad idea and should never be done (unless backed up by user research and a darn good use case).

All that out of the way, I see two risks from being unaware of what a screen reader is actually saying:

  1. QA teams rely on the output matching a speech log;
  2. verbose interfaces (that sometimes hide important cues).

I have worked with QA teams who send something back to developers because it does not exactly audibly match a provided screen reader log. Often that log is from a different screen reader / browser pairing or a customized set-up not used by QA.

Similarly, if as a developer all I do is ensure the extended characters (emoji, math symbols, etc.) show up in the speech viewer but do not confirm how they are announced (if at all), I can create an unusable interface.

Screen readers offer settings that let you control how verbose they are when encountering punctuation, certain characters, abbreviations, acronyms, or even words. A skilled screen reader user may take advantage of those features, but you cannot count on users doing it and you cannot suggest that as a fix.

What to Do?

Pay attention to times, dates, currency, punctuation, special characters, emoji, math symbols, common (and uncommon) abbreviations & acronyms, and so on. Be aware how they announce across all screen reader and browser pairings that will view them.

Other steps:

If you are working with multilingual content (truly multilingual, not just the word panini in a description of your lunch) then be sure the lang attribute and the appropriate values are in place. This will help limit incorrect pronunciation from incorrectly marked-up content.

Update: Later that afternoon…

The problem with hyperbolic headlines is they can lead to overly-broad statements on Twitter, such as when I first shared this and suggested that you might be doing it wrong if not listening to the output. While Steve gave it the appropriate wary emoji, the following response drives home the point:

One of the benefits of the de facto peer review in accessibility is when people point out a flaw in your over-simplified (character-constrained) conclusions. Making sure comments are enabled on your posts and replies allowed on your tweets is a good way to be certain you are not lost in an echo chamber of one.

Similarly, folks can identify other use cases that you failed to mention:

Considering I have seen this happen (QA signs off on a thing, even though thing is never announced due an unnecessary live region interrupting it), it should have occurred to me to include it in the post.

Update: 24 August 2020

This is a great point about JAWS and which likely applies to the other screen readers (namely that text viewers are not QA tools, they are there to support end users).

3 Comments

Reply

There is a solution to these kinds of problem. Certainly all web pages should go through QA and improvements be made so far as possible from their input. But unless you have an expert accessibility consultant on the team (which is not usual), then the next step is then to send the page to an external accessibility audit consultant for them to test.

They will use screen readers on it and listen to what the reader actually says, they won’t look at logs which as pointed out are not helpful. (You also need to invest in having them test on both desktop screen readers and screen readers on mobile devices, as they produce different results.) An experienced audit consultant will also find any other accessibility defects on the page, to help users with other disabilities besides being blind. They have the expertise to do the whole job for you – they do this kind of work full time, not just as a small bit-part among all the other kinds of testing to be done.

Guy Hickling, Accessibility Consultant; . Permalink
Reply

Some very good observations here. I’d like to mention a couple of omissions.

First of all, screen reader announcements are typically handled by a system-level speech synthesiser. There are usually various ‘voices’ to choose between, to represent different dialects or other preferences.

The pronunciation heuristics that each ‘voice’ follows are not the same, even with different voices in the same ‘dialect’ from the same vendor.

Example: We have products for a medical context where the string “IV” is announced as “four” by some voices and “roman four” by other voices using the same screen reader. Both are wrong in our case. (We need the idiomatic abbreviation for “intravenous”, which is simply the announcement of the two letters).

So… testing on different screen readers will not reveal the full scale of the problem – and in any case, there is little that can be done by the content creator (or web developer) to fix it which wont compromise the output on (say) a Braille device. So testing widely might reveal more pronunciation issues, but we still have no way to solve them.

As the article above points out the best practice is not to dictate phonetics, it is to indicate clear semantics, but we lack a mechanism to do this. Sometimes the sequence :) is not a smiley, and sometimes I want “IV” read out as two letters. Sometimes O2 means oxygen, sometimes “O-squared” is intended. How do I specify these things?

The W3 pronunciation task force was set up to address some of these issues from a standards point of view, and I encourage all readers here to support or even contribute to their efforts, but the problem and its solution lies primarily with speech synth vendors, not the AT vendors or the content creators.

The primary speech synth vendors are Microsoft, Apple and Google, all of whom have made speech synthesis a part of their operating systems. I don’t get the impression that the teams involved in these technologies are even aware that their speech synth offerings have these significant accessibility failures, and they won’t find out if we complain only to Freedom Scientific, or NVDA, or blame the web developers.

All three of the main speech synth vendors offer quite sophisticated accessibility features at a system level, but none of them tackle the problem of poorly guessed screen reader pronunciation. The way that time values are announced is particularly uneven, and time values are a rare case where you can indicate which field refers to which value. In practice, hours, minutes and seconds are often announced wrongly – even if you take the trouble to specify a well-formed datetime attribute.

So how do we – as web content and accessibility professionals – apply pressure at the right (and most effective) place to the appropriate teams at Microsoft, Apple and Google to quit their ‘clever’ guessing, and give content creators a modicum of semantic control over how emojis, dingbats, abbreviations, acronyms, units, technical values and the like are intended to be treated?

Brennan Young; . Permalink
Reply

A further point about multilingual support.

The lang attribute is often ignored, especially if the string for pronunciation is found inside an aria attribute, or a live region. Try it! We find dismal results on many browsers. Safari seems to ignore lang altogether – the system speech synth language setting overrides everything in the markup.

Again, knowing that there are problems with screen reader pronunciation is of questionable value if we have no way to fix them. What is the business case for running tests for problems which have no solution?

Brennan Young; . Permalink

Leave a Comment or Response

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>