-
Notifications
You must be signed in to change notification settings - Fork 6
Description
Implementation experience in the ARIA-AT project has uncovered a large amount of variation in the formatting of the text that screen readers send to text-to-speech engines.
For instance, we've observed text from JAWS such as:
Print Page \u001d Button \u001e
(Where \u001d and \u001e represent the Unicode "group separator" and "record separator", respectively)
And text from VoiceOver like:
Print Page
button
You are currently on a button. To click this button, press
Control -Option -Space.
(Note the copious amount of empty space, including a trailing space on the third line)
To be sure, these examples are not only accurate but also compliant. The specification places no constraints on the way the text is formatted. The relevant language reads:
When the assistive technology would send some text
data(a string, without speech-specific markup or annotations) to the Text-To-Speech system, or equivalent for non-speech assistive technology software, run these steps:
However, those examples are not the most intuitive way to express the spoken text. The formatting is important to ARIA-AT, so we've written some logic to normalize at the application level. Since I expect formatting will also be important to many future consumers of the protocol, this seems like an opportunity for the standard to reduce repeated work.
A number of concerns come to mind:
- removing details which have no impact on the vocalized text (e.g. extraneous space, new lines, some punctuation, some capitalization)
- using a data type other than a simple string (e.g. an array of strings, each describing a discrete utterance)
- expressing this in a localizable way (at first blush, Unicode's offerings seem promising)
Should we expect implementations to eventually improve and "do the right thing" in these regards? Or should we constrain speech data in some way? If so, how?