Apple has always gone out of its way to build features for users with disabilities, and VoiceOver on iOS is an invaluable tool for anyone with a vision impairment — assuming every element of the interface has been manually labeled. But the company just unveiled a brand new feature that uses machine learning to identify and label every button, slider and tab automatically.
Screen Recognition, available now in iOS 14, is a computer vision system that has been trained on thousands of images of apps in use, learning what a button looks like, what icons mean and so on. Such systems are very flexible — depending on the data you give them, they can become expert at spotting cats, facial expressions or, as in this case, the different parts of a user interface.
The result is that in any app now, users can invoke the feature and a fraction of a second later every item on screen will be labeled. And by “every,” they mean every — after all, screen readers need to be aware of every thing that a sighted user would see and be able to interact with, from images (which iOS has been able to create one-sentence summaries of for some time) to common icons (home, back) and context-specific ones like “…” menus that appear just about everywhere.
The idea is not to make manual labeling obsolete — developers know best how to label their own apps, but updates, changing standards and challenging situations (in-game interfaces, for instance) can lead to things not being as accessible as they could be.
I chatted with Chris Fleizach from Apple’s iOS accessibility engineering team, and Jeff Bigham from the AI/ML accessibility team, about the origin of this extremely helpful new feature. (It’s described in a paper due to be presented next year.)
“We looked for areas where we can make inroads on accessibility, like image descriptions,” said Fleizach. “In iOS 13 we labeled icons automatically — Screen Recognition takes it another step forward. We can look at the pixels on screen and identify the hierarchy of objects you can interact with, and all of this happens on device within tenths of a second.”
The idea is not a new one, exactly; Bigham mentioned a screen reader, Outspoken, which years ago attempted to use pixel-level data to identify UI elements. But while that system needed precise matches, the fuzzy logic of machine learning systems and the speed of iPhones’ built-in AI accelerators means that Screen Recognition is much more flexible and powerful.
It wouldn’t have been possible just a couple of years ago — the state of machine learning and the lack of a dedicated unit for executing it meant that something like this would have been extremely taxing on the system, taking much longer and probably draining the battery all the while.
But once this kind of system seemed possible, the team got to work prototyping it with the help of their dedicated accessibility staff and testing community.
“VoiceOver has been the standard-bearer for vision accessibility for so long. If you look at the steps in development for Screen Recognition, it was grounded in collaboration across teams — Accessibility throughout, our partners in data collection and annotation, AI/ML, and, of course, design. We did this to make sure that our machine learning development continued to push toward an excellent user experience,” said Bigham.
It was done by taking thousands of screenshots of popular apps and games, then manually labeling them as one of several standard UI elements. This labeled data was fed to the machine learning system, which soon became proficient at picking out those same elements on its own.
It’s not as simple as it sounds — as humans, we’ve gotten quite good at understanding the intention of a particular graphic or bit of text, and so often we can navigate even abstract or creatively designed interfaces. It’s not nearly as clear to a machine learning model, and the team had to work with it to create a complex set of rules and hierarchies that ensure the resulting screen reader interpretation makes sense.
The new capability should help make millions of apps more accessible, or just accessible at all, to users with vision impairments. You can turn it on by going to Accessibility settings, then VoiceOver, then VoiceOver Recognition, where you can turn on and off image, screen and text recognition.
It would not be trivial to bring Screen Recognition over to other platforms, like the Mac, so don’t get your hopes up for that just yet. But the principle is sound, though the model itself is not generalizable to desktop apps, which are very different from mobile ones. Perhaps others will take on that task; the prospect of AI-driven accessibility features is only just beginning to be realized.