Apple researchers recently unveiled an advancement in artificial intelligence designed to improve voice assistant interactions with the introduction of ReALM (Reference Resolution As Language Modeling) which tackles a key challenge: understanding user references to what’s on their screen (via. VentureBeat).
Voice assistants usually struggles with interpreting ambiguous user commands, particularly those referencing visual elements on a device’s display. ReALM tackles this hurdle by leveraging the power of large language models. These models analyze the on-screen content and contextualize user queries, enabling them to pinpoint the specific information being referenced.
Being able to understand context, including references, is essential for a conversational assistant. Enabling the user to issue queries about what they see on their screen is a crucial step in ensuring a true hands-free experience in voice assistants.
Discover new horizons, always connected with eSIM
Travel the world stress and hassle-free with the best eSIM service available. Enjoy unlimited data, 5G speeds, and global coverage for affordable prices with Holafly. And, enjoy an exclusive 5% discount.
5% OffExplore Now
Apple research team
This innovation hinges on ReALM’s ability to reconstruct the user’s screen. By parsing on-screen elements and their locations, it generates a textual representation that captures the visual layout. This allows ReALM to translate visual information into a language model’s familiar territory. This approach, combined with fine-tuned language models, surpasses existing systems like GPT-4 in understanding screen-based references.
The benefits extend beyond convenience. ReALM paves the way for a truly hands-free experience. Users can interact with their devices seamlessly, issuing voice commands directly related to what they see on the screen. This is particularly valuable for visually impaired users or situations where touching the device is impractical.
Apple researchers acknowledge the limitations of this technology. ReALM relies on automated parsing, which can struggle with complex visual references, like distinguishing between multiple images. Future iterations might incorporate computer vision and multi-modal techniques to address these challenges.
Apple’s upcoming Worldwide Developers Conference (WWDC) on June 10 is expected to serve as a platform for showcasing its AI advancements alongside iOS 18, a major update for iPhones. Speculation also suggests the unveiling of a new large language model framework, an “Apple GPT” chatbot, and a broader integration of AI features within their ecosystem.