Voice AI Agents: Evolution to Emerging Applications
Conversational AI’s Next Phase: We’ve Arrived
Conversational AI’s Next Phase: We’ve Arrived
Voice AI Agents are ushering in a new era of interactions. After enduring clunky IVRs and awkward conversations with Alexa, talking to a Voice AI agent today is a delight — more human-like, natural, and interactive (call Bland and check it out).
Conversational AI has been around for decades, but the technology has never been better. As a result, voice features keep dropping as part of the multimodal trajectories of AI heavyweights — Open AI launched GPT-4o’s natively Speech-to-Speech capability during their spring update a few months ago (watch the live demo here), and Meta is in talks with celebrities to release celebrity voice-powered Meta AI assistants.
Traditionally, the conversational AI stack involved the pipelining of Automatic Speech Recognition (ASR) models (also known as Speech-to-Text), Large Language Models (LLMs), and Text-to-Speech (TTS) models. These models have seen significant performance improvements across metrics over time. Additional techniques, including voice biometrics and voice cloning, have led to richer experiences.
Today’s integrated models like GPT-4o take the technology paradigm one step further and can process audio input and generate audio output directly within a single neural network. This reduces latency and enables better capture of contextual information like tone, emotional expression, background noises, and multiple speakers.
Open AI evaluated its own Audio ASR performance between GPT-4o and Whisper v3, finding a lower Word Error Rate for GPT-4o across several languages.
Applications will be built using both approaches, each with advantages depending on the end objective, as newer methods continue to emerge. While we still have ways to achieve multilingual capabilities and better quality, we’ve reached an inflection point in the technology, unlocking an incredible set of previously impractical applications.
Model Eval between Whisper v3 and GPT-4o 16-shot
Source: Open AI
Can Single-Modality Voice AI Applications Exist in a Multimodal World?
I believe that single-modality Voice AI Agents are a stepping stone to seamlessly integrated multimodal agents. The tech stack for voice will be an essential part of multimodal applications that integrate visual and other contextual information to provide new experiences.
That said, single-modality voice applications will exist independently for applications that rely primarily on verbal interaction, such as phone calls to schedule doctor appointments. These applications will create huge efficiencies within the enterprise and highly engaging, personalized consumer experiences.
Voice AI Applications: Enterprise and Consumer
Enterprise applications
Voice AI Agents will power dozens of enterprise applications. I've highlighted three key enterprise functions that will be supercharged with Voice AI Agents. Start-ups will build leveraging different technical approaches, as highlighted on the y-axis in the image below, each with unique advantages.
Note: Image adapted from Domenic Donato’s (ex-DeepMind research engineer & CTO of Assembly AI building something new) work.
Customer Service and Support: The most obvious function to benefit from Voice AI Agents is customer service and support. Enterprises spend a fortune on human-staffed call centers to respond to customer inquiries, troubleshoot, and solve basic problems. Voice AI Agents are perfectly suited to utilize existing company guidelines, policies, and call databases to handle these queries. Scheduling and appointment management are other clear-cut human tasks that Voice AI Agents can replace and scale.
Sales and Lead generation: Voice AI Agents will play a major role in augmenting sales teams. Jobs to be done include conducting initial sales calls, qualifying leads, and scheduling appointments & follow-ups. Deploying Voice AI Agents in sales can be trickier than customer service and support because of nuances in sales conversations and a lower margin for error. Thus, it’s likely that live deployments of Voice AI Agents in this function will be limited to top-of-funnel activities for now as the tech continues to improve in performance.
Human Resource Management: Human Resource Management, encompassing recruiting, onboarding, and learning and development, is a high-potential yet often overlooked area. Voice AI agents will enhance top-of-funnel candidate screening, offering a more natural and fluid experience than current video interviewing tools. Onboarding is another function that can be scaled and made more engaging using interactive Voice AI Agents.
Companies building Voice AI Agents can adopt either a function-specific or industry-specific approach. While some verticals, such as healthcare, e-commerce, and financial services, benefit from domain-specific knowledge, function-specific agents will be deployed horizontally across sectors. Ultimately, the performance of these agents will depend on the depth of their integrations and ability to either build or work within the existing ecosystem of other generative AI-powered products, such as chatbots, knowledge-based search, post-call summaries, etc.
Consumer applications:
There’s no doubt Voice AI Agents will deliver value to enterprises. But are we ready to engage with them as therapists, coaches, tutors, and personal assistants? While Alexa may have fallen short, imagine if Voice AI Agents met our expectations. Though still early, I’ve outlined two live use cases being tested by companies. Let your imagination flow as you explore them!
AI Therapists: There are clear signs of latent demand in existing generative AI products. ChatGPT, Replika, and Character AI are already fielding mental health queries. Voice AI therapists trained on clinically relevant data have the potential to fill the widening supply-demand gap in mental healthcare.
Here’s a scenario: Imagine interacting with a Voice AI-powered therapist — let’s call it “Elysium” — trained on thousands of clinically relevant conversations between a therapist and a patient. During a particularly difficult time in his life, John confided in Elysium about his anxieties. Elysium listened empathetically and used practical, personalized strategies to manage his stress. John found a new sense of peace and self-discovery as he navigated life’s curveballs. (Credit: You.com assisted me in brainstorming this scenario)
AI Coaches: Like therapists, coaches are in high demand but face barriers to access. Voice AI coaches can democratize access, benefiting all levels of employees, not limited to executive talent.
Imagine a scenario where a Voice-AI-powered coach, “Emma,” is helping Susan prepare for a keynote presentation on demo day. As Susan rehearses her presentation, Emma listens and provides constructive feedback on where she’s excelled and areas that need improvement. Susan continues to practice with Emma and delivers a resoundingly engaging and inspiring presentation as a result. (Credit: You.com assisted me in brainstorming this scenario)
We’re only getting started
Multimodal AI Agents are the future of AI. The tech stack for voice will remain an essential part of a multimodal future, and single-modality Voice AI agents will thrive independently in tasks primarily relying on verbal interactions.
With rapid improvements in the Voice AI stack and the ongoing evolution of the technology, we’re only getting started.
If you are building in this space or have thoughts to share, email me at sanjana@radical.vc or DM me at https://x.com/SanjanaBasu14.
Thumbnail image generated by DALLE.