Why the Next Video You Watch Might Be Voiced By AI

You might not be able to tell the difference

  • Creators are using AI-powered software to voice characters in videos and podcasts. 
  • Experts say the new software is so realistic that it can be hard to distinguish from human speech. 
  • A filmmaker is creating movies voiced entirely with AI software rather than human actors.
Woman using a smart phone.

Luis Alvarez / Getty Images

A growing number of programs powered by artificial intelligence (AI) are imitating human voices in everything from podcasts to videos, and experts say the software is remarkably lifelike. 

One amateur filmmaker uses AI tools to create a series of films he releases on Twitter. Fabian Stelzer uses AI voice generation tools to make the speech for animation that's also computer-generated. 

"AI through deep learning is now proven to be the most accurate approach for speech-to-text and text-to-speech (i.e., talking and hearing)," Todd Mozer, the CEO of the computer communications company Sensory, told Lifewire in an email interview. "It's useful because it works. For speech output, AI can create new voices and new faces to lip sync accurately, or it can replicate known persons." 

Smart Voices

Stelzer writes on Twitter that his movie called SALT "is a fully AI-generated film verse, where community choices drive a multi-plot story." The show's plot is hard to discern, but it seems to have science-fiction elements based on clips on his Twitter account. 

Creators like Stelzer have many AI voice software options to choose from. Micmonster, for example, provides an online library of voiceovers. The company has over 500 voices available, with rate and pitch options, and more than 129 languages. All you have to do with Micmonster and similar programs is type the words you want the characters to speak. 

Mozer said that the surge in interest in AI-generated voices is being driven by the increasing use of subtitles in social media videos. "Having the ability to automatically turn spoken words into text is a huge time and cost savings for podcasts and videos where captioning is highly popular," Mozer said. "Using text-to-speech systems can also be deployed in podcasts and videos to convey information more rapidly and with better articulation and no errors or "Ums." 

Mozer said that our brains can process speech much faster than most people speak, so we often get bored and distracted. "Having the ability to easily speed up speech, control articulation and pitch, and quickly edit anything problems rather than re-recording is a huge advantage," he said. 

Text-to-speech systems can also be used in podcasts and videos to convey information more rapidly and with better articulation than most humans, Mozer said. "Content can be created by AI as well, but that is still in its infancy and prone to factual inaccuracies. I don't expect any academy awards for AI-produced media for at least four years."

Bob Rogers, the former chief data scientist at Intel and the current CEO of the data science company Oii.ai, told Lifewire via email that AI transformer language models, such as GPT3, have distilled almost all written communication, in many languages, into a framework that can create naturally flowing communication.

An abstract sound wave pattern.


"Often, this communication even makes sense, although they do not currently have much reasoning built into them," he added. "Start a sentence like 'The dog barked when…' and a language model will cheerfully fill in '.. when its master comes home,' or '... at the sound of an approaching vehicle' depending on the surrounding context of the question."

The Future of Voice

Tech investor Brian David Crane predicts that in the future, more and more deep fakes will be created as the original voice goes through AI interactions and iterations to make similar voice samples or voice mixing to develop unique voices. “With NLP (Natural language processing) and AI, Voice cloning will be used to improve the personalization of services, even in the podcasting medium,” he added. 

Beware of fake voices that seem real, though, Rogers said. He added that it is already possible to type a script and then simulate an individual’s voice, inflection, and video image to deliver that script. 

“Obviously, this means we will need to continue to develop tools to detect synthetic video and audio to keep pace with the technology,” he said.

Was this page helpful?