본문 바로가기 주메뉴 바로가기 검색 바로가기
Naver’s AI Speech, Less Robotic and More Human
Naver’s AI Speech, Less Robotic and More Human
The technology uses only 40 minutes of data to generate natural-sounding, expressive speech

By Jenny Lee WIRED Korea

It is 8:20 a.m. on a congested South Korean subway train. The train car is jam-packed with hundreds of commuters – while many of them are squeezed uncomfortably together with no privacy whatsoever, some who are ready to leave the train are jockeying for position toward the doors.

As the train stops at a station, a monotonous and somewhat quirky voice blares through a loudspeaker: “This stop is Express Bus Terminal. Please stay away from the door for safety.”

That voice – which sounds sort of like that of a human but is in fact a computer-generated simulation – is an example of speech synthesis technology, otherwise known as Text-to-Speech(TTS), and its use is ubiquitous; synthetic voices can be heard in announcements on public transport and during calls with customer service, and they are also integrated into navigation systems, smartphones, tablets and many other devices.

Not only does speech synthesis provide convenience to many as it lowers human intervention, but it also aids those without the ability to speak, like the late Physicist Stephen Hawking, who relied on a computerized voice system to communicate since the 1980s until his death in 2018.

These benefits notwithstanding, many synthetic voices are not really easy on the ear; although they are not anymore plain cacophony of mingled sounds as they used to be, there is still a gulf between the most natural-sounding voices —Google’s Assistant, Apple’s Siri and Amazon’s Alexa —and a human-generated speech.

South Korea’s largest web portal operator Naver is about to change that, however, with a new technology leveraging deep learning, an Artificial Intelligence(AI) technique that uses data to “train” neural networks – or a computer system modeled after the human brain.

Dubbed natural end-to-end speech synthesis (NES), the technology can produce much more realistic-sounding, expressive speech by training on only a small amount of data – no more than 40 minutes of audio sample, which corresponds to about 400 sentences.

“We have a voice model, which we created with a large amount of existing voice data, and we have sort of coated it, I should say, with 40 minutes of audio recording from a speaker,” said Yi Bong-jun, a developer of Clova Voice NES. “By having huge datasets as a base, our technology can learn to synthesize a realistic-sounding voice – in the style of the speaker – with a smaller (new) dataset.”

Devised by a team of 5 people, this end-to-end technology, which generates speech directly from raw text, has much simplified traditional TTS conversion methods. Among these methods is one with an arcane name, the concatenative approach, which combines fragments of pre-recorded audio to generate new speech. Another is the parametric approach, which extracts the linguistic features of the written text – including phonemes and durations – which are then converted into speech signals by what is called a vocoder, short for voice encoder.

With several steps involved in these traditional speech synthesis processes, a huge amount of resources and labor – 40 to 100 hours of actual voice recordings – were required to produce speech that sounds intelligible yet jarring.

“Traditional TTS systems, which were not really end to end, consisted of a multitude of different modules; they first had to convert text to phoneme and then move that on to subsequent stages,” Yi said. “But now the end-to-end model, a single neural network that integrates linguistic and acoustic modules, is all it takes to produce speech waveforms directly from given texts.”

The backbone of NES is of course the Recurrent Neural Networks(RNNs)-based sequence-to-sequence learning framework, in which two RNNs – one, an encoder, and the other, a decoder – work together to convert an input sequence into a corresponding output sequence. This framework has been frequently used in machine translation, image caption generation and speech recognition tasks.

Just as humans in general base on previous knowledge to make reasonable and informed decisions, speech synthesis systems have to rely on not just the current input but all the other inputs before it to process such “sequence data” or a stream of interdependent data. With the ability to have some memory about what happened earlier in the sequence of data, RNNs pave the way for the system to gain context. This is further facilitated by a special attention mechanism, which lets the decoder to focus on a specific range of the input data.

“In RNN, the output from the previous step is fed as input along with the original input to the current step,” said Lee Soo-young, director of Institute for Artificial Intelligence at Korea Advanced Institute of Science and Technology(KAIST) in Daejeon, South Korea. “For example, in order to predict tomorrow's stock price, tomorrow's economic indicators as well as today's stock price are taken into account. RNNs are used in speech synthesis because the speech output of the next time step is not only affected by the next input of phonemes or characters but also the output from the current step.”

With NES, Naver hoped to improve the naturalness and intelligibility of synthesized speech, an aim which it seems to have achieved.

“Naver’s AI-powered technology is outstanding, producing some high-quality synthesized audios,” the director said. “It’s also outstanding in that it can create a number of different voices (expressing emotions) as opposed to existing TTS systems.”

Along with a neutral voice, two other speaking styles – happy and sad – can be generated, samples of which are currently available on Naver’s Clova Voice website. With the increased flexibility provided by NES, Yi said the speaking style of synthesized speech can easily be varied by training the model with new data.

But given that the quality of training data dictates the quality of synthesized speech of the technology, this data is being recorded primarily by professional voice actors in high-quality studio conditions as of now.

“NES is certainly not without its limits,” Yi said. “There are challenges when using voice data recorded by, say, a child whose pronunciation is not very clear. But we’ll continue to make improvements and build speeches from many kinds of voice data, optimizing for any specific use.”

About 20 speaking styles — from conversational speaking styles to newscaster styles and to styles peppered with regional dialects – will be made available when the open-to-public Clova Dubbing service is launched later this month. And English, Chinese, Japanese and other language options, in addition to Korean, are in the works. Clova Premium Voice, a NES-based speech synthesis service for businesses, is already for sale.

Kevin Lee, who is in charge of marketing for Naver’s Clova Voice, said when rolled out, the company’s speech synthesis technology will eventually make its way into banks, customer service centers and many other firms where synthesized voices are needed for various purposes. It can also be utilized in the making of videos or audiobooks.

“In the foreseeable future, AI-based speech synthesis systems will deliver experiences that are more tailored to each user,” Director Lee said. “People will be able to synthesize their own voice, their friend’s voice or even the voice of their deceased parent. If an audiobook is told in the voice of parents, listening to it will be more enjoyable to their children.”

As with any new technology, concerns are being raised about the prospect of this speech synthesis technology displacing workers with voice-over or sound-recording jobs.

“If this technology gets to the point where it can incorporate breath sounds and use a variety of tones for narration of text longer than short phrases,” said Lee Seung-heon, a voice trainer in Seodaemun-gu, “it will certainly threaten those jobs.”

와이어드 코리아=
이 기사를 공유합니다