For many, vocal communication is easy and effortless. However, that is not the case for those with certain neurological conditions that affect speech. A research group from the University of California, San Francisco has published a “proof of principle” technology that may achieve the goal of generating synthetic speech directly from the brain.

The goal of his lab, notes Edward Chang, MD, professor of neurological surgery at the University of California, San Francisco and senior author on the paper, is to “create technologies to restore communication to patients with severe speech disabilities such as stroke or paralysis or injuries to the vocal tract.”

To do that, the researchers developed a method to decode or decipher brain activity into useful output. In this case, speech. The work is published today in Nature in the paper, “Speech synthesis from neural decoding of spoken sentences.”

“This is really impressive work,” notes A. Bolu Ajiboye, PhD, assistant professor at the department of biomedical engineering, Case Western Reserve University, noting that this is the first time speech synthesis has been done in this way.

“This is an elegant demonstration of an exciting technological goal—direct neural control of speech utterances for communication,” notes Bijan Pesaran, PhD, faculty at the New York University Center for Neural Science. The novelty of this study, he adds, “is in how it uses new deep learning tools in a way that flexibly constrains the learned representations that transform brain activity into sounds.”

The team worked with five patients who were already undergoing intracranial monitoring, where electrodes measure brain activity as part of a treatment for epilepsy. The recording, known as “ecog” (electrocorticography) is standard for epilepsy surgery. The researchers here used the technology to map out the areas of the brain that control the movements of speech.

In the first step of their two-pronged approach, the researchers recorded cortical activity from the brains of the participants as they spoke several hundred sentences aloud. They measured the neurons that control the movement to create sounds—not sound directly. The participants were asked simply to read, not to make any specific mouth movements. Based on these recordings of the movements of the lips, tongue, jaw, and larynx, the authors designed a system that decoded the brain signals responsible for individual movements of the vocal tract. In this second step, they were able to synthesize speech from the decoded movements.

“This work brings together ideas from speech motor control and neuroscience to validate a ‘biomimetic’ approach to neural decoding of speech,” notes Jon Brumberg, PhD, assistant professor in the department of speech-language-hearing with a courtesy appointment in electrical engineering and computer science at The University of Kansas. He adds that, “the really nice thing about this work is that it uses modern decoding techniques that support our ideas about how speech motor control is represented in the brain, and it was nice to see those ideas supported in the study results.”

A. Bolu Ajiboye, PhD
Case Western University

Indeed, third-party listeners could readily identify and transcribe the synthesized speech. The listeners were asked to transcribe the speech, using dedicated pools of words to use to best describe the sentence. On average, the listeners created sentences with a 31% or 53% error rate, depending on word pool size (25 or 50 words, respectively.) Interestingly, many mistaken words were similar in meaning to the original words. So, even with high error rates, the meaning of the sentence remained intact and understood.

One example of this is illustrated in the differences between the original text of “Mum strongly dislikes appetizers” versus the listener transcription of “Mom often dislikes appetizers.” Despite the ~50% error rate, the meaning of the sentence is well understood.

The Chang lab has a prolific history of decoding phonemes—elementary linguistic units or the units of sound that distinguish one word from another, notes Ajiboye. Brain recording arrays can decode these phonemes with high fidelity. But, he adds, that is very different from decoding whole sentences which have flow, intonations, and pitch. In order to do that, they had to innovate how to decode the kinematics of speech and translate that to an acoustic model.

The fact that the model was trained on a fairly limited set of sentences, and those words were used to extrapolate to new sentences speaks to the generalizability of this system, notes Ajiboye.

In separate tests, a participant spoke sentences and then mime them—making the same articulatory movements, without sound. Although synthesis performance of mimed speech was inferior to that of audible speech, the decoder could synthesize speech that is never audibly spoken.

Although many patients with neurological conditions that result in the loss of speech use communication devices that use brain–computer interfaces to spell out words, this process is slow—producing roughly 10 words per minute. However, Chang’s technology works at the rate of normal speech—roughly 120–150 words per minute.

“This work’s focus on the motor-based methods may be a more natural approach for some individuals who lose their speech due to neurological disorders since it can take advantage of their past experience with speaking prior to the onset of their disease or injury,” notes  Brumberg.

Bijan Pesaran, PhD
New York University

The work is exciting, but, it is not the holy grail of a direct thought to speech brain interface. Chang notes that their end goal is to “reproduce speech directly from brain activity.” In order to achieve that, they would have to remove the kinematic aspect of the model.

Pesaran is excited about how the advances in new electrode technologies could advance this type of research. He tells GEN that this work “used relatively coarse electrode arrays” and that “FDA approval for far more sophisticated devices is on the horizon.” A higher resolution electrode array would yield naturalistic speech with larger vocabularies and would open up the treatment to a wider range of patients with other neurological and even psychiatric disorders.