Suggestions or feedback?

MIT News | Massachusetts Institute of Technology

  • Machine learning
  • Sustainability
  • Black holes
  • Classes and programs

Departments

  • Aeronautics and Astronautics
  • Brain and Cognitive Sciences
  • Architecture
  • Political Science
  • Mechanical Engineering

Centers, Labs, & Programs

  • Abdul Latif Jameel Poverty Action Lab (J-PAL)
  • Picower Institute for Learning and Memory
  • Lincoln Laboratory
  • School of Architecture + Planning
  • School of Engineering
  • School of Humanities, Arts, and Social Sciences
  • Sloan School of Management
  • School of Science
  • MIT Schwarzman College of Computing

Computer system transcribes words users “speak silently”

Press contact :, media download.

Arnav Kapur, a researcher in the Fluid Interfaces group at the MIT Media Lab, demonstrates the AlterEgo project.

*Terms of Use:

Images for download on the MIT News office website are made available to non-commercial entities, press and the general public under a Creative Commons Attribution Non-Commercial No Derivatives license . You may not alter the images provided, other than to crop them to size. A credit line must be used when reproducing images; if one is not provided below, credit the images to "MIT."

Arnav Kapur, a researcher in the Fluid Interfaces group at the MIT Media Lab, demonstrates the AlterEgo project.

Previous image Next image

MIT researchers have developed a computer interface that can transcribe words that the user verbalizes internally but does not actually speak aloud.

The system consists of a wearable device and an associated computing system. Electrodes in the device pick up neuromuscular signals in the jaw and face that are triggered by internal verbalizations — saying words “in your head” — but are undetectable to the human eye. The signals are fed to a machine-learning system that has been trained to correlate particular signals with particular words.

The device also includes a pair of bone-conduction headphones, which transmit vibrations through the bones of the face to the inner ear. Because they don’t obstruct the ear canal, the headphones enable the system to convey information to the user without interrupting conversation or otherwise interfering with the user’s auditory experience.

The device is thus part of a complete silent-computing system that lets the user undetectably pose and receive answers to difficult computational problems. In one of the researchers’ experiments, for instance, subjects used the system to silently report opponents’ moves in a chess game and just as silently receive computer-recommended responses.

“The motivation for this was to build an IA device — an intelligence-augmentation device,” says Arnav Kapur, a graduate student at the MIT Media Lab, who led the development of the new system. “Our idea was: Could we have a computing platform that’s more internal, that melds human and machine in some ways and that feels like an internal extension of our own cognition?”

“We basically can’t live without our cellphones, our digital devices,” says Pattie Maes, a professor of media arts and sciences and Kapur’s thesis advisor. “But at the moment, the use of those devices is very disruptive. If I want to look something up that’s relevant to a conversation I’m having, I have to find my phone and type in the passcode and open an app and type in some search keyword, and the whole thing requires that I completely shift attention from my environment and the people that I’m with to the phone itself. So, my students and I have for a very long time been experimenting with new form factors and new types of experience that enable people to still benefit from all the wonderful knowledge and services that these devices give us, but do it in a way that lets them remain in the present.”

The researchers describe their device in a paper they presented at the Association for Computing Machinery’s ACM Intelligent User Interface conference. Kapur is first author on the paper, Maes is the senior author, and they’re joined by Shreyas Kapur, an undergraduate major in electrical engineering and computer science.

Subtle signals

The idea that internal verbalizations have physical correlates has been around since the 19th century, and it was seriously investigated in the 1950s. One of the goals of the speed-reading movement of the 1960s was to eliminate internal verbalization, or “subvocalization,” as it’s known.

But subvocalization as a computer interface is largely unexplored. The researchers’ first step was to determine which locations on the face are the sources of the most reliable neuromuscular signals. So they conducted experiments in which the same subjects were asked to subvocalize the same series of words four times, with an array of 16 electrodes at different facial locations each time.

The researchers wrote code to analyze the resulting data and found that signals from seven particular electrode locations were consistently able to distinguish subvocalized words. In the conference paper, the researchers report a prototype of a wearable silent-speech interface, which wraps around the back of the neck like a telephone headset and has tentacle-like curved appendages that touch the face at seven locations on either side of the mouth and along the jaws.

But in current experiments, the researchers are getting comparable results using only four electrodes along one jaw, which should lead to a less obtrusive wearable device.

Once they had selected the electrode locations, the researchers began collecting data on a few computational tasks with limited vocabularies — about 20 words each. One was arithmetic, in which the user would subvocalize large addition or multiplication problems; another was the chess application, in which the user would report moves using the standard chess numbering system.

Then, for each application, they used a neural network to find correlations between particular neuromuscular signals and particular words. Like most neural networks, the one the researchers used is arranged into layers of simple processing nodes, each of which is connected to several nodes in the layers above and below. Data are fed into the bottom layer, whose nodes process it and pass them to the next layer, whose nodes process it and pass them to the next layer, and so on. The output of the final layer yields is the result of some classification task.

The basic configuration of the researchers’ system includes a neural network trained to identify subvocalized words from neuromuscular signals, but it can be customized to a particular user through a process that retrains just the last two layers.

Practical matters

Using the prototype wearable interface, the researchers conducted a usability study in which 10 subjects spent about 15 minutes each customizing the arithmetic application to their own neurophysiology, then spent another 90 minutes using it to execute computations. In that study, the system had an average transcription accuracy of about 92 percent.

But, Kapur says, the system’s performance should improve with more training data, which could be collected during its ordinary use. Although he hasn’t crunched the numbers, he estimates that the better-trained system he uses for demonstrations has an accuracy rate higher than that reported in the usability study.

In ongoing work, the researchers are collecting a wealth of data on more elaborate conversations, in the hope of building applications with much more expansive vocabularies. “We’re in the middle of collecting data, and the results look nice,” Kapur says. “I think we’ll achieve full conversation some day.”

“I think that they’re a little underselling what I think is a real potential for the work,” says Thad Starner, a professor in Georgia Tech’s College of Computing. “Like, say, controlling the airplanes on the tarmac at Hartsfield Airport here in Atlanta. You’ve got jet noise all around you, you’re wearing these big ear-protection things — wouldn’t it be great to communicate with voice in an environment where you normally wouldn’t be able to? You can imagine all these situations where you have a high-noise environment, like the flight deck of an aircraft carrier, or even places with a lot of machinery, like a power plant or a printing press. This is a system that would make sense, especially because oftentimes in these types of or situations people are already wearing protective gear. For instance, if you’re a fighter pilot, or if you’re a firefighter, you’re already wearing these masks.”

“The other thing where this is extremely useful is special ops,” Starner adds. “There’s a lot of places where it’s not a noisy environment but a silent environment. A lot of time, special-ops folks have hand gestures, but you can’t always see those. Wouldn’t it be great to have silent-speech for communication between these folks? The last one is people who have disabilities where they can’t vocalize normally. For example, Roger Ebert did not have the ability to speak anymore because lost his jaw to cancer. Could he do this sort of silent speech and then have a synthesizer that would speak the words?”

Share this news article on:

Press mentions, smithsonian magazine.

Smithsonian reporter Emily Matchar spotlights AlterEgo, a device developed by MIT researchers to help people with speech pathologies communicate. “A lot of people with all sorts of speech pathologies are deprived of the ability to communicate with other people,” says graduate student Arnav Kapur. “This could restore the ability to speak for people who can’t.”

WCVB-TV’s Mike Wankum visits the Media Lab to learn more about a new wearable device that allows users to communicate with a computer without speaking by measuring tiny electrical impulses sent by the brain to the jaw and face. Graduate student Arnav Kapur explains that the device is aimed at exploring, “how do we marry AI and human intelligence in a way that’s symbiotic.”

Fast Company

Fast Company reporter Eillie Anzilotti highlights how MIT researchers have developed an AI-enabled headset device that can translate silent thoughts into speech. Anzilotti explains that one of the factors that is motivating graduate student Arnav Kapur to develop the device is “to return control and ease of verbal communication to people who struggle with it.”

Quartz reporter Anne Quito spotlights how graduate student Arnav Kapur has developed a wearable device that allows users to access the internet without speech or text and could help people who have lost the ability to speak vocalize their thoughts. Kapur explains that the device is aimed at augmenting ability.

Axios reporter Ina Fried spotlights how graduate student Arnav Kapur has developed a system that can detect speech signals. “The technology could allow those who have lost the ability to speak to regain a voice while also opening up possibilities of new interfaces for general purpose computing,” Fried explains.

After several years of experimentation, graduate student Arnav Kapur developed AlterEgo, a device to interpret subvocalization that can be used to control digital applications. Describing the implications as “exciting,” Katharine Schwab at Co.Design writes, “The technology would enable a new way of thinking about how we interact with computers, one that doesn’t require a screen but that still preserves the privacy of our thoughts.”

The Guardian

AlterEgo, a device developed by Media Lab graduate student Arnav Kapur, “can transcribe words that wearers verbalise internally but do not say out loud, using electrodes attached to the skin,” writes Samuel Gibbs of The Guardian . “Kapur and team are currently working on collecting data to improve recognition and widen the number of words AlterEgo can detect.”

Popular Science

Researchers at the Media Lab have developed a device, known as “AlterEgo,” which allows an individual to discreetly query the internet and control devices by using a headset “where a handful of electrodes pick up the miniscule electrical signals generated by the subtle internal muscle motions that occur when you silently talk to yourself,” writes Rob Verger for Popular Science.

New Scientist

A new headset developed by graduate student Arnav Kapur reads the small muscle movements in the face that occur when the wearer thinks about speaking, and then uses “artificial intelligence algorithms to decipher their meaning,” writes Chelsea Whyte for New Scientist . Known as AlterEgo, the device “is directly linked to a program that can query Google and then speak the answers.”

Previous item Next item

Related Links

  • Paper: “AlterEgo: A personalized wearable silent speech interface”
  • Arnav Kapur
  • Pattie Maes
  • Fluid Interfaces group
  • School of Architecture and Planning

Related Topics

  • Assistive technology
  • Computer science and technology
  • Artificial intelligence

Related Articles

silent speech

Recycling air pollution to make art

Researchers at the MIT Media Lab have created a finger-worn device with a built-in camera that can convert text to speech for the visually impaired.

Finger-mounted reading device for the blind

A carving tool designed by MIT Media Lab postdoc Amit Zoran, called FreeD, allows the user to control the carving process while aided by a computer guidance system that is preprogrammed with the desired three-dimensional shape.

‘Wise chisels’: Art, craftsmanship, and power tools

More mit news.

Rendering of a yellow liquid on a green slick material.

MIT spinout Arnasi begins applying LiquiGlide no-stick technology to help patients

Read full story →

Laura Arnold and Esther Duflo sit in a conference space. Arnold gestures and speaks as Duflo looks at her.

Groundbreaking poverty alleviation project expands with new Arnold Ventures, J-PAL North America collaboration

The Grand Concourse Avenue street in the Bronx borough, underneath subway tracks.

Study tracks exposure to air pollution through the day

Seven people pose poolside. The four in front hold remotely operated underwater vehicles that they built.

Edgerton Center hosts workshop for deaf high school students in STEM

Rachel Thompson, Eric Grimson, Yossi Sheffi, Nick Jennings, and Jan Godsell pose at an agreement signing ceremony. The words "Loughborough University" are on the wall behind them.

MIT Global SCALE Network expands by adding center at Loughborough University

Two schematics of the crystal structure of boron nitride, one slightly slightly different. An arrow with "Slide" appears between them.

New transistor’s superlative properties could have broad electronics applications

  • More news on MIT News homepage →

Massachusetts Institute of Technology 77 Massachusetts Avenue, Cambridge, MA, USA

  • Map (opens in new window)
  • Events (opens in new window)
  • People (opens in new window)
  • Careers (opens in new window)
  • Accessibility
  • Social Media Hub
  • MIT on Facebook
  • MIT on YouTube
  • MIT on Instagram

This Device Can Hear You Talking to Yourself

AlterEgo could help people with communication or memory problems by broadcasting internal monologues

Emily Matchar

Emily Matchar

Innovation Correspondent

alterego-main.jpg

He’s worked on a lunar rover, invented a 3D printable drone, and developed an audio technology to narrate the world for the visually impaired .

But 24-year-old Arnav Kapur’s newest invention can do something even more sci-fi: it can hear the voice inside your head.

Yes, it’s true. AlterEgo , Kapur’s new wearable device system, can detect what you’re saying when you’re talking to yourself, even if you’re completely silent and not moving your mouth.

The technology involves a system of sensors that detect the minuscule neuromuscular signals sent by the brain to the vocal cords and muscles of the throat and tongue. These signals are sent out whenever we speak to ourselves silently, even if we make no sounds. The device feeds the signals through an A.I., which “reads” them and turns them into words. The user hears the A.I.’s responses through a microphone that conducts sound through the bones of the skull and ear, making them silent to others. Users can also respond out loud using artificial voice technology.

AlterEgo won the “Use it!” Lemelson-MIT Student Prize , awarded to technology-based inventions involving consumer devices. The award comes with a $15,000 cash prize.

“A lot of people with all sorts of speech pathologies are deprived of the ability to communicate with other people,” says Kapur, a PhD candidate at MIT. “This could restore the ability to speak for people who can’t.”

YouTube Logo

Kapur is currently testing the device on people with communication limitations through various hospitals and rehabilitation centers in the Boston area. These limitations could be caused by stroke, cerebral palsy or neurodegenerative diseases like ALS. In the case of ALS, the disease affects the nerves in the brain and spinal cord, progressively robbing people of their ability to use their muscles, including those that control speech. But their brains still send speech signals to the vocal cords and the 100-plus muscles involved in speaking. AlterEgo can capture those signals and turn them into speech. According to Kapur’s research , the system is about 92 percent accurate.

Kapur remembers testing the device with a man with late-stage ALS who hadn’t spoken in a decade. To communicate, he’d been using an eye-tracking device that allowed him to operate a keyboard with his gaze. The eye-tracking worked, but was time-consuming and laborious.

“The first time [AlterEgo] worked he said, ‘today has been a good, good day,’” Kapur recalls.

The device could also “extend our abilities and cognition in different ways,” Kapur says. Imagine, for example, making a grocery list in your head while you’re driving to the store. By the time you’re inside, you’ve no doubt forgotten a few of the items. But if you used AlterEgo to “speak” the list, it could record it and read back the items to you as you shopped. Now imagine you have dementia. AlterEgo could record your own instructions and give reminders at an appropriate time. Potential uses are nearly endless: you could use the system to talk to smart home devices like the Echo, make silent notes during meetings, send text messages without speaking or lifting a finger. AlterEgo could even one day act as a simultaneous interpreter for languages—you’d think your speech in English and the device would speak out loud in, say, Mandarin.

“In a way, it gives you perfect memory,” Kapur says. “You can talk to a smarter version of yourself inside yourself.”

This Device Can Hear You Talking to Yourself

“I think that they’re a little underselling what I think is a real potential for the work,” says Thad Starner, a professor in Georgia Tech’s College of Computing, speaking to MIT News .

The device, Starner says, could be useful in military operations, such as when special forces need to communicate silently. It could also help people who work in noisy environments, from fighter pilots to firefighters.

This Device Can Hear You Talking to Yourself

Kapur has applied for a patent for AlterEgo and plans to develop it into a commercial device. Right now he’s working on optimizing the hardware to process extremely high volumes of data with minimal delay, and on refining the A.I.

Kapur hopes AlterEgo can help people see A.I. not as a scary, evil force here to steal our identities and our jobs, but as a tool that can improve our everyday lives.

“Somewhere in the last 20 or 30 years we forget that A.I. was meant to enable people,” he says.

silent speech

Get the latest stories in your inbox every weekday.

Emily Matchar

Emily Matchar | | READ MORE

Emily Matchar is a writer from North Carolina. She's contributed to many publications, including the New York Times , the Washington Post , the Atlantic  and many others. She's the author of the novel In the Shadow of the Greenbrier . 

Creative Commons

Attribution 4.0 International

AlterEgo is a  non-invasive, wearable, peripheral neural interface that allows humans to converse in natural language with machines, artificial intelligence assistants, services, and other people without any voice—without opening their mouth, and without externally observable movements—simply by articulating words internally.  The feedback to the user is given through audio, via bone conduction,  without disrupting the user's usual auditory perception, and making the interface closed-loop. This enables a human-computer interaction that is subjectively experienced as completely internal to the human user—like speaking to one's self.  

A primary focus of this project is to help support communication for people with speech disorders including conditions like ALS (amyotrophic lateral sclerosis) and MS (multiple sclerosis).  Beyond that, the system has the potential to seamlessly integrate humans and computers—such that computing, the Internet, and AI would weave into our daily life as a "second self" and augment our cognition and abilities.  

The wearable system captures peripheral neural signals when internal speech articulators are volitionally and neurologically activated, during a user's internal articulation of words. This enables a user to transmit and receive streams of information to and from a computing device or any other person without any observable action, in discretion, without unplugging the user from her environment, without invading the user's  privacy.  

AlterEgo: A Personalized Wearable Silent Speech Interface

A. Kapur, S. Kapur, and P. Maes, "AlterEgo: A Personalized Wearable Silent Speech Interface." 23rd International Conference on Intelligent User Interfaces (IUI 2018), pp 43-53, March 5, 2018.

Non-Invasive Silent Speech Recognition in Multiple Sclerosis with Dysphonia

Kapur, A., Sarawgi, U., Wadkins, E., Wu, M., Hollenstein, N., & Maes, P. (2020, April 30). Non-Invasive Silent Speech Recognition in Multiple Sclerosis with Dysphonia. Retrieved July 20, 2020, from http://proceedings.mlr.press/v116/kapur20a.html

Research Topics

Cornell Chronicle

  • Architecture & Design
  • Arts & Humanities
  • Business, Economics & Entrepreneurship
  • Computing & Information Sciences
  • Energy, Environment & Sustainability
  • Food & Agriculture
  • Global Reach
  • Health, Nutrition & Medicine
  • Law, Government & Public Policy
  • Life Sciences & Veterinary Medicine
  • Physical Sciences & Engineering
  • Social & Behavioral Sciences
  • Coronavirus
  • News & Events
  • Public Engagement
  • New York City
  • Photos of the Week
  • Big Red Sports
  • Freedom of Expression
  • Student Life
  • University Statements
  • Around Cornell
  • All Stories
  • In the News
  • Expert Quotes
  • Cornellians

AI-equipped eyeglasses can read silent speech

By louis dipietro cornell ann s. bowers college of computing and information science.

It may look like Ruidong Zhang is talking to himself, but in fact the doctoral student in the field of information science is silently mouthing the passcode to unlock his nearby smartphone and play the next song in his playlist.

It’s not telepathy: It’s the seemingly ordinary, off-the-shelf eyeglasses he’s wearing, called EchoSpeech – a silent-speech recognition interface that uses acoustic-sensing and artificial intelligence to continuously recognize up to 31 unvocalized commands, based on lip and mouth movements.

Ruidong Zhang, a doctoral student in the field of information science, wearing EchoSpeech glasses.

Ruidong Zhang, a doctoral student in the field of information science, wearing EchoSpeech glasses.

Developed by Cornell’s Smart Computer Interfaces for Future Interactions (SciFi) Lab , the low-power, wearable interface requires just a few minutes of user training data before it will recognize commands and can be run on a smartphone, researchers said.

Zhang is the lead author of “ EchoSpeech: Continuous Silent Speech Recognition on Minimally-obtrusive Eyewear Powered by Acoustic Sensing ,” which will be presented at the Association for Computing Machinery Conference on Human Factors in Computing Systems (CHI) this month in Hamburg, Germany.

“For people who cannot vocalize sound, this silent speech technology could be an excellent input for a voice synthesizer. It could give patients their voices back,” Zhang said of the technology’s potential use with further development.

In its present form, EchoSpeech could be used to communicate with others via smartphone in places where speech is inconvenient or inappropriate, like a noisy restaurant or quiet library. The silent speech interface can also be paired with a stylus and used with design software like CAD, all but eliminating the need for a keyboard and a mouse.

Outfitted with a pair of microphones and speakers smaller than pencil erasers, the EchoSpeech glasses become a wearable AI-powered sonar system, sending and receiving soundwaves across the face and sensing mouth movements. A deep learning algorithm, also developed by SciFi Lab researchers, then analyzes these echo profiles in real time, with about 95% accuracy.

“We’re moving sonar onto the body,” said Cheng Zhang , assistant professor of information science in the Cornell Ann S. Bowers College of Computing and Information Science and director of the SciFi Lab.

“We’re very excited about this system,” he said, “because it really pushes the field forward on performance and privacy. It’s small, low-power and privacy-sensitive, which are all important features for deploying new, wearable technologies in the real world.”

The SciFi Lab has developed several wearable devices that track body , hand and facial movements using machine learning and wearable, miniature video cameras. Recently, the lab has shifted away from cameras and toward acoustic sensing to track face and body movements, citing improved battery life; tighter security and privacy; and smaller, more compact hardware. EchoSpeech builds off the lab’s similar acoustic-sensing device called EarIO , a wearable earbud that tracks facial movements.

Most technology in silent-speech recognition is limited to a select set of predetermined commands and requires the user to face or wear a camera, which is neither practical nor feasible, Cheng Zhang said. There also are major privacy concerns involving wearable cameras – for both the user and those with whom the user interacts, he said.

Acoustic-sensing technology like EchoSpeech removes the need for wearable video cameras. And because audio data is much smaller than image or video data, it requires less bandwidth to process and can be relayed to a smartphone via Bluetooth in real time, said François Guimbretière , professor in information science in Cornell Bowers CIS and a co-author.

“And because the data is processed locally on your smartphone instead of uploaded to the cloud,” he said, “privacy-sensitive information never leaves your control.”

Battery life improves exponentially, too, Cheng Zhang said: Ten hours with acoustic sensing versus 30 minutes with a camera.

The team is exploring commercializing the technology behind EchoSpeech, thanks in part to Ignite: Cornell Research Lab to Market gap funding .

In forthcoming work, SciFi Lab researchers are exploring smart-glass applications to track facial, eye and upper body movements.

“We think glass will be an important personal computing platform to understand human activities in everyday settings,” Cheng Zhang said.

Other co-authors were information science doctoral student Ke Li, Yihong Hao ’24, Yufan Wang ’24 and Zhengnan Lai ‘25. This research was funded in part by the National Science Foundation.

Louis DiPietro is a writer for the Cornell Ann S. Bowers College of Computing and Information Science.

Media Contact

Becka bowyer.

Get Cornell news delivered right to your inbox.

You might also like

silent speech

Gallery Heading

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Open access
  • Published: 13 August 2021

All-weather, natural silent speech recognition via machine-learning-assisted tattoo-like electronics

  • Youhua Wang   ORCID: orcid.org/0000-0001-7824-4871 1 , 2   na1 ,
  • Tianyi Tang   ORCID: orcid.org/0000-0003-1711-7435 3   na1 ,
  • Yin Xu   ORCID: orcid.org/0000-0002-4635-9778 3   na1 ,
  • Yunzhao Bai 1 , 2 ,
  • Lang Yin   ORCID: orcid.org/0000-0003-4785-5025 1 , 2 ,
  • Guang Li   ORCID: orcid.org/0000-0002-1253-3985 4 ,
  • Hongmiao Zhang   ORCID: orcid.org/0000-0003-2073-7125 3 ,
  • Huicong Liu   ORCID: orcid.org/0000-0003-3651-9114 3 &
  • YongAn Huang   ORCID: orcid.org/0000-0001-7713-8380 1 , 2  

npj Flexible Electronics volume  5 , Article number:  20 ( 2021 ) Cite this article

6394 Accesses

49 Citations

3 Altmetric

Metrics details

  • Electrical and electronic engineering
  • Sensors and probes

The internal availability of silent speech serves as a translator for people with aphasia and keeps human–machine/human interactions working under various disturbances. This paper develops a silent speech strategy to achieve all-weather, natural interactions. The strategy requires few usage specialized skills like sign language but accurately transfers high-capacity information in complicated and changeable daily environments. In the strategy, the tattoo-like electronics imperceptibly attached on facial skin record high-quality bio-data of various silent speech, and the machine-learning algorithm deployed on the cloud recognizes accurately the silent speech and reduces the weight of the wireless acquisition module. A series of experiments show that the silent speech recognition system (SSRS) can enduringly comply with large deformation (~45%) of faces by virtue of the electricity-preferred tattoo-like electrodes and recognize up to 110 words covering daily vocabularies with a high average accuracy of 92.64% simply by use of small-sample machine learning. We successfully apply the SSRS to 1-day routine life, including daily greeting, running, dining, manipulating industrial robots in deafening noise, and expressing in darkness, which shows great promotion in real-world applications.

Similar content being viewed by others

silent speech

An epidermal sEMG tattoo-like patch as a new human–machine interface for patients with loss of voice

silent speech

Speaking without vocal folds using a machine-learning-assisted wearable sensing-actuation system

silent speech

Mixed-modality speech recognition and interaction using a wearable artificial throat

Introduction.

Silent speech can offer people with aphasia an alternative communication way. More importantly, compared to voice interactions or visual interactions, human–machine interactions using silent speech are versatile enough to work in all-weather surroundings, such as obscured, dynamic, quiet, dark, and noisy. Speaking is owned from babyhood, thus silent speech requires less specialized learning and carries more information than most silent alternatives (from typing to sign language to Morse code). Human brains manipulate the voices by neural signals, and therefore it is effective to learn about human intentions by recognition of surface electromyographic (sEMG) signals on faces. Natural silent speech in daily and working life hinges on high-fidelity sEMG acquisition, accuracy classification, and imperceptible wearable devices.

sEMG signals are ubiquitously distributed on skins and have significant spatiotemporal variability 1 . The diversity of sEMG even occurs in the same actions 1 . Such complexity of sEMG motivates researchers to develop various classifiers, such as supporting vector machine (SVM) 2 , deep learning 3 , 4 , and machine learning 5 , 6 , to construct the mapping relations between facial sEMG and silent speech. Silent speech recognition based on EMG can be traced back to the mid-1980s. Sugie 7 of Japan and Morse 8 of the United States published their research almost at the same time. Sugie used three-channel electrodes to classify five Japanese vowels, while Morse successfully separated two English words with 97% accuracy. In the past two decades, the number of classified words has been increasing. In 2003, Jorgensen et al. recognized six independent words with 92% accuracy 9 . In 2008, Lee further expanded the number of words to 60, using the hidden Markov model to achieve 87.07% accuracy 10 . In 2018, Meltzner et al. recognized >1200 phrases generated from a 2200-word vocabulary with high accuracy of 91.1% 11 . In 2020, Wang et al. used the bidirectional long short-term memory to recognize ten words, and the accuracy reached 90% 12 . Although the silent speech recognition of sEMG has made great progress in recent years, most of these works use non-flexible electrodes and sampling equipment with a high sampling rate and precision, they are only verified in the laboratory environment, and their long-term performance is not evaluated. Machine learning and deep learning are commonly used in previous research. Deep learning needs to collect a large number of training data with labels, which is very tired and monotonous for people. Compared with deep learning, machine learning has better performance in the case of small sample size and multi-classification. The processing speed is faster, which is more suitable for real-time recognition. Furthermore, it is a trend to reduce the complexity of both the acquisition side and application side by deploying algorithms on the cloud, which is of great importance for wearable devices 13 , 14 , 15 .

Human faces have complex features, such as geometrically nondevelopable surfaces, softness, dynamical behaviors, and large deformation (~45%) 16 . However, current inherently planar and rigid electrodes, including wet gels (Ag/AgCl electrodes) 17 , invasive silicon needle electrodes 18 , and bulk metal electrodes 19 , cannot comply with skin textures, forming unstable gaps between skin and electrodes and correspondingly reducing signal-to-noise ratio. A commercial solution is to employ large-area and strong-adhesion materials (foams, nonwovens, etc.) to wrap electrodes; however, the auxiliary materials severely constrain the movements of muscles and cause uncomfortable experience. Users cannot normally express intentions when a mass of conventional electrodes is attached on faces. The emergence of lightweight, bendable, stretchable tattoo-like electronics shifts the paradigm of the conventional wearable field and show great prospect in clinical diagnosis, personal healthcare monitoring, and human–machine interaction 1 , 20 , 21 , 22 , 23 , 24 , 25 . The mechanical performance of tattoo-like electronics similar to human skin renders the devices seamlessly conformal with the morphology of skin. The softness and conformability of tattoo-like electronics not only extend the effective contact area of skin–device interfaces, facilitating the accurate transmission of bio-signals from human bodies to external devices, but also achieve imperceptible wearing. Currently, few researchers except us apply the tattoo-like electrodes to the acquisition of silent speech sEMG signals 1 , 26 . However, our previous works take a simple try of recording several words, which evidently cannot be extensively implemented in practice.

The proposed strategy in this paper fuses tattoo-like electrodes, wireless data acquisition (DAQ) modulus, and machine-learning algorithm into one all-weather silent speech recognition system (SSRS). The tattoo-like electrodes made up of ultrathin filamentary serpentines keep laminated on facial skins even under long-term, large deformation. The wireless DAQ modulus is a reusable wearable device, serving as real-time bio-data transmission from tattoo-like electrodes to machine learning. The machine-learning algorithm, suitable for multi-label classification of small samples, is deployed on the cloud and used for the accuracy recognition of 110 daily words. To show the applicability of SSRS, we apply SSRS to various scenarios close to daily life.

Results and discussion

Design of the ssrs.

Figure 1 illustrates the schematics of the all-weather, natural SSRS, which not only helps people naturally communicate in their daily lives but also benefits the users by silently interacting in all-weather conditions. Compared with sign language, with our SSRS system, users do not need a professional training. As shown in Fig. 1a , the SSRS includes four parts: four-channel tattoo-like electronics, a wireless DAQ module, a server-based machine-learning algorithm, and a terminal display of silent speech recognition. Without the use of large sEMG acquisition devices, the user only needs to wear the tattoo-like electronics properly assisted with an ear-mounted wireless DAQ module to capture, process, and transmit the four-channel sEMG signals. The users’ real-time sEMG signals are transmitted to a cloud server with powerful computing power and are online classified through the model trained by the machine-learning algorithm. By a Bluetooth connection, a mobile terminal is used to display the recognized speech information and play the audio. The advantages for all-whether, natural use of the SSRS come from the user’s long-time wearing, portable device, stable, and high-rate recognition in a variety of scenarios, such as greeting, exercise, repast, work, and dark scenes. Besides, our SSRS uses natural speech, lowering the training cost, and therefore is user-friendly for beginners.

figure 1

a Schematic illustration of an all-weather, natural SSRS, including four-channel tattoo-like electronics, the wireless DAQ module, the server-based machine-learning algorithm, and the terminal display of recognition, with adaptability in various scenarios. The identifiable drawing is fully consented by the written consent. b The photograph of a participant wearing the SSRS. The identifiable photograph is fully consented by the written consent. c Functional block diagram of the wireless DAQ module. d Confusion matrix of recognition results of the frequently used 110 words. The words are from 13 categories.

Different speaking is generated through the coordination among facial and neck muscles such that the placements of electrodes are of critical importance. Four pairs of tattoo-like electrodes are selectively attached on the muscles with significant sEMG signals as one silently speaks, including levator anguli oris (LAO), depressor anguli oris (DAO), buccinators (BUC), and anterior belly of digastric (ABD), to elevate the accuracy of silent speech recognition. Each channel includes one reference electrode and one working electrode. To guarantee high-fidelity delivery of sEMG through skin–electrode interfaces, the tattoo-like epidermal electrodes are designed to be only 1.2 μm thick and integrated within a skin-like 3 M Tegaderm patch (Young’s module ~7 KPa, 47 μm), which are able to perfectly conform with topologies of skins. The electrodes are further patterned to be filamentary serpentines to improve elastic stretchability. Specifically, the width, the ribbon-width-to-arc-radius ratio, and the arc angle of filamentary serpentines are 500 μm, 0.32, and 20°, respectively. According to the mechanics theory of serpentine ribbons 27 , the design can simply reach 4% elastic stretchability, much less than the deformation of facial skins. We introduce a so-called “electricity-preferred” method to enable the tattoo-like electrodes adequate to tough stretch, which will be described in “Wearable characterizations of tattoo-like electrodes.” The overall size of one electrode is about 18 mm × 32 mm. The tattoo-like electrodes are prepared by the low-cost but high-efficiency “Cut and Paste” methods 24 , 28 and the processes are described in the “Methods” section.

Figure 1b shows the picture of a user wearing the tattoo-like electronics and an ear-mounted wireless DAQ module 29 . As shown in Supplementary Fig. 1 , the method of low-temperature alloy welding effectively increases the strength of the connection, which ensures all-weather use without damage. The specific operation of connection is described in “Methods.” The block diagram in Fig. 1c summarizes the system architecture and overall wireless operation procedures 30 . The wireless DAQ module has four signal collection channels and each channel is connected with a working electrode and a reference electrode of tattoo-like electronics. The sEMG signal from each channel is processed by an instrumentation amplifier and an analog filter. Then a micro control unit and a Bluetooth transmission unit are employed to convert and transmit the four-channel signals, simultaneously. The DAQ module amplifies the high-fidelity sEMG signals by 1000 times and then extracts the effective signals, which carry speech information through 10–500 Hz band-pass filtering 31 . A 10-bit analog-to-digital converter of the micro control unit operating at a sampling frequency of 500 Hz digitizes the signals collected from each channel. Under the tests of the recognition rate under different sampling frequencies in Supplementary Table 1 , the low sampling frequency of 500 Hz can not only meet the requirements of a high recognition rate but also reduce the processing time and power consumption of the SSRS. The Bluetooth transmission unit uses the fifth-generation Bluetooth protocol, which allows continuous data transmission at a rate of up to 256 kb/s 32 , 33 . With the help of a Bluetooth receiver, the mobile terminal receives the recognition information and performs a proper interaction in daily applications. The recognition of silent speech is achieved by training facial sEMG signals with the linear discriminant analysis (LDA) algorithm, which will be described in “sEMG-based silent speech recognition by machine learning.” Figure 1d shows the confusion matrix of recognition results of the proposed frequently used 110 words in daily life, which are divided into 13 categories. The high recognition rate of 92.64% of the SSRS can fully meet the users’ daily communication requirements.

Wearable characterizations of tattoo-like electrodes

The mechanical mismatch of the skin–electrode interface constrains the natural deformation of human skins, thus causing an uncomfortable wearing experience. Figure 2a and Supplementary Fig. 2 compares the mechanical constrain of tattoo-like electrodes and commercial gel electrodes to human skins under large deformation. It is obvious that, no matter how a human face deforms extremely, including opening mouth, inflating cheeks, and twitching mouth toward left/right, the ultrathin tattoo-like electrodes comply with deformed skins while the gel electrodes constrain the deforming movements of skins. The strong driving forces between skin–gel electrode interfaces not only decrease the wearability but also delaminate the interfaces. Figure 2b displays the conformability of tattoo-like electrodes on skin textures at different scales. It is evident that tattoo-like electrodes are able to perfectly match both coarse and fine skin textures. The overlapped curves in Supplementary Fig. 3 evidently indicate that, due to the excellent conformability of tattoo-like electrodes, the skin–electrode interface has robust electrical performance even after suffering large deformation. Figure 2c exhibits the micro-optical photographs of stretchability of skin–electrode interfaces. Soft silicone rubbers (Young’s modulus is ~0.35 MPa, close to that of human skin) were used to mimic human skin, and the mimic skin laminated with an ultrathin tattoo-like electrode was stretched by 30%, equal to the elastic limit of human skin. The comparison results in Fig. 2c show that the skin–electrode interface is still intact after tensed. Motion artifact has a critical impact on the signal-to-noise ratio. The robust conformability on complex skin textures in Fig. 2b and under large deformation in Fig. 2c has the ability to suppress the motion artifact of various silent voices.

figure 2

a The wearability of tattoo-like electrodes and gel electrodes when attached on the subject’s face. The identifiable photographs are fully consented by the written consents. b The tattoo-like electrodes conform with skin textures at different scales. The scale bars on the left and right panel are 6 and 2.5 mm, respectively. c The skin–electrode interface before and after being stretched by 30%. The scale bar is 150 μm. d , e The strain distributions of tattoo-like electrodes under horizontal and vertical tensing with 45%. f The resistivity changes of gold ribbons with respect to strain. g The long-term measurement to investigate the change of background noise and impedance. The standard deviation (SD) characterizes the strength of background noises. h Long-term measurement of log detector (LOG) and classification accuracy (CA).

Though researches have shown that the elastic limit of human skin is about 30% 25 , 34 , 35 and that tensing limit without pain is about 20% 36 , 37 , the tension ability of human faces reaches up to 45% 16 . Figure 2d, e present the strain contours of ultrathin tattoo-like electronics axially extended with 45% applied strain in horizontal and vertical directions. The maximum principal strains are, respectively, 4.1% (horizontal direction) and 1.8% (vertical direction), both beyond the yield limit of nano-film gold (0.3%) 37 . However, for physiological electrodes, we pay more attention to electrical conductivity. Figure 2f plots the change of electrical resistivity of nano-film gold in the 100 nm Au/10 nm Cr/1.1 μm polyethylene terephthalate (PET) composite with respect to applied strain under the uniaxial tension (the inserted schematics in Fig. 2f ). The change of electrical resistivity is simply about 5% when tensed to 2% while it sharply rises to ~30% when tensed to 4%. Supplementary Fig. 4 clearly shows that the parts beyond 2% in Fig. 2d are at inner crests of serpentine structures. The parts <2% and those beyond 2% can be equivalent to parallel circuits, seen in the schematics in Fig. S4 . According to the electrical resistance change rule in a parallel circuit, the parts beyond 4% cannot bring remarkable influence on the whole resistance of the ultrathin tattoo-like electronics. Therefore, it is still claimed that the ultrathin tattoo-like electrodes are effective structures. Such thought, giving priority to the electrical performance, enables the structural design and fabrication of electrodes simple, which is called the “electricity-preferred” method.

The practical applications of silent speech demand long-term wearing performance, and the background noises and the skin–electrode contact impedance directly determine whether the deserved silent speech are collected or not, therefore we tested both electrical parameters of ultrathin tattoo-like electrodes during a 10-h wearing period. Commercial gel electrodes (3 M) were used as the gold standard to study the long-term electrical performances. Two pairs of ultrathin tattoo-like electrodes and gel electrodes were closely attached on the subject’s forearm and the distance between two tattoo-like electrodes, or gel electrodes, were set to 7 cm. The subject was required to begin to run for half an hour at the eighth hour. The results are illustrated in Fig. 2g . The gel electrodes have robust noise and impedance during the whole measurement while the noise and impedance of tattoo-like electrodes gradually degrade. The background noise and the impedance highly depend on the skin–electrode interface and skin properties 38 . The chloride ions contained in gel electrodes freely permeate through the stratum corneum, significantly suppressing noise and impedance. After running, both electrical parameters of tattoo-like electrodes sharply go down and the noise is even weaker than that of gel electrodes. It is mainly because, during the test, the Tegaderm film prevents the evaporation of sweat, and the sweat goes through the stratum corneum and finally accumulates at the skin–electrode interface, dramatically reducing the noise and impedance. Now that the long-term wear can affect the background noise, it is reasonable to consider the effect of daily usages on sEMG. Thus, we study the effect of complicated daily activities on signal features and classification accuracy. Tattoo-like electrodes and gel electrodes were, respectively, attached on the left and right faces. The subject was required to dine at the 2.3th hour, 5th hour, and 9.5th hour and run for 20 min at the 9th hour. Additionally, the room temperature gradually rises to 30.1 from 20 °C in 6 h and then declines to 19.8 °C in 4.5 h, shown in the upper panel in Fig. 2h . The middle panel in Fig. 2h plots the log detector (LOG) of the first channel with respect to wearing time. The results show that the signal feature captured by tattoo-like electrodes are immune to room temperature change, running, and dining during the long-term test while that of gel electrodes keep fluctuating. The bottom panel in Fig. 2h shows that silently saying “Hello, nice to meet you” at different times can achieve the high classification accuracy of 90% and the average classification accuracy is 95%. In conclusion, the long-term experiments in Fig. 2g, h offer the silent speech the probability in extensive applications.

sEMG-based silent speech recognition by machine learning

To keep the normal communications going smoothly, a collection of 110 words and phrases in American Sign Language (ASL) covering the words frequently used in daily life is selected for recognition (Supplementary Table 2 ) 39 . The collection is divided into 13 categories, including time, place, emotion, health, etc. According to the previous researches, eight muscles were selected (Supplementary Fig. 5 ) 9 , 10 , 26 , 40 . However, the more channels are used, the more power consumption of the wireless DAQ modulus the transmitted data demand, thus it is necessary to pick out the optimal combination. The classification accuracy of different channel combinations from one muscle to eight muscles was calculated. The results in Supplementary Fig. 6 clearly show that the mean classification accuracy gradually increases and approaches 92.1% along with the growth of channels. The number of channels is finally selected as four. Furthermore, the combination with the highest classification accuracy among the four channels is selected (Supplementary Fig. 7 ), specifically, LAO, DAO, BUC, and ABD.

The high-quality sEMG for the recognition is immediately recorded by the flexible tattoo-like electronics and transmitted through the wireless DAQ module to the cloud server in real time once participants perform silent speech tasks. The proposed recognition procedure shown in Fig. 3a is deployed on the cloud server and comprises the active segment interception, the training phase (left panel in Fig. 3a ), and online prediction (right panel in Fig. 3a ).

figure 3

a Recognition flow chart of training phase (left) and online prediction (right). The identifiable photographs are fully consented by the written consents. b Confusion matrix of recognition results of 110 ASL words. c Prediction performance of different classifiers LDA, SVM, and NBM. d Accuracy rate from multiple channels to a single channel.

The active segment interception plays an essential role in distinguishing between the silent speech-related sEMG and non-silent speech-related sEMG (swallowing, blinking, etc.). According to our experience and the previous research 41 , the sEMG absolute amplitude threshold and the number threshold of facial muscles activated by the silent speech are set to 50 μV and 2, respectively. The active segment interception extracts the sEMG of 800 ms before and 1200 ms after the moment when the sEMG signals achieve beyond both thresholds above.

Due to the significant discrimination of the ultrathin devices and the welding spots in terms of thickness (1.2 and about 300 μm, respectively), there is a huge difference in bending stiffness, which may easily cause motion artifacts. The baseline wandering of signals is unavoidable though the violent shaking of connection between tattoo-like electronics and wires is suppressed by the adhesive Tegaderm. To remove the baseline wandering, a 4-level wavelet packet with a soft threshold is used to decompose the extracted signals and reconstruct signals with the node coefficients from the 2nd to 16th node in the 4th layer 42 . Fifteen relative wavelet packet energy 43 as frequency-domain features are extracted from 15 nodes, respectively. Then the denoised signals are treated by full-wave rectification and ten time-domain features are extracted from the rectified signals. The definitions of all features are listed in Supplementary Table 3 44 , 45 , 46 , 47 . Considering that there are four channels, a silent speech word corresponds to a feature vector composed of 100 features. In the training phase, the silent speech users are required to speak 110 words and repeat 10 times each word. With an additional vector of 101 labels, the dimensions of the feature matrix reach up to 1100 × 101. The feature matrix is input into an LDA model for training and tenfold cross-validation is used to evaluate the training effectiveness. To speed up the recognition, one vs rest is selected as the multiple pattern recognition. In the online prediction, a vector of 100 features is input into the well-trained LDA model to predict the silent speech users’ intentions.

For offline recognition, the average classification accuracy of LDA reaches up to 92.64% in the case of 110 words (Fig. 3b ). Supplementary Fig. 8 shows the classification accuracy of each word. SVM and naive Bayesian model (NBM) are the other two machine-learning methods used extensively and compared with LDA in four aspects: classification accuracy, F1-score, training speed, and prediction speed. Only a small amount of data to be collected is of great importance for users to avoid monotony and fatigue. The recognition of 110 words with 10 samples is the typical few-shot classification. The comparison results in Fig. 3c clearly show that LDA is superior to SVM and NBM in whatever aspect of performances. In conclusion, LDA with high classification accuracy and high prediction speed renders silent speech users to naturally communicate.

Considering the contamination from eye blinking to the facial sEMG, the influence of electrooculogram (EOG) on SSRS is discussed. Only channel 1 (LAO) closest to the eye is affected by the EOG. The maximum amplitude of EOG is 30 μV, which is less than the threshold of the muscle activity detection (50 μV). Although the muscle activity segment detection will not misjudge the EOG signal as the silent speech-related signal, sometimes speaking and blinking happen at the same time. When such situation occurs, the EOG signal can be eliminated by preprocessing. When preprocessing the raw signal, the first node in the fourth layer decomposed by wavelet packet is not involved in reconstructing the signal. This means that the original signal is filtered by a 15 Hz high-pass filter (the frequency range of EOG is 0–12 Hz 48 ). After preprocessing, the maximum amplitude of EOG is <12 μV (Supplementary Fig. 9 ). The effect of EOG signal on SSRS can be ignored.

Considering possible extreme conditions during the long-term usage, such as wire disconnecting or electrode damaging, SSRS may lose some sEMG channels. All the possible scenarios from four channels (normal state) to only one channel are tested to examine the robustness of LDA (Fig. 3d ). When three channels are in good condition, the average classification accuracy can reach >85%. When two channels work, the average classification accuracy can reach >70%. Even when merely one channel remains intact, the average classification accuracy can reach 42.27%, which is much higher than the random recognition (0.91%) of one word. Therefore, our SSRS has promising applications in extreme conditions.

All-weather demonstration of the SSRS

Figure 4a exhibits five typical scenarios that a user often experiences in daily life, including greeting, exercise, repast, working in a noisy environment, and communicating in darkness. The model training and signal recognition of the SSRS are based on cloud servers with powerful computing capabilities, which allow users to only need a mobile phone with basic communication functions. The popularization of the fifth-generation communication technology has great potential to further reduce the delay of SSRS and bring users a more natural interactive experience. With the help of SSRS, users can not only communicate point to point but also express their intentions point to net. Some excellent capabilities and advantages of the SSRS are demonstrated in detail as below.

figure 4

a Five typical scenarios experienced in daily life. b Wearable and natural communication: (i) The scene of greeting. (ii) Four-channel sEMG of four representative words. (iii) The recognition rates of eight words in the greeting scene. c All-weather use in dynamic condition: (i) The scene of exercise. (ii) The recognition rates and background noises of five words related to locations under four different running speeds. (iii) The recognition rates of five words related to locations in four different exercise states. d All-weather use in the large deformation condition: (i) The scene of the repast. (ii) The recognition rates of five words related to food under different repeat times, each time contains four kinds of mouth deformation. (iii) The recognition rates of five words related to food. e Adaptability in the noisy environment. (i) The scene of the noisy working environment. (ii) The signals of ASR and sEMG. (iii) The comparison of the recognition rates of two recognition methods under four different ambient noises. f Adaptability in the dark environment. i–iii The comparison of recognition effect of Silent Speech Recognition (left) and American Sign Language (right) in the darkened environment. The identifiable photographs in b – f are fully consented by the written consents.

Figure 4bi demonstrates a typical greeting scene in which the user needs to communicate with people in his/her life. Figure 4bii shows the real-time sEMG signals collected from the four channels of the SSRS, as long as the user silently pronounces the words “Hello,” “Morning,” “Thanks,” “Goodbye,” and so on. The characteristics of different words can be easily identified from different channels in real time. It is proved that the sEMG of facial muscles carries enough speech information. As shown in Supplementary Video 1 , the subject is able to communicate naturally with his friend in silent speech with the help of SSRS. The confusion matrix in Fig. 4biii indicates that the recognition rate of eight words in the greeting scene is 95%. The SSRS is able to meet the user’s natural communication in three aspects. First, the wearable flexible printed circuit and the wireless connection of the SSRS provide more convenience, which greatly extends the activity range of the users. Second, the ultrathin tattoo-like electronics can collect high-fidelity facial sEMG and can be worn all day long. Finally, the LDA algorithm achieves a high recognition rate of 92.64% in 110 classifications. This is more than enough for the user to naturally communicate in daily life without any sign-language training.

The proposed ultrathin and super-conformal tattoo-like electronics ensure the stability of acquisition of the sEMG signal, even if the user experiences strenuous exercise activities. Figure 4ci demonstrates a typical exercise scene. The comparative experiments on background noise and recognition rate at different running speeds were carried out. We selected five common words (“Home,” “Work,” “School,” “Store,” “Church”), and tested the recognition rates in four motion states, i.e., resting (0 m/s), walking (1 m/s), jogging (3 m/s), and running (5 m/s). Each word was repeated ten times by subject. In Fig. 4cii , the recognition rate at resting, walking, and jogging state maintain as high as ≥96% and at running state it is up to 86%. The average recognition rate at four states is 96% in Fig. 4ciii , which proves the excellent stability of SSRS. Supplementary Video 2 shows the exercise scene where SSRS can still recognize words correctly when the subject is jogging. It is also verified that the SSRS does not get affected by the user’s body shaking and hence has great potential to replace touch control to operate smart devices.

In addition, to maintain a high recognition rate in dynamic conditions, the all-weather SSRS has good tolerance to mouth deformation and muscle fatigue. Figure 4di displays the scene of a user repasting in a restaurant with the help of SSRS (in Supplementary Video 3 ). According to the statistics, users chew about 400–600 times during a meal. Therefore, the subject was asked to repeat mouth movement actions 0–200 times, and four kinds of mouth deformations each time were performed, as seen in Fig. 4dii . The recognition rate of five words related to food, such as “Pizza,” “Milk,” “Hamburger,” “Hotdog,” “Egg,” remains consistently as high as ≥96%, as shown in Fig. 4diii , and the total recognition rate of five different repeat times is 98%.

Compared with Automatic Speech Recognition (ASR), the SSRS has a good capability of tolerating sound especially in noisy or quiet-required environments, such as workplaces and public places. Figure 4ei shows a noisy industry environment, which the user may experience at work. The comparative experiments on the performance of SSRS (left) and ASR (right) in a noisy environment were carried out in Supplementary Video 4 . As four words related to colors are tested in Fig. 4eii , the noise of ASR decreases with the increase of ambient decibels, while the noise of sEMG remains unchanged (the details seen in the red dashed box). Figure 4eiii depicts the comparison of the recognition rate of SSRS and ASR under different decibels of ambient noise. When the ambient noise reaches 80 dB, the recognition rate of ASR is dropped down to only 20%, while the recognition rate of SSRS remains as 100%, as shown in Supplementary Fig. 10 . It can be seen from the above comparison that SSRS has great potential to be an effective interface of human–machine interaction. People can easily control the equipment through SSRS in the noisy working environment (Supplementary Fig. 11 ).

One alternative to overcome darkness is by means of smart gloves. In the past few years, some smart gloves have been able to effectively recognize sign language. In 2019, Sundaram et al. proposed a scalable tactile glove, which realized the classification of 8 gestures, and the recognition rate was 89.4% 49 . In 2020, Zhou et al. demonstrated a recognition rate of up to 98.63% by using machine-learning-assisted stretchable sensor arrays to analyze 660 acquired sign language hand gestures 50 . However, sensor-based gesture recognition is different from SSRS. It takes a lot of time to master sign language, which becomes an obstacle to assisting pronunciation. Compared with the sign language, the all-weather SRSS can be used naturally without any technical threshold or any influence of brightness. Figure 4f compares the recognition effects of SSRS (left) and ASL (right) in a gradually darkening environment. The subject expresses “Happy,” “Sad,” “Sorry,” “Angry,” and “Love” through SSRS and ASL, respectively. As the light becomes darker, the ASL cannot be identified anymore as seen in Supplementary Video 5 , while the SSRS still works well. Therefore, the SSRS would be a better choice than ASL for users in the future.

We have successfully proposed a silent speech strategy by designing an all-weather SSRS, realizing natural silent speech recognition. The ultrathin tattoo-like electronics are able to be conformal with various skin textures and the simple but effective electricity-preferred design method renders filamentary serpentines bear ~45% extension of facial skins. Long-term attachment not only decreases the interface impedance but also maintains the features of sEMG. The wireless DAQ module bridges the tattoo-like electronics with LDA algorithm. The LDA algorithm is deployed on the cloud to lightweight the wireless DAQ module and achieves a high recognition rate of 110 words. The 1-day routine life demonstrates the competence for future all-weather, natural silent speech, including exchanging of communication in people with aphasia, communication while keeping quiet, and human–machine interactions free from surrounding disturbance.

Manufacturing processes of the tattoo-like electronics

The fabrication processes started with the lamination of an ultrathin PET film (1.1-µm thickness) on the wetted water transfer paper (Huizhou Yibite Technology, China). The composite substrate of PET and water transfer paper was baked in an oven (ZK-6050A, Wuhan Aopusen Test Equipment Inc., China) at 50–60 °C for ~1 h and subsequently at 100–110 °C for ~2 h for adequate drying. Ten-nm-thick chromium (Cr) and 100-nm-thick gold (Au) were deposited on PET. Then the film was cut by a programmable mechanical cutter (CE6000-40, GRAPHTEC, Japan) to a designed pattern. A tweezer was used to carefully remove the unnecessary part of the pattern on the re-wetted water transfer paper. Then the patterned film was flipped over using the thermally released tape (TRT) (REVALPHA, Nitto, Japan). The TRT was deactivated on a hotplate at ~130 °C for 3 min, followed by sticking to the 3 M Tegaderm. Finally, the deactivated TRT was removed to get the tattoo-like electronics.

Method of connecting electrodes and wireless DAQ module

The processed electrode was placed on the platform with the Tegaderm layer facing up. Then we used low-temperature welding to realize the connection between the pad and the wire. The electrode was peeled off and folded in half along the pad and then used low-temperature alloy to weld the wire on the pad. When the temperature dropped to room temperature, we used another Tegaderm to fix the connection between the pad and the wire.

Design of wireless DAQ module

The wireless DAQ module used were AD8220, OPA171, Atmega328p, and CC2540F256. AD8220 and OPA171 were used to amplify the original sEMG signal 1000 times. Atmega328p with a 10-bit precision was used for analog-to-digital conversion and the sampling frequency of the microprocessor was set to 500 Hz. CC2540F256 was used to send and receive data.

Experimental process of EMG signal acquisition

The reference electrode needs to be placed on electrically neutral tissue 51 ; the position of the posterior mastoid closest to the acquisition device is selected as the reference electrode placement position. Another 8 electrodes were attached to the designated 4 muscles, with every 2 electrodes targeting 1 muscle, and the distance between the 2 electrodes was set to 2 cm. Before applying the electrode, the target locations were cleaned with clean water. The wireless DAQ module and the electrodes were connected by wires, and the wireless circuit module was hung on the subject’s ear. The subject was instructed to read each word silently ten times. During the experimental sessions, the subject was asked to avoid swallowing, coughing, and other facial movements unrelated to silent reading.

System environment and parameters of SSRS

The SSRS was built in Windows 10 environment. The LDA algorithm in the machine-learning toolbox of MATLAB 2019b was used in SSRS. In real-time recognition, the time window length was 2000 ms and the sliding window was 200 ms. The sampling frequency was 500 Hz.

Ethical information for studies involving human subjects

All experiments involving human subjects were conducted in compliance with the guidelines of Institutional Review Board and were reviewed and approved by the Ethics Committee of Soochow University (Approval Number: SUDA20210608A01). All participants for the studies were fully voluntary and submitted the informed consents. The SSRS is located on the silent speech users’ faces and thus the necessary but limited identifiable images have to be used. All identifiable information was totally consented by the user.

Data availability

The data that support the findings of this study are available from the corresponding author upon reasonable request.

Code availability

The custom code and mathematical algorithm that support the findings of this study are available at https://doi.org/10.5281/zenodo.4925493 . The most recent version of this code can be found at https://github.com/xsjzbx/paper_All-weather-natural-silent-speech-recognition-via-ML-assisted-tattoo-like-electronic .

Wang, Y. H. et al. Electrically compensated, tattoo-like electrodes for epidermal electrophysiology at scale. Sci. Adv. 6 , eabd0996 (2020).

Article   CAS   Google Scholar  

Cai, S. et al. SVM-based classification of sEMG signals for upper-limb self-rehabilitation training. Front. Neurorobot. 13 , 31 (2019).

Article   Google Scholar  

Cote-Allard, U. et al. Interpreting deep learning features for myoelectric control: a comparison with handcrafted features. Front. Bioeng. Biotechnol. 8 , 158 (2020).

Orjuela-Canon, A. D., Ruiz-Olaya, A. F. & Forero, L. Deep neural network for EMG signal classification of wrist position: Preliminary results. In 2017 IEEE Latin American Conference on Computational Intelligence (LA-CCI) 5 pp. (IEEE, 2017).

Jaramillo-Yánez, A., Benalcázar, M. E. & Mena-Maldonado, E. Real-time hand gesture recognition using surface electromyography and machine learning: a systematic literature review. Sensors 20 , 2467 (2020).

Jiang, Y. et al. Shoulder muscle activation pattern recognition based on sEMG and machine learning algorithms. Comput. Biol. Med. 197 , 105721 (2020).

Google Scholar  

Sugie, N. & Tsunoda, K. A speech prosthesis employing a speech synthesizer - vowel discrimination from perioral muscle activities and vowel production. IEEE Trans. Biomed. Eng. 32 , 485–490 (1985).

Morse, M. S. & Obrien, E. M. Research summary of a scheme to ascertain the availability of speech information in the myoelectric signals of neck and head muscles using surface electrodes. Comput. Biol. Med. 16 , 399–410 (1986).

Jorgensen, C., Lee, D. D. & Agabon, S. Sub Auditory Speech Recognition Based on EMG Signals. In International Joint Conference on Neural Networks 2003 3128–3133 (Institute of Electrical and Electronics Engineers Inc., 2003)

Lee, K. S. EMG-based speech recognition using hidden Markov models with global control variables. IEEE Trans. Biomed. Eng. 55 , 930–940 (2008).

Meltzner, G. S. et al. Development of sEMG sensors and algorithms for silent speech recognition. J. Neural Eng. 15 , 046031 (2018).

Wang, Y. et al. Silent speech decoding using spectrogram features based on neuromuscular activities. Brain Sci. 10 , 442 (2020).

Molina-Molina, A. et al. Validation of mDurance, a wearable surface electromyography system for muscle activity assessment. Front. Physiol. 11 , 606287 (2020).

Peng, Y. H., Wang, X. J., Guo, L., Wang, Y. C. & Deng, Q. X. An efficient network coding-based fault-tolerant mechanism in WBAN for smart healthcare monitoring systems. Appl. Sci. 7 , 18 (2017).

Mehmood, G., Khan, M. Z., Abbas, S., Faisal, M. & Rahman, H. U. An energy-efficient and cooperative fault-tolerant communication approach for wireless body area network. IEEE Access 8 , 69134–69147 (2020).

Hsu, V. M., Wes, A. M., Tahiri, Y., Cornman-Homonoff, J. & Percec, I. Quantified facial soft-tissue strain in animation measured by real-time dynamic 3-dimensional imaging. Plast. Reconstr. Surg. Glob. Open 2 , e211 (2014).

Bracken, D. J., Ornelas, G., Coleman, T. P. & Weissbrod, P. A. High-density surface electromyography: a visualization method of laryngeal muscle activity. Laryngoscope 129 , 2347–2353 (2019).

Kim, S. et al. Integrated wireless neural interface based on the Utah electrode array. Biomed. Microdevices 11 , 453–466 (2009).

Liao, L. D., Wang, I. J., Chen, S. F., Chang, J. Y. & Lin, C. T. Design, fabrication and experimental validation of a novel dry-contact sensor for measuring electroencephalography signals without skin preparation. Sensors 11 , 5819–5834 (2011).

Kim, D. H. et al. Epidermal electronics. Science 333 , 838–843 (2011).

Kim, Y. et al. A bioinspired flexible organic artificial afferent nerve. Science 360 , 998–1003 (2018).

Yang, J. C. et al. Electronic skin: recent progress and future prospects for skin-attachable devices for health monitoring, robotics, and prosthetics. Adv. Mater. 31 , e1904765 (2019).

Son, D. et al. An integrated self-healable electronic skin system fabricated via dynamic reconstruction of a nanostructured conducting network. Nat. Nanotechnol. 13 , 1057–1065 (2018).

Zhou, Y. et al. Multichannel noninvasive human–machine interface via stretchable µm thick sEMG patches for robot manipulation. J. Micromech. Microeng. 28 , 014005 (2018).

Wang, Y. et al. Low-cost, μm-thick, tape-free electronic tattoo sensors with minimized motion and sweat artifacts. npj Flex. Electron. 2 , 6 (2018).

Liu, H. C. et al. An epidermal sEMG tattoo-like patch as a new human-machine interface for patients with loss of voice. Microsyst. Nanoeng. 6 , 16 (2020).

Widlund, T., Yang, S. X., Hsu, Y. Y. & Lu, N. S. Stretchability and compliance of freestanding serpentine-shaped ribbons. Int. J. Solids Struct. 51 , 4026–4037 (2014).

Yang, X. et al. “Cut-and-paste” method for the rapid prototyping of soft electronics. Sci. China Technol. Sci. 62 , 199–208 (2019).

Vesa, E. P. & Ilie, B. Equipment for SEMG signals acquisition and processing. In International Conference on Advancements of Medicine and Health Care through Technology , MEDITECH 2014 187–192 (Springer Verlag, 2014).

Ferreira, J. M. & Lima, C. Distributed system for acquisition and processing the sEMG signal. In 1st International Conference on Health Informatics , ICHI 2013 335–338 (Springer, 2013).

Alemu, M., Kumar, D. K. & Bradley, A. Time-frequency analysis of SEMG−with special consideration to the interelectrode spacing. IEEE Trans. Neural Syst. Rehabil. Eng. 11 , 341–345 (2003).

Sheikh, M. U., Badihi, B., Ruttik, K. & Jantti, R. Adaptive physical layer selection for bluetooth 5: measurements and simulations. Wirel. Commun. Mob. Comput. 2021 , 1–10 (2021).

Sun, D. Z., Sun, L. & Yang, Y. On secure simple pairing in bluetooth standard v5.0-Part II: Privacy analysis and enhancement for low energy. Sensors 19 , 3259 (2019).

Lee, K. et al. Mechano-acoustic sensing of physiological processes and body motions via a soft wireless device placed at the suprasternal notch. Nat. Biomed. Eng. 4 , 148–158 (2020).

Koh, A. et al. A soft, wearable microfluidic device for the capture, storage, and colorimetric sensing of sweat. Sci. Transl. Med. 8 , 366ra165 (2016).

Hyoyoung, J. et al. NFC-enabled, tattoo-like stretchable biosensor manufactured by cut-and-paste method. In 2017 39th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC) 4094–4097 (IEEE, 2017).

Tian, L. et al. Large-area MRI-compatible epidermal electronic interfaces for prosthetic control and cognitive monitoring. Nat. Biomed. Eng. 3 , 194–205 (2019).

Huigen, E., Peper, A. & Grimbergen, C. A. Investigation into the origin of the noise of surface electrodes. Med. Biol. Eng. Comput. 40 , 332–338 (2002).

Vicars, W. First 100 signs: American Sign Language (ASL). http://www.lifeprint.com/asl101/pages-layout/concepts.htm (2002).

Meltzner, G. S. et al. Speech recognition for vocalized and subvocal modes of production using surface EMG signals from the neck and face. In INTERSPEECH 2008 - 9th Annual Conference of the International Speech Communication Association 2667–2670 (International Speech Communication Association, 2008).

Hooda, N., Das, R. & Kumar, N. Fusion of EEG and EMG signals for classification of unilateral foot movements. Biomed. Signal Process. Control 60 , 101990 (2020).

Mithun, P., Pandey, P. C., Sebastian, T., Mishra, P. & Pandey, V. K. A wavelet based technique for suppression of EMG noise and motion artifact in ambulatory ECG. Annu. Int. Conf. IEEE Eng. Med. Biol. Soc. 2011 , 7087–7090 (2011).

CAS   Google Scholar  

Xiao et al. Classification of surface EMG signal using relative wavelet packet energy. Comput Methods Prog. Biomed. 79 , 189–195 (2005).

Pancholi, S. & Joshi, A. M. Electromyography-based hand gesture recognition system for upper limb amputees. Electron. Lett. 3 , 1–4 (2019).

Yikang, Y. et al. A Multi-Gestures Recognition System Based on Less sEMG Sensors. In 2019 IEEE 4th International Conference on Advanced Robotics and Mechatronics (ICARM) 105–110 (IEEE, 2019).

Too, J., Abdullah, A. R. & Saad, N. M. Classification of hand movements based on discrete wavelet transform and enhanced feature extraction. Int. J. Adv. Comput Sci. Appl. 10 , 83–89 (2019).

Savur, C. & Sahin, F. Real-Time American Sign Language Recognition System Using Surface EMG Signal. In 2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA) 497–502 (IEEE, 2015).

Halder, S. et al. Online artifact removal for brain-computer interfaces using support vector machines and blind source separation. Comput. Intell. Neurosci. 2007 , 82069–82069 (2007).

Sundaram, S. et al. Learning the signatures of the human grasp using a scalable tactile glove. Nature 569 , 698–702 (2019).

Zhou, Z. et al. Sign-to-speech translation using machine-learning-assisted stretchable sensor arrays. Nat. Electron. 3 , 571–578 (2020).

Luca, C. D. Surface electromyography: detection and recording. DelSys Incorporated 10 , 1–10 (2002).

Download references

Acknowledgements

This research was supported by the National Natural Science Foundation of China (grant nos. 51925503, U1713218) and the Program for HUST Academic Frontier Youth Team.

Author information

These authors contributed equally: Youhua Wang, Tianyi Tang, Yin Xu.

Authors and Affiliations

State Key Laboratory of Digital Manufacturing Equipment and Technology, Huazhong University of Science and Technology, Wuhan, China

Youhua Wang, Yunzhao Bai, Lang Yin & YongAn Huang

Flexible Electronics Research Center, Huazhong University of Science and Technology, Wuhan, China

School of Mechanical and Electric Engineering, Jiangsu Provincial Key Laboratory of Advanced Robotics, Soochow University, Suzhou, China

Tianyi Tang, Yin Xu, Hongmiao Zhang & Huicong Liu

State Key Laboratory of Industrial Control Technologyg, Institute of Cyber Systems and Control, Zhejiang University, Hangzhou, China

You can also search for this author in PubMed   Google Scholar

Contributions

The design, preparation, and characterizations of tattoo-like electronics were completed by Y.W., Y.B. and L.Y.; the typical scenarios of SSRS were carried out by T.T. and Y.X.; the cloud-based machine-learning algorithm was developed by Y.X., T.T. and G.L.; Y.A.H., H.L., H.Z., Y.W., T.T. and Y.X. contributed to the writing of the manuscript; Y.A.H., H.L. and H.Z. supervised the overall research.

Corresponding authors

Correspondence to Hongmiao Zhang , Huicong Liu or YongAn Huang .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary information, supplementary video 1, supplementary video 2, supplementary video 3, supplementary video 4, supplementary video 5, rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Wang, Y., Tang, T., Xu, Y. et al. All-weather, natural silent speech recognition via machine-learning-assisted tattoo-like electronics. npj Flex Electron 5 , 20 (2021). https://doi.org/10.1038/s41528-021-00119-7

Download citation

Received : 08 April 2021

Accepted : 30 June 2021

Published : 13 August 2021

DOI : https://doi.org/10.1038/s41528-021-00119-7

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

silent speech

Navigation Menu

Search code, repositories, users, issues, pull requests..., provide feedback.

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly.

To see all available qualifiers, see our documentation .

  • Notifications You must be signed in to change notification settings

Code for voicing silent speech from EMG. Official repository for the papers "Digital Voicing of Silent Speech" at EMNLP 2020 and "An Improved Model for Voicing Silent Speech" at ACL 2021. Also includes code for converting silent speech to text.

dgaddy/silent_speech

Folders and files.

NameName
9 Commits

Repository files navigation

Voicing silent speech.

This repository contains code for synthesizing speech audio from silently mouthed words captured with electromyography (EMG). It is the official repository for the papers Digital Voicing of Silent Speech at EMNLP 2020, An Improved Model for Voicing Silent Speech at ACL 2021, and the dissertation Voicing Silent Speech . The current commit contains only the most recent model, but the versions from prior papers can be found in the commit history. On an ASR-based open vocabulary evaluation, the latest model achieves a WER of approximately 36%. Audio samples can be found here .

The repository also includes code for directly converting silent speech to text. See the section labeled Silent Speech Recognition .

The EMG and audio data can be downloaded from https://doi.org/10.5281/zenodo.4064408 . The scripts expect the data to be located in a emg_data subdirectory by default, but the location can be overridden with flags (see the top of read_emg.py ).

Force-aligned phonemes from the Montreal Forced Aligner have been included as a git submodule, which must be updated using the process described in "Environment Setup" below. Note that there will not be an exception if the directory is not found, but logged phoneme prediction accuracies reporting 100% is a sign that the directory has not been loaded correctly.

Environment Setup

We strongly recommend running in Anaconda. To create a new environment with all required dependencies, run:

This will install with CUDA 11.8.

You will also need to pull git submodules for Hifi-GAN and the phoneme alignment data, using the following commands:

Use the following commands to download pre-trained DeepSpeech model files for evaluation. It is important that you use DeepSpeech version 0.7.0 model files for evaluation numbers to be consistent with the original papers. Note that more recent DeepSpeech packages such as version 0.9.3 can be used as long as they are compatible with version 0.7.x model files.

(Optional) Training will be faster if you re-run the audio cleaning, which will save re-sampled audio so it doesn't have to be re-sampled every training run.

Pre-trained Models

Pre-trained models for the vocoder and transduction model are available at https://doi.org/10.5281/zenodo.6747411 .

To train an EMG to speech feature transduction model, use

where hifigan_finetuned/checkpoint is a trained HiFi-GAN generator model (optional). At the end of training, an ASR evaluation will be run on the validation set if a HiFi-GAN model is provided.

To evaluate a model on the test set, use

By default, the scripts now use a larger validation set than was used in the original EMNLP 2020 paper, since the small size of the original set gave WER evaluations a high variance. If you want to use the original validation set you can add the flag --testset_file testset_origdev.json .

HiFi-GAN Training

The HiFi-GAN model is fine-tuned from a multi-speaker model to the voice of this dataset. Spectrograms predicted from the transduction model are used as input for fine-tuning instead of gold spectrograms. To generate the files needed for HiFi-GAN fine-tuning, run the following with a trained model checkpoint:

The resulting files can be used for fine-tuning using the instructions in the hifi-gan repository. The pre-trained model was fine-tuned for 75,000 steps, starting from the UNIVERSAL_V1 model provided by the HiFi-GAN repository. Although the HiFi-GAN is technically fine-tuned for the output of a specific transduction model, we found it to transfer quite well and shared a single HiFi-GAN for most experiments.

Silent Speech Recognition

This section is about converting silent speech directly to text rather than synthesizing speech audio. The speech-to-text model uses the same neural architecture but with a CTC decoder, and achieves a WER of approximately 28% (as described in the dissertation Voicing Silent Speech ).

You will need to install the ctcdecode library (1.0.3) in addition to the libraries listed above to use the recognition code. (This package cannot be built successfully under Windows platform)

And you will need to download a KenLM language model, such as this one from DeepSpeech:

Pre-trained model weights can be downloaded from https://doi.org/10.5281/zenodo.7183877 .

To train a model, run

To run a test set evaluation on a saved model, use

Contributors 2

@dgaddy

  • Python 100.0%
  • DSpace@MIT Home
  • MIT Libraries
  • Graduate Theses

A continuous silent speech recognition system for AlterEgo, a silent speech interface

Thumbnail

Other Contributors

Terms of use, description, date issued, collections.

Show Statistical Information

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List

Logo of brainsci

Silent Speech Decoding Using Spectrogram Features Based on Neuromuscular Activities

1 State Key Laboratory of Industrial Control Technology, Institute of Cyber Systems and Control, Zhejiang University, Hangzhou 310027, China; nc.ude.ujz@yw_gnik (Y.W.); nc.ude.ujz@natsyrd (M.Z.); nc.ude.ujz@70123912 (R.W.); nc.ude.ujz@nah_oag (H.G.)

2 Department of Computer Science and Technology, School of Mechanical Electronic and Information Engineering, China University of Mining and Technology, Beijing 100083, China; [email protected]

Zhiyuan Luo

3 Computer Learning Research Centre, Royal Holloway, University of London, Egham Hill, Egham, Surrey TW20 0EX, UK; [email protected]

Silent speech decoding is a novel application of the Brain–Computer Interface (BCI) based on articulatory neuromuscular activities, reducing difficulties in data acquirement and processing. In this paper, spatial features and decoders that can be used to recognize the neuromuscular signals are investigated. Surface electromyography (sEMG) data are recorded from human subjects in mimed speech situations. Specifically, we propose to utilize transfer learning and deep learning methods by transforming the sEMG data into spectrograms that contain abundant information in time and frequency domains and are regarded as channel-interactive. For transfer learning, a pre-trained model of Xception on the large image dataset is used for feature generation. Three deep learning methods, Multi-Layer Perception, Convolutional Neural Network and bidirectional Long Short-Term Memory, are then trained using the extracted features and evaluated for recognizing the articulatory muscles’ movements in our word set. The proposed decoders successfully recognized the silent speech and bidirectional Long Short-Term Memory achieved the best accuracy of 90%, outperforming the other two algorithms. Experimental results demonstrate the validity of spectrogram features and deep learning algorithms.

1. Introduction

Research on Brain–Computer Interfaces (BCI) has a long history [ 1 ] and has attracted more attention for its extensive potential in the fields of neural engineering, clinical rehabilitation, daily communication and many other possible applications [ 2 , 3 , 4 ]. A typical non-invasive BCI uses electroencephalography (EEG) as it is inexpensive and easy to implement [ 5 ]. However, the difficulty in data processing still remains for practical use. One promising approach to address the challenge is the neuromuscular decoding from articulatory muscles [ 6 ]. Surface Electromyography (sEMG) captures neuromuscular activities in a non-invasive way like EEG. Besides, it only requires a few channels for signal processing due to the neural pathway from the brain to muscle acting as a primary filter and encoder [ 7 , 8 , 9 ].

In the accessible area around the face, surface electrodes are placed on articulatory muscles to obtain speech-related sEMG, both in vocal and silent speech [ 6 , 10 , 11 , 12 ]. Some other techniques are also used in the silent speech recording. Video and ultrasound imaging can record the movements of visible or invisible speech articulators straightforwardly [ 13 , 14 ]. However, they do not work in purely silent speech without any articulator motion.

The primary use of sEMG for silent speech recognition can date back to the mid-1980s, when Sugie in Japan [ 15 ] and Morse in the United States [ 16 ] demonstrated that sEMG signals contain speech-related information, respectively. Using simple thresholding techniques, Sugie utilized a three-channel electrode to distinguish five Japanese vowels, verifying that they could run in a pilot real-time system [ 15 ]. Later, Morse obtained linguistic information from muscle activities of neck and head, successfully distinguishing two words [ 16 ]. In the following years, the word number expanded to ten with an accuracy of 70% [ 17 ]. However, when it increased to 17, the accuracy dropped to only 35% [ 18 ]. In 2001, Chan reported the work of recognizing 10 English numbers based on sEMG during speech, using a wavelet transform feature set with linear discriminant analysis [ 19 ]. Later on, a group of researchers utilized sEMG to identify six commands to control an aircraft [ 20 ]. Szu-Chen studied continuous audible speech recognition using sEMG, achieving a 32% error rate by decomposing the signal into different feature spaces in time domain [ 21 ]. In 2014, Wand used four polar and two bipolar electrodes to capture sEMG to achieve the best average silent speech error rate at 34.7%, where zero-crossing rate, mean value and signal power were extracted [ 9 ]. Early in 2018, Kapur reported a wearable silent speech interface to obtain good accuracy around 90%, using a convolutional neural network [ 6 ]. Later, Meltzner demonstrated that silent speech was recognized with high accuracy using vocal speech features on a large data set [ 22 ].

Although multiple electrodes lead to multi-channel sEMG, previous studies mostly focus on channel-wise features which are extracted on a channel-by-channel basis, while correlations between channels are ignored. Speech is produced by the synergistic work of vocal system and articulatory neuromuscular activities occur along with these physiological processes [ 23 , 24 , 25 ]. Even in silent speech, such signals can be recorded and synergistic mechanism exists among the muscles. So, synergistic features from multi-channel sEMG are considered to recognize different words. Xception, originally designed for image classification [ 26 , 27 , 28 , 29 ], is utilized to process spectrograms of multichannel sEMG to explore the spatial correlation.

In this paper, multi-channel sEMG of silent speech are recorded. Xception is exploited to extract spatial correlative features. Three deep learning methods, Multi-Layer Perceptron (MLP), Convolutional Neural Network (CNN) and bidirectional Long Short-Term Memory (bLSTM), are evaluated to decode silent speech.

2. Silent Speech Data

2.1. capturing speech-related semg.

Studying the relationships between vocalization and articulatory muscles, we select suitable electrode positions around the face [ 6 , 8 , 30 , 31 , 32 , 33 ], as shown in Figure 1 . Channels 2 and 5 are bipolar derivation to improve the common-mode rejection ratio (CMRR) while others are derived unipolarity. Channels 1 and 2 record the levator anguli oris while channel 4 captures both the extrinsic tongue and the digastric anterior belly. Channels 3, 5 and 6 record the platysma, the extrinsic tongue and the lateral pterygoid, respectively. Besides, two reference electrodes are placed on the mastoid behind ears.

An external file that holds a picture, illustration, etc.
Object name is brainsci-10-00442-g001.jpg

Recording sites around the face and neck. These dedicated positions form an articulator muscular net to decode the silent speech. The sites are cleaned by gel to ensure the impedance is lower than 5 k Ω between electrodes and skin surface.

There is no articulator motion in silent speech, so the amplitude of sEMG is generally below 1 mV, smaller than normal sEMG. The frequency band of silent speech sEMG is always no more than 300 Hz. In our data recording system, the bandwidth is approximately 5 kHz and a 24-bit analog-to-digital converter (ADC) is used. Two resistor–capacitor (RC) filters, including a direct current (DC) filter and a 5 kHz low-pass filter are exploited to eliminate the DC bias and high-frequency interference, respectively. sEMG data are recorded at a sampling rate of 1000 Hz.

Seven students with normal vision and oral expression skills, having no history of mental illness and neurological diseases, 20 to 25 years old (average 22, four males and three females), are recruited as subjects. The experiment named “BCI research based on transmitted neural signals” has been approved by Ethics and Human and Animal Protection Committee of Zhejiang University (Ethical Approval: ZJUEHAPC2019-CSEA01), and strictly follows the Declaration of Helsinki. All collected data are only used for data analysis and the privacy of the participants are firmly protected.

The six-channel sEMG is recorded while the subjects are trained to imagine speaking the labelled words displayed on a computer screen one by one in a defined sequence, which is the meaning of silent speech in this paper. In our experiments, ten Chinese words are selected, including ’噪’, ’1#’, ’2#’, ’前’, ’后’, ’左’, ’右’, ’快’, ’慢’, ’停’, which mean ’null’, ’No.1’, ’No.2’, ’forward’, ’backward’, ’left’, ’right’, ’accelerate’, ’decelerate’, ’stop’ in English, respectively. In total, 69,296 valid samples for the ten words are recorded, and the label distribution is various, as shown in Table 1 . Figure 2 illustrates a valid six-channel sEMG example.

An external file that holds a picture, illustration, etc.
Object name is brainsci-10-00442-g002.jpg

An example of six-channel surface electromyography (sEMG) when imagining to speak ‘decelerate’ in Chinese.

Valid samples.

Label‘0’‘1’‘2’‘3’‘4’‘5’‘6’‘7’‘8’‘9’
Word’噪’’1#’’2#’’ 前’’后’’左’’右’’快’’慢’’停’
Samples7964670768146978659365106682688376146524

2.2. Preprocessing

An 8th order Butterworth bandpass filter (0.15∼300 Hz) was applied to remove the DC bias and high frequency of sEMG. The power frequency of 50 Hz and its harmonics was filtered by a comb notch filter [ 6 , 34 , 35 , 36 ]. The filtered sEMG is then obtained, as shown in Figure 3 b.

An external file that holds a picture, illustration, etc.
Object name is brainsci-10-00442-g003.jpg

Preprocessing of sEMG. ( a ) An example of raw sEMG, corresponding to channel 2 in Figure 2 ; ( b ) The sEMG filtered by Butterworth (0.15∼300 Hz) and notch (50 Hz) filters; ( c ) Quadratic Variation Reduction (QVR)-processed sEMG, where the most amplitude change is less than 1 mV.

In order to remove the baseline drift, the Quadratic Variation Reduction (QVR) [ 37 ] method is applied:

where z ˜ and z denote the signal before and after using QVR, λ is a constant value ( λ = 100 ), I represents the identity matrix and D is a ( n − 1 ) × n matrix:

where n is the length of z ˜ .

In Equation ( 1 ), ( I + λ D T D ) is a symmetric, positive-definite, tridiagonal matrix, which can be solved efficiently. The effect is shown in Figure 3 c where it can be seen that most wander part is removed.

3. Processing Methods

In order to extract time–frequency features effectively, the original six-channel sEMG in the time domain were transformed into the frequency domain, creating a spectrogram which is represented as an image. The state-of-the-art model Xception was selected for extracting image features, which were then decoded by MLP, CNN and bLSTM, respectively. Figure 4 describes the processes to decode sEMG.

An external file that holds a picture, illustration, etc.
Object name is brainsci-10-00442-g004.jpg

Silent speech decoding. ( a ): The neuromuscular activities are captured by surface electrodes; ( b ): All data are transformed into spectrograms by short-time Fourier transform (STFT); ( c ): Transfer learning method is used to extract features from spectrograms; ( d ): Neural networks decode multi-channel sEMG using the extracted features.

3.1. Spectrogram Images

The spectrogram of a signal sequence is the visual representation of the magnitude of the time-dependent Fourier Transform (FT) versus time, also known as the short-time Fourier transform (STFT) [ 38 , 39 , 40 ]. It describes the spectral details in time-frequency domain.

The spectrogram was calculated by Equation ( 3 ) [ 38 ], where the parameters of [window, window length, sample rate, overlap, FFT length] were specified as [hanning, 512, 1000 Hz, 50%, 64]. An example of a spectrogram image is shown in Figure 5 . The images associate with each other, reflecting sEMG spatial relationships in the frequency domain. Inspired by short video streams, the images were treated as a fixed-size video. Then, the silent speech decoding becomes a video classification, explored by deep learning methods.

An external file that holds a picture, illustration, etc.
Object name is brainsci-10-00442-g005.jpg

An example of a spectrogram image.

3.2. Feature Extraction

To explore sEMG spatial features, transfer learning with Xception is used. It is a deep learning image classifier using depthwise separable convolution layers with residual connections, which has been pre-trained on large scale images [ 26 ]. After input, data using only pointwise convolution (1 × 1 convolution) create separate convolution sizes of 3 × 3 without average pooling, which proceeds in nonoverlapping sections of the output channels to then be fed-forward for concatenation [ 26 , 27 ]. The model demonstrates a strong ability to generalize to images outside the original dataset via transfer learning, such as feature extraction and fine-tuning. Fine-turning is done by training all weights with a smaller learning rate, removing and updating some biased weights from the original network.

The spectrogram images have various shapes and are scaled to 299 × 299. Xception model outputs 1000 features for each image, therefore 1000 × 6 = 6000 features are obtained for one sEMG sample. All samples are processed using Xception to generate a large feature set.

3.3. Decoder Design

Three deep learning methods, namely MLP, CNN and bLSTM, are explored using the above feature set. Their structure and parameter details are designed in this section [ 41 , 42 , 43 ]. Figure 6 illustrates our decoding process, where parts (c)∼(g) represent the common structures and components for the three models, except that different hidden layers and parameters are used in each model.

An external file that holds a picture, illustration, etc.
Object name is brainsci-10-00442-g006.jpg

Decoding processes. ( a ) Spectrogram images. ( b ) Feature set extracted by Xception. ( c ) Input layer of neural networks. ( d ) Hidden layers of neural networks. ( e ) Fully connected dense layer. ( f ) Softmax layer as the output layer. ( g ) The predicted labels we obtain from the models.

Multi-Layer Perceptron (MLP) is a common Artificial Neural Network (ANN). In addition to the input and output layers, there can be multiple hidden layers. MLP can also be thought of as a directed graph consisting of multiple layers, each fully connected to the next layer [ 44 , 45 ].

Figure 7 illustrates the MLP structure where the ’dense’ layer connects each input unit with each output unit of the layer to learn and update the weights. ’Dropout’ regularization is used to help prevent overfitting as it randomly drops out input units with a fixed rate during parameter tuning [ 46 ]. ‘Softmax’ calculates predicted label probabilities at the output layer and then outputs the label with the maximum probability. The loss function defined in this method is cross-entropy loss.

An external file that holds a picture, illustration, etc.
Object name is brainsci-10-00442-g007.jpg

MultiLayer Perceptron (MLP) architecture to decode silent speech. A feature vector goes through the layers and a digital (from 0 to 9) will be output.

Each hidden layer uses a non-linear activation function to enhance the performance of the neural network and solve the linear inseparable problem. Commonly used activation functions are sigmoid, tanh, and rectified linear unit (ReLU). ReLU is used in the MLP as the derivative of ReLU is always 1 in the positive interval, alleviating the gradient disappearance and gradient explosion problems. In addition, ReLU has a much faster convergence than sigmoid and tanh.

Convolutional Neural Network (CNN) features with local connections and shared weights, making it very popular and successful in image classification problems. The core operation of CNN is mathematical convolution which consists of filters. The convolution is applied on the input data to produce a feature map. Specifically designed filters can extract features via convolution [ 47 , 48 , 49 ].

The CNN structure is shown in Figure 8 , where two convolutional layers (Conv1 and Conv2) with different filters are used to create specific feature maps. The pooling layer provides downsampling to reduce the size of features and also helps prevent overfitting. Max pooling that calculates the maximum value for each patch is used in our CNN architecture.

An external file that holds a picture, illustration, etc.
Object name is brainsci-10-00442-g008.jpg

Convolutional Neural Network (CNN) architecture to decode silent speech.

In the neural networks, the output of the first layer feeds into the second layer, and the output of the second layer feeds into the third, and so on. When the parameters of a layer change, so does the distribution of inputs to subsequent layers [ 50 ], which is described as an internal covariate shift. These shifts in input distribution can be problematic for neural networks, especially deep neural networks that have a large number of layers [ 51 ]. Batch normalization, a technique to standardize the inputs to a layer and reduce unwanted shifts to speed up training [ 52 ], is used in the CNN model.

3.3.3. bLSTM

Recurrent Neural Network (RNN) is well-known for processing sequence data, and has made many significant accomplishments in natural language processing applications. Unlike MLP and CNN, the output of each hidden layer in RNN are stored as memory and can be considered as another input, by which it allows information to persist [ 47 , 49 , 53 ]. However, RNNs suffer from short-term memory. If a sequence is long enough, they will have a hard time carrying information from earlier time steps to later ones. During back propagation, the gradient vanishing in RNN is a serious problem when learning long-term dependencies [ 53 ]. The gradient shrinks in back propagation and it does not contribute much to learning if it becomes extremely small [ 54 ].

LSTM, a special kind of RNN, addresses this issue by considering that both memory and input operations are addition only. As a result, it is capable of learning long-term dependencies [ 49 ]. The core concept of LSTM is the cell state and its various gates. The cell state acts as a transport highway that transfers relative information all the way down the sequence chain. LSTM has the ability to remove or add information by gates. There are the forget gate, input gate and output gate to regulate the flow of information inside the LSTM unit and learn which data in a sequence is important to keep or dismiss.

bLSTM, including forward LSTM and backward LSTM, captures bidirectional semantic dependencies [ 44 , 54 ]. For six-channel sEMG, bLSTM tends to be a suitable classifier as it can effectively model bidirectional dependencies. Figure 9 shows details of the bLSTM architecture, consisting of three bidirectional layers, two dense layers and one softmax output layer.

An external file that holds a picture, illustration, etc.
Object name is brainsci-10-00442-g009.jpg

Bidirectional Long Short-Term Memory (bLSTM) architecture to decode silent speech.

4.1. Decoder Optimization

For model training and testing purposes, the data set is randomly split at the ratio of training: validate: test = 7:2:1. The structures and parameters of MLP, CNN and bLSTM are optimized based on a series of trials. There are a number of experiments that have been implemented to explore optimal hyperparameters, including dropout rate, learning rate, and network depth. Figure 10 a presents that the best dropout rates for MLP and bLSTM are 0.2 while for CNN is 0.5. Learning rate controls how much the weights in neural networks are adjusted with respect to the loss gradient [ 55 , 56 ]. To explore a better initial learning rate in the decaying scheme, experiments are implemented. Figure 10 b indicates the initial learning rate of 1 × 10 − 3 suits for all the three methods.

An external file that holds a picture, illustration, etc.
Object name is brainsci-10-00442-g010.jpg

Dropout and learning rate optimization.

Different depths of these networks are also tried while other parameters remain the same. More layers lead to a decrease of prediction or overfitting whereas fewer layers may not be sufficiently trained. Figure 7 , Figure 8 and Figure 9 provide more details of the final topologies with the suitable depth.

4.2. Decoding Results

The features are trained, validated and tested by MLP, CNN and bLSTM, respectively. The key hyperparameters of the deep learning models are displayed in Table 2 . The same initial learning rate, activation function and batch size are used in the three decoders, while different optimizers and dropout rates are applied.

Hyperparameters.

ModelOptimizerDropoutLearning RateActivationBatch Size
MLPadam0.21 ReLU32
CNNadadelta0.51 ReLU32
bLSTMrmsprop0.21 ReLU32

MLP, CNN and bLSTM are implemented in Keras (on top of TensorFlow), which offers many flexible functional APIs to build and optimize deep learning structures and parameters. Batch normalization is applied for all models to obtain smaller training and validation loss. In particular, the function ‘ReduceLROnPlateau’ is called to reduce learning rate, with factor = 0.2, patience = 20 and min_lr = 0.5 × 10 − 6 . In early stopping, patience is set to 80, which means training is stopped if the loss does not decrease after 80 epochs.

Figure 11 shows the learning rate changes along with epochs during the training. All the three models are initialized with the same learning rate which is then decayed in different epochs. bLSTM takes more than 250 epochs in model training, while that of MLP is smaller and CNN consumes the least number of epochs.

An external file that holds a picture, illustration, etc.
Object name is brainsci-10-00442-g011.jpg

Learning rate of prediction in decoders.

Training profiles are provided in Figure 12 . Figure 12 a,d give the training details of MLP, where the accuracy becomes stable around 150 epochs and the validation loss stays about 0.45. In Figure 12 b,e, CNN training achieves a little better validation results than MLP but a large number of epochs is required. bLSTM shows the best validation accuracy of 0.92 in Figure 12 c and the lowest validation loss of 0.26 in Figure 12 f, however, its computational efficiency is not as good as those of MLP and CNN since bLSTM needs a large number of epochs to complete the training. The validation performance lines generally follow the training processes, which means the models are generally well-trained without obvious overfitting or underfitting.

An external file that holds a picture, illustration, etc.
Object name is brainsci-10-00442-g012.jpg

Training profile on the feature set by three deep learning models. ( a ) and ( d ): training on MLP. ( b ) and ( e ): training on CNN. ( c ) and ( f ): training on bLSTM. Both training and validation results are shown in the above sub-figures.

The accuracy of MLP, CNN and bLSTM on the test set is 0.85, 0.87 and 0.90, respectively. Both training and test results indicate that bLSTM achieves the best performance among the three methods, though it takes a longer time to train.

The confusion matrix is computed to show more prediction details on the test set, as is shown in Figure 13 . Labels 0 and 8 achieve the highest accuracy in all test predictions while labels 1, 5 and 6 have relatively low accuracy. Except for label 5, the accuracy of all others increases from Figure 13 a,c. Samples are more likely to be classified as labels 0 or 8. In addition, all three decoders have an equal difficulty in distinguishing label 4 and label 6. This may be caused by similar neuromuscular activities.

An external file that holds a picture, illustration, etc.
Object name is brainsci-10-00442-g013.jpg

Confusion matrices of the three decoders.

5. Discussion

The valid samples for each label are various, due to (1) the impedance between an electrode and skin surface always changes in different experiments, even for the same participant; (2) the inherent differences in the speech intention of participants; and (3) the different responses from the neuromuscular activities to individual words in silent speech recording. The data set, as shown in Table 1 , is still acceptable though small label imbalance exists, because each label is fully trained. The impedance reduction and preprocessing algorithm optimization will be further studied to increase the rate of valid samples.

MLP, CNN and bLSTM are trained and applied to decode the sEMG on the same platform (Intel i5-7400 CPU @ 3 GHz). bLSTM obtains highest accuracy around 0.92, with largest time consumption of almost 10 h. For CNN, the performance is not as good as bLSTM (0.88), but it consumes the least time (6 h) for proper model training. Though MLP takes less time (8 h) than bLSTM, its accuracy of 0.87 is worst among the three models. The bi-directional structure in bLSTM can generate better decoding results than MLP and CNN. Therefore, bLSTM suits the silent speech recognition if time consumption is less important. In test experiments, it takes no more than 50 ms to predict a new sEMG sample for the three models, which means instant prediction can be obtained to satisfy a real-time system.

The technology of silent speech decoding can be output in two forms, text code and synthetic speech [ 9 , 57 ]. It is up to the practical requirements. The speech pattern only appears in sEMG form regardless of audible or silent speech, so the privacy is ensured by the subject.

Silent speech decoding investigated in this paper is promising in possible applications: medical prostheses that help people with speech disabilities; hands-free peripheral device control; communication in privacy or noisy ambience [ 22 , 57 , 58 , 59 ]. The accuracy of single word is not high enough for piratical use. Communication also requires more complicated expression than the single words. Semantic dependency may help silent speech recognition in such potential applications, so phrases or even sentences may need to be researched.

Currently, 10 electrodes (2 for ground, 2 pairs of bipolar and 4 monopolar electrodes) for 6 channels are needed, and an integrated electrode array will be developed to improve the wearability. Furthermore, the electrode positions and channel number might be optimized to improve the performance and simplify the data collecting device. Online learning is another possible future research, as it is useful in data augmentation.

6. Conclusions

In this paper, it is demonstrated that spectrogram features combined with deep learning models can be applied to the silent speech decoding task, where bLSTM outperforms other methods. Result analysis indicates that synergic information hidden in multi-channel sEMG can provide useful features for recognition. It is suggested that the synergic exploration in silent speech decoding should be extended to phrases or even sentences, not only limited to a single word.

Acknowledgments

The authors would like to thank all participants who contributed to our project, and special thanks go to Wei Zhang, Huiyan Li, and Qing Ai for their kindly help for data collection.

Author Contributions

Conceptualization, Y.W. and G.L.; methodology, M.Z. and Z.L.; software, Y.W. and Z.M.; validation, M.Z.; formal analysis, Y.W. and M.Z.; investigation, M.Y. and M.Z.; resources, Z.L.; data curation, M.Z., R.W. and H.G.; writing—original draft preparation, M.Z., R.W. and H.G.; writing—review and editing, G.L. and M.Z.; visualization, G.L.; supervision, Y.W.; project administration: G.L.; funding acquisition, G.L. and Y.W. All authors have read and agreed to the published version of the manuscript.

This research was funded by Zhejiang University Education Foundation Global Partnership Fund and the Fundamental Research Funds for the Central Universities.

Conflicts of Interest

The authors declare no conflict of interest and do not hold relevant patents.

Digital Voicing of Silent Speech

David Gaddy , Dan Klein

Export citation

  • Preformatted

Markdown (Informal)

[Digital Voicing of Silent Speech](https://aclanthology.org/2020.emnlp-main.445) (Gaddy & Klein, EMNLP 2020)

  • Digital Voicing of Silent Speech (Gaddy & Klein, EMNLP 2020)
  • David Gaddy and Dan Klein. 2020. Digital Voicing of Silent Speech . In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages 5521–5530, Online. Association for Computational Linguistics.

EchoSpeech: Continuous Silent Speech Recognition on Minimally-obtrusive Eyewear Powered by Acoustic Sensing

silent speech

New Citation Alert added!

This alert has been successfully added and will be sent to:

You will be notified whenever a record that you have chosen has been cited.

To manage your alert preferences, click on the button below.

New Citation Alert!

Please log in to your account

Information & Contributors

Bibliometrics & citations, supplementary material.

  • Zhang Q Lan Y Guo K Wang D (2024) Lipwatch: Enabling Silent Speech Recognition on Smartwatches using Acoustic Sensing Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 10.1145/3659614 8 :2 (1-29) Online publication date: 15-May-2024 https://dl.acm.org/doi/10.1145/3659614
  • Yang Y Chen T Huang Y Guo X Shangguan L (2024) MAF: Exploring Mobile Acoustic Field for Hand-to-Face Gesture Interactions Proceedings of the CHI Conference on Human Factors in Computing Systems 10.1145/3613904.3642437 (1-20) Online publication date: 11-May-2024 https://dl.acm.org/doi/10.1145/3613904.3642437
  • Pandey L Arif A (2024) MELDER: The Design and Evaluation of a Real-time Silent Speech Recognizer for Mobile Devices Proceedings of the CHI Conference on Human Factors in Computing Systems 10.1145/3613904.3642348 (1-23) Online publication date: 11-May-2024 https://dl.acm.org/doi/10.1145/3613904.3642348
  • Show More Cited By

Index Terms

Computing methodologies

Artificial intelligence

Natural language processing

Speech recognition

Human-centered computing

Human computer interaction (HCI)

Interaction techniques

Gestural input

Ubiquitous and mobile computing

Ubiquitous and mobile computing systems and tools

Recommendations

Hpspeech: silent speech interface for commodity headphones.

We present HPSpeech, a silent speech interface for commodity headphones. HPSpeech utilizes the existing speakers of the headphones to emit inaudible acoustic signals. The movements of the temporomandibular joint (TMJ) during speech modify the ...

Lipwatch: Enabling Silent Speech Recognition on Smartwatches using Acoustic Sensing

Silent Speech Interfaces (SSI) on mobile devices offer a privacy-friendly alternative to conventional voice input methods. Previous research has primarily focused on smartphones. In this paper, we introduce Lipwatch, a novel system that utilizes acoustic ...

Small-vocabulary speech recognition using a silent speech interface based on magnetic sensing

This paper reports on word recognition experiments using a silent speech interface based on magnetic sensing of articulator movements. A magnetic field was generated by permanent magnet pellets fixed to relevant speech articulators. Magnetic field ...

Information

Published in.

cover image ACM Conferences

LMU Munich, Germany60028717

Author Picture

Tampere University, Finland60011170

Google Research, USA60006191

Author Picture

University of Cambridge, UK60031101

University of Namibia, Namibia60072704

Author Picture

Massachusetts Institute of Technology, USA60022195

Author Picture

University of Glasgow, UK60001490

Author Picture

University of Nottingham, UK60015138

  • SIGCHI: ACM Special Interest Group on Computer-Human Interaction

Association for Computing Machinery

New York, NY, United States

Publication History

Permissions, check for updates, author tags.

  • Acoustic Sensing
  • Silent Speech Recognition
  • Smart Glasses
  • Research-article
  • Refereed limited

Acceptance Rates

Contributors, other metrics, bibliometrics, article metrics.

  • 9 Total Citations View Citations
  • 1,121 Total Downloads
  • Downloads (Last 12 months) 820
  • Downloads (Last 6 weeks) 92
  • Dong X Chen Y Nishiyama Y Sezaki K Wang Y Christofferson K Mariakakis A (2024) ReHEarSSE: Recognizing Hidden-in-the-Ear Silently Spelled Expressions Proceedings of the CHI Conference on Human Factors in Computing Systems 10.1145/3613904.3642095 (1-16) Online publication date: 11-May-2024 https://dl.acm.org/doi/10.1145/3613904.3642095
  • Cai Z Ma Y Lu F (2024) Robust Dual-Modal Speech Keyword Spotting for XR Headsets IEEE Transactions on Visualization and Computer Graphics 10.1109/TVCG.2024.3372092 30 :5 (2507-2516) Online publication date: May-2024 https://doi.org/10.1109/TVCG.2024.3372092
  • Sofronievski B Kiprijanovska I Stankoski S Sazdov B Kjosev J Nduka C Gjoreski H (2024) Efficient Real-time On-the-edge Facial Expression Recognition using Optomyography Smart Glasses 2024 International Conference on Intelligent Environments (IE) 10.1109/IE61493.2024.10599896 (49-55) Online publication date: 17-Jun-2024 https://doi.org/10.1109/IE61493.2024.10599896
  • Zhang R Chen H Agarwal D Jin R Li K Guimbretière F Zhang C (2023) HPSpeech: Silent Speech Interface for Commodity Headphones Proceedings of the 2023 ACM International Symposium on Wearable Computers 10.1145/3594738.3611365 (60-65) Online publication date: 8-Oct-2023 https://dl.acm.org/doi/10.1145/3594738.3611365
  • Sun R Zhou X Steeper B Zhang R Yin S Li K Wu S Tilsen S Guimbretiere F Zhang C (2023) EchoNose: Sensing Mouth, Breathing and Tongue Gestures inside Oral Cavity using a Non-contact Nose Interface Proceedings of the 2023 ACM International Symposium on Wearable Computers 10.1145/3594738.3611358 (22-26) Online publication date: 8-Oct-2023 https://dl.acm.org/doi/10.1145/3594738.3611358
  • Gemicioglu T Winters R Wang Y Gable T Tashev I (2023) TongueTap: Multimodal Tongue Gesture Recognition with Head-Worn Devices Proceedings of the 25th International Conference on Multimodal Interaction 10.1145/3577190.3614120 (564-573) Online publication date: 9-Oct-2023 https://dl.acm.org/doi/10.1145/3577190.3614120

View Options

Login options.

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

View options.

View or Download as a PDF file.

View online with eReader .

View this article in Full Text.

HTML Format

View this article in HTML Format.

Share this Publication link

Copying failed.

Share on social media

Affiliations, export citations.

  • Please download or close your previous search result export first before starting a new bulk export. Preview is not available. By clicking download, a status dialog will open to start the export process. The process may take a few minutes but once it finishes a file will be downloadable from your browser. You may continue to browse the DL while the export process is in progress. Download
  • Download citation
  • Copy citation

We are preparing your search results for download ...

We will inform you here when the file is ready.

Your file of search results citations is now ready.

Your search export query has expired. Please try again.

IEEE Account

  • Change Username/Password
  • Update Address

Purchase Details

  • Payment Options
  • Order History
  • View Purchased Documents

Profile Information

  • Communications Preferences
  • Profession and Education
  • Technical Interests
  • US & Canada: +1 800 678 4333
  • Worldwide: +1 732 981 0060
  • Contact & Support
  • About IEEE Xplore
  • Accessibility
  • Terms of Use
  • Nondiscrimination Policy
  • Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

Things you buy through our links may earn Vox Media a commission.

Rashida Tlaib Holds ‘War Criminal’ Sign During Netanyahu Speech

Portrait of Nia Prater

On Wednesday, Israeli prime minister Benjamin Netanyahu gave a defiant address to a joint session of Congress, defending his country’s actions in Gaza and urging lawmakers to continue the United States’s longstanding support of Israel.

“If you remember one thing from this speech, remember this: Our enemies are your enemies. Our fight is your fight, and our victory will be your victory!” Netanyahu said to applause.

But though the prime minister spoke of unity between the two nations, his presence in Washington, D.C., was notably divisive, prompting thousands of protesters to march through the streets of the capital, as well as boycotts of Netanyahu’s speech, mostly by Democratic lawmakers.

Representative Rashida Tlaib, the only Palestinian American member in the House, attended the speech after criticizing congressional leadership’s decision to invite him to speak. Tlaib remained seated through the entirety of Netanyahu’s address as her colleagues frequently stood and applauded his words. At times, she was seen holding up a small black sign with white lettering that read “GUILTY OF GENOCIDE” on one side and “WAR CRIMINAL” on the other.

Tlaib’s action prompted reactions from her colleagues and a talk from a House floor manager, according to Punchbowl News:

House Speaker Mike Johnson had warned members ahead of Netanyahu’s planned appearance that any potential disruption during his speech could result in arrest by the sergeant of arms. In the end, the only protests were silent. Three people in the House gallery wearing shirts that read “Seal The Deal Now” — referring to a possible deal to release the hostages currently being held by Hamas — were removed from the chamber by U.S. Capitol Police. According to reports , they are family members of some of the hostages.

Axios reports that nearly half of congressional Democrats declined to attend the speech, including former House Speaker Nancy Pelosi, Senator Bernie Sanders, and Congresswoman Alexandria Ocasio-Cortez. Some who did attend made sure to make their opposition to Netanyahu known. New York representative Jerry Nadler called the speech a “cynical stunt” in a statement . Senate Majority Leader Chuck Schumer, who has called for Netanyahu to be replaced, did not applaud when the prime minister entered the House chamber, and the two men did not shake hands when they crossed paths.

  • rashida tlaib
  • benjamin netanyahu

Most Viewed Stories

  • J.D. Vance Can’t Stop Saying the Dumbest Things Imaginable
  • Would Trump Really Dump J.D. Vance?
  • How to Buy a President
  • Everyone Hates That Google AI Olympics Commercial
  • The Thrill of Taking a Huge Risk on Kamala Harris
  • Who Will Be the Kamala Harris VP Pick? Odds for Every Shortlist Candidate.
  • The Real Origin of Trump’s Hannibal Lecter Obsession

Editor’s Picks

silent speech

Most Popular

  • J.D. Vance Can’t Stop Saying the Dumbest Things Imaginable By Matt Stieb
  • Would Trump Really Dump J.D. Vance? By Ed Kilgore
  • How to Buy a President By David Freedlander
  • Everyone Hates That Google AI Olympics Commercial By Matt Stieb
  • The Thrill of Taking a Huge Risk on Kamala Harris By Rebecca Traister
  • Who Will Be the Kamala Harris VP Pick? Odds for Every Shortlist Candidate. By Margaret Hartmann
  • The Real Origin of Trump’s Hannibal Lecter Obsession By Margaret Hartmann

silent speech

What is your email?

This email will be used to sign into all New York sites. By submitting your email, you agree to our Terms and Privacy Policy and to receive email correspondence from us.

Sign In To Continue Reading

Create your free account.

Password must be at least 8 characters and contain:

  • Lower case letters (a-z)
  • Upper case letters (A-Z)
  • Numbers (0-9)
  • Special Characters (!@#$%^&*)

As part of your account, you’ll receive occasional updates and offers from New York , which you can opt out of anytime.

Middle East Crisis Harris Expresses Support for Israel but Says She ‘Will Not Be Silent’ About Palestinian Suffering

  • Share full article

[object Object]

  • Antigovernment protests in Tel Aviv calling for the end of the conflict and for a hostage deal. Abir Sultan/EPA, via Shutterstock
  • A Palestinian woman being carried into Nasser Hospital after strikes in Khan Younis, in southern Gaza. Bashar Taleb/Agence France-Presse — Getty Images
  • Palestinian children receiving aid at Nasser Hospital in Khan Younis. Bashar Taleb/Agence France-Presse — Getty Images
  • Israeli bulldozers demolishing a Palestinian home under construction in the Israeli-occupied West Bank. Hazem Bader/Agence France-Presse — Getty Images
  • The funeral of an Israeli soldier in Tel Aviv. Abir Sultan/EPA, via Shutterstock

Harris offers support for Israel but calls out Palestinians’ plight after Netanyahu meeting.

Vice President Kamala Harris offered Prime Minister Benjamin Netanyahu strong support for Israel’s right to defend itself from terrorism on Thursday but declared that “far too many innocent civilians” had died in Gaza and that “I will not be silent” about their suffering.

In what amounted to her debut on the world stage since her rapid ascension as the presumptive Democratic nominee for president, Ms. Harris sought to strike a balance and capture what she called “the complexity” of the strife in the Middle East. But while she did not stray from President Biden on policy, she struck a stronger tone on the plight of Palestinians.

“What has happened in Gaza over the past nine months is devastating,” she told reporters after meeting with Mr. Netanyahu at the White House complex. “The images of dead children and desperate, hungry people fleeing for safety, sometimes displaced for the second, third or fourth time — we cannot look away in the face of these tragedies, we cannot allow ourselves to become numb to the suffering, and I will not be silent.”

She noted that she had also met with the families of Israeli hostages held by Hamas since its Oct. 7 terrorist attack and expressed distress for their anguish, making a point of reciting the names of each of the hostages with U.S. citizenship. “I’ve told them each time they are not alone, and I stand with them,” she said. “And President Biden and I are working every day to bring them home.”

In a sign of the changing order in Washington since Mr. Biden withdrew from the presidential race on Sunday, Ms. Harris offered the only substantive comments after Mr. Netanyahu met separately with each of them. She pressed for the conclusion of a long-delayed cease-fire deal to end the war and bring the hostages home.

Many were watching Ms. Harris, given her new role. Over the nine months since the Hamas attack, she has largely stuck close to the president’s position, although at times she has sounded more empathetic about the suffering in Gaza, leading some to conclude that she might not be as supportive of Mr. Netanyahu’s war as Mr. Biden has been.

Republicans criticized Ms. Harris for not attending the prime minister’s address to Congress on Wednesday while keeping a previously scheduled out-of-town commitment, although they had no criticism for Senator JD Vance of Ohio, their own Republican vice-presidential nominee, for also skipping the speech, citing a scheduling conflict.

Clearly determined not to let herself be painted into a corner, Ms. Harris made a point of denouncing the “despicable acts by unpatriotic protesters” who burned a flag and defaced statues with anti-Israel slogans outside the Capitol on Wednesday.

“I condemn any individuals associating with the brutal terrorist organization Hamas, which has vowed to annihilate the State of Israel and kill Jews,” she said in a written statement issued hours before her meeting with Mr. Netanyahu. “Pro-Hamas graffiti and rhetoric is abhorrent, and we must not tolerate it in our nation.”

The administration’s support of Israel’s war effort, even with the qualms Mr. Biden has expressed about the civilian toll and his suspension of a shipment of munitions, had been a thorny issue for his re-election campaign. He has faced criticism from some Democrats for not exerting more pressure on Mr. Netanyahu to limit the carnage and end the fighting.

The contrast between the prime minister’s meetings with Ms. Harris and Mr. Biden on Thursday was striking. The president greeted Mr. Netanyahu cordially in the Oval Office. “Well, welcome back, Mr. Prime Minister,” Mr. Biden said as the two sat down for what would be a 90-minute meeting. “We’ve got a lot to talk about. I think we should get to it.”

While the two have been at odds over the conduct of the war for months, Mr. Biden offered no thoughts about the situation on the ground while reporters were in the room and instead turned the floor over to Mr. Netanyahu, who used the opportunity to express gratitude now that the president is winding up his long political career.

“Mr. President, we’ve known each other for 40 years, and you’ve known every Israeli prime minister for 50 years, from Golda Meir,” Mr. Netanyahu told him. “So from a proud Jewish Zionist to a proud Irish American Zionist, I want to thank you for 50 years of public service and 50 years of support for the state of Israel. And I look forward to discussing with you today and working with you in the months ahead on the great issues before us.”

Mr. Biden grinned at the reference to him as an “Irish American Zionist” and then said he looked forward to their discussions as well. “By the way, that first meeting with Prime Minister Golda Meir, and she had an assistant sitting next to me, a guy named Rabin,” he said, referring to Yitzhak Rabin, who would later become prime minister. “That’s how far back it goes. I was only 12 then.”

Ms. Harris, by contrast, was polite but businesslike in greeting Mr. Netanyahu in her ceremonial office in the Eisenhower Executive Office Building next to the White House, and the two offered no statements in front of the cameras as they began their 40-minute meeting. When she emerged afterward to make her comments, she did so by herself, and the Israelis were surprised by her tone.

She expressed solidarity with Israel, reiterating her “unwavering commitment” to its existence and its security, and she condemned Hamas as a “brutal terrorist organization” that had started the war when it “massacred 1,200 innocent people, including 44 Americans” and “committed horrific acts of sexual violence.”

“Israel has a right to defend itself,” she said, then added pointedly, “and how it does so matters.”

John F. Kirby, a White House spokesman, played down any differences between the president and vice president on Gaza. “She’s been a full partner in our policies in the Middle East,” he told reporters before either meeting.

Mr. Biden and Mr. Netanyahu met with families of hostages held by Hamas amid renewed confidence about prospects for a cease-fire deal that would release their loved ones. Some of the hostage relatives said as they left the White House that they were convinced the American and Israeli leaders both felt urgency to bring the war to an end so that those captured during the Oct. 7 attack could come home.

“We feel probably more optimistic than we have since the first round of releases in late November, early December,” said Jonathan Dekel-Chen , the father of Sagui Dekel-Chen, who lived on the kibbutz Nir Oz.

“We got an absolute commitment from the Biden administration and from Prime Minister Netanyahu that they understand the urgency of this moment now to waste no time and to complete this deal, as it currently stands with as little change as humanly possible within,” he added.

Rachel Goldberg-Polin, whose son Hersh Goldberg-Polin was grievously injured during the attack but seen in a video released in April , said Mr. Biden’s decision to give up his re-election bid would not diminish his ability to influence events in the region.

“On the contrary, I actually think it allows the president to be laser focused on the things that are true priorities to him,” she said. “And saving human beings, cherished human beings, 115 of them, eight of whom are U.S. citizens, is one paramount issue for him.”

Mr. Kirby said that the negotiators “are closer now, we believe, than we’ve been before” but that there were still gaps. He did not blame Israel in particular for resisting. “The Israelis already have made many compromises to get us to this point,” he said. “Hamas through their interlocutors have made compromises to get us to this point. And yet we’re still not there. So there’s still a need for compromise.”

The White House meeting came a day after Mr. Netanyahu used his address to a joint meeting of Congress to denounce critics of Israel, particularly left-wing protesters he termed “useful idiots.” Police used pepper spray outside the Capitol to push back thousands of protesters, some of whom burned an American flag and marred statues with slogans like “Hamas is coming.” On Thursday, protesters were kept at a distance from the White House by a new wall of fencing beyond the normal gates as they shouted upon Mr. Netanyahu’s arrival.

Mr. Netanyahu planned to hedge his bets by making a trip to Florida to visit former President Donald J. Trump at his Mar-a-Lago estate on Friday. But Mr. Trump, who has soured on Mr. Netanyahu after their initially strong alliance, may not offer the message the prime minister wants to hear.

In an interview on Fox News on Thursday, the former president said that Israel should wrap up the war soon because it has yielded bad public relations for the country. Israel should “finish up and get it done quickly,” Mr. Trump said, “because they are getting decimated with this publicity.”

Zach Montague contributed reporting.

— Peter Baker Peter Baker covers the White House and served briefly as the lead correspondent in Jerusalem. He first encountered Prime Minister Benjamin Netanyahu on the White House driveway after a visit with President Bill Clinton in 1996.

Trump urges Netanyahu to end the war in Gaza ahead of Friday meeting.

Republicans in Congress applauded often when Israel’s prime minister, Benjamin Netanyahu, spoke at the Capitol on Wednesday. But the Republican nominee for president, Donald J. Trump, appeared less impressed with Israel’s messaging the next day.

Israel must end the war in Gaza “and get it done quickly,” Mr. Trump said in an interview on Fox News on Thursday. He argued that Israel was “getting decimated” by negative publicity over its conduct of the war, set off by the Oct. 7 Hamas-led attack on Israel. Since then, more than 39,000 Palestinians have been killed, according to Gazan health authorities, and the war has wreaked widespread disease, hunger and destruction.

“Israel is not very good at public relations,” Mr. Trump said.

The comments came a day before a scheduled meeting on Friday between the former president and Mr. Netanyahu at Mar-a-Lago, Mr. Trump’s private residence and club in Palm Beach, Fla. But it is not clear that the Israeli prime minister — who praised the former president in his congressional address — would agree with Mr. Trump about wrapping up the conflict.

In his speech to U.S. lawmakers, Mr. Netanyahu vowed that Israel would fight until Hamas was eradicated. He did not say what many Israelis, especially the relatives of hostages in Gaza, wanted to hear: that he would close a cease-fire deal with Hamas to end the war and return about 115 people taken from Israel on Oct. 7 who remain in Gaza, several dozen of whom are believed to be dead.

On Monday, the Israeli military announced that two of the remaining hostages were dead. On Thursday, the Israeli military announced that five hostages’ bodies had been found in tunnels in an operation in Khan Younis and returned to Israel from Gaza.

The steady drumbeat of bad news about the captives underscores the urgency of a deal for the hostages’ relatives, some of whom met with Mr. Netanyahu in Washington this week, including at a gathering with President Biden at the White House on Thursday. They expressed optimism about the possibility of a deal when they emerged from the meeting, and told reporters in a briefing that Mr. Netanyahu understood the urgency of the need for a cease-fire.

It is a point that Mr. Trump may make at Mar-a-Lago, too, telling Mr. Netanyahu what he told Fox News: “Finish up.”

— Ephrat Livni

Advertisement

Israeli forces have retrieved from Gaza the bodies of 5 people killed on Oct. 7.

Israeli forces retrieved the bodies of five Israelis held in Gaza, the Israeli military said on Thursday, amid growing international pressure for a cease-fire deal that would involve the release of the remaining captives.

The bodies were found on Wednesday in a tunnel shaft in a Khan Younis zone that Israel previously designated as a humanitarian area where Gazan civilians could go to avoid the fighting and to receive aid, the Israeli military said. The shaft was nearly 220 yards long and more than 20 yards underground, with several rooms, the military said.

Israel has said that Hamas has exploited the “humanitarian zone” to launch rockets at Israel, as well as use it for other military purposes. Aid groups have lamented that Israel has occasionally struck the area, despite telling Gazans they would be safer there. There was no immediate response from Hamas.

Israel has been carrying out a new operation in Khan Younis this week, using tanks and fighter jets to strike what it has described as Hamas infrastructure in the southern Gaza city. Rear Adm. Daniel Hagari, the Israeli military spokesman, told reporters the renewed offensive aimed in part to “enable the operation” to retrieve the bodies.

Dozens of people have been killed during the Israeli assault on Khan Younis, the Gazan Health Ministry has reported. Many also fled their homes as the Israeli bombardment intensified, while others elected to stay, hoping they would be safer in their houses than in tents. Admiral Hagari said that Israeli forces had killed “many terrorists.”

The five people whose bodies were recovered — Maya Goren, 56; Tomer Ahimas, 20; Kiril Brodski, 19; Oren Goldin, 33; and Ravid Katz, 51 — were killed during the Hamas-led attacks on Oct. 7 and were taken back to Gaza to be held as bargaining chips, Israeli officials said. They are considered hostages by the Israeli government.

Mr. Brodski and Mr. Ahimas were soldiers who fell during the attacks, while the other three were civilians.

Ms. Goren was a teacher from Nir Oz, one of the hardest-hit communities near the Gaza border; her husband was also killed on Oct. 7. Mr. Katz, also from Nir Oz, was a father of three children. The body of Mr. Goldin, a member of a nearby village’s civil response squad, was taken, along with that of his brother-in-law Tal Haimi, whose body is still in Gaza.

The Israeli military said that intelligence — including information from detained Palestinian militants — had guided forces to the tunnel.

More than 250 people were abducted during the Hamas-led attack on Oct. 7, according to Israel, and 105 were released during a brief cease-fire in November. Israeli officials say 115 hostages remain in Gaza, including roughly 40 who are presumed dead.

The return home of the hostages’ remains in body bags added to the domestic political pressure on Prime Minister Benjamin Netanyahu to end the war, even as he was visiting Washington and in a speech to Congress gave a full-throated defense of Israel’s military operations in Gaza.

“The war in Gaza could end tomorrow if Hamas surrenders, disarms and returns all the hostages,” Mr. Netanyahu said during his address to Congress on Wednesday. “But if they don’t, Israel will fight until we destroy Hamas’s military capabilities and its rule in Gaza and bring all our hostages home.”

Mr. Netanyahu did not refer to the current proposal backed by the Biden administration and the United Nations Security Council. Under that deal, Israel would ultimately agree to a permanent cease-fire with Hamas and withdraw its forces from Gaza in exchange for the release of all hostages.

Nissim Kalderon, whose brother Ofer was abducted on Oct. 7, accused Mr. Netanyahu of hesitating to reach a deal for political reasons. Mr. Netanyahu’s coalition government depends on hard-line parties who support permanent Israeli control of Gaza, effectively ruling out a cease-fire with Hamas.

“I expected, hoped, wished that you would open your speech with ‘We have a signed deal.’ But again and again, you’re not doing what you should have done 292 days ago,” Mr. Kalderon said at a rally in Tel Aviv on Wednesday night. “Bring your citizens home.”

At least six Israeli relatives of hostages were arrested in the House gallery by Capitol Police during Mr. Netanyahu’s speech as they wore bright yellow T-shirts calling on him to reach an agreement to free their loved ones.

“Benjamin Netanyahu spoke for 54 minutes and he did not mention once the need to seal the deal,” said Gil Dickmann, whose cousin Carmel Gat was abducted from the Israeli border community of Be’eri. “That’s what he needs to do, sign the deal and release all the hostages now.”

Rawan Sheikh Ahmad contributed reporting.

— Aaron Boxerman reporting from Jerusalem

The U.S. has sent thousands of bombs and missiles to Israel, a report found.

In his address to Congress on Wednesday , Prime Minister Benjamin Netanyahu of Israel asked for the speedy delivery of more American weapons to help his country prevail over Hamas in the war in Gaza. “Give us the tools faster, and we’ll finish the job faster,” he said.

The prime minister’s plea for more weapons, faster, comes despite enormous transfers of American military hardware over the last 10 months.

A tally of publicly known deliveries , as compiled this week by the Jewish Institute for National Security of America, show that more than 20,000 unguided bombs, an estimated 2,600 guided bombs and 3,000 precision missiles — as well as aircraft, ammunition and air defenses — are among the American weapons that already have been shipped since Oct. 7.

Many of the arms shipments that the United States has sent to Israel since the war began in October are classified or have been otherwise kept secret. Nonetheless, what had been delivered by March alone amounts to “an enormous number and variety of weapons, which have played a vital role in helping Israel defend itself,” an analysis by the Foundation for Defense of Democracies found this past spring.

With Americans divided over U.S. support for the war, and the domestic defense industry already stretched thin by the war in Ukraine, some defense officials and weapons experts have predicted that arms shipments to Israel could soon level off, or be phased out over the next decade .

Human rights groups and some U.S. lawmakers have demanded that the United States stop supplying weapons that could be used by Israel in potential war crimes, though security experts and some members of Congress have argued that ending American military aid would make Israel more vulnerable to attacks by Iran and its regional proxies.

In May, the State Department concluded that Israel had most likely violated humanitarian standards by failing to protect civilians in Gaza, but it did not find specific instances that would justify withholding American military aid.

Earlier this year, the Biden administration halted the shipment of 1,800 2,000-pound bombs to Israel after concerns swelled that such explosives had killed thousands of Palestinian civilians in southern Gaza. President Biden said in May that he would block the delivery of weapons that could be fired into densely populated areas of Rafah, in southern Gaza, where more than a million Palestinians were sheltering.

But this month, Mr. Biden loosened some of the restrictions to allow the delivery of 1,700 500-pound bombs that were part of the paused shipment of 2,000-pound bombs.

The seeming inconsistency has prompted some weapons experts to explore how — or whether — Israel can become less reliant on the United States, in part by building up its own weapons industry. Such development would cost Israel tens of billions of dollars, the defense foundation’s analysis said, for a country that already spends around 4.5 percent of its gross domestic product on defense — more than any NATO country .

“It seems unlikely that Israel could attain across-the-board weapons and munitions self-sufficiency anytime soon (and some say ever),” the analysis said.

— Lara Jakes

Israelis contrast Netanyahu’s speech in Congress with the grim reality at home.

For many Israelis, it wasn’t what Prime Minister Benjamin Netanyahu said, it was what he didn’t say.

In his speech to Congress on Wednesday , Mr. Netanyahu cast the war in Gaza as a battle for the survival of the Jewish state, a view widely shared across Israel.

But many Israelis want the their leader to agree to a cease-fire that would allow for the release of the 115 remaining hostages in Gaza, at any cost. While Mr. Netanyahu spoke of “intensive efforts” to secure the release of the captives, he did not publicly embrace a proposed truce deal being negotiated.

In Israel, the dissonance between the repeated applause from U.S. lawmakers during his address and a grimmer domestic reality was apparent on the front pages of Thursday’s Hebrew-language newspapers, which were dominated by news that the military had recovered the bodies of several hostages from the Palestinian enclave.

Yedioth Ahronoth, a popular mainstream daily, split its front page horizontally, devoting the top half to portraits of four captives whose bodies were recovered, and the bottom half to the speech. A fifth body was subsequently identified by the Israeli authorities, who said all five had been killed on Oct. 7, during the Hamas-led assault on southern Israel that prompted the war.

The visit to Washington by Mr. Netanyahu, who arrived at the White House on Thursday afternoon for meetings with President Biden and Vice President Kamala Harris, was intended to shore up support for the war both at home and abroad.

But there is a widespread sense of government failure in Israel as the war has dragged on, with the fighting having expanded to multiple fronts and the leadership offering little vision for what comes next.

“It was a speech devoid of disappointments or good tidings,” wrote Ben-Dror Yemini in Thursday’s Yedioth Ahronoth. “Never, ever was there such a large chasm between high words and contradictory actions.”

In Israel, critics of Mr. Netanyahu have accused him of putting his political survival above the fate of the hostages. Two far-right parties that he relies on for his governing coalition have threatened to quit should he agree to a deal on terms that they would deem a surrender to Hamas.

Seeking better terms, Mr. Netanyahu has delayed the departure of an Israeli negotiating team that was meant to set out from Israel on Thursday for talks with mediators in Qatar. An Israeli official with knowledge of the talks said only that the team would depart for Qatar sometime after Mr. Netanyahu’s meeting with Mr. Biden, without specifying a new date.

The Hostages Families Forum, a grass-roots organization advocating for the captives’ release, declared a “crisis of trust” in a statement on Thursday, accusing Mr. Netanyahu of obstructing a deal.

“This foot-dragging is a deliberate sabotage of the chance to bring our loved ones back,” the forum said in its statement, adding, “It effectively undermines the negotiations and indicates a serious moral failure.”

— Isabel Kershner Reporting from Jerusalem

Gaza’s death toll was largely accurate in the early days of the war, a study finds.

A new study analyzing the first 17 days of Israel’s bombardment in the Gaza Strip found that the Gaza Ministry of Health’s death toll, a subject of debate at the time, was reliable.

The study, conducted by Airwars, a British organization that assesses claims of civilian harm in conflicts, added to previous research suggesting that the Health Ministry’s figures in the early days of the war were credible.

In late October, the Health Ministry published the names of about 7,000 people who had been killed in the first 17 days of the war. Of the thousands of Israeli airstrikes and other explosions during that time period, only a fraction — 350 events — were analyzed by Airwars for the study released Wednesday. Airwars said it was able to independently identify 3,000 names, most of which matched the ministry’s list.

As a result, Airwars said, it felt confident the ministry’s casualty reporting system at the beginning of the war was reliable and that it was working to analyze additional strikes and explosions.

Airwars reported that more recent ministry figures had become less accurate after the destruction of the territory’s health system.

The war has, however, clearly devastated the civilian population in Gaza. On Wednesday, the ministry, whose death toll does not distinguish between civilians and combatants, said that more than 39,000 people had been killed.

The ministry is ultimately overseen by Hamas, and Israeli officials have expressed skepticism about its accuracy. Early in the war, before the Health Ministry released its list, President Biden said he had “no confidence in the number that the Palestinians are using,” though he and other American officials have since expressed more confidence in them, urging Israel to do more to protect civilians.

Israel says that it tries to avoid civilian casualties, but notes that Hamas often bases its forces in densely populated urban areas.

Airwars focused its research only on the early days of the conflict. It said that there were many other strikes and explosions apart from the nearly 350 it documented during the period.

About 75 percent of the names documented by Airwars appeared on the Health Ministry’s October list, a rate that showed that “both capture a large fraction of the underlying reality,” said Mike Spagat, a professor at Royal Holloway College at the University of London who reviewed the findings and advised on the research process.

Many international officials and experts familiar with the way the Health Ministry verifies deaths in Gaza — drawing on information from morgues and hospitals across the territory — say its numbers are generally reliable. But there is evidence that the quality of the data has declined , as infrastructure has collapsed in many parts of the territory. In December, after many hospitals had closed, the Health Ministry announced it was supplementing its hospital and morgue-based tally with “reliable media sources.”

In its analysis, Airwars verified that at least some militants were included on the list of those killed in the first three weeks of the war. Israel’s military said in July that it had killed or captured around 14,000 combatants in Gaza since the war began, a number that cannot be independently confirmed.

In one instance, an Israeli airstrike on Oct. 19 targeted and killed Maj. Gen. Jihad Muheisan , commander of the Hamas-run National Security Forces, along with 18 members of his family, including nine children and six women, Airwars found. General Muheisan and all but one of the 18 were included on the Health Ministry’s list.

Because Airwars only analyzed incidents in which civilians were reportedly harmed, researchers said they could not estimate how many militants were included on the Health Ministry’s list.

Other studies have also backed the reliability of the ministry’s early death toll.

Johns Hopkins researchers found that there was no evidence that it was inflated through early November. And researchers from the London School of Hygiene and Tropical Medicine who analyzed ID numbers from the October list found there was “no obvious reason” to doubt the data.

Airwars used the same methodology in its Gaza analysis as it has for conflicts in Iraq, Syria, Ukraine, Libya and others, said Emily Tripp, the group’s director.

The pace of those killed in Gaza in October stands out, she said. Airwars tracked more allegations of harm to civilians in October than in any month in its decade of monitoring, according to the report, including the U.S.-led fight against Islamic State and Russia’s bombardment of Syria. About a quarter of those included at least 10 civilians reportedly killed, which is much higher than other conflicts it has monitored.

“We have, per incident, more people dying than we’ve seen in any other campaign,” Ms. Tripp said. “The intensity is greater than anything else we’ve documented.”

— Lauren Leatherby

This was the message that Netanyahu took to Congress.

Israel’s leader traveled some 5,000 miles and did not give an inch.

Addressing a joint meeting of Congress on Wednesday, Prime Minister Benjamin Netanyahu pushed back on condemnations of Israel’s prosecution of the war in Gaza and lavished praise and thanks on the United States for its support.

He offered a retort to harsh international criticism that Israel had done far too little to protect civilian lives in Gaza and was starving the population there. And he remained defiant in the face of the global pressure over a conflict that has killed tens of thousands of Palestinians, giving little hint that Israel would back down from the fight anytime soon.

Here are some of the highlights.

He name-checked both Biden and Trump.

Mr. Netanyahu was careful to walk a middle path, thanking both Democrats and Republicans, including President Biden and the Republican presidential nominee, Donald J. Trump, for their support.

“I know that America has our back,” he said. “And I thank you for it. All sides of the aisle. Thank you, my friends.”

He expressed particular appreciation for Mr. Biden’s “heartfelt support for Israel after the savage attack” led by Hamas on Oct. 7. But he also made a point of praising Mr. Trump, who as president was more receptive to some of his expansionist policies.

He denied that Israel was starving Gazans.

The prosecutor of the International Criminal Court has requested arrest warrants for war crimes and crimes against humanity for Mr. Netanyahu and the leaders of Hamas. But Mr. Netanyahu rejected accusations by the court’s prosecutor that Israel was deliberately cutting off food to Gazans.

“Utter, complete nonsense, a complete fabrication,” he declared.

Israel, he said, has enabled more than 40,000 aid trucks to enter Gaza during the war.

However, U.N. aid officials say Israel is responsible for most obstacles to getting aid to desperate Palestinians. Mr. Netanyahu said members of Hamas were stealing the goods.

He rejected blame for the heavy civilian loss.

More than 39,000 people have been killed in Gaza during the war, according to the Gaza health authorities, who do not distinguish between combatants and civilians. But Mr. Netanyahu again rejected Israeli responsibility. He denied deliberately targeting noncombatants and said the Israel Defense Forces had worked hard to protect them.

“The I.D.F. has dropped millions of fliers, sent millions of text messages and hundreds of thousands of phone calls to get Palestinian civilians out of harm’s way,” he said.

But those directives often confuse Gaza civilians who struggle to find any safe place to shelter amid the incessant airstrikes and bombardments that have lasted for more than nine months.

Mr. Netanyahu again blamed Hamas, saying it “does everything in its power to put Palestinian civilians in harm’s way” by using schools, hospitals and mosques for military operations.

International law requires combatants to avoid using such “civilian objects” for military objectives. But Israel’s critics say that Hamas’s use of civilian sites does not absolve Israel of its obligations under international law to protect civilians, nor does it explain the scale of death and destruction.

He played up diversity in Israeli society.

During the speech, Mr. Netanyahu called on a few Israeli soldiers in the audience to stand up, including one of Ethiopian descent and another who is Bedouin, citing their heroism and their important role in the Israeli military. It appeared to be an effort to convey that Israel and its military are not homogenous.

“The Muslim soldiers of the I.D.F. fought alongside their Jewish, Christian and other comrades in arms with tremendous bravery,” Mr. Netanyahu said.

Ethiopian Jews and Bedouins in Israel are often marginalized, but the prime minister offered a different portrayal.

He sketched out a vague vision of peace.

The Israeli prime minister has been accused by critics in Israel and some diplomats of dragging his feet in reaching a cease-fire deal with Hamas to end the bloodshed, possibly to preserve his own political longevity.

But Mr. Netanyahu said “a new Gaza could emerge” if Hamas was defeated and Gaza “demilitarized and de-radicalized,” adding that Israel “does not seek to resettle Gaza.”

He turned to past world conflicts to make his case, noting that the approach of demilitarization and de-radicalization was used in Germany and Japan after World War II.

There is broad concern, however, that in Gaza the trauma of the war will yield a new generation of radicalization.

The common enemy? Iran, he said.

“If you remember one thing, one thing from this speech, remember this: Our enemies are your enemies,” Mr. Netanyahu said. “Our fight is your fight. And our victory will be your victory.”

Iran, he said, wants to impose “radical Islam” on the world and sees the United States as its greatest enemy because it is “the guardian of Western civilization and the world’s greatest power.”

He argued that Iran-backed militias like Hamas, Hezbollah in Lebanon and the Houthis in Yemen, whatever their aggression against Israel, are actually fighting a different war.

“Israel is merely a tool,” Mr. Netanyahu said. “The main war, the real war, is with America.”

Israeli forces press forward in Khan Younis. At least 30 people are reported killed in 24 hours.

At least 30 people were killed and dozens more injured over a 24-hour period on Wednesday and Thursday in the Gaza Strip, local health officials said, as the Israeli military pushed deeper into parts of Khan Younis that it had previously designated as humanitarian zones for civilians fleeing the fighting.

The Israeli military, which began a renewed offensive in the southern Gaza city of Khan Younis earlier this week, said it was targeting Hamas forces whom it accused of embedding fighters among civilians.

Many of the victims were taken to the Nasser Medical Complex in Khan Younis, where photos taken by a photographer for Agence France-Presse showed bloodied children being rushed in for care.

Mohammad Saqer, the director of nursing at Nasser Hospital, said he had treated three children for severe blast wounds, which he said were most likely from bombardment. Dr. Saqer, who has worked at the medical center for 18 years, said few shipments of medicine and fuel were arriving at the hospital, making treatment difficult.

“So many dead, so many wounded, not enough beds,” Dr. Saqer said. “The situation’s disastrous. We’re rationing electricity, turning off air conditioning, trying to save what we can.”

Patients at the facility have been forced to share beds, and the hospital was “under enormous strain as the killing, wounding and maiming of people continues relentlessly in southern Gaza,” the aid group Doctors Without Borders wrote on social media earlier in the week.

The United Nations said that 150,000 people fled Khan Younis on Monday alone, the day the renewed Israeli offensive began, and that “large-scale displacement” from the area was ongoing.

In Al-Mawasi, the coastal town where the Israeli military ordered Khan Younis residents to go, there is “no space for even a single tent due to the overwhelming number of people desperate for safety,” the Palestinian Red Crescent said. The group said that one of its ambulances came under fire on Thursday as medics were trying assisting injured civilians.

⭕️The Israeli occupation forces directly targeted a Palestine Red Crescent ambulance with live bullets today in the city of #KhanYunis while the crew was evacuating an injured person. #NotATarget #IHL #Gaza pic.twitter.com/6AflIicCto — PRCS (@PalestineRCS) July 25, 2024

Fighting in recent days has centered around three towns near the city of Khan Younis — Bani Suaila, Al Zanna and Al Qarara. On Wednesday, the Israeli military discovered the bodies of five Israelis in Al Qarara who had been killed in Hamas’s Oct. 7 attack on Israel. The bodies were found in a tunnel used by militants.

“Hamas exploited the humanitarian area and used it to hold our hostages captive,” the military said in a statement on social media. Hamas did not issue a response on its social media channels.

Israeli officials say 115 hostages remain in Gaza, including roughly 40 who are presumed dead.

The military said on Thursday that Hamas had launched several rockets toward Israel from the humanitarian area in Khan Younis earlier in the day. But the strike did not reach Israel and at least one rocket hit a U.N.-run school in Al Qarara, killing two people and injuring several others, the military said.

Schools have not been operating during the war and most of them have become shelters for displaced people. UNRWA, the United Nations’ main relief group for Palestinians that runs schools, did not confirm the attack.

The Israeli military said its forces operating in Khan Younis had killed dozens of Hamas militants over the past day and struck more than 60 terror targets.

Gaza’s health ministry said Israeli military strikes on areas in eastern Khan Younis killed at least 14 people early Thursday, with airstrikes reported in southern Gaza and tanks advancing in central Rafah.

Mahmoud Basal, a spokesman for the Palestinian Civil Defense, said Israeli forces had killed at least 17 people on Thursday in Deir al-Balah, in central Gaza, and in Khan Younis, Israeli snipers shot and killed at least one person while he was moving down Salah al-Din Street, Gaza’s main north-south route, he said. The Israeli military did not immediately respond to a request for comment about the incident.

Anushka Patil and Aaron Boxerman contributed reporting.

— Anjana Sankar

Grab your spot at the free arXiv Accessibility Forum

Help | Advanced Search

Computer Science > Human-Computer Interaction

Title: liplearner: customizable silent speech interactions on mobile devices.

Abstract: Silent speech interface is a promising technology that enables private communications in natural language. However, previous approaches only support a small and inflexible vocabulary, which leads to limited expressiveness. We leverage contrastive learning to learn efficient lipreading representations, enabling few-shot command customization with minimal user effort. Our model exhibits high robustness to different lighting, posture, and gesture conditions on an in-the-wild dataset. For 25-command classification, an F1-score of 0.8947 is achievable only using one shot, and its performance can be further boosted by adaptively learning from more data. This generalizability allowed us to develop a mobile silent speech interface empowered with on-device fine-tuning and visual keyword spotting. A user study demonstrated that with LipLearner, users could define their own commands with high reliability guaranteed by an online incremental learning scheme. Subjective feedback indicated that our system provides essential functionalities for customizable silent speech interactions with high usability and learnability.
Comments: Conditionally accepted to the ACM CHI Conference on Human Factors in Computing Systems 2023 (CHI '23)
Subjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as: [cs.HC]
  (or [cs.HC] for this version)
  Focus to learn more arXiv-issued DOI via DataCite
: Focus to learn more DOI(s) linking to related resources

Submission history

Access paper:.

  • Other Formats

license icon

References & Citations

  • Google Scholar
  • Semantic Scholar

BibTeX formatted citation

BibSonomy logo

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

  • Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

Harris pushes Netanyahu to ease suffering in Gaza: 'I will not be silent'

  • Medium Text

U.S. Vice President Harris meets with Israeli PM Netanyahu in Washington

A DIVIDED PARTY

Sign up here.

Reporting by Steve Holland and Jeff Mason; Additional reporting by Trevor Hunnicutt and Daphne Psaledakis; Editing by Heather Timmons, Howard Goller, Cynthia Osterman and Don Durfee

Our Standards: The Thomson Reuters Trust Principles. , opens new tab

silent speech

Thomson Reuters

Jeff Mason is a White House Correspondent for Reuters. He has covered the presidencies of Barack Obama, Donald Trump and Joe Biden and the presidential campaigns of Biden, Trump, Obama, Hillary Clinton and John McCain. He served as president of the White House Correspondents’ Association in 2016-2017, leading the press corps in advocating for press freedom in the early days of the Trump administration. His and the WHCA's work was recognized with Deutsche Welle's "Freedom of Speech Award." Jeff has asked pointed questions of domestic and foreign leaders, including Russian President Vladimir Putin and North Korea's Kim Jong Un. He is a winner of the WHCA's “Excellence in Presidential News Coverage Under Deadline Pressure" award and co-winner of the Association for Business Journalists' "Breaking News" award. Jeff began his career in Frankfurt, Germany as a business reporter before being posted to Brussels, Belgium, where he covered the European Union. Jeff appears regularly on television and radio and teaches political journalism at Georgetown University. He is a graduate of Northwestern University's Medill School of Journalism and a former Fulbright scholar.

U.S. Vice President Harris visits El Paso, Texas

Israeli forces quit east Khan Younis, Palestinians recover dozens of bodies

Thousands of Palestinians returned to their homes in the ruins of Gaza's main southern city Khan Younis on Tuesday, after Israeli forces ended a week-long incursion there which they said aimed to prevent Islamist armed group Hamas from regrouping.

57th ASEAN Foreign Ministers' Meeting at the National Convention Center in Vientiane

COMMENTS

  1. Subvocalization

    Subvocalization, or silent speech, is the internal speech typically made when reading; it provides the sound of the word as it is read. This is a natural process when reading, and it helps the mind to access meanings to comprehend and remember what is read, potentially reducing cognitive load.. This inner speech is characterized by minuscule movements in the larynx and other muscles involved ...

  2. Silent speech interface

    Silent speech interface is a device that allows speech communication without using the sound made when people vocalize their speech sounds. As such it is a type of electronic lip reading. It works by the computer identifying the phonemes that an individual pronounces from nonauditory sources of information about their speech movements.

  3. Computer system transcribes words users "speak silently"

    In the conference paper, the researchers report a prototype of a wearable silent-speech interface, which wraps around the back of the neck like a telephone headset and has tentacle-like curved appendages that touch the face at seven locations on either side of the mouth and along the jaws.

  4. This Device Can Hear You Talking to Yourself

    Yes, it's true. AlterEgo, Kapur's new wearable device system, can detect what you're saying when you're talking to yourself, even if you're completely silent and not moving your mouth ...

  5. Overview ‹ AlterEgo

    AlterEgo. AlterEgo is a non-invasive, wearable, peripheral neural interface that allows humans to converse in natural language with machines, artificial intelligence assistants, services, and other people without any voice—without opening their mouth, and without externally observable movements—simply by articulating words internally.

  6. AI-equipped eyeglasses can read silent speech

    The silent speech interface can also be paired with a stylus and used with design software like CAD, all but eliminating the need for a keyboard and a mouse. Outfitted with a pair of microphones and speakers smaller than pencil erasers, the EchoSpeech glasses become a wearable AI-powered sonar system, sending and receiving soundwaves across the ...

  7. All-weather, natural silent speech recognition via machine-learning

    Silent speech can offer people with aphasia an alternative communication way. More importantly, compared to voice interactions or visual interactions, human-machine interactions using silent ...

  8. Silent speech interfaces

    A silent speech interface (SSI) is a system enabling speech communication to take place when an audible acoustic signal is unavailable. By acquiring sensor data from elements of the human speech production process - from the articulators, their neural pathways, or the brain itself - an SSI produces a digital representation of speech which ...

  9. Silent Speech Interfaces for Speech Restoration: A Review

    This review summarises the status of silent speech interface (SSI) research. SSIs rely on non-acoustic biosignals generated by the human body during speech production to enable communication whenever normal verbal communication is not possible or not desirable. In this review, we focus on the first case and present latest SSI research aimed at providing new alternative and augmentative ...

  10. PDF Voicing Silent Speech

    Another possible sensor for silent speech input is electromagnetic articu- lography, or EMA, which uses magnets attached to the lips and tongue to track their movement (Wrench and Richmond, 2000).

  11. Silent Speech Interfaces for Speech Restoration: A Review

    This review summarises the status of silent speech interface (SSI) research. SSIs rely on non-acoustic biosignals generated by the human body during speech production to enable communication whenever normal verbal communication is not possible or not desirable. In this review, we focus on the first case and present latest SSI research aimed at providing new alternative and augmentative ...

  12. End-To-End Silent Speech Recognition with Acoustic Sensing

    Silent speech interfaces (SSI) has been an exciting area of recent interest. In this paper, we present a non-invasive silent speech interface that uses inaudible acoustic signals to capture people's lip movements when they speak. We exploit the speaker and microphone of the smartphone to emit signals and listen to their reflections, respectively. The extracted phase features of these ...

  13. PDF Silent Speech Interfaces for Speech Restoration: A Review

    speech from non-acoustic (silent) biosignals generated during speech production. A well-known form of silent speech com-munication is lip reading. A variety of sensing modalities have been investigated to capture speech-related biosignals, such as vocal tract imaging [21]-[23], electromagnetic articulog-

  14. PDF Digital Voicing of Silent Speech

    A V - audio from vocalized speech E V - EMG from vocalized speech E S - EMG from silent speech Figure 2: The three components of our data that we will use in our model. The vocalized speech signals A V and E V are collected simultaneously and so are time-aligned, while the silent signal E S is a separate recording of the same utterance without vocalization.

  15. PDF Silent Speech Interfaces for Speech Restoration: A Review

    speech from non-acoustic biosignals generated during speech production. A well-known form of silent speech communica-tion is lip reading. A variety of sensing modalities have been investigated to capture speech-related biosignals, such as vocal tract imaging [20]-[22], electromagnetic articulography (mag-

  16. GitHub

    This repository contains code for synthesizing speech audio from silently mouthed words captured with electromyography (EMG). It is the official repository for the papers Digital Voicing of Silent Speech at EMNLP 2020, An Improved Model for Voicing Silent Speech at ACL 2021, and the dissertation Voicing Silent Speech.The current commit contains only the most recent model, but the versions from ...

  17. A continuous silent speech recognition system for AlterEgo, a silent

    Abstract. In this thesis, I present my work on a continuous silent speech recognition system for AlterEgo, a silent speech interface. By transcribing residual neurological signals sent from the brain to speech articulators during internal articulation, the system allows one to communicate without the need to speak or perform any visible ...

  18. Silent Speech Decoding Using Spectrogram Features Based on

    Silent speech decoding is a novel application of the Brain-Computer Interface (BCI) based on articulatory neuromuscular activities, reducing difficulties in data acquirement and processing. In this paper, spatial features and decoders that can be used to recognize the neuromuscular signals are investigated. Surface electromyography (sEMG ...

  19. Digital Voicing of Silent Speech

    In this paper, we consider the task of digitally voicing silent speech, where silently mouthed words are converted to audible speech based on electromyography (EMG) sensor measurements that capture muscle impulses. While prior work has focused on training speech synthesis models from EMG collected during vocalized speech, we are the first to ...

  20. EchoSpeech: Continuous Silent Speech Recognition on Minimally-obtrusive

    Non-Invasive Silent Speech Recognition in Multiple Sclerosis with Dysphonia. In Proceedings of the Machine Learning for Health NeurIPS Workshop(Proceedings of Machine Learning Research, Vol. 116), Adrian V. Dalca, Matthew B.A. McDermott, Emily Alsentzer, Samuel G. Finlayson, Michael Oberst, Fabian Falck, and Brett Beaulieu-Jones (Eds.). ...

  21. EarSSR: Silent Speech Recognition via Earphones

    As the most natural and convenient way to communicate with people, speech is always preferred in Human-Computer Interactions. However, voice-based interaction still has several limitations. It raises privacy concerns in some circumstances and the accuracy severely degrades in noisy environments. To address these limitations, silent speech recognition (SSR) has been proposed, which leverages ...

  22. 'It is time for this war to end,' Harris tells Netanyahu; 'I will not

    Harris' remarks followed a meeting she held with Israeli Prime Minister Benjamin Netanyahu in Washington Thursday.

  23. [2010.02960] Digital Voicing of Silent Speech

    Digital Voicing of Silent Speech. In this paper, we consider the task of digitally voicing silent speech, where silently mouthed words are converted to audible speech based on electromyography (EMG) sensor measurements that capture muscle impulses. While prior work has focused on training speech synthesis models from EMG collected during ...

  24. Tlaib silently protests Netanyahu speech with 'War Criminal' sign

    The silent protest comes after House Speaker Mike Johnson (R-LA) warned lawmakers not to disrupt the speech; otherwise, they risk being arrested and charged by U.S. Capitol Police. At least five ...

  25. Harris says she 'will not be silent' on Gaza suffering while telling

    Vice President Kamala Harris vowed to "not be silent" about suffering in Gaza amid the Israel-Hamas war, saying she expressed her "serious concern" to Israeli Prime Minister Benjamin ...

  26. Rashida Tlaib Holds 'War Criminal' Sign During Netanyahu Speech

    Congresswoman Rashida Tlaib, the only Palestinian American in the House or Senate, held up a sign reading 'war criminal' during an address by Israel's prime minister, Benjamin Netanyahu.

  27. Middle East Crisis: Harris Expresses Support for Israel but Says She

    Prime Minister Benjamin Netanyahu, a day after addressing Congress, met on Thursday with Vice President Kamala Harris and President Biden separately at the White House at a time when the future of ...

  28. [2302.05907] LipLearner: Customizable Silent Speech Interactions on

    Silent speech interface is a promising technology that enables private communications in natural language. However, previous approaches only support a small and inflexible vocabulary, which leads to limited expressiveness. We leverage contrastive learning to learn efficient lipreading representations, enabling few-shot command customization with minimal user effort. Our model exhibits high ...

  29. Tlaib's "war criminal" sign

    Latest news, updates, and analysis on Israeli Prime Minister Benjamin Netanyahu's speech to Congress.

  30. Harris pushes Netanyahu to ease suffering in Gaza: 'I will not be silent'

    U.S. Vice President Kamala Harris pressured Israeli Prime Minister Benjamin Netanyahu on Thursday to help reach a Gaza ceasefire deal that would ease the suffering of Palestinian civilians ...