Speech envelope dynamics for noise-robust auditory scene analysis in robotics

Published: January 21, 2021

Rea F., Kothig A., Grasse L., Tata M.
Published in International Journal of Humanoid Robotics 2021 (IJHR)

Abstract

Humans make extensive use of auditory cues to interact with other humans, especially in challenging real-world acoustic environments. Multiple distinct acoustic events usually mix together in a complex auditory scene. The ability to separate and localize mixed sound in complex auditory scenes remains a demanding skill for binaural robots. In fact, binaural robots are required to disambiguate and interpret the environmental scene with only two sensors. At the same time, robots that interact with humans should be able to gain insights about the speakers in the environment, such as how many speakers are present and where they are located. For this reason, the speech signal is distinctly important among auditory stimuli commonly found in human-centered acoustic environments. In this paper, we propose a Bayesian method of selectively processing acoustic data that exploits the characteristic amplitude envelope dynamics of human speech to infer the location of speakers in the complex auditory scene. The goal was to demonstrate the effectiveness of this speech-specific temporal dynamics approach. Further, we measure how effective this method is in comparison with more traditional methods based on amplitude detection only.

Images

iCub robot at the Italian Institute of Technology — iCub humanoid robot.

iCub head at the University of Lethbridge — iCub humanoid robot.

Processing pipeline for extracting features from the input audio.

Processing pipeline for computing Bayesian updates of probable sound source.

Localization of a human voice among a pink noise distraction using an amplitude based approach — Localizing a human voice with a single distractor present. Left uses an Amplitude-Only approach, while right is filtering for the Envelope-Dynamics. The white track denotes the angular position of the head over time.

Localization of a human voice among a pink noise distraction using an envelope based approach — Localizing a human voice with a single distractor present. Left uses an Amplitude-Only approach, while right is filtering for the Envelope-Dynamics. The white track denotes the angular position of the head over time.

Representative audio signal with the low-frequency envelope overplotted in red — Representative audio signal with the low-frequency envelope and the average low-frequency envelope periodograms of the audio samples.

The average low-frequency envelope periodograms of the speech samples used in experiments — Representative audio signal with the low-frequency envelope and the average low-frequency envelope periodograms of the audio samples.

How to cite?

@article{rea2021speech,
  title={Speech envelope dynamics for noise-robust auditory scene analysis in robotics},
  author={Rea, Francesco and Kothig, Austin and Grasse, Lukas and Tata, Matthew},
  journal={International Journal of Humanoid Robotics},
  volume={17},
  number={06},
  pages={2050023},
  month={January},
  year={2021},
  publisher={World Scientific},
  url={http://dx.doi.org/10.1142/S0219843620500231}, 
  DOI={10.1142/s0219843620500231}
}

Downloads

International Journal of Humanoid Robotics

Download Paper

Austin Kothig

Abstract

Images

How to cite?

Downloads