Written by: Audiotelligence
Blind Source Separation: Making Your Voice stand out from the Crowd
Blind Source Separation, or BSS as it’s commonly known, is a familiar term in the world of audio research. Many research papers have been written on it, but not many people, even in the world of audio, fully understand how it works. Nor how, used appropriately, it can work in conjunction with other audio techniques to produce clear speech in noisy environments.
So – what is BSS and what can it do?
To explain this, let’s take a step back and imagine a room where someone is being interviewed in front of a video camera. Just as what we can see in the room is the visual scene, the sounds within the room make up the acoustic scene. In our imaginary interview setting, you might hear the interviewee’s voice, the voice of the interviewer, perhaps background hum from air conditioning, a clock ticking, some noise from outside…. All of these are sound sources within an acoustic scene.
Microphones placed in our imaginary room will pick up a mix of sounds coming from all these sources. This can result in muddled sound and barely intelligible speech, especially if two people are talking at the same time or if the background noise is loud. What we’re trying to do with BSS is to take that mix and separate it out into the individual sources. So, if the interviewer and interviewee both talk at once, BSS can separate the two voices into individual and completely separate streams of sound. At the same time, BSS will help to reduce any background noise to make those voices more intelligible. Thus, not only does it separate out the dominant sources from each other really well, it also does quite a significant amount of ambient noise reduction – between 5 and 12 dBs, depending on the number of microphones used.
And by the way, achieving a successful commercial BSS solution is not easy. At AT our method was to approach it as a Bayesian probability problem using the statistics of the sources. Most sources are “heavy tailed”; they have large outliers that occur reasonably frequently. The central limit theorem says that any mixture of independent sources becomes more Gaussian, which is a lot less heavy tailed than the typical distribution of an individual source. The microphones are picking up such a mixture of sound sources. Our BSS algorithm transforms the mic data back to the sources by restoring this heavy tailed nature. Obviously this can’t be just any old transformation, and there are various mathematical properties that have to be preserved as well for this to be effective.
The great thing about BSS is that, used on its own, it not only does a great job of separating sound sources, but that it has lots of synergies with other techniques. It is best viewed as a valuable and highly useful component in the whole audio stack: you can choose the best techniques to combine with it, depending on what problem you need to solve.
For example, because the output from BSS is just the one source that is of interest, plus some ambient noise, that output is perfect for processing with noise reduction technology. Noise reduction on its own cannot solve the problem of separating sources but if you add it to BSS, it will reduce the ambient noise even further. You will achieve much better overall signal quality from the combination of BSS and noise reduction than by using noise reduction on its own.
Voice assistance is becoming a ubiquitous technology. It’s not just for smart speakers – more and more devices which respond to voice commands are being made. Voice assistance needs automatic speech recognition to work, and BSS combines very well with ASR. BSS lets the device ‘hear’ just the voice giving it the commands by separating out that source of sound from other competing voices. If there are other people talking when you are giving your device a command, ASR finds it very hard to work out which of the two streams of sound the device should pay attention to. In that use case, BSS will give the device a good, highly intelligible single source of sound. That allows the ASR to do its processing and the device should then obey you!
Finally, you must always be thinking about the end user. Our overarching objective in developing our BSS was to preserve the intelligibility of speech. It’s important that the result of your audio processing is clear natural speech – with no artefacts – that can be easily understood by real people. Testing results with metrics such as Signal to Interference Ratio and Signal to Distortion Ratio is a good place to start, but nothing validates an algorithm better than actually listening to the sound it produces.
Your audio research must be focussed on real world use, which means leaving the lab and getting out into the same situations in which your tech will be used. Make recordings in restaurants, cafés, homes, offices – and then listen to them. But don’t simply rely on your own ears: everyone hears things slightly differently. Get as wide a variety of people as possible to listen, test and evaluate the results. They will tell you how useful your tech was for them – and achieving a result which your users like, and which solves their problem, is after all the whole point of any audio research.