Spatial Audio in the browser

Get your headphones, this is a demo of spatial audio rendered in the browser with head tracking using your webcam. Sit about 2.5 feet (0.75 meters) from your webcam for best results.

What is it?

The following track is a four-part Bach chorale, played by synthesized brass instruments, with the four instruments virtualized at different locations. Front-left, front-right, rear-left, rear-right. Press the button to start face tracking (and give permission to use your webcam). The four instruments are fixed in space, and you can move and angle your head to get closer to one or another.

How does it work?

Humans use several cues to tell where sound is coming from: interaural time difference (sound hits the closer ear first), interaural intensity difference (sound is attenuated more at the distant ear), and pinna filtering (the outer ear selectively attenuates and emphasizes different frequencies depending on the sound source direction). I modeled these effects as time- and location-varying digital filters.

Interaural time and intensity difference

A sound coming from directly in front of the listener will arrive at the two ears at the same time, but a sound coming from the left side of the listener will arrive at the left ear first. Introducing a delay to the right ear creates the perception of sound coming from the left side of the body. The precise amount of delay depends on the angle: for sounds directly to the left, it’d be about 0.7 milliseconds. For sounds at 45 degrees to the left, it’d be about 0.4 milliseconds.

Sounds from the left will be perceived as louder by the left ear, and softer by the right ear. Part of that difference is that the head blocks some of the intensity, and part is that sound intensity naturally attenuates over distance. Amplitude decreases inversely with distance.

Pinna filtering

The pinnae (outer ears) attenuate and emphasize different frequencies depending on the direction of the sound source, and this effect plays a very large role in human sound localization. Researchers can record this effect by putting microphones in the ear canals of a synthetic head and recording audio test signals in an anechoic chamber from different directions. These signals are collected in something called a “head related transfer function” (HRTF). These signals also capture the head shadow effect, not just the pinnae effect.

An HRTF captures only finitely many directions. I wanted to be able to play a synthetic sound source from any location, not just one where an HRTF was captured, so I trained a (very small) neural network to interpolate the effects of the HRTF to positions that weren’t in the original data. If I put a direction into the neural network, I get out a digital filter that simulates the effect of the pinnae from that direction. It has to be a very small neural network because I need to run it in the browser very frequently. CD-quality audio is generated at 44100 Hz which means I need to run the digital filters simulating the pinnae 44100 times per second. The neural network doesn’t have to run quite so frequently, but it still can’t be too computationally intensive.

License/copyright details, click to expand/collapse

The music is Herr, ich denk an jene Zeit, BWV 329 J. S. Bach, rendered from MIDI to brass quartet with MIDI-DDSP. I auto-tuned the pitch to just intonation. The default pitch did not sound good.

I’m distributing a copy of the Facemesh model which is licensed under Apache 2.0 by MediaPipe.

I’m distributing a copy of the Blazeface model which is licensed under Apache 2.0 by MediaPipe.

I’m distributing files from tfjs-backend-wasm which is licensed under Apache 2.0 by Google LLC.

I use some code written by Julius O. Smith III: filters.lib/lowpass0_highpass1 and filters.lib/tf21 are available under the MIT-style STK-4.3 license.

I used a slightly modified version of a piece of the Faust compiler, which is itself available under the GPLv2. My modifications are available here.

I used subject_008.sofa from the CIPIC HRTF database, published under the following license:

Copyright © 2001 The Regents of the University of California. All Rights Reserved

Disclaimer

THE REGENTS OF THE UNIVERSITY OF CALIFORNIA MAKE NO REPRESENTATION OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND SPECIFICALLY DISCLAIM ANY IMPLIED WARRANTIES OR MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE.

Further, the Regents of the University of California reserve the right to revise this software and/or documentation and to make changes from time to time in the content hereof without obligation of the Regents of the University of California to notify any person of such revision or change.

Use of Materials

The Regents of the University of California hereby grant users permission to reproduce and/or use materials available therein for any purpose-educational, research or commercial. However, each reproduction of any part of the materials must include the copyright notice, if it is present. In addition, as a courtesy, if these materials are used in published research, this use should be acknowledged in the publication. If these materials are used in the development of commercial products, the Regents of the University of California request that written acknowledgment of such use be sent to:

CIPIC- Center for Image Processing and Integrated Computing University of California, 1 Shields Avenue, Davis, CA 95616-8553

Spatial Audio in the browser •

What is it? •

How does it work? •

Interaural time and intensity difference •

Pinna filtering •

Spatial Audio in the browser

What is it?

How does it work?

Interaural time and intensity difference

Pinna filtering