Using OrchPlay to investigate musical instrument identification

Project Blog | Project Updates | ACTOR Exchange

Using the Tristan prelude from OrchPlay to investigate musical instrument identification under realistic acoustical scenarios

by Simon Jacobsen
April 28th, 2023

Introduction

The identification of musical instruments is an important topic in both music information retrieval (MIR) and music cognition. While MIR strives to build algorithms that can uncover all instruments in complex sound mixtures, studies on human performance have been limited to isolated tones or controlled mixture conditions. As part of my PhD thesis, I will model musical instrument identification under realistic acoustical conditions.

What do I mean by realistic conditions? Picture a symphony concert in the Elbphilharmonie in Hamburg, Germany. You are sitting in the best seat of the house – arguably any seat is the best seat in this concert hall – and you close your eyes while listening to the music, which you have not heard before. While you are exposed to the sounds coming from the stage and their reflections off the walls, you are gripped by strong emotions in the music. And you think: “That’s a beautiful oboe sound! And the accompaniment… Wait, what instruments am I hearing? Is that a French horn or a bassoon?” Maybe you are not actually asking these questions – but you could. And they are reasonable.

Now, doing scientific research during a live performance of an orchestra in a concert hall might be a difficult and rather unconventional endeavor. So, it would be great to simulate such an environment. The OrchPlayMusic (https://www.orchplaymusic.com/) library provides just that, valuable opportunities for diving deeper into the world of orchestration and timbre. In my case it is the vehicle for in depth scientific research on music perception and musical scene analysis. In this project update I will describe how I incorporate OrchPlay in my research project. It contains the outcome from research activities that have been conducted during my ACTOR exchange at the McGill University’s Music Perception and Cognition Lab as well as follow up implementations in the Music Perception and Processing Lab at the Carl von Ossietzky University of Oldenburg, Germany, up until now. I plan to follow up with a series of updates on this project in the future.

Motivation

Instrument Identification and Blend

When talking about instrument identification there are different ways as to assess listener performance. There is the “Was the instrument in the mix” approach, which focuses primarily on recognition, as the actual type of instrument itself is not necessarily important. A more general approach would be to simply play a mixture and then ask the listener to list all instruments that were heard. This would be the ultimate challenge in a full orchestra setting. In my project, I will use a middle path between these approaches. Following a cued melodic line, – acoustically and timbrally ambiguous – the correct instrument(s) representing this melody should be identified within a subsequent sound mixture. Dependent on the underlying orchestration the cued melody does not necessarily correspond to a single target instrument. It could very well represent multiple target instruments that play in unison. In this sense, the identification task could be used as an objective measure for blend between individual instruments. As suggested by Sandell (1995) blending between sound could be described by timbral heterogeneity, timbral augmentation, or timbral emergence. Whereas the first two cases render the participating instruments as identifiable, the third should result in complete blending of the sounds, making them unidentifiable. These perceptual results of blend should inversely correlate with identification performance.

Here are four audio examples from the OrchPlay rendition of Wagner’s prelude to “Tristan and Isolde” to simulate the identification task. Each stimulus starts with the target melody based on pure tones followed by the orchestral mixture. For reference, the isolated target instrument(s) is (are) then also played back. All target instruments contained in the mixture are also listed at the end of this report.

Room Acoustics

But not only instrument timbre will be of interest in my research. In striving to combine the complete acoustic communication loop, room acoustics and with that reverberation that provides binaural cues to the listener will be another focus of the project. To quantify the effects of the room in which the music is played – we are talking about a concert hall setting which can house a full symphonic orchestra – three room acoustical parameters will be investigated. As a measure of the strength of the effective size of the room, different T60 reverberation times will be used. The T60 is a measure of the time it takes for the sound to be reduced by 60 dB after the source of the sound has stopped playing. The two other parameters are the distance of the listener to the stage as a measure of direct-to-reverberant sound and the spatial configuration of the instruments on stage, providing different degrees of binaural cues from early reflections.

Audio Material

Simulated Score

To perform the identification task and have control over individual instruments multichannel audio of the entire score is needed. The OrchPlayMusic Library provides individual stereo instrument tracks. The instruments’ audio is highly realistic and stems from a combination of sound libraries that provide recordings of individual instrument sounds across the entire pitch range, dynamic range, and articulation palette of the instrument. Different sound libraries use different microphone recordings, such as, e.g., close miking, A/B miking, and their combination. Some instruments sound better with close-miking (mostly woodwinds) and others with A/B miking (brass and strings). The resulting “dry” instrument tracks were then mixed using a digital audio workstation (DAW) to add reverb, equalize individual instruments, and create other dynamic and tuning effects to get a hybrid acoustical-digital rendition of the score that sounds pleasing and recreates a concert hall recording.

Manipulation of Audio Tracks

For this project, the multichannel audio will be used to render each instrument in a virtual acoustic environment (VAE), as described in more detail in the following section on acoustical rendering. Overall, the idea is to render and manipulate the room acoustics or reverberation using a room acoustic simulation tool. But there is a problem: since the audio from OrchPlay already includes reverb, subsequent simulations of a concert hall using these stimuli would create further reverb on top of the already existing one. This does not reflect a physically and acoustically realistic scenario. To overcome this issue of “double reverb” changes for the individual instruments had to be made in the mix. First, the selected reverb – in this case the Grand Hall of the Berliner Philharmonie – was removed for all instruments. Secondly, in an attempt to ensure correct sound latencies between instruments in the room simulation additional instrument-dependent digital delays added in the DAW was removed. In the stereo recording the delays should create the impression of distance between the instruments. The resulting tracks now featured the original recordings with different miking techniques and the added spectral, dynamical, and panning alterations.

It turned out that further steps had to be taken to make the stimuli ready for further processing. Since the spatial location of the instruments would be governed by the room acoustic simulation software initial panning through stereo audio had to be removed and mixed down to a mono audio signal. Furthermore, the digital delay was added again, since instruments entrances appeared to be all over the place during first tests in the VAE. Keeping the delay from the original tracks seemed to resolve this issue.

Limitations and Challenges

There are still some limitations to the current set of stimuli that must be addressed in the future. For one it has to be investigated how the recordings with different miking techniques affect the perception of the individual instruments, since some recordings already include early reflections. Their audibility and superposition should be tested in psychoacoustic experiments. Another problem involves the entire string section, which have been recorded in ensemble playing. Theoretically, individual solo instruments should be rendered to make up each section and then rendered in the room acoustic simulation. But this is generally not a viable option since it removes the tutti effect of the strings and hardly sounds realistic in the end. For the simulation, however, this results in each section being represented by a single sound source that does not fill a large space of the virtual stage. Simply duplicating the sources to add a spatial impression would result in audible artifacts since the sounds sources are identical and thus coherent. Superposition of slightly delayed versions due to the spatial configuration would also lead to comb filter effects, especially for the direct sound and early reflections. Keeping the stereo image for the strings seemed to overcome this issue and add the perception of a spatially spread source in a first implementation as described in the next sections. 

 

Figure 1. Position of the loudspeakers and the listener in the Dark-Lab.

 

Acoustical Rendering

Simulation Software

As already mentioned in the previous part, the stimuli are rendered in a virtual acoustic environment (VAE) that creates a scene with concert hall acoustics. The toolbox for acoustic scene creation and rendering (TASCAR) (https://github.com/HoerTech-gGmbH/tascar/) is used to build the direct sound path and room-specific combinations of early reflections and late reverberation utilizing feedback-delay networks (FDNs). (Grimm et al., 2019). Simply speaking, it uses a geometric model of a concert hall with absorption and reflection coefficients for the walls and ceiling and computes the reflections up to a given order. Late reverberation is added using diffuse sound. The individual instruments can be freely placed within the room as can be done with the listener. TASCAR is even capable of adjusting all sound paths during playback to simulate moving sound sources or receivers in real time. To render a whole orchestra, the instrument sound sources are placed on the stage given a three-dimensional coordinate system. The virtual listener can also be placed at specific coordinates, for instance in the audience or even among the instruments on the stage. In that sense the sound perception from the musician’s perspective can be simulated as well. 

Speaker layout

The VAE is being currently played back through a rectangular setup of 16 loudspeakers (Figure 1). Twelve loudspeakers were spaced with a separation angle of 30 degrees such that four of them are at the locations 0, 90, 180, and –90 degrees. To the front (0 degrees) – where the stage is rendered in the scene – four additional loudspeakers fill the gaps at 15, 45, –45, and –15 degrees to decrease the separation angle to 15 degrees. All loudspeaker locations: 0, 15, 30, 45, 60, 90, 120, 150, 180, –150, –120, –90, –60, –45, –30, –15 degrees. The setup was calibrated to a reference level of 70 dB SPL pink noise for both, direct individual speaker sound and the combined diffuse field. Different playback options can be selected including nearest speaker panning (NSP), vector-based amplitude panning (VBAP), or higher-order two-dimensional Ambisonics. The actual listener is then placed in the center of the room for optimal room acoustic playback. In the future, the VAE will be played back through an even larger array of 48 loudspeakers. This will allow for more precise localization of instruments when using the NSP method.

Leveling

Next steps included leveling the individual instruments and finding a suitable playback option for the tutti strings. For leveling, particular instruments were taken as a reference and the other instruments were then adjusted relative to the reference instrument. Every instrument group was leveled separately. For example, for the brass the horns, which have a dominant and characteristic role in the prelude, were set to a realistic sound level first and then the trombone, trumpets, and the tuba were adjusted accordingly. The same could be done for the woodwinds. However, only their global sound levels were adjusted since their levels were already highly optimized for ensemble playing and blend. For the tutti strings, before adjusting the levels, a different playback solution was investigated which entailed the aforementioned splitting of the left and right channel of the stereo recording. Instead of down-mixing and playback of a mono signal, as was done for the woodwinds, brass, and timpani, the separated stereo signals should be placed in such a way, that it creates the spatial effect of, e.g., a section of first violins that space from all the way to the left to the center of the stage as seen from the audience. The method should create this spatial effect of encapsulating the listener without creating audible artifacts like comb filtering, essentially recreating the original stereo image within the scene. First tests in the lab confirmed this choice to be a good solution for “faking” the tutti strings, although it does not necessarily represent the physical truth within the VAE. Nonetheless, for now it provides a good and convincing playback option. Still, the implications for specific identification tasks involving the strings must be considered in the future.

Conclusion and Outlook

This project update was about first steps and implementations using the prelude of Tristan and Isolde by Richard Wagner as a case study for my PhD research project on modeling the identification of musical instruments under realistic acoustical conditions. Instrument stimuli from OrchPlay were used and manipulated to allow for room-acoustic rendering in a virtual acoustic environment (VAE). Using the simulation software TASCAR early reflections and late reverberation give the realistic impression of a concert hall. Source positioning and leveling efforts for all instruments already showed a successful incorporation and realistic recreation of the orchestral simulations from OrchPlay.

The project thus features a practical implementation of OrchPlay for scientific research on musical scene analysis. Not only the use of orchestral stimuli in psychoacoustic experiments but also the option for room-acoustical rendering in a surround sound environment show the capabilities and challenges of applying OrchPlay in such an environment. It is the first attempt of creating a way to combine the software with external room-acoustic simulations which could open the opportunity for surround sound playback directly through OrchPlay.

In terms of the course of my research project, the current setup provides the foundations for experiments with musicians and non-musicians as well as normal hearing and hearing impaired participants in the future. Identification performance but also sound quality ratings will provide insights into the mechanisms of musical scene analysis and how musical and room-acoustical elements shape this process of the auditory system. 

Figure 2. Félix and Simon working on the virtual orchestra scene in the Dark-Lab at the University of Oldenburg. 

Acknowledgements

I would like to thank Stephen McAdams, the Music Perception and Cognition Lab and the ACTOR program for having me in Montreal in October 2022. Special thanks to Félix Baril for providing the OrchPlay sounds, helping me with stimuli manipulations and coming to Oldenburg for a joint venture in mixing the instruments in the virtual scene.

Reference List

  • Grimm, G., Luberadzka, & J., Hohmann, V. (2019). A toolbox for rendering virtual acoustic environments in the context of audiology. Acta Acustica united with Acustica, 105(3), 566–578 

  • Sandell, Gregory J. (1995). Roles for spectral centroid and other factors in determining “blended” instrument pairings in orchestration. Music Perception, 13(2),209–246. https://doi.org/10.2307/40285694 

Resolution of instruments in audio examples

  • Audio Example 1, target instrument(s): oboe

  • Audio Example 2, target instrument(s): 2 bassoons (unison)

  • Audio Example 3, target instrument(s): 2 French horns (loco and 8vb)

  • Audio Example 4, target instrument(s): 2 oboes (unison), English horn, 2 clarinets (unison), 2 bassoons (unison 8vb), 2 French horns (loco and 8vb)

Previous
Previous

Metaphors we listen with

Next
Next

A report on the University of California San Diego Fall 2022 ACTOR and its continuation into 2023