AUDITORY MODELLING FOR ASSESSING ROOM ACOUSTICS. Jasper van Dorp Schuitman

Size: px

Start display at page:

Download "AUDITORY MODELLING FOR ASSESSING ROOM ACOUSTICS. Jasper van Dorp Schuitman"

Albert Tate
5 years ago
Views:

1 AUDITORY MODELLING FOR ASSESSING ROOM ACOUSTICS Jasper van Dorp Schuitman

2 Auditory modelling for assessing room acoustics PROEFSCHRIFT ter verkrijging van de graad van doctor aan de Technische Universiteit Delft, op gezag van de Rector Magnificus prof. ir. K.C.A.M. Luyben, voorzitter van het College voor Promoties, in het openbaar te verdedigen op donderdag 15 september 211 om 1: uur door Jasper VAN DORP SCHUITMAN ingenieur in de natuurkunde geboren te Zwijndrecht

3 Dit proefschrift is goedgekeurd door de promotoren: Prof. dr. ir. A. Gisolf Prof. dr. ir. D. de Vries Samenstelling promotiecommissie: Rector Magnificus, Prof. dr. ir. A. Gisolf, Prof. dr. ir. D. de Vries, Prof. dr. ir. P. Kruit, Prof. dr. A.G. Kohlrausch, Prof. dr. rer. nat. M. Vorländer, Prof. dr. ir. S. van de Par, Dr. E. Özcan Vieira, Prof. dr. H.P. Urbach, voorzitter Technische Universiteit Delft, promotor RWTH Aachen University, promotor Technische Universiteit Delft Technische Universiteit Eindhoven RWTH Aachen University Carl von Ossietzky Universität Oldenburg Technische Universiteit Delft Technische Universiteit Delft, reservelid ISBN Copyright c 211, by J. van Dorp Schuitman, Laboratory of Acoustical Imaging and Sound Control, Faculty of Applied Sciences, Delft University of Technology, Delft, The Netherlands. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, without the prior written permission of the author J. van Dorp Schuitman, Faculty of Applied Sciences, Delft University of Technology, P.O. Box 546, 26 GA, Delft, The Netherlands. This research is supported by the Dutch Technology Foundation STW, which is part of the Netherlands Organisation for Scientific Research (NWO) and partly funded by the Ministry of Economic Affairs, Agriculture and Innovation (project number DTF.7459). Typesetting system: L A TEX. Printed in The Netherlands by Ipskamp Drukkers, Enschede.

4 Contents Symbols and notation 7 1 Introduction The quality of room acoustics Problem statement Thesis outline Room acoustics Sound waves in enclosed spaces The Helmholtz equation Spherical waves A point source in a bounded medium Impulse response measurements Room acoustical parameters Reverberance Clarity Intimacy Loudness Apparent Source Width Listener Envelopment Brilliance Warmth Speech intelligibility Support Influence of digital filtering Shortcomings of current methods Spatial fluctuations

5 2 CONTENTS Measuring in empty halls The influence of signal properties Properties of the human auditory system The perception of loudness Revised objective parameters Solving problems with spatial fluctuations Improved measures related to spaciousness Other improved parameters Summary and discussion The AMARA method Psycho-acoustics The human auditory system Binaural hearing Applying auditory modelling in room acoustics The advantages of auditory modelling Room acoustical attributes Previous work The binaural auditory model Monaural model Examples of masking threshold predictions Binaural extension Objective parameters from the model outputs Model output examples Splitting of the output stream Direct and reverberant streams for ITD Objective parameters Reverberance Clarity Apparent Source Width Listener Envelopment Summary and discussion Simulating room acoustics A simulator for shoebox-shaped rooms Computation in frequency bands Initial estimation of the energy decay Simulating the direct sound

6 CONTENTS Early reflections Late reverberation Diffuse reverberation Total response Simulating different receivers Virtual microphones Binaural response Usage of the room simulator Evaluation of a new measure for speech intelligibility Summary and discussion Listening tests Listening test method Rating procedure Test setup Instructions for the subjects Level normalization Determining the SPL of the audio samples Re-distribution of the results Statistical analysis of the results Listening test I Rooms Subjects Results Analysis of the results Listening test II Rooms Subjects Results Analysis of the results Listening test III Rooms Subjects Results Analysis of the results Listening test IV Rooms Subjects

7 4 CONTENTS Results Analysis of the results The effect of SPL differences on the results Summary and discussion Optimization and validation of the model Model optimization using a genetic algorithm Optimization of the free parameters in the model The genetic algorithm Results of the optimization procedure Model validation Mapping of objective parameters onto perceptual attributes Summary and discussion Practical aspects of the AMARA method Applying the model in practice A typical measurement setup (1) A typical measurement setup (2) Software implementation Minimum signal length Robustness to noise Test rooms Robustness of the indirect method Robustness of the direct method Influence of SPL errors Stimulus types Voice stimuli Instrument stimuli Ensemble stimuli Suitable test signals Using microphones instead of a dummy head Influence of the dummy head position Influence of the dummy head orientation Summary and discussion Conclusions and outlook Conclusions Recommendations for further research Outlook

8 CONTENTS 5 A The auditory model 199 A.1 Model input A.2 Outer and Middle-ear transfer function A.3 Basilar membrane A.4 Absolute threshold of hearing A.5 Inner hair cells A.6 Adaptation A.7 Scaling A.8 Binaural processor A.9 Envelope extraction B The Replaygain algorithm 25 B.1 Equal loudness filter B.2 RMS calculation B.3 Estimating the loudness level B.4 Calibration with a reference level C Sound absorption through air 27 D Listening test instructions 29 E Statistical analyses 213 E.1 N-way ANOVA E.1.1 Factors E.1.2 Main effects and interaction effects E.1.3 Performing an N-way ANOVA E.1.4 Presentation of the ANOVA results E.2 Tukey HSD F The Gauss-Newton algorithm 219 Bibliography 221 Summary 233 Samenvatting 235 About the author 239 Dankwoord 241

9 6 CONTENTS

10 Symbols and notation Table 1: Quantities Symbol Description a Slope of decay (db/s), sigmoid constant, sound transmission coefficient b Boundary index, sigmoid constant, bandwidth constant c Velocity of sound (m/s) d Mean free path (m) df Degrees of freedom f Frequency (Hz) f s Sampling frequency (Hz) h Impulse response, relative humidity (%) i Repeat index, chromosome index j Attribute index, adaptation loop index k Wave number (m -1 ), frequency band index m Sample index, Gammatone filter index n Sample index, boundary crossing index, interval index p Pressure (Pa) q Frequency band index, filter constant, Tukey s q-value r Distance (m), correlation coefficient, filter constant, residual r Location vector s Scattering coefficient, source signal t Time (s) u Mirror source index, filter constant

11 8 Symbols and notation v w w s x x y z A ASW ATH B BMLD BR C 5,8 E D 5 EDT ERB F G GLL H IACC E IACC L ILD ITD ITDG JND K L L N L RMS L S LEV LF M MS Particle velocity (m/s) Pink noise signal, weighting window White noise signal Distance (m) x-position (m), time signal y-position (m), time signal z-position (m), frequency band index Total amount of sound absorption, scaling factor Apparent Source Width Absolute Threshold of Hearing Bandwidth (Hz) Binaural Masking Level Difference (db) Bass Ratio Clarity index (db) Energy (J) Definition Early Decay Time (s) Equivalent Rectangular Bandwidth Fitness value, F-ratio Sound strength (db), angular sensitivity Late Lateral Sound Level (db) Transfer function Early Interaural Cross-Correlation Late Interaural Cross-Correlation Interaural Level Difference Interaural Time Difference Initial Time Delay Gap (s) Just Noticeable Difference Bulk modulus (Pa) (Average) signal level (MU), SPL (db) Loudness (phon) RMS level (db) Loudness (sone) Listener Envelopment (db) (Early) lateral energy fraction Number of observations Mean sum of squares

12 9 N Gaussian noise, number of reflections, chromosomes or intervals N s Number of reflections per sample O s Maximum order of specular reflections P Pressure (Pa), Room acoustical parameter RT Reverberation time (s) S Surface area (m 2 ), source signal, Scaled room acoustical parameter, Sum of squares SNR Signal-to-noise ratio SPL Sound Pressure Level (db) SS Sum of squares ST early Early Support ST late Late Support STI Speech Transmission Index T Time (s), Temperature (K) T 2/3/6 Reverberation time (s) T S Centre time (s) TR Treble Ratio V Room volume (m 3 ) α Absorption coefficient β Weighting constant α Level attenuation step (db) γ Logarithmic frequency ratio, step length δ Amount of fluctuation ɛ Threshold (MU) ζ Mask θ Angle (elevation) µ Weighting constant, Armijo condition constant ν Weighting constant, Gammatone filter order π Pi (constant) ρ Mass density (kg/m 3 ) σ Standard deviation τ Time shift (s), time constant (s) φ Angle (azimuth) ψ Starting phase ω Angular frequency (rad/s)

13 1 Symbols and notation Ψ Step size Model output Table 3: Discrete and continuous signals Notation x(t) x[n] Description Continuous signal Discrete signal Table 4: Scalars, vectors and matrices Notation a a A Description Scalar Vector Matrix Table 5: Functions Function β δ Q Description Incomplete regularized beta function Delta dirac function Tukey s q studentized range Table 6: Mathematical operations Operation x y x 1 y max Re x x T Description y convolved with x y deconvolved with x Maximum Real part Complex conjugate of x Transpose of x

14 11 Table 7: Operators Operator H F F 1 Description Hilbert transform Fourier transform Inverse Fourier transform Table 8: Various acronyms Acronym Description 3IFC 3-Interval Forced-Choice AAC Advanced Audio Coding AMARA Auditory Modeling for Assessing Room Acoustics ANOVA Analysis of Variance BRIR Binaural Room Impulse Response CS Central Spectrum EC Equalization-Cancellation EI Excitation-Inhibition GUM Guide to the expression of Uncertainties in Measurements HRTF Head-Related Transfer Function IIR Infinite Impulse Response JND Just Noticeable Difference MLS Maximum-length sequence MP3 MPEG-1 Audio Layer 3 MU Model Units QESTRAL Quality Evaluation of Spatial Transmission using an Artificial Listener SQAM Sound Quality Assessment Material

15 12 Symbols and notation

16 1 Introduction Acoustics are not a science, not even an art, but a roll of the dice (Bernard Holland, present) 1.1 The quality of room acoustics Ever since groups of people gathered in spaces to communicate, enjoy theater, listen to a musical performance, etc., designers not only had to deal with the construction and visual aspects but also with the way sound waves behave in the room. Even in ancient times, the Greeks and the Romans used basic knowledge of acoustics to improve the speech intelligibility of their theaters. However, it was not until the late 19th century that a scientific approach to room acoustics was applied for the first time. In 1895, the Harvard physicist Wallace Clement Sabine founded the science of architectural acoustics when he performed experiments to derive an estimation for the reverberation time, which is the time it takes for the sound pressure level in a room to decay by 6 db after the sound source stops. Using rooms filled with varying amounts of cushions, carpets and student bodies, a stopwatch and a sound source in the form of an organ pipe with a fixed frequency of 512 Hz, he found that the reverberation time increases with increasing room volume and is inversely proportional to the amount of sound absorbing material. After weeks of experimenting, he came up with the famous Sabine Formula for the reverberation time: T 6 =.161 V αs, (1.1) where V is the room volume, S the total wall area and α the average absorption coefficient in the room. The formula is still used very often today to make an initial estimation for the reverberation time of a room, although its applicability

14 Introduction is somewhat limited to medium-sized auditoria, it assumes the wave field to be diffuse and it does not take air absorption into account.

17 14 Introduction is somewhat limited to medium-sized auditoria, it assumes the wave field to be diffuse and it does not take air absorption into account. However, it was the first objective parameter describing the acoustics of a room and it marked the start of modern room acoustical science. Sabine was asked to work as an acoustical consultant during the design of the Boston Symphony Hall. This hall opened in October 19 and it is still considered to be one of the three best concert halls in the world together with the Concertgebouw in Amsterdam (1888) and the Vienna Grosser Musikvereinssaal (187). Sabine s findings were published in his Collected Papers on Acoustics [Sabine, 1922]. Figure 1.1: Boston Symphony Hall (photo: Michael J. Lutch). The praise for a concert hall like the Boston Symphony Hall of course raises the question if it is possible to qualify what are good and bad acoustics and if one room can be acoustically better than the other. Obviously, this is a very complex topic since quality in the context of sound is a highly subjective and multidimensional phenomenon. Perhaps this is best illustrated by a quote of conductor George Szell in reference to the interior of the Philharmonic Hall: How can one make beautiful music in a blue room? Although this might seem a little extreme, it has been found that colour indeed can have an influence on how people judge the loudness of a sound source, for example [Menzel et al., 29]. Another example of the subjective nature of acoustics and how it also depends on the context, is the comment by vocalist Janne Schra of the Dutch jazz-pop band Room Eleven after performing in the universally acclaimed Concertgebouw in Amsterdam: The acoustics were terrible. For me, the intensification should come from the music itself, not from the building or the size of the event (OOR magazine, Jan/Feb 29). Given the complexity of the topic it is necessary to define carefully the term quality first. A very suitable definition has been proposed by Jekosch [24]: Definition 1. Quality: The physical nature of an entity with regards to its ability

18 1.1 The quality of room acoustics 15 to fulfill predetermined and fixed requirements. Definition 2. Physical nature: The totality of the features of an entity and the values assigned to these features. Definition 3. Entity: The material or immaterial item under investigation. Jekosch related these definitions to product-sound quality, but without modification they apply to room acoustics as well. Following the definition by Jekosch, quality can be broken down into features: Definition 4. Quality features: Properties for recognizing or distinguishing between entities. In other words, in order to be able to assess the acoustical quality of a room objectively, one has to know the various features that together contribute to the overall acoustical quality (perceptual attributes) and find objective parameters that correlate with those features. The groundbreaking work of Sabine was a first step into this direction, since he found that the amount of reverberance in a room can be an indicator for why some rooms are preferred over others. Subsequently, he also proposed a measure for the amount of the reverberance in the form of the reverberation time. One could say that after the work of Sabine progress in this direction has been very slow; for a very long time reverberation time was the only quantity considered. In the first volume of the Journal of the Acoustical Society of America (JASA), MacNair proposed a method to determine the necessary amount of acoustical absorption in a room such that the sensation level decays uniformly across frequencies [MacNair, 193] due to the reverberation. This was carried out by relating this sensation level to a, not clearly defined, physical measure for loudness as a function of frequency. More progress was made in the 5 s by Thiele, who added definition as another important attribute of room acoustics [Thiele, 1953]. Definition is a measure for how well the different components in the sound, e.g. notes in music, stand apart from each other. Thiele proposed objective parameters which are related to the perceived amount of definition, based on energy ratios between early and late arriving sound energy. In the 6 s, Beranek [1962] added a range of important parameters to the list, like the Initial Time Delay Gap (ITDG, the time between the direct sound and the first major reflection), which is related to what Beranek calls intimacy. A hall has a high acoustical intimacy when the audience has the feeling that the music is being played in a small hall and feels connected with the performers [Mitchinson, 21]. Beranek also found that loudness is an important aspect of acoustics [Beranek, 1962]. One of the first large-scale experiments was performed by Schroeder et al. [1974] who surveyed 22 European halls and found a coupling between various perceptual attributes and objective parameters. Parallel work was carried out in Berlin by Plenge et al. [1975], who also found that the acoustic quality of a hall can be described by a set of objective parameters. One of the major differences between these

19 16 Introduction two studies was the fact that Schroeder et al. used binaural recordings of the reproduction an orchestra playing (using two loudspeakers), while Plenge et al. used binaural recordings of a real orchestra (Berlin Philharmonic Orchestra) playing in the halls. Similar work was performed by Beranek in the 9 s [Beranek, 1996] and early s [Beranek, 23]. Other important work has been performed by Ando [1983], Barron [1988] and many others. This research all resulted in a long list of objective parameters describing various aspects of room acoustics, of which a historical overview can be found in [Lacatis et al., 28]. Today, consensus has been reached on a small set of common objective parameters which are mostly used to describe the acoustics of a room and most of them are standardized in ISO standard 3382 [ISO, 29, 28]. Although the work described above involves mostly concert halls for classical music, the findings apply to all kinds of rooms where the acoustics are of importance. Of course, the parameters have different optimal values depending on the application (speech, music, etc.). Also, for some applications certain attributes may be more important than others. For example, speech intelligibility is the most important attribute in class rooms, while it is of virtually no importance in a symphony hall. 1.2 Problem statement Although a set of established parameters is now widely used in the field of room acoustics, it is also known that they have some significant shortcomings. As will be discussed in Chapter 2, the values for the objective parameters can fluctuate severely over small spatial measurement intervals, while the perceptual attributes for which these parameters should be predictors remain constant. Furthermore, in some cases the objective parameters just do not correlate highly enough with their perceptual counterparts. Also, in the current methods the properties of the type of sources for which a room is intended, are barely taken into account. Finally, because of the nature of the methods, measurements are often carried out in empty halls, while the presence of an audience will have a big influence on some aspects of the acoustics. These uncertainties and inaccuracies are very much undesirable, of course, as was also stated so bluntly by New York Times critic Bernard Holland. The quote at the start of this chapter originates from an article by him, which was published after the re-opening of the Carnegie Hall in New York in The hall had been renovated, and apparently acousticians had failed to preserve the much appreciated acoustics it had before [Mitchinson, 21]. The previously mentioned shortcomings are closely related to the fact that the established parameters mostly describe physical attributes of a room, while the perception of these attributes is what is most important. Therefore, in this thesis a method for assessing objective parameters related to various room acoustical attributes will

20 1.3 Thesis outline 17 be proposed: Auditory Modelling for Assessing Room Acoustics (AMARA). This method is based on simulating the human auditory system. From the auditory model, room acoustical parameters can be extracted. These parameters can be used by acousticians, architects and other people who deal with room acoustics to make statements about various acoustical attributes of a room, or to compare different rooms in terms of their acoustics. This method also takes properties of the source signal into account, since it extracts the acoustical parameters directly from arbitrary binaural recordings. This means that rooms can be tested for multiple applications. For example, the acoustics of a theatre can be tested for speech and chamber music. Finally, the method can be used to assess the acoustics in occupied rooms. Therefore, it is no longer necessary to measure in empty rooms. The acoustics of a concert hall can be assessed in a concert situation, for example. 1.3 Thesis outline In Chapter 2, an overview will be given of the established objective parameters, their measurement methods and their shortcomings. An auditory model with which new objective parameters related to room acoustics can be derived that are based on perception is presented in Chapter 3, together with methods to determine these objective parameters. Other, comparable research in this field will also be discussed. In order to optimize and validate the auditory model, various listening tests were conducted. A room acoustics simulator used in these tests is presented in Chapter 4, followed by a description of the listening tests and their results in Chapter 5. Chapter 6 discusses the procedures to optimize and validate the auditory model using the listening test results. Chapter 7 deals with application of the model like its robustness in terms of measurement noise and other practical aspects. The results will be summarized and discussed in Chapter 8. In order to give a better overview of the various stages in this research, Figures 1.2, 1.3 and 1.4 show the stages for the development, optimization and validation of the model, respectively. The figures also show the corresponding chapter numbers.

21 18 Introduction Model development Chapter 3 Auditory model Chapter 3 Model extension Chapter 3 Parameter extraction Figure 1.2: The various stages in the development of the auditory model, including the corresponding chapter numbers. Model optimization Chapter 5 Room measurements Chapter 5 Listening test III: Real rooms with loudness differences Chapter 3 Model Chapter 6 Model optimization Figure 1.3: The various stages in the optimization of the auditory model, including the corresponding chapter numbers.

22 1.3 Thesis outline 19 Model validation Chapter 4 Room simulator Chapter 5 Listening test I: Common rooms (virtual) Chapter 3 Model Chapter 6 Chapter 4 Room simulator Chapter 5 Listening test II: Uncommon rooms (virtual) Chapter 3 Model Chapter 5 Room measurements Chapter 5 Listening test III: Real rooms with loudness differences Chapter 3 Model Model validation Chapter 5 Room measurements Chapter 5 Listening test IV: Real rooms without loudness differences Chapter 3 Model Figure 1.4: The various stages in the validation of the auditory model, including the corresponding chapter numbers.

23 2 Introduction

24 2 Room acoustics And though I may speak of sound-waves In the course of this discussion I shall do so under protest With this frank asseveration That sonorous undulations Are a work of pure invention Brilliantly imaginary Having not the least foundation Either in the laws of nature Or the principles of science (A. Wilford Hall, ) The science of room acoustics deals with how sound waves behave in rooms and how the sound is perceived by humans as a result of this. This chapter will start with the theory behind sound waves in enclosed spaces, followed by methods for measuring objective parameters which are related to various perceptual acoustical attributes. These methods have some shortcomings, which will be explained at the end of the chapter, together with some recent attempts in the literature for improving them. 2.1 Sound waves in enclosed spaces The Helmholtz equation Consider a three-dimensional homogeneous, isotropic fluid in which acoustic waves can propagate. In such a medium, Newton s second law of motion describes the relation between the space derivative of the pressure p and the time derivative of the particle velocity v at a given point r and time t as follows:

25 22 Room acoustics p(r, t) = ρ v(r, t), (2.1) t where ρ is the mass density of the fluid. Hooke s law relates the space derivative of the particle velocity with the time derivative of the pressure: v(r, t) = 1 K p(r, t), (2.2) t where K is the bulk modulus of the fluid. Equations 2.1 and 2.2 can be combined by taking the divergence of 2.1: 2 p(r, t) 1 c 2 2 p(r, t) t 2 =. (2.3) Equation 2.3 is called the wave equation, in which c = K/ρ is the propagation speed of sound in the fluid. By taking the Fourier Transform, this equation can be rewritten as the Helmholtz equation: 2 P (r, ω) + k 2 P (r, ω) =, (2.4) where ω denotes frequency and k = ω/c is the wave number. The pressure of a wave field in a homogeneous, isotropic fluid can be derived by solving Eqs. 2.3 or 2.4. The particle velocity then follows from Spherical waves The wave equation will be solved for a special case. Imagine a point source (monopole) at position: r = r. (2.5) Such a source will transmit spherical sound waves as shown in Fig In this case, the Helmholtz equation in the frequency domain reads: 2 P (r, ω) + k 2 P (r, ω) = 4πδ(r r )S(ω), (2.6) with S(ω) the spectrum of the source. Equation 2.6 has the solution: P (r, ω) = S(ω) exp ( jk r r ). (2.7) r r

2.1 Sound waves in enclosed spaces 23 Figure 2.1: A point source in a three-dimensional unbounded medium transmits spherical waves. Or, in the time domain: ( ) s t r r c p(r, t) =. (2.

26 2.1 Sound waves in enclosed spaces 23 Figure 2.1: A point source in a three-dimensional unbounded medium transmits spherical waves. Or, in the time domain: ( ) s t r r c p(r, t) =. (2.8) r r Equation 2.8 shows that the spherical wave field of a monopole source propagates with a velocity c and its amplitude is inversely proportional to the distance between source and receiver A point source in a bounded medium If the point source is placed in a medium with boundaries, reflections of sound waves at these boundaries will occur. This is illustrated in Fig. 2.2, where a wave field is simulated for a monopole source transmitting a pulse at t = in a rectangularlyshaped (2D) room. The room has dimensions 2 1 m and the source is located at (x, y) = (2, 3) m, with the origin located in the middle of the room. Because of the reflections, a complex wave pattern will emerge after some time. In the example of Fig. 2.2, the walls of the room are fully reflective, i.e. all energy of the incoming wave is being reflected, while no energy is absorbed by the wall material. In practice, a fraction of the energy will always be absorbed. This fraction is defined by the sound absorption coefficient α, which is dependent on the wall material and often a function of frequency and the angle of incidence. Databases with absorption coefficients for various materials exist, like the α-database from the Physikalisch-Technische Bundesanstalt (PTB) in Braunschweig, Germany [PTB, 26]. Table 2.1 lists some example values for common materials. The total amount of sound absorption A in a room can be expressed as:

The wave field in the room is simulated for different times after the source transmits a pulse.

27 24 Room acoustics (a) t = 2 ms (b) t = 5 ms (c) t = 1 ms (d) t = 2 ms (e) t = 5 ms (f) t = 1 ms (g) t = 2 ms (h) t = 5 ms Figure 2.2: An example of a monopole source placed in a rectangular room. The wave field in the room is simulated for different times after the source transmits a pulse. The spherical waves will reflect at the boundaries of the room, introducing a complex pattern of sound waves after some time.

28 2.1 Sound waves in enclosed spaces 25 A = i S i α i, (2.9) where S i and α i are the area and absorption coefficient of each surface in the room respectively. The unit of A is Sabin, named after Wallace Clement Sabine (see Chapter 1) and is equal to the total surface area of open window (for which α = 1) in m 2. Table 2.1: Frequency-dependent, random-incidence absorption coefficients α for some common materials, as taken from the PTB database [PTB, 26]. Material 125 Hz 25 Hz 5 Hz 1 khz 2 khz 4 khz Heavy carpet on concrete Wooden floor Concrete block (painted) Concrete block (plastered) Gypsum board Drapes, heavy velour Windows, window glass Rockwool (5 mm, 7 kg/m 3 ) Audience (2. pers/m 2 ) On a perfectly flat surface the angle of the reflected wave will be equal to that of the incoming wave (specular reflection). However, due to surface irregularities, part of the energy of the wave may also reflect diffusively in other directions (scattering), as shown in Fig There are two widely used coefficients which characterize this non-specular reflection of sound waves [Cox and D Antonio, 24]: (1) Diffusion coefficient: This is a measure for the uniformity of the scattered energy. It is standardized in AES standard AES-4id-21 (r27) [AES, 27] and can be used to measure the quality of a diffuser. A diffuser is a device which is designed to scatter sound intentionally in non-specular directions. (2) Scattering coefficient: This coefficient is defined in ISO standard [ISO, 24] as: s = E scat E spec + E scat, (2.1) with E scat the total energy scattered in non-specular directions and E spec the energy reflected in the specular direction.

29 26 Room acoustics Ei (1 - s)(1 - α)ei s(1 - α)ei Figure 2.3: When an incident sound wave with energy E i reflects on an irregular surface, a part is reflected specularly with energy (1 s)(1 α)e i and a part will be scattered with energy s(1 α)e i. An amount of αe i of the energy will be absorbed. 2.2 Impulse response measurements When determining the wave field in a particular point, the room can be considered as a Linear Time-Invariant (LTI) system. This means that the pressure p R (t) at an arbitrary receiver location R inside the room can be written as a convolution: p R (t) = h SR (t) s(t), (2.11) with h SR (t) the impulse response of the room from the source S to the receiver R and s(t) the source signal. Or, in the frequency domain: P R (ω) = H SR (ω)s(ω). (2.12) Equations 2.11 and 2.12 show that in terms of acoustics, for a given source and receiver combination, a room can be completely characterized by its impulse response. An example impulse response, measured in the Concertgebouw in Amsterdam, is shown in Fig In room acoustics, three basic components of an impulse response are generally distinguished: (1) Primary sound: The primary sound consists of the direct (non-reflected) sound from the source together with very early reflections which typically arrive up to 2 ms after the direct sound. Because of the integration properties of the human auditory system, these very early reflections are perceived as a reinforcement of the direct sound [Haas, 1951]. (2) Early reflections: The part of the impulse roughly between 2 and 8 to 1 ms after the arrival of the direct sound consists of early reflections. As will be shown later, early reflections can contribute to clarity and spaciousness.

30 2.2 Impulse response measurements h SR (t) time (ms) Figure 2.4: An example of an impulse response, measured in the Concertgebouw in Amsterdam. The first 5 ms are shown. (3) Late reverberation: In most rooms, after about 1 ms the impulse response becomes diffuse and can no longer be described in terms of discrete reflections. This part is called late reverberation. It is straightforward to determine the impulse response by using a Dirac delta pulse as input signal (s(t) = δ(t)): p R (t) = h SR (t) s(t) = h SR (t) δ(t) = h SR (t). (2.13) However, it is physically impossible to generate a Dirac delta pulse in practice since it is infinitely short. In the past, impulse response measurements were carried out with approximations of a delta pulse, like exploding balloons and air gun shots. It is, however, obvious that these methods have problems with reproducibility and frequency band limitations. Therefore, today other methods are used which have a high reproducibility and can use the full frequency bandwidth, like techniques based on Maximum-Length Sequences (MLS) [Rife and Vanderkooy, 1989] or sine sweeps. The sine sweep method was first introduced by Berkhout et al. [Berkhout et al., 198]. Recently, Farina proposed some improvements on the method [Farina, 27]. An example of a sine sweep (or swept-sine ) is shown in Fig It is based on the following equation: [ ( ( 2πT x sweep (t) = sin exp γ t ) )] 1, (2.14) γ T with γ a constant depending on the lower and upper frequency limits f and f 1 :

31 28 Room acoustics γ = log ( f1 f ). (2.15) Equation 2.14 represents a sine wave with a frequency that exponentially increases from f to f 1 over a time T. 1.5 x sweep (t) time (s) Figure 2.5: Left: the first 3 seconds of a swept sine with a frequency increasing from 1 to 24 Hz in the time domain. Right: a spectrogram (Short-Time Fourier Transform) of the swept sine (in db). When using a swept sine as source signal s(t), the impulse response from source position S to receiver R can easily be obtained from a measurement P R (ω) in the frequency domain: H SR (ω) = S(ω) 1 P R (ω), (2.16) provided that the inverse of S(ω) exists, which is the case for a sine sweep over the full frequency range. The operation in Eq is called deconvolution. Two advantages of using a sine sweep as a source signal were already mentioned: high reproducibility and excitation of the full bandwidth. A third advantage is the fact that a sine sweep is always full-scale in the time domain which is advantageous for reaching a high Signal-to-Noise Ratio (SNR). Finally, exponential sine sweeps (see Eq. 2.14) have more energy in lower frequencies. This is useful in practice, since the level of measurement and background noise is generally higher in that frequency range. 2.3 Room acoustical parameters Based on the definition of quality in Chapter 1, it was explained that in order to assess the acoustical quality of a room, the set of perceptual attributes that

32 2.3 Room acoustical parameters 29 together contribute to the overall quality has to be known, together with objective parameters related to these attributes. In the following subsections a total of ten attributes are discussed that are well-accepted as being important for the impression of room acoustics, together with established corresponding objective parameters that can be obtained from impulse response measurements. Almost all of these attributes are always mentioned in books on acoustics; see for example [Beranek, 1996], [Kuttruff, 2] and [Rossing, 27]. For most parameters the definitions, calculation methods and measurement procedures are standardized in ISO standard 3382 and its appendices [ISO, 29, 28] Reverberance In Section 2.1, it was explained that sound waves in enclosed spaces reflect at the surfaces present in the room. Depending on the amount of energy that is absorbed by these surfaces and the dimensions of the room it takes a certain time for the sound to decay after the source stops. The amount of reverberance (or liveness ) that is perceived in a room is related to this effect. As discussed in Chapter 1, Sabine found that the Reverberation Time (RT or T 6 ) is closely related to the perceived amount of reverberance. It is defined as the time it takes for the sound level to decay by 6 db after the sound source stops. The reverberation time can be obtained from the impulse response by applying backward integration, as proposed by Schroeder [Schroeder, 1965]: h 2 i (t) = t [h SR (τ)] 2 dτ = t [h SR (τ)] 2 d ( τ) (2.17) The reverberation time T can be calculated from the slope a of h 2 i (t) (when plotted logarithmically), as shown in Fig This slope is expressed in db s 1 and is determined by fitting a straight line through (a part of) the energy decay. The reverberation time follows as T = 6/a. In practice, a dynamic range of 6 db will probably not be reached during an impulse response measurement due to measurement and background noise. Therefore, the ISO standard 3382 [ISO, 29] proposes T 3, which is obtained by computing the slope between 5 and 35 db decay using linear least-squares fitting. Likewise, T 2 is obtained using the part between 5 and 25 db decay. Finally, the ISO standard defines the Early Decay Time (EDT) as the time obtained from the first 1 db of the decay. According to the standard, EDT is more related to the perception of reverberance while the reverberation times T x, with x equal to 2, 3 or 6, are more related to the physical properties of the room. The reverberation times and early decay time are usually measured in octave bands, or one-third octave bands. To arrive at a single number, the standard advises to

33 3 Room acoustics 1 (t) (db) h 2 i time (s) Figure 2.6: The reverberation time can be obtained from an impulse response by estimating the slope of the backward integrated curve. Here, the slope is estimated by fitting a line through the part of the decay between -5 and -35 db. average over mid frequencies (the 5 and 1 Hz octave bands). ranges for the reverberation time are shown in Table 2.2. Some typical The Just Noticeable Difference (JND) for EDT and the reverberation time is 5% relative to the value [ISO, 29; Barron, 25]. Table 2.2: Typical values for the reverberation time for different types of rooms (taken from [Boone et al., 1995]). Room type Rev. time (s) Living room.5 Cinema.7 1. Theatre Chamber music hall Opera house Concert hall (classical music) Church Clarity In room acoustics, clarity describes how well different components of a signal can be perceived. Late reverberation can blur the signal, making it harder to perceive details in the signal. Early reflections may contribute to the overall loudness of the signal relative to the background, thus, improving clarity. Therefore the ISO standard proposes the clarity index as follows:

34 2.3 Room acoustical parameters 31 C te = 1 log 1 ( te h2 SR (t)dt t e h 2 SR (t)dt ) [db]. (2.18) For speech, t e is set at 5 ms (C 5 ) while for music 8 ms (C 8 ) is used. The clarity index for speech, C 5, is closely related to the definition parameter D 5 which was developed by Thiele [1953] and mentioned in Chapter 1: so: D 5 = 5 ms h 2 SR (t)dt h 2 SR (t)dt, (2.19) C 5 = 1 log 1 ( D5 1 D 5 ) [db]. (2.2) The clarity index is usually averaged over the 5 and 1 Hz bands. Values for the clarity index typically are within a 5 to +5 db range, and according to the ISO standard, its just noticeable difference is believed to be around 1 db [ISO, 29]. For speech applications high clarity values are preferred while for classical music lower values are acceptable. Another commonly used measure for clarity is the centre time T S [Kürer, 1971], which is equal to the centre of gravity of the squared response: T S = th 2 SR (t)dt [s]. (2.21) h 2 SR (t)dt However, as discussed by Bradley, T S is strongly related to the energy decay and may therefore be closer to being a measure of reverberance than of clarity [Bradley, 21]. Bradley argues that energy ratio-based measures like C 5 and C 8 may be better indicators of clarity Intimacy As explained in Chapter 1, Beranek defined intimacy as the degree to which the audience has the feeling that the music is being played in a small hall and feels connected with the performers. He also found that intimacy is related to the Initial Time Delay Gap (ITDG) which is the time between the arrival of the direct sound and the arrival of the first major reflection. When the first major reflection arrives shortly after the direct sound, the room is perceived as small. When a concert hall is very wide the ITDG becomes large and the room will lack intimacy. In concert halls for symphonic repertoire the ITDG typically lies between 15 and 3 ms [Beranek, 1996].

35 32 Room acoustics Loudness In acoustics, loudness is defined as the perceived level of a signal. Due to reverberance in a room, the perceived sound level of a source is louder compared with a free-field situation. This is expressed in the sound strength G: ( h 2 SR G = 1 log (t)dt ) 1 h 2 1 (t)dt [db], (2.22) where h 1 (t) is an impulse response as measured in a free field at 1 m distance. Typical values found for G are in the range of 2 to +1 db and the reported just noticeable difference is 1 db [ISO, 29]. Loudness in a room acoustical context should not be confused with loudness in a signal context. For signals like speech or music, loudness is most often expressed in sone: L S = 2 L N 4 1 [sone], (2.23) where L N is the loudness level in phon. L N is based on equal loudness curves: a tone with a certain frequency and with level L N phon has the same perceived loudness level as a tone with a frequency of 1 khz and a sound pressure level of L N db. Equation 2.23 is defined for L N > 4 phon. For time-varying sounds, the perception of signal level is a complex topic. Various models exist to calculate loudness for time-varying signals; see for example [Rennies et al., 21]. A widely used algorithm for estimating the loudness of arbitrary signals is the ReplayGain algorithm [Robinson, 21]. This algorithm is, for example, used to normalize the loudness of digital audio files (like MP3s) to avoid big differences in the perceived level of played files on media players Apparent Source Width Due to early reflections from lateral directions, decorrelation of the sound field between both ears might occur. This will result in the effect of a sound source being perceived wider than the visual, physical size of the source [Keet, 1968]. This broadening effect is expressed as Apparent Source Width (ASW). It is therefore a measure for spaciousness. There are two commonly used parameters which are related to ASW. The first one is the (early) lateral energy fraction (LF): LF = 8 ms h 2 5 ms f8 (t)dt 8 ms (2.24) h 2 SR (t)dt,

36 2.3 Room acoustical parameters 33 with h f8 (t) the impulse response as measured using a figure-of-eight microphone with its null pointed toward the source. LF has values in the range to 1, where higher values indicate more lateral energy and thus a broader sounding source. In concert halls for classical music, generally values in the range.5 to.35 are found. This parameter should be averaged over the 125 to 1 Hz octave bands and its JND is assumed to be.75 [ISO, 29]. A second parameter believed to be related to ASW is one minus the Early Interaural Cross-Correlation IACC E : 1 IACC E = 1 max 8 ms h L (t)h R (t + τ)dt 8 ms p 2 L (t)dt 8 ms p 2 R (t)dt, (2.25) where the time shift τ is varied between 1 and +1 ms. h L (t) and h R (t) are impulse responses for the left and right ear respectively, as measured using an artificial head. IACC E is a measure for the decorrelation between both ears in the early part of the impulse response. The JND for IACC is assumed to be.75 [ISO, 29] although different values can be found in the literature (see for example [Blau, 22]). The relevant frequency bands and reliability of IACC are currently topics of discussion [Witew et al., 21b]. According to Beranek, averaging over the 5, 1 and 2 Hz bands yields the best results [Beranek, 1996]. In concert hall acoustics, it is generally believed that ASW should preferably be as high as possible [Beranek, 1996] Listener Envelopment The effect of feeling inside and surrounded by the reverberant sound of the room is called Listener Envelopment (LEV). Like ASW, LEV is an aspect of spaciousness. It has been found that it is mainly related to late arriving lateral energy [Bradley and Soulodre, 1995]. ISO mentions two predictors for envelopment. The first one is the late lateral sound level GLL: ( 8 ms GLL = 1 log h2 f8 (t)dt ) 1 h 2 1 (t)dt [db], (2.26) GLL is generally averaged over the 125 to 1 Hz octave bands, with typical values in the range of 14 to +1 db. Its JND is yet unknown. Contrary to the other parameters specified in ISO , GLL should be averaged over the four octave bands as follows: GLL avg = 1 log 1 ( 1 4 ) 4 1 GLLn/1 [db]. (2.27) n=1

37 34 Room acoustics Another common measure for LEV is one minus the Late Interaural Cross-Correlation IACC L : 1 IACC L = 1 max 8 ms h L(t)h R (t + τ)dt 8 ms p2 L (t)dt with 1 ms < τ < +1 ms. 8 ms p2 R (t)dt, (2.28) Like the Early Interaural Cross-Correlation IACC E, the Late Interaural Cross-Correlation has values in the range to 1 with a reported JND of.75 [ISO, 29] Brilliance A sound is called brilliant if it is bright, clear and rich in harmonics. The treble frequencies should be prominent and decay slowly [Beranek, 1996]. Therefore, it is related to the perceptual attribute timbre. Brilliance can be evaluated by calculating the Treble Ratio (TR) [Rossing, 27]: TR = RT 2 + RT 4 RT 5 + RT 1, (2.29) where RT f is the measured reverberation time in the frequency band with center frequency f (Hz). The treble ratio is not often mentioned in the literature and not much is known about typical values. Bradley reports TR values for three classical concert halls: the Concertgebouw in Amsterdam, the Vienna Grosser Musikverreinssaal and the Boston Symphony Hall [Bradley, 1991]. In [Bradley, 1991], the values for TR are in the range of.77 (Vienna) to.95 (Boston). The JND for the treble ratio is unknown Warmth In a musical context, warmth is defined as liveness of bass or fullness of the bass tones, relative to mid-frequency tones. Like brilliance, it is related to timbre. A common predictor for warmth is the Bass Ratio BR [Beranek, 1996; Rossing, 27]: BR = RT RT 25 RT 5 + RT 1. (2.3) In concert halls and opera houses, values for BR are typically in the range.9 to 1.5 [Beranek, 1996]; its JND is unknown. Note that both the Treble Ratio TR and the Bass Ratio BR are not specified in ISO

38 2.3 Room acoustical parameters Speech intelligibility In rooms where speech is important, like classrooms, lecture halls, theaters and airports, the intelligibility of speech should be high. Different measures for speech intelligibility exist, of which the Speech Transmission Index (STI) is the most well-known. It is defined in IEC standard [IEC, 23] and developed by Steeneken and Houtgast [Steeneken and Houtgast, 198]. The STI predicts the speech intelligibility by evaluating how the transmission chain affects the modulation depth in the speech signal, like a microphone, a loudspeaker, the room, etc. The result is a value between (completely unintelligible) and 1 (perfect intelligibility). For speech purposes, intelligibility is obviously a very important aspect and, therefore, it is mentioned in this section. However, since the focus of this project is on the perception of acoustics for different kinds of stimuli, an in-depth discussion of the STI is outside the scope of this thesis and the reader is referred to above references for detailed information Support During musical performances, it is important for musicians to be able to hear each other sufficiently and to get response from the room. This effect is called support, and it was first investigated by Gade in two papers published in Acustica [Gade, 1989a,b]. In pop music situations, musicians often hear each other through separate monitor loudspeakers close by. In orchestras this is almost never the case, and reflections from the room itself can help to get sufficient support. Today, two different parameters are used in practice: Early Support ST early and Late Support ST late. ST early is defined in ISO as: ( 1 ms h 2 2 ms SR ST early = 1 log (t)dt ) 1 1 ms h 2 SR (t)dt [db]. (2.31) Likewise, ST late is given by: ( 1 ms h 2 1 ms SR ST late = 1 log (t)dt ) 1 1 ms h 2 SR (t)dt [db]. (2.32) Both parameters should be arithmetically averaged over the 25 to 2 Hz frequency bands [ISO, 29]. ST early is relevant for an ensemble. It determines how well the musicians in an orchestra can hear each other, and its values are typically in the range of 24 to 8 db [ISO, 29]. ST late predicts the amount of response from the room that is experienced by the musicians, with typical values in the range 24 to 1 db. For

39 36 Room acoustics the computation of support, the impulse response h SR (t) should be measured on the orchestra platform, with the source and microphone close together. 2.4 Influence of digital filtering The (digital) filters that are used to filter the impulse response to evaluate the acoustical parameters in different bands, should be chosen very carefully, since the temporal and spectral properties of a filter may influence the results significantly, as shown in [Huszty et al., 28]. ISO standard 3382 states that for reliable results, the following condition should be fulfilled: BT 6 > 16, (2.33) with B the bandwidth of the filter (in Hz). Furthermore, the quality of a filter is defined in ANSI standard S1.11 [ANSI, 24]. In [Huszty et al., 28] it is recommended to use filters of the highest quality possible (class ) for measuring room acoustical parameters. It can be shown that for filtering in octave bands 3 rd order Butterworth IIR filters satisfy this criterion for all frequency bands of interest ( Hz). To compensate for the filter delay, ISO 3382 proposes to time-window the broadband impulse response by determining the arrival of the direct sound. This is the time where the signal rises significantly above the background, but still is more than 2 db below the maximum. The early and late components of the response should be filtered separately and the integration periods should be increased to account for the filter delay [ISO, 29]. 2.5 Shortcomings of current methods Despite the fact that a list of well-established room acoustical attributes and related objective parameters now exists (and partly even is specified in an ISO standard), it is also recognized that there are some issues with the current methods for assessing the acoustical qualities of a room. These shortcomings will be discussed in the next sections followed by attempts to overcome them reported in the literature Spatial fluctuations First, it has been found that some objective parameters may fluctuate significantly (i.e. more than one JND) when measured at small spatial intervals, while the perceptual attributes remain constant. For example, Nielsen et al. [1998] reported variations in C 8 up to 3.2 db within a single seat. This value is much higher than the

40 2.5 Shortcomings of current methods 37 JND for C 8 (1 db). De Vries et al. [De Vries et al., 21] examined the variability in measures for spaciousness by measuring impulse responses in the Concertgebouw in Amsterdam using a linear microphone array with a distance of x = 5 cm between microphone positions. Figure 2.7 shows these parameters, calculated by the present author from the measurements by De Vries et al., together with 1-IACC L. The values for monaural parameters EDT, T 2 and C 8 are shown in Fig As can be seen in the figures, for some parameters the spatial fluctuations are quite large. To evaluate how large these variations are in a perceptual sense, the maximum variation observed within 5 cm intervals are calculated. This interval length is chosen, because it is comparible to the width of one seat. The results are shown in Figs. 2.9 and 2.1. Table 2.3 shows the fraction of measurement positions where the observed variation exceeds one JND for each of the six parameters. The table shows that for all parameters (except for 1-IACC L ) variations are observed that are larger than one JND at more than 1/4 of the measurement positions along the microphone array. As De Vries et al. stated in their paper (with respect to the variations in measures for spaciousness), this does not agree with the fact that the perceptual attribute, for which these objective parameters should be predictors, remain constant over these small intervals. Table 2.3: The fraction of measurement positions at which variations are observed within a 5 cm interval which exceed one JND, for the measurement carried out in the Concertgebouw. LF (1-IACC E ) (1-IACC L ) EDT T 2 C 8 Fraction 43% 67% 12% 95% 32% 28% De Vries et al. explain that the fluctuations in LF and IACC E are due to local wave interferences to which the human auditory system apparently is insensitive. These interferences may also be the cause for the fluctuations observed for the other parameters. It is known that EDT fluctuates more within a room than the reverberation time, because EDT is calculated from the very early part of the impulse response, which varies more when the measurement position is slightly shifted compared with later parts of the impulse response [Rossing, 27]. The fluctuations that occur make it difficult, if not impossible, to evaluate the acoustical qualities of (a certain part of) a room based on a single source and receiver location alone. In ISO , variations throughout the room are taken into account by recommending to perform measurements for multiple source and microphone positions. This does, however, not account for the large fluctuations over short intervals. From Figs. 2.7 and 2.9, it is clear that for parameters related to spaciousness the largest variations are encountered near the center of the room. This is why Okano et al. recommend to measure parameters related to ASW at least one meter from

41 38 Room acoustics.4.3 LEF x (m) (a) Lateral Fraction LF (averaged over the 125 to 1 Hz bands) IACC E x (m) (b) (1-IACC E ) (averaged over the 5, 1 and 2 Hz bands) IACC L x (m) (c) (1-IACC L ) (averaged over the 5, 1 and 2 Hz bands). Figure 2.7: The room acoustical parameters LF, (1-IACC E) and (1-IACC L) as a function of measurement location, obtained from impulse response measurements carried out in the Concertgebouw in Amsterdam using a linear microphone array. The parameters are computed in compliance with ISO standard

42 2.5 Shortcomings of current methods EDT (s) x (m) (a) Early Decay Time EDT (averaged over the 5 and 1 Hz bands) T 2 (s) x (m) (b) Reverberation time T 2 (averaged over the 5 and 1 Hz bands). 1 C 8 (db) x (m) (c) Clarity index C 8 (averaged over the 5 and 1 Hz bands). Figure 2.8: The room acoustical parameters EDT, T 2 and C 8 as a function of measurement location, obtained from impulse response measurements carried out in the Concertgebouw in Amsterdam using a linear microphone array. The parameters are computed in compliance with ISO standard

43 4 Room acoustics Max. LEF difference x (m) (a) Max. (1 IACC E ) difference x (m) (b) Max. (1 IACC L ) difference x (m) (c) Figure 2.9: The maximum differences observed within 5 cm intervals for the room acoustical parameters LF, (1-IACC E) and (1-IACC L) as a function of measurement location, obtained from impulse response measurements carried out in the Concertgebouw in Amsterdam. The dashed lines denote the Just Noticeable Difference (JND) for each parameter.

44 2.5 Shortcomings of current methods 41 Max. EDT difference (s) x (m) (a) Max. T 2 difference (s) x (m) (b) Max. C 8 difference (db) x (m) (c) Figure 2.1: The maximum differences observed within 5 cm intervals for the room acoustical parameters EDT, T 2 and C 8 as a function of measurement location, obtained from impulse response measurements carried out in the Concertgebouw in Amsterdam. The dashed lines denote the Just Noticeable Difference (JND) for each parameter.

45 42 Room acoustics the center line [Okano et al., 1998]. However, it can be shown, using the data from De Vries et al., that even when this criterion is fulfilled, fluctuations are observed within one seat that are (much) larger than one JND; see Figs. 2.9 and Measuring in empty halls A second problem with the current objective parameters is the fact that they are mostly measured in empty rooms, i.e. in the absence of an audience. However, the occupancy of a room can have a large effect on room acoustical parameters [Beranek, 1996]. Nevertheless, measuring in empty halls is common practice, because of the methods involved. Equipment has to be positioned on the stage, which is often not practical in a concert situation, and the type of stimuli that is necessary for the measurement process (noise, sine waves, pulses) is unpleasant for people to listen to. When a room is occupied, the sound absorption will generally be higher. This means that the reverberation time will usually be overestimated when measured in an empty hall, and parameters related to clarity will be different. Objective parameters related to spaciousness are less affected by occupancy [Barron, 25]. Methods to correct the reverberation time based on occupancy are proposed by Hidaka et al. [21] and Bradley [1991], amongst others. Although the approaches are different, they will yield similar results in practice [Barron, 25]. According to Barron, in most concert halls the seats are well-upholstered, and therefore the occupancy does not have a very large effect, leading to accurate corrections for the acoustical parameters. Not all rooms have these kinds of seats, and measuring in occupied state will be necessary in those cases The influence of signal properties As discussed in Section 2.3, the established objective room acoustical parameters are determined using impulse responses. It has been shown, however, that the perception of various acoustical attributes may greatly depend on the signal properties of the sound source(s), something that is not taken into account when only impulse responses are considered [Kahle and Jullien, 1995; Lokki et al., 21]. For example, during a long note from a musical instrument, like a pipe organ or cello, the direct sound from the source may perceptually mask the reflections in the room. This makes it difficult to spot any reverberance at all. On the other hand, when percussive instruments are playing, the room reflections will be easily heard in the pauses between hits. It is obvious that the reverberation time alone will not be able to predict these two effects, since it is a constant depending on the impulse response(s) of the room only. To overcome this, sometimes a distinction is made between running reverberance (heard while the music playing) and stopped reverberance (when the music has stopped playing) [Griesinger, 1992b,a]. Stopped reverberance is related to the reverberation time. Running reverberance is in practice the most important

46 2.6 Revised objective parameters 43 of the two and is believed to be more related to the Early Decay Time EDT. This is because pauses between notes are often short, and therefore the most early part of the room impulse response will be important. However, as Griesinger points out, the absolute level of the reverberance is even more important for running reverberance than the length of the decay [Griesinger, 1999, 1992b]. Other attributes, like ASW and LEV, also depend on the spectral and temporal properties of the source. Literature on this subject includes [Griesinger, 1999] and [Mason et al., 25]. In order to get a feeling of spaciousness, interaural fluctuations have to be perceived. As shown in [Griesinger, 1999], these fluctuations arise only above a minimum time delay between the left- and right arriving signals, and this minimum delay is dependent on the bandwidth of the signal, for example Properties of the human auditory system A final problem of the established methods for deriving objective parameters related to room acoustics is that they are very limited in terms of which aspects of the human auditory system are taken into account. Parameters related to spaciousness and/or based on energy fractions use time limits to mimic the integration properties of the human ear, and the varying sensitivity to different frequency (bands) is taken into account by averaging within certain frequency limits. Other major properties of the auditory system, like non-linearities, spectral and temporal masking effects, phaselocking, filtering in critical bands, binaural interaction, etc., are hardly, or not at all, taken into account. This will inevitably lead to a large simplification of perception and will be of influence on the accuracy of the predictors The perception of loudness In Section 2.3.4, it was already discussed that loudness is a complex topic, and this is especially true in a room acoustical context. For the evaluation of environment noise it is generally well accepted as a rule of thumb that an increase of 1 db in the sound pressure level corresponds to a perception of the sound being twice as loud. However, as shown by De Vries and Han [1983], even a difference in the SPL between two rooms as small as 4 db can result in very different perceptual judgments regarding the loudness: from much too weak to full orchestra very loud and harsh. 2.6 Revised objective parameters As a result of the shortcomings of the established objective parameters and their measurement methods, reports can be found in the literature of situations, in which the parameters show very poor correlation with perceptual results. Examples are

47 44 Room acoustics [Barron, 1988], [Farina, 21], [Hess et al., 23] and [Lokki et al., 21]. This means, that apparently the existing objective parameters sometimes are poor predictors for their perceptual counterparts, and therefore in these cases it is impossible to make an accurate description of the acoustical qualities of a room using the established methods. This is why in the literature various revised parameters can be found. Most of this research is in the context of parameters related to spaciousness. Some of those corrections will be discussed below Solving problems with spatial fluctuations In the paper by De Vries et al. [21], an attempt is made to overcome the problems with parameters related to ASW caused by wave interferences, by applying postprocessing using beam forming techniques before calculating lateral energy fractions. It is shown that the resulting predictor for ASW (named LF ) suffers less from spatial fluctuations than LF and IACC E. Recently, Witew et al. [21a] proposed a method to derive a minimum number of measurement positions needed to describe the acoustics of auditoria. Using the Guide to the expression of Uncertainties in Measurements (GUM), they analyze the changes in certain parameters due to microphone displacements. By relating these changes to the known JNDs for these parameters, a conclusion about the required degree of detail that is required in order to sufficiently characterize the measurement position in a room (seat, row, audience, etc.). In [Witew et al., 21a], this method is applied to determine the required accuracy for the clarity index C 8, when measured in the Concertgebouw in Amsterdam. The resulting minimum accuracy is.3 m, when the JND value of 1 db (as stated in ISO ) is considered. This implies, for example, that a C 8 value measured in a single seat is not representative for a complete row Improved measures related to spaciousness It has been found that ASW is not depending on decorrelation between both ears or lateral energy alone, but also on the absolute sound level. Evidence for this level dependence was already found by Barron and Marshall [1981]. Generally, ASW seems to increase at higher sound levels. Okano et al. found that especially low frequency components of the sound are important for ASW [Okano et al., 1998]. They proposed to add the sound strength for low frequencies (< 5 Hz) G low into the equation. Besides ASW, LEV also seems to be not only related to decorrelation between both ears or to the amount of lateral energy. Recently, it has been found that late arriving energy from vertical directions and behind also contributes to LEV [Furuya et al., 25; Soulodre et al., 23a]. Soulodre et al. proposed a new measure for LEV,

48 2.7 Summary and discussion 45 which has recently been turned into a more practical form by Beranek [28]: where: LEV calc =.5G late,mid + 1 log 1 (1 IACC L,mid ) [db], (2.34) G late,mid = G mid + 1 log 1 (1 + 1 C8/1) [db]. (2.35) In Eqs and 2.35, mid means the average over the 5 and 1 Hz bands. Another discussion around measures related to spaciousness concerns the time limit that separates the impulse response into an early part (and contributing to ASW) and late part (related to LEV). As explained in Section 2.3, ISO sets the limit to 8 ms, but other values proposed are 15 ms [Soulodre et al., 23b] and 15 ms [Griesinger, 1999], for example. This probably has to do with temporal masking effects in the human auditory system; depending on the loudness of the early part it takes time for the neurons to relax. Therefore the time limit may shift, depending on the signal properties and the absolute loudness level (more on this will be discussed in Chapter 3). Morimoto [22] states that a fixed time limit is practical, but not efficient. He proposes to take the precedence effect (or law of the first wave front) into account. Reflections with an energy below the curve of this law contribute to ASW and reflections above this curve contribute to LEV. Recent research on parameters related to LEV is summarized in [Nyberg and Berg, 28] Other improved parameters Besides revised parameters related to spaciousness, attempts can also be found in the literature for improving parameters related to reverberance and clarity, for example. It is outside of the scope of this thesis to discuss all these attempts; some suggestions for improvement and open questions can be found in [Bradley, 21]. All of the corrections above may solve some of the problems that currently exist, but they are still impulse response-based. This means that particular properties of the source signal are still not taken into account and measurements will still mostly take place in unoccupied rooms. Finally, some known aspects of the human auditory system, like non-linearity and phase-locking, are still not considered. 2.7 Summary and discussion A lot of research has been carried out in the field of room acoustics, to arrive at a list of important room acoustical attributes and objective parameters related to these attributes. In this chapter these attributes and parameters have been discussed,

49 46 Room acoustics but it has also been explained that they have some major shortcomings. Some parameters are very dependent on measurement position and may fluctuate severely when measured at closely spaced microphone positions. Furthermore, measurements mostly have to be carried out in unoccupied rooms while the presence of an audience has a significant influence on some parameters. Also, properties of the source signal are not taken into account, although it is known that various types of signals lead to a different perception of acoustics. Finally, some major aspects of the human auditory system are not considered in the current methods. A list of revised parameters can be found in the literature, which solve some - but not all - of the shortcomings. Therefore, a different approach is needed. In the next chapter, a new method will be introduced, with which room acoustics can be evaluated objectively. This new method is based on psycho-acoustic principles and works with arbitrary binaural input signals instead of impulse responses. It is, therefore, able to solve all of the problems with the current methods as discussed in this chapter.

50 3 The AMARA method Science is nothing but perception (Plato, 427 BC BC) In this chapter, the AMARA method is presented. An auditory model will be introduced that is able to process arbitrary binaural audio signals. The chapter starts with a brief overview of psycho-acoustics and the human auditory system. The advantages of using auditory modelling in room acoustics is discussed, followed by an overview of previous work by other authors, in this direction. Next, the binaural model is presented, including some examples of various types of masking thresholds as predicted by the model under various masking conditions. The effect of different room acoustical environments on the model outputs is also examined. From these model outputs, objective parameters are obtained that are related to various room acoustical perceptual attributes, which are discussed in the last part of this chapter. 3.1 Psycho-acoustics Psycho-acoustics is the psychological or behavioural study of hearing [Plack, 25]. In this field, the responses of subjects to certain sounds are investigated and from the results it is tried to describe auditory events quantitatively. Often, a model is developed using the data of listening tests, which is able to predict the human responses. Such a model will then be capable of acting as an artificial listener. This way, sound can be judged on certain aspects objectively, without the need to bring a large group of people into the laboratory. Psycho-acoustical experiments are most often carried out under well-controlled conditions, using headphones, trained subjects, carefully selected stimuli, etcetera. Because of these controlled conditions, particular aspects of the human auditory system can be investigated in detail, although the limitations of psycho-acoustical methods

51 48 The AMARA method always have to be taken into account. For example, psycho-acoustic attributes are dependent on the context in which they are measured and, therefore, the perceptual results will never be unbiased [Blauert and Guski, 29]. Also, perception is multi-modal, as was also discussed in Chapter 1. This sometimes makes it difficult to apply the pure psycho-acoustic results to real-life situations [Guski and Blauert, 29]. Nevertheless, psycho-acoustics has helped a lot over the years to increase knowledge of the human auditory system and perception. This knowledge can be used to improve audio coding algorithms, for example. Due to spectral and temporal masking effects, certain levels of quantization noise in digital audio may be inaudible, often making it possible to quantize a signal at a lower amount of bits than usual (conventionally, digital audio for Hi-Fi applications is quantized at 16 or 24 bits per sample). This principle is utilized in audio compression algorithms like the famous MP3 (MPEG-1 Audio Layer 3) [ISO/IEC, 1993] and, more recently, MPEG-4 AAC (Advanced Audio Coding) [ISO/IEC, 29]. 3.2 The human auditory system Pinna Incus Malleus Stapes Semicircular canals Oval window Round window Vestibular nerve Ear canal Auditory nerve Tympanic cavity Cochlea Eardrum / Tympanic membrane Eastacian tube Figure 3.1: A schematic version of the anatomy of the human auditory system (original graphics: C. L. Brockmann) In this section, the anatomy of the human auditory system and how sound is translated into signals towards the brain is briefly discussed. The most important stages of the system will be explained. Figure 3.1 shows the anatomy of the periphery of the human auditory system. Sound waves enter the auditory system through the pinna, which is shaped in such a way that the sound is spectrally modified depending on the direction of incidence of the sound. Besides the pinna, also other parts of the human head and torso colour

52 3.2 The human auditory system 49 the sound, improving the ability of humans to localize a source. From the pinna, sound is led through the ear canal towards the eardrum. The ear canal is shaped like a tube of approximately 2.5 cm length and therefore has resonant properties, effectively acting as a band-pass filter. The eardrum is a thin membrane which moves in response to pressure changes in the ear canal. Three little bones are attached to this membrane: the malleus, incus and stapes. These bones transfer the pressure changes in the ear canal to pressure changes in the cochlea through a small area called the oval window (see Fig. 3.1). Because this window is small compared to the eardrum, the pressure is amplified with approximately a factor of 2 [Plack, 25]. The cochlea is a thin, fluid-filled tube about 3.5 cm long with an average diameter of about 2 mm (it is thicker at the base, near the oval window, and its width decreases towards the apex). It looks like a snail shell because it is coiled up. Inside the cochlea, a membrane is located which is called the basilar membrane. In response to pressure changes in the fluid inside the cochlea, this membrane will start to vibrate. The diameter of the basilar membrane varies along its length and increases towards the apex. Because of this, the local mass density of the basilar membrane is also increasing towards the end of the cochlea, resulting in the fact that each part of the membrane is most sensitive to a certain frequency. At the base, the membrane responds to higher frequencies, while low frequencies excite parts of the membrane located near the apex. A sine wave with a certain frequency will excite an area of the basilar membrane with a finite width. Therefore, the basilar membrane can be thought of as a series of auditory bandpass filters with a certain bandwidth, called critical bands. The movements of the basilar membrane are translated into signals by the inner hair cells which are fit between the basilar membrane and a second membrane: the tectorial membrane. When the membranes move, tiny hair-like structures attached to the inner hair cells will move sideways relative to each other. Displacements of these stereocilia cause the inner hair cell to release chemical neurotransmitter, resulting in electrical activity (neural spikes) in the neurons attached to the hair cell. Bigger displacements of the basilar membrane lead to more neurotransmitter being released, resulting in more electrical activity. Note that the hair cells only react to displacements in the upward direction, i.e. no neurotransmitter is released when the basilar membrane moves towards the center of the cochlea. This way, the neural firing is locked to a particular phase of the basilar membrane (phase locking). As a result, the timing of the neural firing is related to the period of the input signal. Finally, the signals are transmitted from the neurons through the auditory nerve, a bundle of nerve fibers. This nerve comprises some 3, fibres, of which the majority are attached to the inner hair cells [Plack, 25]. To each hair cell, approximately 2 nerve fibers are attached. Since each hair cell is located at a different position on the basilar membrane, the fibers are tuned to different frequencies.

53 5 The AMARA method Even when no sound enters the ear, most nerve fibers show a background level of spikes, called spontaneous activity. When a sound with a constant sound level starts, the neurons show a sudden peak in activity after which the activity decays to a steady-state level. Also, when the sound stops, a level of activity below the level of spontaneous activity can be observed for some time. This behaviour in response to onsets and offsets 1 is called adaptation. Higher sound levels result in higher steadystate levels of the neural activity. However, there is a maximum for the amount of neural activity; at very high sound levels the neural activity starts to saturate. In Fig. 3.1, a second set of nerve fibers is shown: the vestibular nerve. These nerve fibers are used to send positional information from the semicircular canals. The semi-circular canals serve as the body s balance organ. The auditory nerve and vestibular nerve join together to form the vestibulocochlear nerve, which transmits all signals from the auditory system to the brain. There, the signals are processed in various ways, finally leading to the perception of sound. 3.3 Binaural hearing For the localization of sound events the human auditory system makes use of various cues. Two of these cues are Interaural Time Differences (ITDs) and Interaural Level Differences (ILDs). When a plane sound wave arrives at a certain azimuth angle φ relative to the median plane, there will be a difference in the arrival times of the wave between the left and right ear, see Fig Also, due to shadowing effects of the head a difference in intensity may be experienced. By evaluating these cues across frequencies, the human auditory system is able to estimate where the sound source is located [Blauert, 1996]. When a source is located right in front or behind the listener, there will be no ILD or ITD cues since the arrival times for the left and right ear are exactly equal in that case. Still, humans are able to distinguish between frontal and backward directions by evaluating the colouration effects that result from reflections inside the pinna. This colouration is heavily dependend on the shape of the pinna and, therefore, differs from person to person [Blauert, 1996]. When a source moves out of the horizontal plane (i.e. the elevation angle differs from zero), the elevation angle is also estimated by the auditory system by evaluating the colouration effects caused by the pinna, together with colouration as a result of shoulder reflections [Blauert, 1996]. 1 In the field of psycho-acoustics, the ending of a stimulus or signal component is generally called an offset.

54 3.4 Applying auditory modelling in room acoustics 51 ΔL φ Figure 3.2: A plane wave arriving at the ears from a certain azimuth angle φ will generate ITD and ILD cues because of a difference in the lengths of the paths to both ears and because of head shadowing effects. 3.4 Applying auditory modelling in room acoustics The advantages of auditory modelling As was mentioned in Section 3.1, mathematical models exist that simulate the human auditory system. These models vary from very complex, attempting to mimic each part of the auditory system in detail, to basic, simulating only some parts and/or using rough approximations of the physical reality. The model complexity needed depends on the application for which the model has been designed. For example, in audio compression algorithms (MP3, AAC, etc.) a more complex model will be able to predict the audibility thresholds for quantization noise very accurately, making it possible to quantize the digital audio in the least number of bits possible. On the other hand, such a model might be too computationally expensive to be implemented in portable devices. In Chapter 2, the problems with the current methods for assessing the acoustical qualities of a room were explained. From this discussion it can be argued that including auditory modelling might solve most of the shortcomings: (1) Depending on the complexity of the model, all aspects of the human auditory system which are known to be important for the perception of room acoustics can be taken into account. The current methods based on impulse responses do not account for some of these aspects, as was discussed in Chapter 2. (2) Mathematical models of the auditory system are able to process arbitrary, reallife audio signals. This means that if such a model is used to make predictions

55 52 The AMARA method about the perception of the acoustics of a room, this analysis can be performed while the properties of the (source) signal are taken into account. It was argued in Chapter 2 that this is important; an accurate and complete description of the acoustics of a room cannot be formed based on impulse responses alone. (3) Using an auditory model, room acoustical measurements can be performed in occupied halls. Since such a model will accept arbitrary signals, there is no need to measure impulse responses using synthetic signals. Instead, a recording can be made of an actual performance, for example, after which the acoustics can be judged based on the particular application (string quartet, opera, symphony orchestra, pop music, drama, etc.) Room acoustical attributes Because of the potential advantages of applying auditory modelling in room acoustics, a (binaural) model will be presented in this thesis, which will be used to derive objective parameters related to room acoustical attributes. In this project it was chosen to focus on the following four attributes (see Section 2.3 for a detailed description of each attribute): (1) Reverberance (2) Clarity (3) Apparent Source Width (ASW) (4) Listener Envelopment (LEV) Conclusions on which attributes are believed to be the most important for the perception of room acoustics vary from author to author, but these four parameters are most commonly mentioned, together with loudness (see for example [Jordan, 1981], [Bradley, 199], [Farina, 21], [Bradley, 25], [Barron, 25] and [Skålevik, 21]). Overall loudness (or sound strength) was not included, because it was preferred to keep the number of attributes down. Furthermore, the perception of sound level is a complex topic that extends far beyond the context of room acoustics. Loudness applies to all kinds of audio signals and because of its complexity it is a topic of ongoing research (see for example [Fastl et al., 29] and [Glasberg and Moore, 21]). Nevertheless, in Chapter 8 it will be discussed how the proposed model can potentially be used to estimate loudness, as a topic for further research Previous work Before the auditory model will be introduced, other cases in the literature on acoustics will be discussed where auditory modelling is applied to assess room acoustical

56 3.4 Applying auditory modelling in room acoustics 53 qualities. In 1995, David Griesinger mentioned a hearing model for predicting the perceived amount of reverberation [Griesinger, 1995]. No model details are given in the paper, but Griesinger explained that it consists of a series of 1/3 octave band filters, followed by level detectors to find onsets and offsets. It was claimed that the model is able to predict the amount of reverberance that is being masked by an anechoic input signal. Lee et al. [21] developed new parameters related to reverberance, based on the perceived decay rate of the room impulse response. The perceived decay rate was calculated using the Dynamic Loudness Model by Chalupper and Fastl [22]. Lee et al. calculate two parameters from the perceived loudness decay: EDT N and T N. EDT N is calculated from a regression line between the peak loudness and half the peak loudness, which is consistent with the rule of thumb that a decay of 1 db corresponds to half the perceived loudness. T N was calculated using a regression line of the loudness decay function over.78 of the peak loudness to.178 of the peak loudness. Lee et al. found that these loudness-based parameters are better predictors for the perceived reverberance than the conventional parameters. The parameters succesfully predict the effect that the perceived amount of reverberance increases when the sound pressure level is higher [Lee et al., 21]. Tapio Lokki and Matti Karjalainen proposed a basic hearing model to visualize room impulse responses in an auditorily motivated way [Lokki and Karjalainen, 2, 22]. In their model, the input signal is first filtered with a frequency-weighting filter, to simulate level sensitivity as a function of frequency. Next, the filtered signal is sent through a Gammatone filter bank (see [Patterson et al., 1992]) to simulate the frequency resolution of the basilar membrane. To mimic the behaviour of the inner hair cells, the absolute values of the Gammatone filter bank outputs are taken. This is followed by compression and a sliding temporal integrator because of the limited time resolution of the auditory system. To visualize the results properly, mapping is applied in the form of de-compressing and logarithmic scaling. Lokki and Karjalainen admit that the model is quite basic, but also claim that it is a useful tool to analyze room impulse responses, since it better respects the frequency and time resolution of human hearing than a one-third octave band spectrum. Their analysis method is purely visual; no parameters are obtained from the model results to analyze room acoustics objectively. Lately, the focus within auditory modelling for room acoustical purposes is on spatial attributes like Apparent Source Width and Listener Envelopment instead of the monaural attributes reverberance and clarity. This probably has two reasons: (1) As discussed in Chapter 2, no consensus has been reached yet on objective parameters that predict the perception of spatial aspects of a sound field accurately. (2) With the recent increase of popularity for multi-channel audio systems (like

57 54 The AMARA method Dolby Surround), the need for objective methods to assess the quality of reproduced, spatial audio is of growing importance. An early paper where auditory modelling is applied to assess spatial qualities in room acoustics was presented by Bilsen [1994]. In this paper a Central Spectrum (CS) model is proposed, which is based on the Jeffress model [Jeffress, 1948]. This model estimates the ITD using so-called tapped delay lines. These delay lines represent delayed versions of the neural spike trains from the left- and right ear. Through coincidence detectors, the interaural time differences are converted into a place of maximum neural excitation. This procedure is repeated for each frequency band separately, performing effectively a kind of discrete cross-correlation within each band [Bilsen, 1994]. From the CS, which represents the amount of neural excitation as a function of frequency and ITD, an objective measure was derived for the Apparent Source Width. This measure was based on the modulation depth in the Central Spectrum, with weighting applied according to the importance of the different frequency bands with respect to the perception of ASW. From listening test results, where the subjects had to evaluate ASW for dichotic white noise signals, it was shown that this measure predicts ASW in rooms quite well. It was also found that for artificial signals the measure based on the CS outperforms IACC [Bilsen, 1994]. When impulse responses measured in real rooms were used, both IACC and the new measure performed well. By now, it is well understood that the perception of spaciousness is primarily related to the fluctuations in ITD and ILD over time, see for example [Blauert and Lindemann, 1986a,b], [Lindemann, 1986a,b], [Griesinger, 1999], [Mason, 22], [Becker, 22] and [Hess, 26]. It is believed that ITD is the dominant cue in this respect, see for example [Wightman and Kistler, 1992] and [Griesinger, 1992b]. Blauert and Lindemann [1986a] proposed a binaural model including bandpass filtering and evaluation of the ITD and ILD as a function of time using a cross-correlation method. For the running cross-correlation calculation, different integration constants were used as a function of frequency band. Finally, the standard deviations for ITD and ILD were used as a predictor for auditory spaciousness (no distinction between ASW and LEV was made). Becker used the concept of evaluating ASW through the fluctuations in ITD, and proposed a binaural model for this purpose [Becker, 22]. In this model, the middleear is simulated using a 3 rd order Bessel low-pass filter with a border frequency of 12 Hz. Furthermore, the model includes a filter bank of 36 Roex filters [Glasberg and Moore, 199]. It also simulates the transduction from mechanical waves to neural pulses using a model as proposed by Meddis [1988]. After applying the model to binaural signals, the interaural time difference ITD as a function of time was determined using two methods: an extended correlation method and a subtraction method. The fluctuations in ITD were used as a prediction for ASW. The results show good correlation with listening test results, where subjects had to eval-

58 3.5 The binaural auditory model 55 uate ASW for white noise stimuli (low-pass filtered at different cutoff frequencies), convolved with measured impulse responses. Unfortunately the results were not compared with conventional measures like IACC. Mason et al. also proposed the usage of a binaural hearing model when evaluating spaciousness in a paper from 24 [Mason et al., 24]. Mason agrees with Griesinger [1997] that it makes sense to evaluate acoustics using real-life signals instead of impulse responses. The binaural model which Mason proposes in [Mason et al., 24] is capable of calculating the cross-correlation in an auditorily motivated way for real-life binaural signals. No results were presented in the paper, but in his PhD thesis Mason showed that measures based on ITD fluctuations show good correlation with listening test results [Mason, 22]. In a later paper, Mason et al. discussed representative signals that can be used for calculating the IACC [Mason et al., 25]. They concluded that for the use of the impulse response, wide-band impulsive signals or continuous tonal signals are insufficient; signals with temporal and spectral properties similar to common musical signals have to be used. At the 125 th AES Convention in San Fransisco in 28, a series of papers was presented on a new framework for evaluating the quality of spatial audio reproduction: [Rumsey et al., 28], [Conetta et al., 28], [Jackson et al., 28] and [Dewhirst et al., 28]. This framework is called QESTRAL: Quality Evaluation of Spatial Transmission and Reproduction using An Artificial Listener. A model has been developed based on the work of Supper [25]. The model includes the division of the input signal into critical bands, envelope smoothing, calculation of Interaural Level Difference ILD and ITD, loudness weighting and the combination of ILD and ITD for source localization. From the model outputs, various metrics are extracted, like intensity, entropy, ILD and ITD standard deviations and more [Jackson et al., 28]. Using a regression model, listening test results are used to combine the various metrics and obtain a measure for overall spatial quality. The model results closely match the listening test results [Dewhirst et al., 28]. In conclusion, other work has been carried out in the field of auditory modelling for assessing (room) acoustics. However, most of these approaches either have focused only on spatial attributes, or only include visual inspection of processed room impulse responses. Therefore, in this thesis, a new method for assessing room acoustics using auditory modelling will be proposed which focuses on four important attributes related to room acoustics and includes the extraction of objective parameters related to these attributes. 3.5 The binaural auditory model This section will discuss the binaural auditory model that will be used in the AMARA method, to derive auditorily motivated objective parameters related to the perception of room acoustics. Based on the shortcomings of the established ob-

59 56 The AMARA method jective parameters (Section 2.5) and the revised parameters which overcome some of these problems (Section 2.6), and based on previous work by other authors on applying auditory modelling in room acoustics (Section 3.4.3), it can be argued that such a model should at least have the following properties: (1) Accurate modelling of temporal and spectral masking. Depending on the temporal (for forward masking) and spectral (for simultaneous masking) content of the stimulus, late parts of the impulse response may be masked, affecting perceptual attributes such as envelopment and reverberance. (2) Taking non-linearity into account. The human auditory system behaves as a non-linear system, which results in some perceptual attributes being dependent on the sound pressure level. (3) Including binaural interaction. The auditory system assesses spatial aspects of the sound field by analyzing the relation of the two signals arriving at the left and right ear. A model that features all of the above aspects is the binaural model proposed by Breebaart [Breebaart, 21; Breebaart et al., 21a,b,c], which is basically a binaural extension of the monaural model by Dau et al. [1996a,b]. The model has proved to predict various psycho-acoustic effects accurately, like monaural and binaural masking [Breebaart, 21] and localization in sound reproduction [Nelson et al., 28]. Therefore, this model is used as a starting point in this research. First, the monaural model will be discussed, followed by an explanation about the part that models binaural interaction. When modifications are made with respect to the original model version(s) by the author, this will be mentioned clearly in the text. A more detailed description of the model implementation can be found in Appendix A Monaural model Signal Middle ear Basilar membrane Hair cells Thresholding Adaptation Smoothing f Ψ(t,f) τ 1 τ 5 Figure 3.3: A schematic version of the monaural part of the auditory model. All the blocks are discussed in the text.

60 3.5 The binaural auditory model 57 Figure 3.3 shows a schematic version of the monaural part of the auditory model that will be used in this thesis. The monaural part is based on the auditory model proposed by Dau et al. [1996a,b]. A sound signal enters the model from the left, and its output Ψ is shown on the right side. Below, the various stages of the model will be discussed. Model input The model accepts the sound pressure of audio signals as input. As explained before, the model is non-linear, meaning that the input should be scaled. For a stationary input signal, an input RMS level of L RMS db should correspond to a true SPL of L RMS db. Outer and middle-ear filter As explained in Section 3.2, the ear canal has resonant properties. This can be modelled effectively, combined with the middle-ear, by a time-invariant bandpass filter with cutoff frequencies at 1 khz and 4 khz and a roll-off of 6 db/octave at either side. Basilar membrane The cochlea and basilar membrane are most often modelled by a series of (linear or non-linear) bandpass filters to simulate the frequency resolution of the membrane. In this case, the basilar membrane is modelled by a gammatone filterbank [Patterson et al., 1992]. The filter bandwidth corresponds to the equivalent rectangular bandwidth (ERB) of the auditory filters [Glasberg and Moore, 199]. The filterbank contains two filters per ERB. Note that Breebaart proposed using third-order gammatone filters, but here a fourth-order filterbank is used (as proposed by Jepsen [26]), since it yields better results. Inner hair cells The behaviour of the inner hair cells is simulated by applying half-wave rectification followed by a 5th-order low-pass filter with its cutoff frequency at 77 Hz. For frequencies below this cutoff frequency, the negative phase of the wave form is lost as a result of the half-wave rectification. For high frequencies (approximately above 2 Hz), nearly all phase information is lost, and only the envelope of the signal is preserved [Breebaart et al., 21a].

61 58 The AMARA method Absolute Threshold of Hearing To model the absolute threshold of hearing (ATH), thresholding is applied to the outputs of the inner hair cells according to: Y (t, f) = { Y (t, f) ɛ(f) if Y (t, f) ɛ(f) if y < ɛ(f), (3.1) where ɛ(f) is a frequency-dependent threshold. This thresholding method differs from the method proposed by Breebaart [21], see Section Adaptation It was explained in Section 3.2 that neurons behave in an adaptive way based on onsets and offsets present in the input signal. Therefore, Dau et al. [1996a,b] included a chain of five adaptation loops in their model with time constants of τ = 5, 129, 253, 376 and 5 ms. For stationary signals, the input-output characteristic of the adaptation stage is logarithmic in good approximation, as shown in [Dau et al., 1996a]. Sudden onsets and offsets will lead to overshoots and undershoot, respectively, which corresponds to the behaviour of the neurons. However, it has been found that the response of the chain of adaptation loops to onsets and offset is too strong in some cases; therefore, Münkner [1993] proposed to include an overshoot limitation algorithm that compresses the output of each adaptation loop at high levels. However, in the present research, it is found that better results are obtained without this overshoot limitation algorithm, and therefore it is omitted. After the adaptation loops, the output is scaled to Model Units (MU) such that a stationary input level of 1 db leads to an output level of 1 MU, and a stationary input of zero (silence) results in an output level of MU. Due to a different method for simulating the Absolute Threshold of Hearing at low input levels, this scaling method also differs from the one proposed by Breebaart [21], as will be explained in Appendix A. Smoothing Finally, the output signal is smoothed by applying a first-order low-pass filter with its cutoff frequency at 8 Hz [Dau et al., 1996a]. Model output The monaural model outputs a time-frequency representation Ψ(t, f) of the input signal.

62 3.5 The binaural auditory model Examples of masking threshold predictions A common experiment in psycho-acoustics is to find audibility thresholds using a so-called three-interval forced-choice (3IFC) procedure. In such a test, a subject is presented with three intervals: one which contains the target signal (e.g. a pure tone) in the presence of a masker (e.g. noise), and two intervals containing the masker alone. The subject has to indicate which interval contains the signal. Based on the subject s answer, the signal level is adjusted up or down. Mostly, the level is adjusted following a two-down one-up rule [Levitt, 1971]. After two subsequent trials where the answer was correct, the signal level is decreased, while the level is increased after one wrong answer. The initial step size with which the level is adjusted is 8 db. The step size is halved after each second reversal, until a step size of 1 db is reached. Finally, the test is being continued for another eight reversals. The median of the signal level for these last eight reversals is taken as the audibility threshold. It can be shown that this method estimates the signal level at which the probability of being correct is 7.7% [Levitt, 1971]. The auditory model presented in this chapter can be used as an artificial observer in such a test, meaning that it can be used to predict audibility thresholds in a virtual 3IFC-procedure. In this case, the limited resolution of the human ear, as well as other effects like memory effects, need to be simulated. This is carried out by corrupting the model outputs with additive Gaussian noise: Ψ N (t, f) = Ψ(t, f) + N(t, f), (3.2) where N(t, f) is a Gaussian noise process with a mean of and a level-independent standard deviation σ N. The N superscript indicates that Ψ is corrupted by internal noise. When used in a virtual 3IFC-procedure, the model is extended with a decision device. Within each trial, this device compares each of the three intervals with a template, which is defined as: Ψ N template = Ψ N sig+mask Ψ N mask. (3.3) Here, Ψ N sig+mask is the model output for the target signal plus the masker, where the signal is presented at a sufficiently high level. Ψ N mask is the model output for the masker alone. Therefore, Ψ N template can be thought of as the model output for the target signal in the presence of the masker. Since the model is non-linear, this template will not be equal to the model output without the presence of the masker: Ψ N template = (Ψ sig+mask Ψ mask ) Ψ sig. (3.4)

63 6 The AMARA method The decision device also calculates Ψ N i for each of the three intervals: Ψ N i = Ψ N i,mask Ψ N mask, (3.5) where Ψ N i,mask is the model output for the complete signal in that interval, including the masker. So, for one of the intervals Ψ N i contains the signal in the presence of the masker (corrupted by internal noise) and for the other two intervals Ψ N i contains internal noise only. Finally, the decision device determines the correlation coefficients between the template Ψ N template and each of the three ΨN i. The interval that yields the highest correlation coefficient is considered to be the one containing the target signal. It can be shown that the correlation coefficient between the expected signal (template) and the received signal is a monotonic function of the likelihood ratio. Therefore, detectors that are based on correlation are called optimal detectors [Dau et al., 1996a]. In the next sections some examples are given of how the model can be used to predict (monaural) audibility thresholds for various masking conditions. Absolute Threshold of Hearing The Absolute Threshold of Hearing (ATH) is defined as the SPL at which pure tones are just audible. In the model by Breebaart [21], the ATH is simulated by adding Gaussian noise with a mean of and a frequency-independent standard deviation of 3 MU just before the half-wave rectification process. To test this method of simulating the ATH, a virtual 3IFC-test is conducted. In this test, the model is presented with three intervals in each trial: one containing a 1 sec pure tone of a particular frequency and at a certain SPL, and two intervals containing 1 sec silence. For each trial, the decision device has to find the interval containing the tone. The threshold is then found using the previously mentioned adaptive twodown one-up procedure. Figure 3.4 shows the resulting ATH curve as predicted by the model using the Breebaart method of adding frequency-independent noise, compared with the ATH curve from literature. As can be seen, the ATH predicted by the Breebaart model follows the theoretical curve within a couple of db s for a large frequency range of roughly 1 Hz to 1 khz. However, for very low and very high frequencies, the predicted values do not follow the theoretical ones anymore. Also, the Breebaart model does not predict the dip in the ATH around 3 khz. The difference between the predicted values and the theoretical values for low frequencies (< 1 Hz) can possibliy be explained by the effect that at low frequencies, the original shape of the (sine) tone is preserved and the movement of the basilar membrane is much slower than for high frequency tones. The starting of each period in the sine wave can then be seen as an offset, which may lead to overshoots in the

64 3.5 The binaural auditory model 61 output due to the neural adaptation. These overshoots make it easier for the model to detect the tones in noise, effectively lowering the ATH. ATH (db) frequency (Hz) Figure 3.4: The Absolute Threshold of Hearing (ATH) as predicted by the auditory model in a virtual 3IFC-procedure. The dashed line represents the ATH curve as found in the literature [Terhardt, 1979], the solid line denotes the model predictions. Here, the Breebaart method was used, i.e. Gaussian noise was added prior to the inner hair cell simulator to include the ATH in the model. To improve the simulation of ATH, another method is proposed here as discussed in Section Instead of adding noise, a frequency-dependent threshold ɛ(f) is applied prior to the adaptation stage. The values for ɛ(f) are tuned in one-third octave bands and can be found in Appendix A. Figure 3.5 shows the resulting ATH curve. It can be seen that it closely matched the true ATH curve as found in the literature. ATH (db) frequency (Hz) Figure 3.5: The Absolute Threshold of Hearing (ATH) as predicted by the auditory model in a virtual 3IFC-procedure. The dashed line represents the ATH curve as found in the literature [Terhardt, 1979], the solid line denotes the model predictions. Here, a new method was used by applying a frequency-dependent threshold prior to the adaptation stage.

65 62 The AMARA method Simultaneous masking Simultaneous masking occurs when a target signal is masked by another signal when both signals are presented simultaneously. The amount of masking depends on the spectral properties of both the target signal and the masker. The phase of the signals relative to each other may also have an effect due to interferences. The monaural model is used to predict audibility thresholds in a virtual 3IFCprocedure for a simultaneous masking experiment. The target signal consisted of a 2 ms sine wave, starting at t,sig = 2 ms. The masker was a 6 ms white noise signal (bandwidth 15 khz) starting at t,mask = s and presented at an SPL of 7 db. The masker consisted of frozen noise, meaning that the same noise masker was used within each trial and interval. The opposite is called running noise; in that case a new random noise signal is generated each time [Dau et al., 1996b]. Figure 3.6 shows the resulting thresholds as a function of target signal frequency as predicted by the monaural model. As can be seen, for lower frequencies (< 1 khz) a fine structure in the thresholds can be observed due to interferences between the target signal and masker. For frequencies higher than 1 khz another effect is dominant: the width of the auditory filters increases with frequency and, therefore, the amount of energy of the masker that falls into one auditory filter also increases. This is why it is not surprising that the SPL of the target signal, in order to be audible, also needs to increase with frequency. 65 masked threshold (db SPL) k 1.5k 2k 2.5k 3k 3.5k 4k frequency (Hz) Figure 3.6: Masked thresholds of a 2 ms pure tone in a frozen noise masker as a function of signal frequency. The masker was a 6 ms white noise signal ( - 15 khz) presented at a level of 77 db SPL. The solid line represents the model predictions, the dashed line denotes the informal listening test results of one subject. In Fig. 3.6, the results of an informal listening test following a 3IFC-procedure and one subject (the author) are also shown. In this test, the signals were presented through Beyer Dynamic DT 77 Pro headphones connected to an RME FireFace

66 3.5 The binaural auditory model 63 4 audio interface running at 44.1 khz and controlled by a PC. This was not a formal test since no extensive calibration of the headphones was performed and the amount of subjects is obviously too small. Still, it can be seen that the trends in both curves are quite similar. Furthermore, the curves differ by a few db s at most. Forward masking Due to adaptation in the response of the neurons attached to the inner hair cells, the neurons need some time to relax following the offset of a sound. As a result, a sound event can mask a target sound, even when the target is occurring some time after the first event has ended. This effect is called forward masking. Due to the inclusion of the adaptation loops, the auditory model is able to simulate this effect. To test this, another virtual 3IFC-procedure was conducted. This time, the target signal was a short (1 ms) pure tone, preceded by a frozen white noise masker (2 5 Hz) with a length of 2 ms. The offset of the target signal was adjusted from 1 ms to +5 ms in steps of 1 ms. The 3IFC-procedure was repeated for each offset value to find the masked threshold as a function of the offset. The results for the masked thresholds as predicted by the model in this procedure are shown in Fig Also, results of a listening test with one subject (the author) are shown. This informal test was conducted under the same conditions as described in the previous section. 8 masked threshold (db SPL) signal offset (ms) Figure 3.7: Masked thresholds of a 1 ms, 1 khz tone and a frozen noise masker as a function of signal offset relative to the end of the masker. The masker was a 2 ms white noise signal (2-5 Hz) presented at a level of 8 db SPL. The solid line represents the model predictions, the dashed line denotes the informal listening test results of one subject. From the figure, it can be seen that due to the forward masking effect, the masked threshold for the target signal decreases with increasing offset relative to the end of

67 64 The AMARA method the masker, as expected. The thresholds predicted by the model are close to the thresholds found in the informal listening test Binaural extension As explained in the previous section, the monaural model is able to predict various aspects of masking. In the binaural case, i.e. when the signals presented at the left and right ear are not equal, so-called Binaural Masking Level Differences (BMLDs) may occur. For example, when a subject is presented with a sinusoid which is interaurally out-of-phase, together with an in-phase sinusoidal masker of the same frequency, static ITDs and/or ILDs are created. Due to these ITDs and ILDs, the detection threshold for the out-of-phase signal is lower compared to an in-phase signal [Yost, 1972]. It has been found that also for dynamically varying ILDs [Grantham and Robinson, 1977] and ITDs [Grantham and Wightman, 1978] BMLDs can occur. Breebaart extended the monaural model with a binaural processor, to be able to account for BMLDs [Breebaart, 21]. This processor was based on the Equalization- Cancellation (EC) theory proposed by Durlach [1963]. According to this theory, the signals arriving at the left and right ear are adjusted in time delay and interaural level difference in such a way that the masker waveforms are equalized (Equalization), followed by a subtraction of one signal from the other (Cancellation). This often leads to a higher signal-to-masker ratio and, therefore, to a positive BMLD. Figure 3.8 shows the binaural processor as used by Breebaart [21], schematically. In the horizontal direction, the left and right channel signals are delayed with different times by passing a tapped delay line (triangles). In the vertical direction, the signals are attenuated in discrete steps α by a chain of attenuators (rectangles). Attached to these tapped attenuator lines are EI-type (Excitation-Inhibition) elements (circles). So, each of these EI-type elements has a characteristic ITD and ILD. Inside these elements, the inhibitory signal is subtracted from the excitatory signal (for example the right signal from the left signal) with an output of zero if the inhibitory signal is stronger than the excitatory signal. The output of each EI-type element i for each time sample n and frequency band k is given by: E i [n, k] = 1 f s m= ( 1 α i 4 ΨL [m + τ if s 2 αi, k] 1 4 ΨR [m τ ) 2 if s 2, k] w[m n], (3.6) where α i and τ i are the characteristic level and time differences of the EI element respectively, and Ψ L and Ψ R are the left and right ear monaural model outputs (right after the adaptation stage). w[m n] is a sliding temporal integrator, which is used to simulate a finite temporal resolution:

68 3.5 The binaural auditory model 65 Δτ Δτ Δτ Δτ Right channel Δα Δα Δα Δα Δα Δα EI EI EI EI EI Δα Δα Δα Δα Δα Δα EI EI EI EI EI Δα Δα Δα Δα Δα Δα Left channel Δτ Δτ Δτ Δτ A schematic version of the structure of the binaural processor based on EC- Figure 3.8: theory. w[m n] = 1 2τ w exp ( (m n) τ w f s ), (3.7) where τ w = 3 ms. As can be seen, w[m n] is equal to a double-sided exponential window. The outputs of EI-type elements will together form a pattern of the EI activity as a function of characteristic ITD and ILD. A minimum output will be found at the EI-type element of which the characteristic ITD and ILD match those of the input signals. In [Breebaart, 21], it was shown that using the outputs of the binaural processor as an extra input to the optimal detector stage (see Section 3.5.2), BMLDs can be predicted accurately in various binaural masking experiments. However, as discussed in Section 3.4.3, ITD is the dominant cue with respect to the perception of spaciousness. Therefore, it was chosen to focus on ITD to evaluate spaciousness in this project. This means that it is not necessary to find the ILD for each sample n and frequency band k. If the following substitutions are made: Ψ L = Ψ L [m + τ if s, k] (3.8) 2

69 66 The AMARA method and Ψ R = Ψ R [m τ if s, k], (3.9) 2 then Eq. 3.6 can be rewritten as: E i [n, k] = 1 f s m= ( ) 1 α i 2 Ψ 2 L + 1 α i 2 Ψ 2 R 2Ψ LΨ R w[m n]. (3.1) The maximum value of the ITD will be around ±7 µs, depending on the dimensions of the head [Middlebrooks, 1999]. Therefore, it can safely be assumed that τ i << τ w, leading to the following approximations: and m= 1 α i 2 Ψ 2 L w[m n] m= 1 α i 2 ΨL [m, k] 2 w[m n] (3.11) m= 1 α i 2 Ψ 2 R w[m n] m= 1 α i 2 ΨR [m, k] 2 w[m n]. (3.12) This means that the first two terms in the summation in Eq. 3.1 are independent of τ i, while the third term is independent of α i. Therefore, the ITD can be obtained as follows: ITD[n, k] = argmin τi {E i [n, k]} = argmax τi { m= Ψ LΨ Rw[m n] }. (3.13) The term Ψ L Ψ R is the running cross-correlation between Ψ L and Ψ R. Due to this simplification, the ITD can be calculated much faster than when the full ECalgorithm, including ILD, is used. The complete binaural model, including the binaural processor based on crosscorrelation, is shown in Fig Objective parameters from the model outputs In the previous section, it has been shown that the proposed binaural auditory model (Fig. 3.1) has all the properties that were listed as being necessary for making

70 3.6 Objective parameters from the model outputs 67 Left channel t Running Cross-Correlation t Right channel ITD(t,f) Figure 3.9: A schematic version of the binaural processor used within this research, where the ITD as a function of time is determined by calculating the running cross-correlation. Left signal Middle ear Basilar membrane Hair cells Thresholding Adaptation Smoothing f Ψ (t,f) L τ 1 τ 5 Binaural processor ITD(t,f) Right signal Middle ear Basilar membrane Hair cells Thresholding Adaptation Smoothing f Ψ (t,f) R τ 1 τ 5 Figure 3.1: processor. A schematic version of the complete auditory model, including the binaural an accurate prediction of various aspects of room acoustics. In this section, new objective parameters are proposed, which can be determined from the outputs of the binaural model. The parameters are related to the four perceptual attributes listed in Section 3.4.2: reverberance, clarity, ASW and LEV. The proposed equations will be based on the theory of the perception of room acoustics. Their predictive potential will be investigated in Chapter Model output examples Before the new objective parameters will be explained, it will be useful to look into some examples of model results in different room acoustical environments. Three different (real) environments will be considered:

71 68 The AMARA method (1) A dry room with a very short reverberation time: T 2 =.2 s. (2) A moderately reverberant room with a moderate reverberation time: T 2 = 1.7 s. (3) A highly reverberant room with a very long reverberation time: T 2 = 1.1 s. The model outputs are examined for two different stimuli: a solo male speech sample and a solo cello sample. These stimuli were chosen because they are very different in terms of temporal structure and spectral content. Figures 3.11 and 3.12 show the time domain waveform and frequency spectrum of both samples. The male speech sample is taken from the Sound Quality Assessment Material CD (SQAM) [EBU, 1988], while the solo cello is downloaded from the website of the room acoustical simulation software package ODEON ( Both samples were anechoic, meaning that they were recorded in an acoustically dry environment like an anechoic room. This makes it possible to apply measured or simulated impulse responses from various rooms to the samples through convolution, without having an influence from the room in which the samples were recorded originally. x(t) time (s) X(f) (db) frequency (Hz) Figure 3.11: The time domain waveform (the part of 1 to 5 sec, left) and long-term frequency spectrum (right) of the male speech sound sample. The spectrum is smoothed over 1/3 octave bands. As can be seen from the figures, the time-domain waveforms and long-term frequency spectra of both signals look quite different. For example, the male speech sample shows more distinct components, with pauses in between them, whereas the solo cello sample looks more continuous. Both stimuli were convolved with the measured impulse responses for the three environments mentioned above. The resulting sound samples were normalized to the same loudness level using the Replaygain algorithm (see Appendix B). For all samples, the model outputs Ψ L, Ψ R and ITD are examined for the frequency band with center frequency 5 Hz. The results are shown in Figs. 3.13, 3.14 and 3.15.

72 3.6 Objective parameters from the model outputs 69 x(t) time (s) X(f) (db) frequency (Hz) Figure 3.12: The time domain waveform (the part of 1 to 5 sec, left) and long-term frequency spectrum (right) of the solo cello sound sample. The spectrum is smoothed over 1/3 octave bands. Three main effects of the environment on the model outputs can be observed in the figures: (1) In the acoustically dry room (Fig. 3.13), the signal components are more distinct from each other. The more reverberant a room is, the more the signal components tend to smear together in the model outputs (Figs and 3.15). (2) Reverberation seems to have an effect on the overall output level; the more reverberant a room is, the lower the overall output level is, even though all samples were normalized to the same loudness level. (3) Almost no fluctuation in ITD as a function of time can be seen in the acoustically dry case (Fig. 3.13), whereas ITD fluctuates severely in the highly reverberant room (Fig. 3.15). The first two effects are also found when comparing both samples: in the case of the cello sample, the various components are less distinct and the overall output level for Ψ is lower compared with the solo speech case. Effects 1 and 2 can be explained from the non-linear behaviour of the model. As explained earlier in this chapter, the model is most sensitive at low input levels due to the adaptation stage. As a result, overshoots will occur at the onsets of signal components. The more reverberant a room is, the more stationary a signal is, with higher energy in between the components due to smearing effects. This will result in a lower overall output level and a response to onsets which is less high. This also explains the differences between the cello and speech stimuli for the outputs Ψ. The speech sample contains parts where the level reaches silence (Fig. 3.11), whereas for the cello sample the components (i.e. notes) are connected together, making the

73 7 The AMARA method Ψ L (MU) 1 Ψ L (MU) time (s) time (s) Ψ R (MU) 1 Ψ R (MU) time (s) time (s) ITD (ms) time (s) ITD (ms) time (s) Figure 3.13: The model outputs Ψ L, Ψ R and ITD for the frequency band with center frequency 5 Hz. Two stimuli were used: male speech (left) and solo cello (right) stimuli in an acoustically dry room. This room has a reverberation time of T 2 =.2 s.

74 3.6 Objective parameters from the model outputs Ψ L (MU) 1 Ψ L (MU) time (s) time (s) Ψ R (MU) 1 Ψ R (MU) time (s) time (s) ITD (ms) time (s) ITD (ms) time (s) Figure 3.14: The model outputs Ψ L, Ψ R and ITD for the frequency band with center frequency 5 Hz. Two stimuli were used: solo male speech (left) and solo cello (right) stimuli in a moderately reverberant room. This room has a reverberation time of T 2 = 1.7 s.

75 72 The AMARA method Ψ L (MU) 1 Ψ L (MU) time (s) time (s) Ψ R (MU) 1 Ψ R (MU) time (s) time (s) ITD (ms) time (s) ITD (ms) time (s) Figure 3.15: The model outputs Ψ L, Ψ R and ITD for the frequency band with center frequency 5 Hz. Two stimuli were used: solo male speech (left) and solo cello (right) stimuli in a highly reverberant room. This room has a reverberation time of T 2 = 1.1 s.

76 3.6 Objective parameters from the model outputs 73 signal more stationary (Fig. 3.12). Therefore, the speech sample will generally yield a higher model output with higher, more distinct peaks. The third effect is a result of laterally arriving energy. In reverberant rooms, reflections arriving from lateral directions result in an ITD which fluctuates over time. For example, if the direct sound is located exactly on-axis with respect to the median plane, a static ITD of ms will be found in an anechoic situation. When a dominant reflection arrives from the right hand side, a positive ITD may occur (if ITD is defined as the difference in the arrival time of the right ear minus the arrival time of the left ear). A dominant reflection from the opposite direction leads to an ITD with opposite sign. If the reflection comes exactly from a direction corresponding to the axis through both ears, a maximum value for the ITD will be found. As explained in Section 3.5.3, this maximum will be around ±7 µs Splitting of the output stream According to authors like Griesinger [1999] and Rumsey [22], there is evidence that the human brain analyzes an incoming sound stream by splitting it up into two different streams: one belonging to the sound source(s) (Griesinger refers to this as the foreground stream) and one related to the environment (background stream). In experiments for finding room acoustical attributes, it has been found that indeed human subjects are able to describe source and room properties separately [Berg and Rumsey, 21]. The non-linear properties of the model and the resulting behaviour in response to onsets and offsets of signal components can be used to split the model outputs Ψ L and Ψ R into these two streams. In order to perform this splitting, a basic peak detection algorithm was implemented which labels a part of an output stream Ψ as a peak if it is above a threshold Ψ min for at least a certain time T min, see Fig After the peak detection process, the parts of the output streams that are labeled as peaks are grouped together to form the direct stream. The parts that are not labeled as peaks are labeled as indirect sound and are assigned to the reverberant stream. In both streams, unassigned parts are set to zero. Therefore, the sum of the direct and reverberant stream yields the original signal. The direct stream is considered to be related to the sound sources and the reverberant stream to the room/environment. Notice the similarity with Griesinger s theory regarding foreground and background streams [Griesinger, 1999]. The threshold Ψ min is calculated for each frequency band with center frequency f c : Ψ min (f c ) = µ Ψ L Ψ (f c ), (3.14) where L Ψ (f c ) is the average absolute level of the model output for the frequency band:

77 74 The AMARA method Ψ(t) Ψ min T t Figure 3.16: When a model output Ψ is above a threshold value Ψ min for a time T T min then this part of the output stream is labeled as a peak by the peak detector. L Ψ (f c ) = 1 T T Ψ(t, f c ) dt, (3.15) where T is the total length of the output Ψ. In Eq. 3.14, µ Ψ is a frequency-independent constant. The value of this constant will be discussed in Chapter 6. Since sudden offsets will lead to an undershoot in the model output, these undershoot should also be detected and assigned to the direct stream. Therefore, the peak detection is repeated in the negative direction to detect dips. The peak detection algorithm will label a part of the model output stream as a dip when the output level is below a threshold Ψ min,dip for at least a time T T min. Because these undershoots are generally smaller in amplitude compared with the overshoots that result from onsets, a different constant µ Ψ,dip is used to determine this threshold: Ψ min,dip (f c ) = µ Ψ,dip L Ψ (f c ). (3.16) The constant µ Ψ,dip will be further discussed in Chapter 6. Figures 3.17, 3.18 and 3.19 show example results for the direct and reverberant output streams when the peak detection algorithm is applied on the right-channel model outputs Ψ R of Figs. 3.13, 3.14 and 3.15 respectively. As can be seen, more reverberation leads to more energy in the reverberant stream, as expected.

78 3.6 Objective parameters from the model outputs Ψ R,direct (MU) 1 Ψ R,direct (MU) time (s) time (s) 3 3 Ψ R,reverberant (MU) 2 1 Ψ R,reverberant (MU) time (s) time (s) Figure 3.17: The split output streams Ψ R,direct and Ψ R,reverberant for the frequency band with center frequency 5 Hz. Two stimuli were used: solo male speech (left) and solo cello (right) stimuli in an acoustically dry room. This room has a reverberation time T 2 =.2 s. See Fig for the original, un-split model output stream Direct and reverberant streams for ITD The direct and reverberant streams found using the peak detection algorithm are also used to split the ITD values into two streams: one for the direct output of the model (ITD DIR ) and one for the reverberant part (ITD REV ): and: ITD DIR [m, k] = ζ DIR [m, k]itd[m, k], (3.17) ITD REV [m, k] = (1 ζ DIR [m, k]) ITD[m, k]. (3.18) In these two equations, ITD[m, k] is the ITD value for time frame m and frequency band k. ζ DIR [m, k] acts as a mask; it is equal to one if both monaural outputs Ψ L [m, k] and Ψ R [m, k] are assigned to the direct stream, and it is zero if both outputs

79 76 The AMARA method Ψ R,direct (MU) 1 Ψ R,direct (MU) time (s) time (s) 3 3 Ψ R,reverberant (MU) 2 1 Ψ R,reverberant (MU) time (s) time (s) Figure 3.18: The split output streams Ψ R,direct and Ψ R,reverberant for the frequency band with center frequency 5 Hz. Two stimuli were used: solo male speech (left) and solo cello (right) stimuli in a moderately reverberant room. This room has a reverberation time T 2 = 1.7 s. See Fig for the original, un-split model output stream. are assigned to the reverberant stream. If only one of the two model outputs are assigned to the direct stream, ζ DIR =.5. So, ζ DIR also acts as a weighting matrix where more weight is given to ITD values when direct energy is detected in both ears simultaneously and less weight when only direct energy is detected in only one of the ears Objective parameters In the following sections, proposals will be made for obtaining objective parameters from the various model outputs by combining psycho-acoustical theories with findings from the previous section on how the model outputs are affected by different room acoustical environments. The proposals will include some free parameters which need to be optimized, like using certain weighting factors or frequency limits. This optimization will be performed in Chapter 6.

80 3.6 Objective parameters from the model outputs Ψ R,direct (MU) 1 Ψ R,direct (MU) time (s) time (s) 3 3 Ψ R,reverberant (MU) 2 1 Ψ R,reverberant (MU) time (s) time (s) Figure 3.19: The split output streams Ψ R,direct and Ψ R,reverberant for the frequency band with center frequency 5 Hz. Two stimuli were used: solo male speech (left) and solo cello (right) stimuli in a highly reverberant room. This room has a reverberation time T 2 = 1.1 s. See Fig for the original, un-split model output stream Reverberance As discussed in Section 2.3.1, the amount of reverberance is related to the effect that, due to the reflections in a room, it takes some time for the sound level to decay after a sound source has stopped generating sound. From the example results in the previous section the following parameter is proposed as a predictor for reverberance based on the model outputs: P REV = L REV = 1 1 f1 T Ψ L,rev (t, f) B f T 2 + Ψ R,rev (t, f) 2 dt df, (3.19) f where f and f 1 are the lower and upper frequency limits over which P REV is evaluated, and B f = f 1 f is the corresponding frequency bandwidth. Ψ L,rev and Ψ R,rev are the reverberant output streams for the left- and right-ear channels, respectively. L REV can be considered as being the mean reverberant stream level.

81 78 The AMARA method In practice, the outputs Ψ are available at discrete time samples n and for frequency bands k. Therefore, the integrals in 3.19 become summations: L REV = 1 K 1 N k 1 N 1 k=k n= Ψ L,rev [n, k] 2 + Ψ R,rev [n, k] 2, (3.2) where N is the total number of time samples, k and k 1 are the lowest and highest frequency band involved in the calculation and K = k 1 k +1 the resulting number of frequency bands Clarity In Section 2.3.2, it was explained that the current objective parameter for predicting the perceived amount of clarity is based on the ratio between early energy and late energy in the impulse response with a certain time limit separating early from late. This time limit is dependent on the stimulus type: 5 ms for speech and 8 ms for music. Based on the model outputs, the following proposal follows for a new clarity predictor: where L DIR is defined in a way similar to Eq. 3.2: L DIR = 1 K 1 N k 1 k=k n= P CLA = L DIR L REV, (3.21) n 1 Ψ L,dir [n, k] 2 + Ψ R,dir [n, k] 2. (3.22) The frequency limits k and k 1 are equal to the ones used to calculate L REV. This basically makes P CLA a ratio between the mean direct and reverberant stream level. Contrary to the current Clarity Index parameter, no assumptions are made in terms of time limits; based on the peak detection it is decided how much energy is assigned to the direct stream and how much to the reverberant stream. This way, also forward masking effects (when the direct sound masks the reverberant energy or vice-versa) can be taken into account Apparent Source Width The established objective parameters for predicting Apparent Source Width (ASW) are the (Early) Lateral Fraction LF and the Early Interaural Cross-Correlation IACC E (Section 2.3.5). In Section 2.6, it was discussed that recently various authors found that the absolute sound pressure level (especially for low frequencies)

82 3.6 Objective parameters from the model outputs 79 also contributes to ASW. With these recent findings in mind the following objective parameter is proposed as a predictor for ASW: P ASW = µ ASW L LOW + log 1 { 1 + νasw σ ITD,DIR 1 3}, (3.23) where σ ITD,DIR is defined as: σ ITD,DIR = 1 Q q 1 q=q 1 M 1 M 1 m= ( ITDDIR [m, q] ITD DIR,q ) 2, (3.24) where ITD DIR [m, q] is the ITD DIR value for time frame m and frequency band q, and ITD DIR,q is the average ITD REV value for frequency band q. Equation 3.24 is equal to the unbiased standard deviation for ITD in the direct sound, averaged over the frequency bands q to q 1. L LOW is defined as the average level of the output streams for low frequencies: L LOW = 1 N 1 K z 1 N 1 z=z n= ΨL [n, z] 2 + Ψ R [n, z] 2. (3.25) In Eq. 3.25, z and z 1 are the lower and upper frequency bands defining the low frequency range. In Eq. 3.23, µ ASW and ν ASW are weighting constants. constants will be discussed in Chapter 6. Optimal values for these Since the model output L LOW will have a logarithmic character (see Section 3.5.1), it was decided to use the log of the ITD fluctuations in Eq Listener Envelopment As discussed in Section 2.3.6, the current established objective parameters for predicting Listener Envelopment (LEV) are the late lateral sound level GLL and the Late Interaural Cross-Correlation IACC L. In Section 2.6 it was explained that recently it has been found that the overall late sound strength G late also plays a role. Therefore, the following parameter is proposed for predicting LEV from the model outputs: P LEV = µ LEV L REV + log 1 { 1 + νlev σ ITD,REV 1 3}. (3.26) Here, µ LEV and ν LEV are weighting constants which needs to be tuned (further discussion in Chapter 6). L REV is the mean reverberant level as defined in Eq and σ ITD,REV is defined as:

83 8 The AMARA method σ ITD,REV = 1 Q q 1 q=q 1 M 1 M 1 m= ( ITDREV [m, q] ITD REV,q ) 2, (3.27) where ITD REV [m, q] is the ITD REV value for time frame m and frequency band q, and ITD REV,q is the average ITD REV value for frequency band q. Equation 3.27 is equal to the unbiased standard deviation for ITD in the reverberant, averaged over the frequency bands q to q Summary and discussion In this chapter, a non-linear, binaural model was introduced. The model has two inputs for the left and right ear signals and processes these both monaurally (resulting in Ψ L and Ψ R as a function of time and frequency) and binaurally (resulting in the Interaural Time Difference ITD as a function of time and frequency). The outputs are split in streams related to the direct sound field and to the reverberant sound field using a splitting algorithm based on peak detection. In the last sections, objective parameters were proposed that can be determined from the model results. The parameters should be related to the perceptual room acoustical attributes reverberance, clarity, Apparent Source Width and Listener Envelopment. The performance of these parameters will be optimized and validated in Chapter 6.

84 4 Simulating room acoustics Reality is merely an illusion, albeit a very persistent one (Albert Einstein, ) To optimize and validate the auditory model and the objective room acoustical parameters proposed in Chapter 3, it is necessary to apply the model in environments with a wide range of acoustical properties. In order to do so, one could perform impulse response measurements in a large number of rooms with varying acoustical properties. However, it turns out that in practice room acoustical attributes are highly correlated. For example, rooms with much reverberation most often have low clarity values and high values for attributes related to spaciousness. To be able to control the attributes more independently, a room acoustical simulator was developed in this project. This simulator is able to simulate room impulse responses for shoebox-shaped rooms. The simulator can also be used to generate rooms with extreme values for the acoustical attributes, which cannot be found easily in the real world. In Chapter 5, this simulator will be, used together with impulse responses measured in real rooms, to perform listening tests. In these tests various aspects of room acoustics will be investigated perceptually. 4.1 A simulator for shoebox-shaped rooms The goal is to simulate acoustical environments with varying acoustical properties. It is, therefore, not necessary to simulate accurately rooms with very complex geometries. A set of simulated impulse responses with different properties can also be obtained by simulating rooms with a shoebox shape, see Fig The benefit of using this geometry is that room impulse responses can be simulated relatively easy, by using a mirror image source technique [Allen and Berkley, 1979]. In this technique, each reflection is modelled by recursively mirroring the source in each of

85 82 Simulating room acoustics the room boundaries. This is shown schematically in Fig H S R W L Figure 4.1: A schematic version of a shoebox-shaped room of volume L W H with a source (S) and receiver (R). Other room acoustical simulation techniques exist, like ray tracing [Krokstad et al., 1968], beam tracing [Heckbert and Hanrahan, 1984], Finite Element Methods (FEM) [Craggs, 1994] and WRW 1 [Berkhout et al., 1999]. These methods are all suitable for more complex situations compared with mirror image modelling techniques. However, since the application here is limited to shoebox-shaped rooms, it was decided to use mirror image source modelling for reasons of simplicity and computational effort. The goal in this case is not to arrive at a very accurate prediction of the acoustics of a room. Instead, a tool is needed with which room impulse responses with varying properties can be generated. Because the results will also be used in listening tests, the room impulse responses should not sound artificial. In the mirror image source technique the number of reflections increases rapidly with increasing reflection order. Therefore, this technique is often combined with statistical modelling, see for example [Savioja et al., 1999]. In such a case, the early reflections, for example up to an order of two, are simulated using the mirror image source technique, while the late reverberation for higher reflection orders is modelled using decaying noise signals. This keeps the number of calculations needed down. In practical situations the late reverberation tail is diffuse anyway. This procedure is followed here. The different stages of the modelling technique will be discussed in the next sections. 1 In WRW, the W stands for Wave propagation and the R for Reflection. The method is derived from a multi-scattering algorithm from seismic imaging, which is based on wave theory.

86 4.1 A simulator for shoebox-shaped rooms 83 S (,2) S (-1,1) S (,1) S (1,1) S (-2,) S (-1,) S S (1,) S (2,) R S (-1,-1) S (,-1) S (1,-1) S (,-2) Figure 4.2: Simulating a room impulse response by using a mirror image source technique. In this figure, a 2D version of technique is shown with mirror image sources up to an order of two. This figure is based on Fig in [Vogel, 23].

87 84 Simulating room acoustics Computation in frequency bands To arrive at realistic sounding room impulse responses, it will be necessary to carry out all calculations for various frequency bands. The absorption coefficients of the room boundaries will generally depend on frequency and as will be shown in the next section, this is also true for the influence of the absorption through air. Since absorption coefficients in the literature are mostly specified in octave bands in the range 125 to 16 Hz, it was decided to use FIR filters with these center frequencies and bandwidth. Filters are constructed such that they sum up to a flat frequency response (corresponding to a delta pulse in the time domain). This is carried out by using cosine-shaped band-pass filters, for which the down-going ramp of a filter crosses the up-going ramp of the next filter exactly at the 6 db point. The FIR filters with the highest and lowest center frequencies are designed as low shelf and high shelf filters, respectively. The individual filters are shown in the time and frequency domain in Fig Figure 4.4 shows the total sum of the filters. In the following, when a time signal is specified as x k (t) this means that the original signal x(t) has been convolved with an FIR filter h k (t) from Fig. 4.3 for frequency band k (with center frequency f k ): x k (t) = h k (t) x(t). (4.1) 5.4 h k (t).2 H k (f) (db) time (ms) k 2k 4k 8k 16k frequency (Hz) Figure 4.3: The eight linear-phase FIR filters used to perform the room acoustical simulation within octave bands. Left: time domain, right: frequency domain.

88 4.1 A simulator for shoebox-shaped rooms Σ h k (t) Σ H k (f) (db) time (ms) k 2k 4k 8k 16k frequency (Hz) Figure 4.4: The sums of the eight linear-phase FIR filters used to perform the room acoustical simulation within octave bands. Left: time domain, right: frequency domain Initial estimation of the energy decay To make an initial estimation of the reverberation time in a shoebox-shaped room, the famous Sabine formula (Eq. 1.1) could be used. However, a sound wave may travel for a long distance in a room after a number of reflections and sound absorption by air is significant in those cases, especially for high frequencies. Therefore, for an accurate prediction of the acoustics, air absorption also needs to be taken into account. The Sabine formula does not include air absorption, but a correction can be added to solve this [Vorländer, 28]. This correction will be explained in the following. Equations for determining the amount of sound absorption through air per meter, as a function of frequency and meteorological conditions, are specified in ISO standard [ISO, 1993]. These equations are also presented in Appendix C. An initial estimation for the energy decay can be made by first considering the mean free path d [Kosten, 196]: d = 4V S, (4.2) where V is the volume of the room and S the total boundary surface area. The mean free path is basically the average of the free paths between two subsequent wall reflections provided that the enclosure is filled homogeneously and isotropically with paths. It follows from Eq. 4.2, that for a certain mirror image source u with a distance r u to the receiver, the expected number of reflections (boundary crossings) N u corresponding to this mirror source is equal to:

89 86 Simulating room acoustics N u = r u d = r us 4V. (4.3) If the original source is a monopole transmitting a Dirac delta pulse δ(t), then the expected sound pressure at the receiver due to a mirror image source u is given by: h u,k (t) = [1 α[k]] rus 8V [1 α air [k]] ru 2 δ k (t r q /c) r q, (4.4) In Eq. 4.4, k is the number of the frequency band and c the velocity of sound. α air [k] is the absorption coefficient for air at frequency f k and under certain meteorological conditions. Note that the first exponent in Eq. 4.4 is equal to the value in Eq. 4.3 divided by two, because α is defined as the amount of energy absorbed. α[k] is the weighted average wall absorption coefficient defined as: α[k] = b α b[k]s b b S, b (4.5) where S b and α b [k] are the surface area and absorption coefficient of boundary b, respectively. It can be shown that the temporal reflection density increases with the square of t (see for example [Kuttruff, 2]): dn dt = 4πc3 t 2. (4.6) V From Eqs. 4.4 and 4.6, it follows that the expected total mean-square pressure at a time t is equal to: h 2 rms,k (t) = 4πc V cts 4V (1 α[k]) (1 α air [k]) ct. (4.7) An estimate of the reverberation time T 6 follows from a decay of 6 db with respect to the initial mean-square pressure: p 2 rms,k (T 6,k) p 2 rms,k () = 1 6. (4.8) When this equation is solved, T 6 for each frequency band k is given by [Vorländer, 28]: T 6 [k] = 24V log(1) c [S log(1 α[k]) + log(1 α air [k])]. (4.9)

90 4.1 A simulator for shoebox-shaped rooms 87 Note that if absorption through air is neglected (α air [k] = ) and we substitute α[k] << 1 and c = 343 m/s, we arrive at the Sabine Formula (Eq. 1.1): 24V log(1) T 6 [k] c [S( α[k]) + ].161 V α[k]s. (4.1) Simulating the direct sound The direct sound component h,k (t) of the impulse response in the frequency band k with center frequency f k can be calculated as: h,k (t) = [1 α air [k]] r 2 with r the distance from the source to the receiver. δ k (t r /c) r, (4.11) The total direct sound component can be obtained by summing over all K frequency bands: h dir (t) = K h,k (t). (4.12) k= Early reflections The early reflections up to a certain order will be modelled using the mirror image source technique (see Fig. 4.2). The contribution of each mirror source u to the total sound pressure for frequency band k is given by: h u,k (t) = ( Nu n=1 [1 α n [k]] 1 2 ) [1 α air [k]] ru 2 δ k (t r u /c) r u, (4.13) where N u is the total number of boundary crossings (i.e. reflections) for this mirror source, and α n [k] is the absorption coefficient for boundary b = n and frequency band k. r u is the distance from the mirror source to the receiver. The early part of the impulse response can now be computed by summing over all frequency bands k and all contributing mirror image sources u: h early (t) = K k=1 u=1 with U the total number of contributing mirror image sources. U h u,k (t), (4.14)

91 88 Simulating room acoustics Late reverberation The late part of the response is modelled by a statistical method. This method starts with a white noise process w s (t) with an RMS value of w s,rms = 1. Next, the expected number of reflections per time sample N s (t) (sample -1 ) is calculated. This value follows from the reflection density (Eq. 4.6) as: where f s is the sampling frequency. N s (t) = 1 f s 4πc 3 V t2, (4.15) Figure 4.5 shows an example curve for N s as a function of time for a room with dimensions 1x2x1 m and a sampling frequency of 16 khz. 4 <N s (t)> (sample 1 ) time (s) Figure 4.5: An example curve for the expected number of reflections per sample N s(t) as a function of time. The first 5 ms of the curve are plotted. This example is for a room with dimensions 1x2x1 m and a sampling frequency of f s = 16 khz. As can be seen in Fig. 4.5, in the first part of the response the expected number of reflections per time sample is less than one. To model this, values of w s (t) at discrete samples n (n = f s t n ) are randomly set to zero according to: w s[n] = { if w 1 [n] > N s [n] w s [n] otherwise, (4.16) with w 1 [n] a discrete-time white noise process with random values between and 1. This will result in reflections at random times while the reflection density follows the theoretical curve. In Fig. 4.5 it can be seen that after some time the expected number of reflections per sample becomes greater than one. To model this, while making sure that the

92 4.1 A simulator for shoebox-shaped rooms 89 total mean-square pressure follows the theoretical curve (Eq. 4.7), w s(t) needs to be multiplied with the square root of the reflection density: w s (t) = { Ns (t) w s(t) if N s (t) > 1 w s(t) otherwise. (4.17) Finally, w s (t) is filtered in frequency bands using the method described in Section resulting in filtered versions w s,k (t). To compute the energy decay within each frequency band k, the average energy transmission per meter E k is calculated as: E k = (1 α[k]) S 4V (1 α air [k]). (4.18) As can be seen, Eq is based on the mean free path d; on average a sound wave will reflect at a boundary 1/d times per meter. The final statistical response per frequency band, including the energy decay 1/r of a monopole, is now given by: h s,k (t) = (E k ) r w s,k (t) ct w = (E k ) r ct s,k (t). (4.19) Figure 4.6 shows an example impulse response and its energy decay, obtained using the statistical method as proposed in this section. As can be seen, the resulting mean-squared pressure closely follows the theoretical energy decay. h s (t) SPL(t) (db) time (s) time (s) Figure 4.6: An example of an impulse response obtained using the statistical method presented in this section. This response is simulated for a room with dimensions 1x2x1m and an average wide-band absorption coefficient of α=.2. Air absorption is not taken into account for this example. The left graph shows the time domain response and the right graph the SPL as a function of time. The dashed line shows the theoretical SPL decay for a room with these properties. Finally, the total statistical part of the impulse response can be obtained using:

93 9 Simulating room acoustics h late (t) = { if t < T c K k=1 h s,k(t) if T T c, (4.2) where T c is the crossover-time between the early and late part of the response, given by: T c = T + O sd c, (4.21) with T the arrival of the direct sound and O s the maximum order of specular reflections calculated for the early part of the response Diffuse reverberation Besides specular reflection, the effect of scattering (nonspecular reflections, see Chapter 2) is also modelled. Like for the late reverberation this is carried out using a statistical method. Again, a white noise signal w d (t), with an RMS value of 1 is used as the starting point. Scattered reflections may arrive at the receiver at any time. Therefore, the white noise signal is only multiplied by the square root of the reflection density (no discrete-time values are set to zero): w d(t) = N s (t) w d (t). (4.22) Like for the late reverberation, w d is filtered in frequency bands and multiplied with the decay curve within each band: h d,k (t) = (E k ) r w d,k (t) r = (E k ) ct w d,k (t). (4.23) ct Figure 4.7 shows an example of such a diffuse impulse response h d (t) for a room with the same properties as for Fig The total diffuse response is obtained from the following equation: h diff (t) = { if t < T K k=1 h d,k(t) if T T. (4.24) Total response The different contributions described in the previous sections are summed together to form the total modelled room impulse response:

94 4.2 Simulating different receivers h d (t) time (s) Figure 4.7: An example of an impulse response obtained using the diffuse method presented in this section. This response is simulated for a room with dimensions 1x2x1m and an average wide-band absorption coefficient of α=.2. Air absorption is not taken into account for this example. Even though the amplitude of the diffuse response is lower than the response in Fig. 4.6, the SPL of both responses is equal since the reflection density of the diffuse response is higher. h(t) = h dir (t) + 1 s [h early (t) + h late (t)] + sh diff (t). (4.25) Weighting is applied between the diffuse part and the other reflections based on the scattering coefficient s (see Section 2.1.3). Note that the scattering coefficient in Eq is not exactly equal to the one in Eq The s is Eq should be seen as a bulk scattering coefficient for the room, rather than the scattering coefficient for a particular surface. 4.2 Simulating different receivers The method for modelling room impulse responses presented in the previous sections does not take into account any direction-dependent sensitivity of the receiver, i.e. it is assumed that the receiver has an omni-directional sensitivity pattern. In the next sections, it will be explained how different receiver types can be modelled Virtual microphones In the room impulse response simulator presented in this chapter, the 3D polar pattern of a virtual microphone is given by: G(φ, θ) = β + (1 β) cos(φ) cos(θ), (4.26)

95 92 Simulating room acoustics with φ the azimuth, θ the elevation angle and β a constant between and 1. The value of β determines the polar sensitivity. Figure 4.8 shows some of the possible polar patterns for common microphone types (taken from [Eargle, 26]). Using Eq. 4.26, each early reflection, as well as the direct sound, is weighted according to its angles of incidence at the receiver s position: and K h dir,mic (t) = G(φ, θ )h,k (t), (4.27) k=1 h early,mic (t) = K k=1 q=1 Q G(φ q, θ q )h q,k (t). (4.28) The late and diffuse parts (h late,mic and h diff,mic respectively) of the impulse response are obtained by choosing the angles of incidence randomly at each sample position. The total response using a virtual microphone with polar pattern G can now be computed as: h mic (t) = h dir,mic (t) + 1 s [h early,mic (t) + h late,mic (t)] + sh diff,mic (t). (4.29) Binaural response Since the auditory model presented in this thesis is binaural, it will be necessary to simulate binaural room impulse responses (BRIRs). For this purpose, a database with Head-Related Transfer Functions (HRTFs) is available at Delft University of Technology. The HRTFs were measured in an anechoic room using a dummy head developed by the Institute of Technical Acoustics (ITA) in Aachen, Germany [Schmitz, 1995]. In the database, HRTFs are available for different azimuths (φ = to 355 deg, in steps of 5 deg) and elevation angles (θ = -7 to 7 deg, in 1 deg steps). The direct and early components of the binaural impulse response are now obtained as follows: and K h c dir,hrtf(t) = h c hrtf( φ, θ ) h,k (t), (4.3) k=1

96 4.2 Simulating different receivers 93 (a) Figure-of-eight: β = (b) Hypercardioid: β =.25 (c) Supercardioid: β =.37 (d) Cardioid: β =.5 (e) Subcardioid: β =.7 (f) Omnidirectional: β = 1 Figure 4.8: Microphone polar sensitivity patterns for various values of β in the range to 1. The polar patterns are plotted in 3D, where the vector (x,y,z) = (1,,) points in the on-axis direction (φ =, θ = ). These results are for six common microphone types, as found in [Eargle, 26].

97 94 Simulating room acoustics Figure 4.9: The ITA dummy head (original photo: M.M. Boone). h c early,hrtf(t) = K k=1 q=1 Q h c hrtf( φ q, θ q ) h q,k (t), (4.31) where h c hrtf (φ, θ) is the time-domain HRTF corresponding to the angles φ and θ. c is the channel: left (L) or right (R). A nearest neighbour procedure is used to find the angles in the database corresponding to the exact angles of incidence; this is denoted by the tilde ( ) above the angles in Eqs. 4.3 and The late and diffuse components of the response are calculated using Eqs. 4.2 and 4.24, where no convolution with HRTFs is applied (random incidence). For both the late and diffuse parts, the calculations are performed twice, resulting in two uncorrelated versions of each: h (1) late (t), h(2) late (t), h(1) diff (t) and h(2) diff (t). This is used to simulate decorrelation between the left and right ear: and h L late,hrtf(t) =h (1) late (t) h R late,hrtf(t) = 1 sh (1) late (t) + sh (2) late (t), (4.32) h L diff,hrtf(t) =h (1) diff (t) h R diff,hrtf(t) = 1 sh (1) diff (t) + sh (2) diff (t). (4.33) As can be seen, the scattering coefficient s is used to apply decorrelation between the left and right ear. If s = (no scattering), the left and right ear signals will be exactly equal. In the case of s = 1 (full scattering), both channels will be completely uncorrelated. Furthermore, since h (1) xx and h (2) xx are uncorrelated, the total energy of h xx,hrtf will be independent of s.

98 4.3 Usage of the room simulator 95 The left and right channels of the total binaural response are given by: h L hrtf(t) =h L dir,hrtf(t) + 1 s [ h L early,hrtf(t) + h L late,hrtf(t) ] + sh L diff,hrtf(t) h R hrtf(t) =h R dir,hrtf(t) + 1 s [ h R early,hrtf(t) + h R late,hrtf(t) ] + sh R diff,hrtf(t). (4.34) 4.3 Usage of the room simulator The simulator for shoebox-shaped rooms, as presented in this chapter, was used to generate binaural and monaural room impulse responses for conducting listening tests, which will be discussed in Chapter 5. Furthermore, the simulator was used a couple of times outside this project. One example will briefly be discussed in the next section Evaluation of a new measure for speech intelligibility In [Schlesinger et al., 29], [Schlesinger et al., 21] and [Boone et al., 21], the room simulator was used to evaluate a new, binaural and speech-based measure for speech intelligibility. It was shown that this new Binaural STI, as proposed by Schlesinger et al., successfully predicts the masking level differences in binaural situations (see Section 3.5.3). Figure 4.1 shows two graphs from [Schlesinger et al., 29]. In order to generate these graphs, both the monaural room impulse responses (using a virtual omnidirectional microphone) and BRIRs (using a virtual dummy head) were simulated at 225 receiver positions in a room of dimensions 15x2x5m, from which STI-values were derived. The reverberation time in this virtual room was 1.25 s for low to mid frequencies. The left graph shows the conventional monaural STI values. The right graph shows the new binaural STI. The binaural STI values are generally higher compared with the monaural STI. This can be explained by the spatial release from masking in speech intelligibility, which can be as high as 12 db when the direction of incidence from a single masker differs from the target talker [Bronkhorst, 2]. In a room, reflections can act as maskers of the direct sound and generally arrive from different directions than the target. The human auditory system is able to supress this reverberation and, therefore, it is expected that there is a binaural advantage with respect to the speech intelligibility in a room [Schlesinger et al., 29]. 4.4 Summary and discussion In this chapter an algorithm was presented for simulating room impulse responses for virtual, shoebox-shaped rooms. The algorithm uses a mirror image source model

96 Simulating room acoustics Figure 4.1: A top-view of the Speech Transmission Index STI as a function of receiver position in a virtual room with dimensions 15x2x5m.

99 96 Simulating room acoustics Figure 4.1: A top-view of the Speech Transmission Index STI as a function of receiver position in a virtual room with dimensions 15x2x5m. The room impulse responses were simulated using the simulator for shoebox-shaped rooms as presented in this chapter. The left figure shows the conventional monaural STI. The right figure shows the results for the binaural STI as proposed by Schlesinger et al. [29]. for the early part of the response, whereas the late part is modelled using a statistical process. Various microphone sensitivity patterns are supported such that the most common microphone types can be simulated. Using a database of measured head related transfer functions, binaural room impulse responses can also be simulated for auralization purposes. The room simulator will be used in this project to simulate room impulse responses for testing the auditory model and the resulting parameters presented in Chapter 3. The algorithm was also used by other authors to generate virtual room acoustical environments.

100 5 Listening tests After silence, that which comes nearest to expressing the inexpressible is music. (Aldous Huxley, ) To optimize and validate the AMARA method presented in Chapter 3, perceptual data is needed on the four room acoustical attributes the method predicts: reverberance, clarity, Apparent Source Width (ASW) and Listener Envelopment (LEV). Then, the perceptual results can be compared with the method predictions to see if there is sufficient correlation. In total, four listening tests were conducted to collect perceptual data related to these room acoustical attributes. In these tests, subjects had to assign ratings to the four attributes for different rooms. In total, four listening tests were conducted: I: A listening test with nine virtual rooms simulated using the simulator for shoebox-shaped rooms presented in Chapter 4. II: A test with eight virtual rooms that are more uncommon/unrealistic compared with the rooms using in test I in order to make the various attributes more independent. III: A test with ten real rooms using measured room impulse responses. IV: A test with the same rooms as III; only this time the audio samples were normalized to the same loudness level.

101 98 Listening tests 5.1 Listening test method Rating procedure In order to investigate if the model output parameters are accurate predictors of the corresponding room acoustical attributes, it will be necessary to collect perceptual results for a variety of rooms. Several methods exist for conducting listening tests for this purpose, like paired comparison and direct evaluation tests. In a paired comparison test, the subject has to compare the samples in a pairwise fashion; for example by indicating which sample sounds the most reverberant within each pair. From the test results, it is possible to derive scale measures for the different attributes [Wickelmaier and Schmid, 24]. The paired comparison method leads to a very high discriminability since the subjects can carefully compare all the samples against each other. However, such a test can take a very long time for each subject [Parizet et al., 25]. When conducting a direct evaluation test, the subject is presented with all the samples at once. The subject is allowed to listen to all the samples and rate them directly, for example in terms of the perceived amount of reverberance. Although this method is much faster than the paired comparison approach, it leads to a lower discriminability between the different samples [Parizet et al., 25]. Chevret and Parizet [27] proposed a listening test method that is basically a mixed form of a paired comparison test and direct evaluation. In this method, the subject has to assign ratings to a set of samples directly, but he or she is also allowed to sort the samples based on the current ratings. After the samples are sorted by rating, the subject is encouraged to make fine-tunings to the ratings by comparing the samples pair-wise, sort the samples again, etc. In [Chevret and Parizet, 27], it was shown that this method provides results similar to a paired comparison test, but requires much less time. Because of the benefits of this mixed procedure, it was used in this project for all listening tests. In each test, the subjects were asked to rate sets of samples in terms of a certain room acoustical attribute, like reverberance or clarity. The sample sets consisted of a particular anechoic recording convolved with a binaural room impulse response that was measured, or generated using the simulator for shoebox-shaped rooms presented in Chapter 4. The rating was performed on a continuous scale ranging from very low to very high Test setup Each of the four tests was carried out in the listening room of the Laboratory of Acoustics of the Faculty of Applied Sciences at the Delft University of Technology. Since the audio samples are binaural, they were presented to the subject through

5.1 Listening test method 99 headphones (Beyer Dynamic DT-77 Pro). The headphones were connected to an RME Fireface 4 audio interface running at a sampling frequency of 44.

102 5.1 Listening test method 99 headphones (Beyer Dynamic DT-77 Pro). The headphones were connected to an RME Fireface 4 audio interface running at a sampling frequency of 44.1 khz, which was controlled by a PC running the Windows XP operating system. Test software The listening test software itself was newly developed within the scope of this research using the C++ programming language in combination with the Qt library version ( Figure 5.1 shows an example of its graphical interface as presented to the subjects. The samples are aligned vertically, and the subject can listen to each of them by pressing the corresponding Play button with the mouse. Playback can be stopped at any time by pressing this button a second time. The rating can be adjusted for each sample by moving the corresponding slider. Within each screen, the subject was asked to rate one particular attribute, for example reverberance. Figure 5.1: An example of the graphical interface of the listening test software used in this project. By pressing the Sort samples button the samples are sorted from high (top) to low

103 1 Listening tests (bottom) based on the ratings given by the subject. After this, it is possible to make adjustments to the ratings. The order of the screens is randomized; i.e. the subjects are asked to rate a set of signals for a random stimulus / attribute combination. Also, within one screen, the order of the samples is randomized initially. When a subject has rated all samples for all stimulus / attribute combinations, the rating results are stored in a text file on the computer. This file can be used for further processing of the test results. The test was designed as a double blind listening test, meaning that both the test conductor and the subjects do not know the order of the rooms, stimuli and attributes. Stimuli Each of the four listening tests includes two different stimuli: a male speech sample and a solo cello sample. These stimuli were chosen because of their difference in spectral and temporal properties and were already used in Chapter 3. Both stimuli have a length of 1 seconds. The male speech sample is taken from the Sound Quality Assessment Material CD (SQAM) [EBU, 1988]. It contains a part of an anechoic recording of a man reading a part of The Stage Coach by Washington Irving. The cello sample is downloaded from the website of the room acoustical simulation software package ODEON ( It mostly contains slow (long) notes. Time domain and frequency domain graphs of both source stimuli can be found in Figures 3.11 and Tested attributes The attributes that the subjects had to rate in the test were the same as the attributes discussed in Section 3.6.4: reverberance, clarity, ASW and LEV (see the mentioned section for more details on these attributes). Length of the test In total, each test took about 2 to 3 minutes per subject, where each subject was presented with 4 attributes 2 stimuli = 8 different attribute/stimulus combinations in several sound fields (rooms).

104 5.1 Listening test method Instructions for the subjects To get acquainted with the room acoustical attributes reverberance, clarity, ASW and LEV, the subjects got instructions before the actual listening test started. These instructions were equal for all four listening tests and included text description of the attributes as well as audio examples. The complete text can be found in Appendix D Level normalization For listening tests I, II and IV, the audio samples were all normalized to the same loudness level within one test, using the Replaygain algorithm (see Appendix B). Since this algorithm estimates the perceived loudness of a sample, the sound pressure levels of the normalized samples can still differ. As will turn out, this is indeed the case although the differences in SPL are small, with a maximum difference of 3 db within one test Determining the SPL of the audio samples Since the auditory model used in this research is non-linear, its output will depend on the input SPL. Therefore, the amplitude of the input signals needs to be adjusted according to the sound pressure level at the ear, see Section To determine the SPL of the audio samples, the pressure at the entrance of the ear p(t) is calculated as: p(t) = h h (t) x(t), (5.1) where x(t) is the raw wave data of the audio sample (with values in the range -1 to +1) and h h is the impulse response from the headphones to the eardrum. This impulse response is obtained by mounting the headphones on the ITA dummy head, as shown in Fig Next, a pink noise signal was played back through the headphones and the dummy head s outputs were recorded back into the computer using the RME Fireface 4 audio interface. This procedure is repeated 3 times; for each repeat the headphones are taken off the dummy head and mounted back on to test the effect of small changes in the positioning of the headphones. For each repeat i, the amplitude of the transfer function H h,i is calculated as: H h,i (f) = 1 L ref /2 Y i(f) X(f) (5.2) with Y i (f) the amplitude of the dummy head output at repeat i and frequency f. X(f) is the result of a reference measurement obtained by recording the outputs

105 12 Listening tests of the dummy head when a loudspeaker reproducing pink noise is pointed directly at one of the ears. Due to this approach, H h,i (f) will correspond to the transfer function of the headphones to the entrance of the ear, in the presence of the dummy head. To convert the amplitudes of the digital wave forms to a pressure value, a correction factor L ref is included in Eq This factor was determined by measuring the pressure level of the pink noise signal as reproduced by the loudspeaker, using a B&K 2236 SPL meter. It was found that in this case L ref = 16 db. Figure 5.2: The ITA dummy head with the Beyer Dynamic DT 77 Pro headphones. Note that, when obtaining the amplitude of the transfer function using Eq. 5.2, it is assumed that the loudspeaker has a sufficiently flat frequency response, i.e. any peaks and dips in the frequency response of the loudspeaker should be negligible compared with the response of the headphones. Figure 5.3 shows the results for H h,i (f) for all repeats. As can be seen, the spread in the results between measurement repeats is rather large, especially for the right ear of the dummy head. The cause of this larger spread for the right ear has not been looked into, but it is probably due to mechanical differences between the left and right shells of the headphones. Since it is likely that similar variations can be found for human ears, this has to be kept in mind when reporting the estimated sound (pressure) level of the samples. As can be seen from the figures, small displacements of the headphone can lead to changes in the SPL of 2 db for one particular frequency. When the spectra are averaged over all frequencies, differences up to 5 db are found. Inside the shells of the headphones reflection will occur, of which the arrival times at the position of the ear will depend on the exact positioning of the headphones. Therefore, it is impossible to take the phase information into account when an average transfer function is determined. Instead, only colouration effects are taken

106 5.1 Listening test method H h,l (db) 9 H h,r (db) k 2k 4k 8k 16k frequency (Hz) k 2k 4k 8k 16k frequency (Hz) Figure 5.3: The results for the headphone-to-eardrum transfer functions H h,i (f) (shown logarithmically, in db s) for all 3 measurement repeats, in 1/3 octave bands. The left and right figures show the results for the left and right ear, respectively. into account and an average amplitude spectrum is calculated by taking the mean of H h,i (f) over all repeats i for the left and right ear together: H h (f) = H h,i (f). (5.3) To find the final response h(t), the average amplitude spectrum H h (f) needs to be transformed back to the time domain. As explained, only colouration effects introduced by the headphones are taken into account and, therefore, a minimum phase signal is constructed out of H h (f), to yield the time domain response h(t). This will lead to an impulse response that is as short in time as possible. The minimum phase response is obtained as follows: [ { h(t) = Re F 1 e H {log( H h(f) )} }], (5.4) where H denotes the Hilbert Transform, F 1 the inverse Fourier transform and means taking the complex conjugate. Figure 5.4 shows the result for H h (f) as well as h(t) Re-distribution of the results In the listening tests, the subjects rated the various room acoustical attributes on a continuous scale ranging from very low to very high. A disadvantage of such a scale is the fact that the subjects will interpret the scale differently. For example, what a person perceives as very high clarity depends on various subjective factors. This problem is often minimized by using so-called low and high anchors. These are samples with very low and high values for the attribute under consideration.

107 14 Listening tests H h (db) 9 h h (t) (x1 5 Pa) k 2k 4k 8k 16k frequency (Hz) time (ms) Figure 5.4: Left: the mean result for the headphone-to-eardrum transfer functions H h (f) (in db s), in 1/3 octave bands. Right: the resulting headphone-to-eardrum impulse response (minimum phase). For example, when assessing the overall audio quality of a certain audio coding algorithm, the unprocessed sample is often used as a high anchor while the low anchor consists of a sample that is filtered with a low-pass filter with a rather low cutoff frequency, for example at 3.5 khz [Nagel et al., 21]. For the listening tests conducted in the context of this research, it will be a hard task to collect samples that can act as low or high anchors. This has to do with the fact that the subjects rate four different attributes for each set of samples. This means that samples should be available that have extremely low, or extremely high values for all four attributes. Therefore, it was chosen not to use anchors in the test, but to re-distribute the test results for each subject instead, according to ITU Recommendation BS [ITU-R, 23]: z i,a = x i,a x i,a σ i,a σ a + x a, (5.5) where z i,a are the re-distributed results for subject i and attribute a, x i,a are the results for subject i, x i,a is the mean result for this subject and σ i,a is the standard deviation. x a and σ a are the mean and standard deviation for all subjects for attribute a, respectively. Using Eq. 5.5, the individual results of the subjects are re-scaled to the same distribution Statistical analysis of the results In order to analyze the listening test results, two-way Analyses of Variance (ANOVAs) were carried out for each attribute within each test. The results of these ANOVAs will give insight into which factors have a significant influence on the ratings at the p <.5 significance level. The main factors will be room and stimulus. The

108 5.2 Listening test I 15 interaction effect between these main factors will also be included in the ANOVAs. If the interaction effect room stimulus is significant, the effect room will be depending on the size of the effect stimulus. For more details on the two-way ANOVA procedure, the reader is referred to Appendix E. 5.2 Listening test I Rooms In listening test I, it was chosen to work with virtual rooms, which were simulated using the simulator for shoebox-shaped rooms (see Chapter 4). A total of nine rooms were simulated with varying properties, such that a wide range of (conventional) parameter values is available. Table 5.1 gives an overview of the nine virtual rooms including values for common objective parameters at the specific listener positions considered, as calculated using the ISO norm [ISO, 29]. In order to calculate these parameters, the room impulse response was simulated three times for each room: binaurally (for calculating parameters related to the interaural crosscorrelation) using a virtual microphone with an omni-directional sensitivity pattern (for reverberation times, clarity indices and the sound strength) and using a virtual figure-of-eight microphone (for determining LF). As can be seen in the table, the range of values for the (conventional) room acoustical parameters is rather large for this test Subjects In total, five subjects participated in this listening test. The participants were all working at the Laboratory of Acoustics of the TU Delft, and can be considered experts in the field of acoustics. They were familiar with the four different attributes Results The results of listening test I are shown graphically in Figs. 5.5 (cello stimulus) and 5.6 (speech stimulus). The ratings are scaled to a -1 range according to Table 5.2. For example, room VR1 is rated close to very low (.) in terms of reverberance for both samples, whereas room VR4 is rated close to normal (.5). Note that values < and > 1 are found in the figures; this is due to the re-distribution of the data (Section 5.1.6).

109 16 Listening tests Table 5.1: An overview of the (virtual) rooms used in listening test I. The first two columns list the room names as well as the type of room after which they were modelled. The table also lists some common conventional room acoustical parameter values for each room. All parameters are calculated according to the ISO standard (see Section 2.3). The last two columns show the SPL of the resulting audio samples used in the listening test after applying level normalization using the Replaygain algorithm. (*) LEVcalc and G were calculated before the audio samples were normalized to the same loudness level using the Replaygain algorithm, hence they include the original loudness differences between the rooms. (**) For room VR1, there was too little energy in the late part of the response to calculate IACCL. Therefore, 1-IACCL was set to a value of. in this case. Rooms Room acoustical parameters (ISO) Sample SPL Code Type T2 EDT C5 C8 1-IACCE LF 1-IACCL LEVcalc (*) G (*) Speech Cello (s) (s) (db) (db) (db) (db) (db) (db) VR1 Anechoic room (**) VR2 Office VR3 Auditorium VR4 Auditorium VR5 Concert hall VR6 Concert hall VR7 Concert hall VR8 Concert hall VR9 Cathedral

110 5.2 Listening test I reverberance VR1 VR2 VR3 VR4 VR5 VR6 VR7 VR8 VR9 room 1.8 clarity VR1 VR2 VR3 VR4 VR5 VR6 VR7 VR8 VR9 room 1.8 ASW VR1 VR2 VR3 VR4 VR5 VR6 VR7 VR8 VR9 room LEV VR1 VR2 VR3 VR4 VR5 VR6 VR7 VR8 VR9 room Figure 5.5: Boxplots of the listening test results for test I (cello sample). From top to bottom: reverberance, clarity, apparent source width and listener envelopment. In the boxes, the central mark denotes the median, while the lower and upper bounds of the boxes denote the 25 th (q 1) and 75 th (q 3) percentiles, respectively. The length of the whiskers is equal to 1.5 times the box size (q 3 q 1). If the data is normally distributed, the whiskers will cover 99.3% of the data.

111 18 Listening tests reverberance VR1 VR2 VR3 VR4 VR5 VR6 VR7 VR8 VR9 room clarity VR1 VR2 VR3 VR4 VR5 VR6 VR7 VR8 VR9 room 1.8 ASW VR1 VR2 VR3 VR4 VR5 VR6 VR7 VR8 VR9 room LEV VR1 VR2 VR3 VR4 VR5 VR6 VR7 VR8 VR9 room Figure 5.6: Boxplots of the listening test results for test I (speech sample). From top to bottom: reverberance, clarity, apparent source width and listener envelopment. In the boxes, the central mark denotes the median, while the lower and upper bounds of the boxes denote the 25 th (q 1) and 75 th (q 3) percentiles, respectively. The length of the whiskers is equal to 1.5 times the box size (q 3 q 1). If the data is normally distributed, the whiskers will cover 99.3% of the data.

112 5.2 Listening test I 19 Table 5.2: The possible rating values in the listening tests and the corresponding labels at the sliders used in the user interface. Note that the scale is continuous, so any value in between the mentioned values is possible Analysis of the results Rating value Slider label. Very low.25 Low.5 Normal.75 High 1. Very high The results of the ANOVAs for test I are shown in Table 5.3. As can be seen in the table, the factor room was highly significant for all attributes. For example, for the reverberance attribute the ANOVA result for the room factor was [F (8, 72) = , p <.1]. In this test, the stimulus type had a significant effect for reverberance only. From a Tukey HSD test (see Appendix E), it follows that the mean perceived amount of reverberance y rev is significantly lower for the cello stimulus (y rev = 2.8) compared with the speech stimulus (y rev = 2.28) at the p <.5 level. The test results in Figures 5.5 and 5.6 show that the wide range of room types included in the test led to a wide range in the ratings for the various attributes as well. All attributes have mean values that are roughly in the range to 1 (very low to very high). This will be useful for the validation of the model (Chapter 6), since the model can be tested for the full range of attribute values. Another aspect from the test results that can be noticed, is that the boxes are not very large, indicating that there is not a large spread in the subject s answers.

113 11 Listening tests Table 5.3: Two-way ANOVA results for the ratings from listening test I for the four room acoustical attributes. Bold values denote values that are significant at the p <.5 level. Reverberance Source Sum. Sq. d.f. Mean Sq. F Prob > F room <.1 stimulus room stimulus Error Total Clarity Source Sum. Sq. d.f. Mean Sq. F Prob > F room <.1 stimulus room stimulus Error Total ASW Source Sum. Sq. d.f. Mean Sq. F Prob > F room <.1 stimulus room stimulus Error Total LEV Source Sum. Sq. d.f. Mean Sq. F Prob > F room <.1 stimulus room stimulus Error Total

114 5.3 Listening test II Listening test II Rooms In listening test II, virtual rooms were used to obtain binaural room impulse responses, just like for test I. However, for this test an attempt was made to make the four room acoustical attributes more independent of each other compared with the previous test; i.e. more uncorrelated. This is a difficult task, since in practice the room acoustical attributes are almost never completely independent. For example, a room with a high amount of reverberance often has low clarity and high values for attributes related to spaciousness. In the rating results of listening test I, it can be clearly seen that the attributes follow the same trend (Figs. 5.5 and 5.6). The attributes were made more independent in test II by simulating more unrealistic rooms. For example room VR13, which has a high reverberation time of T 2 = 1.75 s, but also has side walls that are completely absorbing. This results in a low value for (1-IACC E ) of.7 and possibly in a low Apparent Source Width, due to the lack of reflected energy arriving from lateral directions. Table 5.4 shows information about the eight virtual rooms used in listening test II. The spread in the ISO 3382 parameters presented in this table is smaller than the spread found in the previous listening test (Table 5.1). For example, the reverberation time T 2 is in the range 1.21 to 1.98 s, where differences between rooms of less than.5 s occur. This made the rating process more difficult to perform for the subjects; this is something that was confirmed informally by the subjects Subjects In listening test II, the five expert subjects, who also took part in the first test, participated; see Section Results The results of listening test II are shown graphically in Figs. 5.7 (cello stimulus) and 5.8 (speech stimulus) Analysis of the results The results of the ANOVAs for test II are shown in Table 5.5. Also in this case, the factor room was significant for all attributes. The factor stimulus was significant for two attributes: reverberance [F (1, 64) = , p =.321] and ASW [F (1, 64) = , p =.63]. Tukey HSD analyses revealed that reverberance

115 112 Listening tests Table 5.4: An overview of the (virtual) rooms used in listening test II. The first two columns list the room names as well as the type of room after which they were modelled. The table also lists some common, conventional room acoustical parameter values for each room. All parameters are calculated according to the ISO standard (see Section 2.3). The last two columns show the SPL of the resulting audio samples used in the listening test, after applying level normalization using the Replaygain algorithm. (*) LEVcalc and G were calculated before the audio samples were normalized to the same loudness level using the Replaygain algorithm, hence they include the original loudness differences between the rooms. Rooms Room acoustical parameters (ISO) Sample SPL Code Type T2 EDT C5 C8 1-IACCE LF 1-IACCL LEVcalc (*) G (*) Speech Cello (s) (s) (db) (db) (db) (db) (db) (db) VR1 Concert hall 1 (pos. 1) VR11 Concert hall 1 (pos. 2) VR12 Concert hall 1 (pos. 3) VR13 Concert hall VR14 Concert hall VR15 Concert hall VR16 Concert hall VR17 Concert hall

116 5.3 Listening test II reverberance VR1 VR11 VR12 VR13 VR14 VR15 VR16 VR17 room clarity VR1 VR11 VR12 VR13 VR14 VR15 VR16 VR17 room ASW VR1 VR11 VR12 VR13 VR14 VR15 VR16 VR17 room LEV VR1 VR11 VR12 VR13 VR14 VR15 VR16 VR17 room Figure 5.7: Boxplots of the listening test results for test II (cello sample). From top to bottom: reverberance, clarity, apparent source width and listener envelopment. In the boxes, the central mark denotes the median, while the lower and upper bounds of the boxes denote the 25 th (q 1) and 75 th (q 3) percentiles, respectively. The length of the whiskers is equal to 1.5 times the box size (q 3 q 1). If the data is normally distributed, the whiskers will cover 99.3% of the data.

117 114 Listening tests reverberance VR1 VR11 VR12 VR13 VR14 VR15 VR16 VR17 room clarity VR1 VR11 VR12 VR13 VR14 VR15 VR16 VR17 room ASW VR1 VR11 VR12 VR13 VR14 VR15 VR16 VR17 room LEV VR1 VR11 VR12 VR13 VR14 VR15 VR16 VR17 room Figure 5.8: Boxplots of the listening test results for test II (speech sample). From top to bottom: reverberance, clarity, apparent source width and listener envelopment. In the boxes, the central mark denotes the median, while the lower and upper bounds of the boxes denote the 25 th (q 1) and 75 th (q 3) percentiles, respectively. The length of the whiskers is equal to 1.5 times the box size (q 3 q 1). If the data is normally distributed, the whiskers will cover 99.3% of the data.

118 5.3 Listening test II 115 was rated significantly lower on average for cello (y rev = 2.15) than for speech (y rev = 2.39). ASW was rated higher for cello (y ASW = 2.43) than for speech (y ASW = 2.12). Compared with the results of listening test I, the variances in the ratings for test II are higher. This is most probably due to the higher degree of difficulty of the test. Also, the value ranges of the ratings for the four attributes is lower compared with those of the previous test. Furthermore, some unrealistic virtual rooms were included in test II, like room VR13, which has a high reverberation time of T 2 = 1.75 s, but a rather low value for the early interaural cross-correlation (1-IACC E =.7), due to completely absorbing side walls. If the perceptual results are examined, it is found that the speech and cello stimuli are rated quite differently; for example for the speech stimulus, reverberance and ASW are rated around.4 and.3 respectively, while values of around.6 (reverberance) and.7 (ASW) are found for the cello stimulus. From a Tukey HSD test it follows that these differences in perceived reverberance and ASW are statistically significant at the p <.5 level. From the low value for 1-IACC E, a low value for ASW would be expected. However, this is not found in the results, especially not for the cello stimulus where ASW is quite high. Apparently, a lack of energy from lateral directions may not lead to low ASW and the interaural crosscorrelation is not able to predict this effect.

119 116 Listening tests Table 5.5: Two-way ANOVA results for the ratings from listening test II for the four room acoustical attributes. Bold values denote values that are significant at the p <.5 level. Reverberance Source Sum. Sq. d.f. Mean Sq. F Prob > F room stimulus room stimulus <.1 Error Total Clarity Source Sum. Sq. d.f. Mean Sq. F Prob > F room <.1 stimulus room stimulus <.1 Error Total ASW Source Sum. Sq. d.f. Mean Sq. F Prob > F room <.1 stimulus room stimulus Error Total LEV Source Sum. Sq. d.f. Mean Sq. F Prob > F room stimulus room stimulus <.1 Error Total

120 5.4 Listening test III Listening test III Rooms A variety of real rooms was used in listening test III. The impulse responses of these rooms were measured using an omni-directional microphone (type B&K 4134) and an ITA dummy head. The binaural room impulse response, as measured using the dummy head, was used to generate the binaural samples for the listening test and were also used in the calculation of IACC E, IACC L and LEV. The responses as measured using the omni-directional microphone were used to calculate the other room acoustical parameters (and are also included in the calculation of LEV calc, see Eq. 2.34). The list of rooms includes four times the auditorium of the TU Delft. In Table 5.7, these are listed as AUDx. These measurements are performed using different settings for the electro-acoustical ACS 1 system present in the room. This system alters the acoustics of a room electronically by using multiple microphones and loudspeakers installed in the room [Berkhout, 1988; De Vries et al., 27]. The numbers in the room codes in the table represent the corresponding preset setting of the ACS system, as listed in Table 5.6. Table 5.6 lists all preset settings available in this system, although in the listening test only presets 1, 5, 6 and 8 were used. At setting, the system is in standby mode and the reverberation time is only related to the natural reflections of the room. The natural reverberation time of the room at this setting is measured to be 1.5 s. The impulse responses for these settings were not all measured at the same receiver locations, as can be seen in Fig AUD5 was measured quite close to the source, at a distance of 6.95 m. AUD1 and AUD6 were measured in the back of the room at 2.95 m distance while AUD8 was measured on an off-axis position at 13.3 m distance Subjects For listening tests III, a group of 15 participants was available. The group included a mixture of experts and non-expert listeners Results Figures 5.1 and 5.11 show the results for listening test III. 1 Acoustic Control Systems

121 118 Listening tests 13.3 m 6.95 m AUD5 AUD m AUD1 AUD6 Figure 5.9: A schematic view of the auditorium in the aula at the TU Delft. The figure also shows the positions of the source and the receiver positions used to measure the impulse responses for AUD1, 5, 6 and 8. Table 5.6: A list of the preset settings of the ACS system installed in the auditorium at the aula of Delft University of Technology and the corresponding reverberation times, as specified by the manufacturer of the system. Preset Rev. time (s) Preset name 1.5 Standby mode - Natural reflections Early reflections only Chamber music / operetta Opera / small ensemble Orchestra 5 2 Orchestra Orchestra, romantic repertoire Church 8 5 Cathedral Analysis of the results The two-way ANOVA results for listening test III are shown in Table 5.8. Again, the factor room had a significant effect for all four room acoustical attributes. The stimulus type had a significant effect in two cases: clarity ([F (1, 28) = 7.378, p =.84]) and LEV ([F (1, 28) = , p =.169]). The interaction effect room stimulus was significant for reverberance, clarity and LEV. It is interesting to take a close look at the results for the AUDx rooms; i.e. the

122 5.4 Listening test III 119 Table 5.7: An overview of the rooms used in listening test III. The first two columns list the room names as well as the type of room after which they were modelled. The table also lists some common, conventional room acoustical parameter values for each room. All parameters are calculated according to the ISO standard (see Section 2.3). The last two columns show the SPL of the resulting audio samples used in the listening test. Unlike the other listening tests, the loudness level of the samples was normalized using the Replaygain algorithm in this test. (*) For rooms AR and LR, there was too little energy in the late part of the response to calculate IACCL. Therefore, 1-IACCL was set to a value of. for these rooms. Rooms Room acoustical parameters (ISO) Sample SPL Code Type T2 EDT C5 C8 1-IACCE 1-IACCL LEVcalc G Speech Cello (s) (s) (db) (db) (db) (db) (db) (db) AR Anechoic room (*) LR Listening room (*) SW Skyway HW Hallway AUD1 Auditorium (ACS1) AUD5 Auditorium (ACS5) AUD6 Auditorium (ACS6) SC Staircase AUD8 Auditorium (ACS8) RC Reverberation chamber

123 12 Listening tests 1.8 reverberance AR LR SW HW AUD1 AUD5 AUD6 SC AUD8 RC room 1 clarity AR LR SW HW AUD1 AUD5 AUD6 SC AUD8 RC room ASW AR LR SW HW AUD1 AUD5 AUD6 SC AUD8 RC room LEV AR LR SW HW AUD1 AUD5 AUD6 SC AUD8 RC room Figure 5.1: Boxplots of the listening test results for test III (cello sample). From top to bottom: reverberance, clarity, apparent source width and listener envelopment. In the boxes, the central mark denotes the median, while the lower and upper bounds of the boxes denote the 25 th (q 1) and 75 th (q 3) percentiles, respectively. The length of the whiskers is equal to 1.5 times the box size (q 3 q 1). If the data is normally distributed, the whiskers will cover 99.3% of the data. The + symbols denote values that are considered to be outliers.

124 5.4 Listening test III reverberance AR LR SW HW AUD1 AUD5 AUD6 SC AUD8 RC room clarity AR LR SW HW AUD1 AUD5 AUD6 SC AUD8 RC room ASW AR LR SW HW AUD1 AUD5 AUD6 SC AUD8 RC room 1.8 LEV AR LR SW HW AUD1 AUD5 AUD6 SC AUD8 RC room Figure 5.11: Boxplots of the listening test results for test III (speech sample). From top to bottom: reverberance, clarity, apparent source width and listener envelopment. In the boxes, the central mark denotes the median, while the lower and upper bounds of the boxes denote the 25 th (q 1) and 75 th (q 3) percentiles, respectively. The length of the whiskers is equal to 1.5 times the box size (q 3 q 1). If the data is normally distributed, the whiskers will cover 99.3% of the data. The + symbols denote values that are considered to be outliers.

125 122 Listening tests TU Delft auditorium with the ACS system active at various settings. The measured reverberation times listed in Table 5.7 for AUD1, AUD6 and AUD8 are very close to the target values as reported by the manufacturer, see Table 5.6. For AUD5, the measured reverberation time of 1.67 s is lower than the target value of 2 s. This might be due to the fact that the measurement for AUD5 was performed closer to the source compared with the other three measurements, see Fig From a multi-comparison Tukey HSD test for the speech stimulus, the following significant differences in the perceived amounts of reverberance are found: The perceived amount of reverberance for AUD1 is significantly lower than for AUD5, 6 and 8. The perceived amount of reverberance for AUD5 is significantly higher than for AUD1. The perceived amount of reverberance for AUD6 is significantly higher than for AUD1 and significantly lower than for AUD8. The perceived amount of reverberance for AUD8 is significantly higher than for AUD1 and AUD6. For the cello stimulus the following results are found: The perceived amount of reverberance for AUD1 is significantly lower than for AUD8. The perceived amount of reverberance for AUD5 is not significantly different from any of the other results. The perceived amount of reverberance for AUD6 is significantly lower than for AUD8. The perceived amount of reverberance for AUD8 is significantly higher than for AUD1 and AUD6. It is interesting to see that the perceived amount of reverberance for AUD5 and AUD6 is not significantly different for both stimuli; something that is not surprising given the small difference in target reverberation times for settings 5 and 6 (2 s and 2.25 s, respectively). However, as mentioned before, the measured reverberation times for AUD5 and AUD6 are quite different and it seems that this objective measure does not predict perception very well in this case. Another problem with the reverberation time is found when the perceptual results for AUD8 and SC (the staircase) are compared. For both stimuli, speech and cello, the perceived amount of reverberance for AUD8 is significantly lower than for SC,

126 5.4 Listening test III 123 while the reverberation time of the former is almost one second higher. It seems that this difference is better reflected by EDT than by T 2. For the clarity attribute, only one significant difference was found. The perceived amount of clarity for the speech stimulus and AUD8 was significantly lower than for AUD6. All the other differences were not significant. Tukey HSD tests might also reveal significant effects of the ACS system on spatial aspects of the sound field (ASW and LEV). For the cello stimulus, no significant differences were found between any of the tested settings for both attributes. For the speech stimulus, ASW was significantly higher for AUD5 than for AUD1 and AUD6. For LEV and the speech stimulus, the following differences were found: The perceived amount of LEV for AUD1 is significantly lower than for AUD5 and AUD8. The perceived amount of LEV for AUD5 is significantly higher than for AUD1 and AUD6. The perceived amount of LEV for AUD6 is significantly lower than for AUD5. The perceived amount of LEV for AUD8 is significantly higher than for AUD1. From the above, it can be concluded that the use of the ACS system mainly affects the perceived amount of reverberance and seems to have little effect on clarity and attributes related to spaciousness. This is probably due to the fact that the system does not compensate for the natural reflections of the room, which will arrive at the listener at shorter times than those of the larger room that is simulated by the system. Perceptually, the natural reflections of the room might therefore be dominant in terms of attributes related to spaciousness. Interestingly, the amount of reverberance is not significantly higher for AUD8 than for AUD5, although the reverberation time of the first is much higher (4.81 s versus 1.67 s). Examining Table 5.7, it can be seen that EDT is (much) lower than T 2 for AUD1, 5, 6 and 8. As discussed above, the ACS system mostly affects the late reverberation tail rather than the early reflection part of the impulse response. This might also explain the difference between the perceptual results and the measured reverberation times. For real life signals, the early reflections may have a larger impact on the perceived amount of reverberance than the late reverberation, due to masking effects. This is also the reason why EDT is considered a better predictor for reverberance [ISO, 29].

127 124 Listening tests Table 5.8: Two-way ANOVA results for the ratings from listening test III for the four room acoustical attributes. Bold values denote values that are significant at the p <.5 level. Reverberance Source Sum. Sq. d.f. Mean Sq. F Prob > F room <.1 stimulus room stimulus Error Total Clarity Source Sum. Sq. d.f. Mean Sq. F Prob > F room <.1 stimulus room stimulus Error Total ASW Source Sum. Sq. d.f. Mean Sq. F Prob > F room <.1 stimulus room stimulus Error Total LEV Source Sum. Sq. d.f. Mean Sq. F Prob > F room <.1 stimulus room stimulus Error Total

128 5.5 Listening test IV Listening test IV Rooms For test IV, the same set of rooms was used as for test III as listed in Table 5.7. However, this time the samples were normalized to the same loudness level using the Replaygain algorithm, just like for tests I and II. This way, it can be investigated if loudness differences have a significant effect on the perceptual results. Table 5.9 shows the SPL of the samples used in test IV. Again, like for tests I and II, the SPL differs slightly between samples and between stimuli types because SPL is not the same as the loudness level as determined by the Replaygain algorithm. Table 5.9: The SPL of the different samples used in test IV. Sample SPL (db) Room Speech Cello AR 7 7 LR 71 7 SW HW AUD AUD AUD SC AUD RC Subjects Like for test III, 15 subjects participated in test IV. The group included a mixture of experts and non-expert listeners Results The rating results for test IV are shown graphically in Figs and Analysis of the results The results of the ANOVAs for the results of test IV are shown in Table 5.1. The room factor is again significant for all four attributes. The stimulus factor only

129 126 Listening tests 1.8 reverberance AR LR SW HW AUD1 AUD5 AUD6 SC AUD8 RC room 1.8 clarity AR LR SW HW AUD1 AUD5 AUD6 SC AUD8 RC room 1.8 ASW AR LR SW HW AUD1 AUD5 AUD6 SC AUD8 RC room LEV AR LR SW HW AUD1 AUD5 AUD6 SC AUD8 RC room Figure 5.12: Boxplots of the listening test results for test IV (cello sample). From top to bottom: reverberance, clarity, apparent source width and listener envelopment. In the boxes, the central mark denotes the median, while the lower and upper bounds of the boxes denote the 25 th (q 1) and 75 th (q 3) percentiles, respectively. The length of the whiskers is equal to 1.5 times the box size (q 3 q 1). If the data is normally distributed, the whiskers will cover 99.3% of the data. The + symbols denote values that are considered to be outliers.

130 5.5 Listening test IV reverberance AR LR SW HW AUD1 AUD5 AUD6 SC AUD8 RC room 1.8 clarity AR LR SW HW AUD1 AUD5 AUD6 SC AUD8 RC room ASW AR LR SW HW AUD1 AUD5 AUD6 SC AUD8 RC room LEV AR LR SW HW AUD1 AUD5 AUD6 SC AUD8 RC room Figure 5.13: Boxplots of the listening test results for test IV (speech sample). From top to bottom: reverberance, clarity, apparent source width and listener envelopment. In the boxes, the central mark denotes the median, while the lower and upper bounds of the boxes denote the 25 th (q 1) and 75 th (q 3) percentiles, respectively. The length of the whiskers is equal to 1.5 times the box size (q 3 q 1). If the data is normally distributed, the whiskers will cover 99.3% of the data. The + symbols denote values that are considered to be outliers.

131 128 Listening tests has significant effect on ASW: [F (1, 28) = , p <.1]. The interaction effect room stimulus is significant for clarity and ASW. The next section will discuss the differences between tests III and IV.

132 5.5 Listening test IV 129 Table 5.1: Two-way ANOVA results for the ratings from listening test IV for the four room acoustical attributes. Bold values denote values that are significant at the p <.5 level. Reverberance Source Sum. Sq. d.f. Mean Sq. F Prob > F room <.1 stimulus room stimulus Error Total Clarity Source Sum. Sq. d.f. Mean Sq. F Prob > F room <.1 stimulus room stimulus Error Total ASW Source Sum. Sq. d.f. Mean Sq. F Prob > F room <.1 stimulus <.1 room stimulus Error Total LEV Source Sum. Sq. d.f. Mean Sq. F Prob > F room <.1 stimulus room stimulus Error Total

133 13 Listening tests 5.6 The effect of SPL differences on the results As discussed in the previous section, the setup of test IV was equal to that of test III except for the fact that in test IV all samples were normalized to the same loudness level, whereas in test III the natural loudness differences that existed in the rooms were maintained. To test if significant differences were present in the results of both tests, an ANOVA was performed for all results of tests III and IV with test as an extra factor in addition to the factors room and stimulus that were used in the individual tests. The results of the ANOVA are summarized in Table From the table, it follows that the main factor test is never significant; i.e. on average there was no difference in the ratings between tests III and IV. However, various interaction effects are significant, like room test which is significant for all four attributes. To investigate this further, post-hoc Tukey HSD analyses were performed for all rooms, stimuli and attributes to find significant differences between both tests. The results of these Tukey HSD tests are shown in Tables 5.12 and 5.13 for the speech and cello stimuli, respectively. In these tables, significant differences are indicated by either a + or a - symbol. A plus symbol means a significant positive difference (the average result for test IV was higher than for test III) and a minus symbol represents a negative difference. As is clear from these results, SPL differences do sometimes result in significant differences in the rating for all four attributes. However, no conclusions can be drawn on the way SPL differences affect the results; a positive SPL difference may lead to a positive significant difference in rating, or a negative difference, or no significant difference at all. Generally, S REV and S CLA are rated higher when the SPL is higher (if the difference in the ratings between listening tests III and IV is significant), although there is one case in which S REV is rated significantly lower while the SPL difference is larger than zero. For the parameters related to spaciousness, S ASW and S LEV, the relation is not so clear; positive SPL differences can lead to both significantly lower and significantly higher results for these parameters.

134 5.6 The effect of SPL differences on the results 131 Table 5.11: Three-way ANOVA results for the ratings from listening tests III and IV combined. Bold values denote values that are significant at the p <.5 level. Reverberance Source Sum. Sq. d.f. Mean Sq. F Prob > F room <.1 stimulus test room stimulus <.1 room test <.1 stimulus test Error Total Clarity Source Sum. Sq. d.f. Mean Sq. F Prob > F room <.1 stimulus test room stimulus <.1 room test <.1 stimulus test Error Total ASW Source Sum. Sq. d.f. Mean Sq. F Prob > F room <.1 stimulus test room stimulus room test <.1 stimulus test Error Total LEV Source Sum. Sq. d.f. Mean Sq. F Prob > F room <.1 stimulus test room stimulus room test <.1 stimulus test Error Total

135 132 Listening tests Table 5.12: The results of the post-hoc Tukey HSD tests between tests III and IV for the speech stimulus. The table lists all ten rooms as well as the absolute difference in SPL between both tests for those rooms (SPL IV - SPL III). A denotes a difference in the ratings which is not significant at the p <.5 level. A + denotes a significant difference where the corresponding attribute on average is rated higher for test IV than for test III, whereas in the case of a -, the average rating for test IV is significantly lower than for test III. Note that only in one case (RC), the SPL for test III was higher than for test IV. Room SPL (db) Reverberance Clarity ASW LEV AR LR SW +7 - HW AUD AUD5 +18 AUD SC AUD RC -3 - Table 5.13: The results of the post-hoc Tukey HSD tests between tests III and IV for the cello stimulus. The table lists all ten rooms, as well as the absolute difference in SPL between both tests for those rooms (SPL IV - SPL III). A denotes a difference in the ratings which is not significant at the p <.5 level. A + denotes a significant difference where the corresponding attribute on average is rated higher for test IV than for test III, whereas in the case of a -, the average rating for test IV is significantly lower than for test III. Note that only in one case (RC), the SPL for test III was higher than for test IV. Room SPL (db) Reverberance Clarity ASW LEV AR LR SW +9 - HW AUD AUD AUD SC AUD RC -2

136 5.7 Summary and discussion Summary and discussion Four listening tests were conducted to collect perceptual data on the room acoustical attributes reverberance, clarity, ASW and LEV. In these tests, groups of subjects gave ratings to these parameters while listening to samples, which were presented binaurally using headphones. The samples varied in their acoustical properties because anechoically recorded signals were convolved with different binaural room impulse responses. These impulse responses were either simulated for virtual rooms (tests I and II) or measured in real rooms using a dummy head (tests III and IV). Two stimulus types were used, male speech and solo cello, to test the effect of stimulus type on the results. The tests were conducted using a hybrid form between direct evaluation and a paired comparison method; it has been shown in literature that this leads to accurate results while the total time to perform the test is relatively short. Furthermore, all the tests had a double-blind design. From statistical analyses performed on the listening test results it was found that significant differences in the ratings between the rooms were always present for all attributes and stimuli. For some attribute / test combinations (but not all), the stimulus type had a significant effect on the ratings, or the interaction effect room stimulus was significant, meaning that the stimulus needs to be taken into account in those cases where the acoustics need to be evaluated, objectively or perceptually. When the listening test results are compared with conventional room acoustical parameters as defined in ISO [ISO, 29], already some examples are found where the objective parameters do not predict perception very well, or not at all. This will be further discussed in Chapter 6. To investigate the effect of SPL on the perceptual results, tests III and IV were conducted with the same set of (measured) room impulse responses. In test III, the original loudness differences between the rooms were maintained, while for test IV all samples were normalized to the same loudness level using the Replaygain algorithm (see Appendix B). From statistical analyses, it was found that the SPL differences between tests III and IV indeed affect the perceptual results significantly for some attributes and rooms, but no clear correlation between SPL difference and this effect is found. The SPLs for the samples in test IV were generally higher than for test III, but both positive and negative significant differences are found within one attribute, as a result. These results show that it is important to take the absolute sound pressure level of a signal into account when evaluating the acoustics of a room; something which is included in the AMARA method.

137 134 Listening tests

138 6 Optimization and validation of the model It doesn t matter how beautiful your theory is, it doesn t matter how smart you are. If it doesn t agree with experiment, it s wrong. (Richard P. Feynman, ) In this chapter, the auditory model presented in Chapter 3 will be optimized and validated by comparing the listening test results from Chapter 5 with the objective parameters obtained from the model outputs. If the model is a good predictor for these perceptual attributes, correlation coefficients between the ratings and the model output results will be high. This chapter will start with the optimization of free parameters in the model by means of a genetic algorithm. The chapter finishes with a description of how the objective parameters obtained from the model can be mapped onto the perceptual attributes. 6.1 Model optimization using a genetic algorithm Optimization of the free parameters in the model The auditory model used in this research has a total of 13 free parameters. A list of these parameters is shown in Table 6.1 together with a short description and the corresponding equation in which they appear. These parameters could be tuned manually to yield the best results. However, in this research a more novel method is chosen in the form of a Genetic Algorithm (GA). Genetic algorithms form a class of algorithms that are capable of optimizing non-linear, multi-modal problems with a high number of free parameters, like the auditory model presented here [Holland, 1975; Weise, 29]. These algorithms search the solution space by

139 136 Optimization and validation of the model simulating evolution (survival of the fittest). An explanation of the algorithm will be presented in the next section. Table 6.1: A list of the 13 free parameters in the auditory model. The table lists the parameters names, a short description and references which point to equations (and one figure) in which the parameters are used. Parameter Description Reference µ Ψ Relative level of the peak threshold Eq µ Ψ,dip Relative level of the dip threshold Eq T min Minimum width of a peak or dip Fig k Min. frequency band in the stream splitting procedure Eqs. 3.2 & 3.22 k 1 Max. frequency band in the stream splitting procedure Eqs. 3.2 & 3.22 q Min. frequency band in ITD fluctuation calculation Eqs & 3.24 q 1 Max. frequency band in ITD fluctuation calculation Eqs & 3.24 z Min. frequency band in low frequency level calculation Eq z 1 Max. frequency band in low frequency level calculation Eq µ ASW Weighting factor used in ASW calculation Eq ν ASW Weighting factor used in ASW calculation Eq µ LEV Weighting factor used in LEV calculation Eq ν LEV Weighting factor used in LEV calculation Eq The genetic algorithm The starting point of a genetic algorithm is a pool of N chromosomes. Each chromosome consists of a sequence of genes representing the parameters to be optimized, as shown in Fig Initially, random values will be assigned to the genes in each chromosome. After the creation of the initial pool of chromosomes, a fitness value is calculated for each chromosome. The fitness of a chromosome is a single value, which represents how well the particular set of parameter values performs and can be defined depending on the problem that needs to be optimized. Obviously, it is the goal to find a chromosome with a fitness that is as high as possible. It is unlikely that the initial pool will contain a chromosome with a fitness value that exactly corresponds to a global or local maximum. Therefore, the algorithm will continue with a process in which parents are selected for generating new offspring. Different methods exist for this selection process. In this project, it was chosen to use a linear ranking method. For this method, the chromosomes are ordered in a ranking from their highest to lowest fitness values F (i). Next, the probability for a chromosome to be selected as a parent depends linearly on its rank. This way, it is more likely that chromosomes with a high fitness are selected, although there is also a non-zero probability that chromosomes with a low rank are selected.

140 6.1 Model optimization using a genetic algorithm 137 POOL A1 A2 A3 A4 A5 C1 C2 C3 C4 C5 B1 B2 B3 B4 B5 E1 E2 E3 E4 E5 D1 D2 D3 D4 D5 CHROMOSOME GENE F1 F2 F3 F4 F5 Figure 6.1: The genetic algorithm optimization procedure starts with a pool of chromosomes representing sequences of genes (parameters to be optimized). In this example, the pool includes six chromosomes (A to F) consisting of five genes each (1 to 5). When two parents are selected, a new chromosome (child) will be generated using a random crossover mechanism (see Fig. 6.2 for an example). Parent I A1 A2 A3 A4 A5 Parent II D1 D2 D3 D4 D5 A1 D2 A3 D4 D5 Child Figure 6.2: Two parent chromosomes, in this case A and D, are used to generate a new chromosome (child) through a random crossover mechanism. The step of selecting two parents and constructing child chromosomes is repeated N 1 times to form the next generation. The chromosome with the highest fitness value is always transferred directly to the new generation. This procedure is called elitism and prevents the possibility that the maximum fitness of the next generation is lower than that of the current generation. Together with the chromosome with the maximum fitness, the N 1 newly generated child chromosomes will form a new

141 138 Optimization and validation of the model generation of N chromosomes. If the procedure of selecting parents and generating a new generation of child chromosomes would be repeated many times, the algorithm might get stuck in a local maximum. To avoid this, mutation is allowed, meaning that some genes will get a new, random value. The genes to which mutation is applied are randomly selected with a probability that is equal to a certain mutation ratio ( - 1). Figure 6.3 shows the mutation process graphically. Child A1 D2 A3 D4 D5 A1 G2 A3 D4 D5 Figure 6.3: In the mutation process, randomly selected genes get a new, random value assigned to them. In this example, the second gene of the child chromosome from Fig. 6.2 gets a new value by mutation. The whole process of parent selection, child generation through crossover and mutation is repeated iteratively until some criterion is met. For example, the process could stop when the highest fitness value of the current generation is above a certain threshold, or when a maximum number of iterations is carried out. The complete genetic algorithm is shown in Fig Genetic algorithms are especially useful for optimizing non-linear problems like the auditory model used in the AMARA method. However, it also has some disadvantages, such as: Genetic algorithms converge towards an optimum, but it is never certain that this optimum is global or local. The likelihood that the algorithm converges to a local optimum depends on the fitness landscape (the shape of the fitness function). Within each generation, the fitness function needs to be calculated for each chromosome. For complex problems, this can computationally be very expensive, leading to slow convergence. Despite the mentioned disadvantages, the genetic algorithm presented in this section was used to optimize the parameters listed in Table 6.1.

142 6.1 Model optimization using a genetic algorithm 139 Initialize: create pool (random) Calculate fitness values Select parents Calculate fitness values Crossover Mutation Criterion met? NO YES STOP Figure 6.4: A flowchart of the iterative procedure used in a genetic algorithm. The decision to stop the procedure is based on the criterion shown at the end of the chain and can be based on a maximum number of iterations, for example Results of the optimization procedure The genetic algorithm was used to optimize the parameters in Table 6.1. The perceptual results from listening test III (see Chapter 5) were used as a training set for the GA. It was decided to define the fitness of a chromosome i as follows: F (i) = r j(i) + min [r j (i)], (6.1) 2 in which r j (i) is the correlation coefficient between the model prediction for acoustical attribute j and chromosome i, r j (i) is the mean correlation coefficient, averaged over the four attributes. The perceptual results of the two stimulus types (speech and cello) were combined in the GA; so in total there is one correlation coefficient for each of the four attributes. min [r j (i)] is the lowest value found for these four attributes. So, Eq. 6.1 takes into account the overall performance of the auditory

143 14 Optimization and validation of the model model for a certain set of free parameter values, in the form of the average correlation coefficient. In the context of this research, it is important that the auditory model performs well overall, but it is also important that it works well for all the four room acoustical attributes individually. Therefore, to decrease the possibility that a very low correlation coefficient for one of the four acoustical attributes is compensated for by high correlation coefficients for the other three attributes, the minimum correlation coefficient found is also included in the equation. The fitness values as calculated using Eq. 6.1 will have a value between and 1. The number of chromosomes for the genetic algorithm was set to N = 1, with 13 genes in each chromosome (one for each parameter). The mutation rate was set to 5%. It was decided to stop the iterative procedure after 1 iterations. Figure 6.5 shows the maximum and mean fitness of the resulting 1 generations of chromosomes. The maximum fitness quickly increases towards a value around.8. After about 2 iterations the result does not change that much anymore, although minor improvements do occur. After 1 iterations the fitness value is equal to: F (i max ) = r j(i max ) + min [r j (i max )] 2 = =.862, (6.2) where i max is the chromosome number with the maximum fitness value. The mean fitness value for the whole population of chromosomes after 1 iterations was equal to F (i) = Table 6.2 shows the resulting parameter values after 1 iterations of the genetic algorithm. fitness generation # Figure 6.5: The fitness values as a function of generation number (iteration) for the genetic algorithm. The solid line shows the maximum fitness value found in the generation, and the dashed line denotes the average for all chromosomes.

144 6.2 Model validation 141 Table 6.2: The result of the free parameter optimization using a genetic algorithm. Parameter Description Value µ Ψ Relative level of the peak threshold µ Ψ,dip Relative level of the dip threshold T min Minimum width of a peak or dip 63.1 ms k Min. frequency band in the stream splitting procedure 5 (168 Hz) k 1 Max. frequency band in the stream splitting procedure 2 (1.84 khz) q Min. frequency band in ITD fluctuation calculation 9 (387 Hz) q 1 Max. frequency band in ITD fluctuation calculation 2 (1.84 khz) z Min. frequency band in low frequency level calculation 5 (168 Hz) z 1 Max. frequency band in low frequency level calculation 9 (387 khz) µ ASW Weighting factor used in ASW calculation ν ASW Weighting factor used in ASW calculation µ LEV Weighting factor used in LEV calculation ν LEV Weighting factor used in LEV calculation Model validation To evaluate the predictive potential of the objective room acoustical parameters from the auditory model, the correlation coefficients between the listening test results from Chapter 5 and these output parameters were calculated. The results are shown in Tables 6.3, 6.4, 6.5 and 6.6 for the four room acoustical attributes separately. As can be seen from the tables, for almost all test / stimulus combinations, the parameters from the model yield higher correlation coefficients than those for the conventional parameters. In fifteen cases, the conventional parameters show an insignificant result (at the p <.5 level). In total, 64 correlation coefficients were calculated between the conventional parameters and the perceptual ratings, so fifteen insignificant results correspond to 23%. The new parameters yield insignificant results in only three cases (9%). In all these three cases, the conventional parameters also failed to give significant results (except for the results in Table 6.6, where LEV calc yields a significant correlation result, while the new parameter does not). This means that not only the model is able to predict these four room acoustical attributes for most test / stimulus combinations, but it even has a higher predictive value than the conventional parameters in most cases. Also, some interesting observations can be made for the conventional parameters. For example, EDT is not necessarily a better predictor for reverberance than T 2 in these tests, even though ISO recommends EDT over the reverberation time as a predictor for reverberance [ISO, 29]. The correlation coefficients of these two parameters have values that are very close. Furthermore, in the case of the clarity attribute, C 8 yields correlation coefficients that are (slightly) higher than for C 5, even in the case of the speech stimulus type,

145 142 Optimization and validation of the model for which C 5 is the recommended parameter according to the ISO standard [ISO, 29]. Finally, the early lateral energy fraction LF does not correlate at all with ASW in these tests. For tests III and IV, LF values were not available, since no measurements using a figure-of-eight microphone were carried out.

146 6.2 Model validation 143 Table 6.3: The correlation coefficients between the listening test results and the objective parameters for the room acoustical attribute reverberance. Two conventional objective parameters were evaluated: the reverberation time T 2 and early decay time EDT. P REV is the objective parameter related to reverberance as resulting from the auditory model presented in this thesis. For each row (test / stimulus combination), the highest correlation coefficient is shown in bold. Values marked with (*) are not significant at the p <.5 level. Test (stimulus type) T 2 EDT P REV I (cello) I (speech) II (cello) II (speech) -.3 (*) -.55 (*).88 III (cello) III (speech) IV (cello) IV (speech) Table 6.4: The correlation coefficients between the listening test results and the objective parameters for the room acoustical attribute clarity. Two conventional objective parameters were evaluated: the clarity indices C 5 and C 8. P CLA is the objective parameter related to clarity as resulting from the auditory model presented in this thesis. For each row (test / stimulus combination), the highest correlation coefficient is shown in bold. Values marked with (*) are not significant at the p <.5 level. Test (stimulus type) C 5 C 8 P CLA I (cello) I (speech) II (cello) II (speech).3 (*).4 (*).82 III (cello) III (speech) IV (cello) IV (speech)

147 144 Optimization and validation of the model Table 6.5: The correlation coefficients between the listening test results and the objective parameters for the room acoustical attribute ASW. Two conventional objective parameters were evaluated: one minus the early interaural cross-correlation coefficient 1 IACC E and the early lateral energy fraction LF. P ASW is the objective parameter related to ASW as resulting from the auditory model presented in this thesis. For each row (test / stimulus combination), the highest correlation coefficient is shown in bold. Values marked with (*) are not significant at the p <.5 level. Test (stimulus type) 1 IACC E LF P ASW I (cello) (*).83 I (speech) (*).9 II (cello).4 (*).9 (*).64 (*) II (speech).66 (*).21 (*).83 III (cello).86 N/A.92 III (speech).86 N/A.94 IV (cello).67 N/A.86 IV (speech).82 N/A.75 Table 6.6: The correlation coefficients between the listening test results and the objective parameters for the room acoustical attribute LEV. Two conventional objective parameters were evaluated: one minus the late interaural cross-correlation coefficient 1 IACC L and Beranek s LEV calc. P LEV is the objective parameter related to LEV as resulting from the auditory model presented in this thesis. For each row (test / stimulus combination), the highest correlation coefficient is shown in bold. Values marked with (*) are not significant at the p <.5 level. Test (stimulus type) 1 IACC L LEV calc P LEV I (cello) (*).85 I (speech) (*).93 II (cello).44 (*) (*) II (speech).54 (*).17 (*).7 (*) III (cello) III (speech) IV (cello) IV (speech)

148 6.3 Mapping of objective parameters onto perceptual attributes Mapping of objective parameters onto perceptual attributes In the previous section, it was shown that the objective parameters as resulting from the auditory model correlate significantly (and highly) with the perceptual results from the listening tests, in most cases. This means that these parameters can be used to predict perceptual room acoustical attributes. In order to do so, the objective results should somehow be mapped onto the perceptual attributes. When a closer look is taken at the listening test results in Chapter 5, it can be seen that the perceptual attributes have the tendency to saturate at the extremes; i.e. the largest differences are found at medium levels, while at very low and very high levels, the results are more close together. A function that models this effect is the so-called sigmoid function: y(x) = exp( a(x b)). (6.3) The constants a and b determine the slope and the zero-point of the sigmoid function, respectively. Figure 6.6 shows an example for a = 1 and b = y Figure 6.6: An example of a sigmoid function. This example follows the curve y(x) = 1/(1 + exp( x)). x The objective results can now be mapped onto the listening test results by fitting such a sigmoid function through the data for each attribute. This fitting procedure is carried out using a Gauss-Newton non-linear least-squares procedure [Fletcher, 1987] which iteratively optimizes a and b giving more weight to data points with a lower uncertainty (standard deviation). An explanation of this algorithm is given in Appendix F. The results of listening test I were used for the fitting procedure, because the range

149 146 Optimization and validation of the model in the perceptual data is the largest for that test. This way, the lowest and highest values of the P X results will map to values close to and 1, respectively, making sure that the full scale of the sigmoid function is used. The results for a and b for the four room acoustical attributes are shown in Table 6.7. Figure 6.7 shows the resulting sigmoid functions. Table 6.7: The sigmoid function constants a and b for each of the four objective room acoustical parameters from the auditory model. These constants were optimized iteratively using a Gauss-Newton algorithm. Parameter a b P REV P CLA P ASW P LEV Using the fitting results in Table 6.7, the objective parameters P X from the model can now be mapped onto a - 1 perceptual scale using: S X = exp( a X (P X b X )), (6.4) where a X and b X are the sigmoid constants for parameter X. The results of this re-mapping procedure are shown graphically in Fig The figure captions also show the resulting correlation coefficients between objective parameters S X and the perceptual data. The scaled parameters S X can easily be interpreted, since they map directly onto the perceptual scale presented in Table 5.2. For example, if S REV =.2, this translates to the observation: the amount of reverberance is low.

150 6.3 Mapping of objective parameters onto perceptual attributes reverberance.6.4 clarity P REV P CLA (a) Reverberance (r =.72, p val < 1 1 ) (b) Clarity (r =.65, p val = ) ASW.6.4 LEV P ASW P LEV (c) ASW (r =.83, p val < 1 1 ) (d) LEV (r =.86, p val < 1 1 ) Figure 6.7: Results of fitting sigmoid functions through the objective and perceptual results for all four room acoustical attributes. The error bars denote the standard deviations in the perceptual ratings. Table 6.7 lists the sigmoid constants a and b for each of the four curves.

151 148 Optimization and validation of the model reverberance.6.4 clarity S REV (a) Reverberance (r =.69, p val < 1 1 ) S CLA (b) Clarity (r =.68, p val < 1 1 ) 1.8 ASW.6.4 LEV S ASW (c) ASW (r =.84, p val < 1 1 ) S LEV (d) LEV (r =.84, p val < 1 1 ) Figure 6.8: Results of mapping the objective parameters resulting from the auditory model onto the perceptual results using sigmoid functions and the corresponding constants from Table 6.7.

152 6.4 Summary and discussion Summary and discussion In this chapter, the optimization and validation of the auditory model presented in this thesis was discussed. The free parameters in the model, which need to be tuned, were optimized using a genetic algorithm. The results from listening test III were used as a training data set for this optimization procedure. After 1 iterations of the genetic algorithm, a set of optimal values was found for the free parameters. Next, the model was validated using this set of parameters, by evaluating the correlation coefficients between the listening test results and the objective room acoustical parameters as resulting from the model. It was found that the model results correlate highly with the perceptual results and are better than the conventional objective parameters as defined in ISO in most cases. Finally, the objective parameters P X as resulting from the model were scaled to a perceptual scale ( - 1) by fitting sigmoid functions through the data. The scaled parameters S X, resulting from this fitting procedure, map directly onto the perceptual scale used in the listening test, so that they are easy to interpret. In the next chapter, some aspects of the auditory model important for practical application will be discussed.

153 15 Optimization and validation of the model

154 7 Practical aspects of the AMARA method Most audio scientists are fortunate enough to confront their problems in dim windowless laboratories, well away from the public eye; acousticians are not so lucky (Paul Mitchinson) In the previous chapter, it was shown that the objective parameters resulting from the auditory model correlate highly with the listening test results from Chapter 5. By fitting sigmoid functions through the objective parameters versus perceptual data curves, it was possible to map the objective parameters directly onto a perceptual scale from to 1. This makes the objective parameters easy to interpret in practice. However, more things need to be considered when the auditory model is used in practical situations. In this chapter, an attempt is made to cover the most important practical aspects, like the minimum length of a stimulus, the minimum signal-tonoise ratio necessary to yield accurate results, etc. 7.1 Applying the model in practice A typical measurement setup (1) A typical measurement setup that can be used to derive objective room acoustical parameters using the auditory model is shown in Fig In this example, a loudspeaker is used as the sound source. The sound is recorded binaurally using a dummy head (like the ITA head discussed in Chapter 4). This setup provides two options:

155 152 Practical aspects of the AMARA method The indirect method By performing the measurement method presented in Chapter 2, binaural room impulse responses (BRIRs) can be measured by this setup, using sweep signals, for example. From these BRIRs, binaural audio signals can be constructed by convolving the responses with anechoic recordings. The resulting signals can then be processed by the method presented in this thesis, to derive the four objective parameters related to room acoustics. This method is called the indirect method because BRIRs have to be measured or simulated first before the objective parameters can be determined. The direct method As a second option, the anechoic recordings mentioned in the previous option can be played back directly through the loudspeaker. Provided that the whole measurement system is linear and time-invariant and the signal-to-noise ratio is high (see Section 7.4), the binaural recordings obtained this way will be equal to the ones resulting from the convolution process in the first option. This approach is called the direct method. The indirect method is the more flexible of the two. The measured BRIRs can be convolved with any anechoic audio recording, meaning that the objective room acoustical parameters can be determined for variety of stimulus types, while only one measurement has to be performed for each source / receiver combination A typical measurement setup (2) An advantage of the AMARA method is the fact that it works on arbitrary binaural recordings. Therefore, it is also possible to use a natural sound source instead of a loudspeaker. Such a measurement setup is shown schematically in Fig This allows for determining objective parameters during a musical performance, for example. This way, measurements can be taken when an audience is present in the room, including the resulting changes in the acoustics (see Chapter 2), without bothering them with artificial or pre-recorded audio signals. Furthermore, the directional characteristics of the source are automatically taken into account using this setup. Obviously, this setup only allows for the direct method (see the previous section). Because this second measurement setup uses natural sound sources, reproducibility will be lower compared with the first setup presented in the previous section. Also, it is unlikely that during one performance all stimulus types the experimenter might want to test (different voices, different instruments, etc.) will be available. Therefore, multiple measurements in different situations will be necessary, which makes the assessment a lengthy process. If reproducibility and time issues are important, it is better to use the test setup presented in the previous section.

156 7.1 Applying the model in practice 153 Dummy head Loudspeaker Left channel Right channel Audio amplifier Digital audio interface Computer Figure 7.1: A typical measurement setup for applying the auditory model in practice to determine objective parameters related to room acoustics. This setup uses a loudspeaker as the sound source. Dummy head Source Left channel Right channel Digital audio interface Computer Figure 7.2: A typical measurement setup for applying the auditory model in practice to determine objective parameters related to room acoustics. This setup uses a natural sound source: a violin in this example.

157 154 Practical aspects of the AMARA method 7.2 Software implementation In the context of this research, a software library was written in C++, which can be used to develop software based on the method presented in this thesis. The software library consists of a cross-platform C++ class. It has been compiled and tested successfully on various Windows and Linux versions. By default, the library sets the model parameters according to the optimized values given in Chapter 6. However, the library includes functions for modifying these parameters. Figure 7.3a shows how the software library could be used to develop software for assessing objective parameters related to room acoustics. The software library accepts arbitrary binaural audio recordings as input, and outputs the four objective parameters defined in Chapter 3. Also, the library can easily be linked by developers to existing measurement or simulation software, such that the objective parameters resulting from the model can be determined in addition to the conventional ones. Figure 7.3b shows the schematics of such a design. 7.3 Minimum signal length Since the method presented here works on arbitrary binaural audio input material instead of BRIRs, it is important to specify what the minimum length of the input signal should be to yield accurate objective results about the acoustics of a room. To test this, 1 audio signals were constructed using the anechoic speech sample used in the listening tests (Chapter 5) and the BRIRs from listening test III. The length of the resulting signals was varied by truncating them at times T trunc ranging from 4 to 18 seconds. For each T trunc, the four objective room acoustical parameters were calculated. The results are shown in Fig The figure shows that all four parameters are (more or less) stable when the audio signals are at least 1 seconds long. This leads to the conclusion that the minimum signal length for using the method is 1 seconds.

158 7.3 Minimum signal length 155 Binaural recording SOFTWARE LIBRARY Objective parameters (a) Anechoic audio file BRIR MEASUREMENT/ SIMULATION SOFTWARE Objective parameters SOFTWARE LIBRARY (b) Figure 7.3: A schematic example of how the software library developed within this project can be used to develop software for determining objective room acoustical parameters. The software library accepts arbitrary binaural audio recordings as input and outputs values for the four objective parameters related to reverberance, clarity, ASW and LEV, respectively (figure a). The software library can also be linked to BRIR measurement or simulation software, making it possible to determine objective parameters using the auditory model inside those software packages (figure b).

159 156 Practical aspects of the AMARA method 1 1 Reverberance Clarity Truncation time (s) (a) Reverberance Truncation time (s) (b) Clarity ASW.6.4 LEV Truncation time (s) (c) ASW Truncation time (s) (d) LEV Figure 7.4: The objective parameters S REV, S CLA, S ASW and S LEV as a function of sample truncation time T trunc. The plot symbols correspond to the ten rooms used in listening test III: = AR, = LR, = SW, = HW, = AUD1, = AUD5, = AUD6, = SC, = AUD8 and = RC.

160 7.4 Robustness to noise Robustness to noise Signals measured in rooms will always have a finite signal-to-noise ratio, which might influence values of the objective parameters resulting from the auditory model. For measuring room impulse responses for assessing conventional room acoustical parameters, ISO specifies a minimum SNR of 45 db [ISO, 29]. To find the minimum SNR needed to yield accurate results using the auditory model, the objective parameters were obtained at various SNRs and different acoustical environments Test rooms In order to test the robustness of the method in different acoustical environments, a set of 1 virtual rooms was constructed using the simulator for shoebox-shaped rooms (Chapter 4). The different properties of the rooms, like the boundary absorption coefficients and the room dimensions, were varied randomly to obtain a range of different room acoustical properties. Inside each room, random source and receiver locations were defined, at which the parameters will be obtained. Table 7.1 shows the ranges for some of the conventional room acoustical parameters in the virtual rooms. The ranges for the new objective parameters S X, as obtained using the male speech stimulus used in the listening tests, are shown in Table 7.2. In Section 7.1, two different approaches for determining the objective parameters have been discussed: the indirect method and the direct method. Both approaches will be tested on their robustness below. Table 7.1: The minimum, maximum and mean values for some conventional room acoustical parameters for the 1 random rooms used in this section. Parameter min. max. mean T 2 (s) C 8 (db) IACC E IACC L Robustness of the indirect method In order to test the robustness of the indirect method, virtual BRIR measurements were carried out in the virtual test rooms at various SNRs. These BRIR measurements were performed using an exponential sweep signal s(t) (Chapter 2). The length of the sweep signal was chosen to be 1 seconds, which is equal to the stimulus length that will be used to test the robustness of the direct method (see below). In order to make a fair comparison, the lengths of both stimuli were chosen to be

161 158 Practical aspects of the AMARA method Table 7.2: The minimum, maximum and mean values for the four room acoustical parameters S X as determined using the auditory model for the 1 random rooms used in this section. These values were obtained using the male speech stimulus used in the listening tests. Parameter min. max. mean S REV S CLA S ASW S LEV equal. To simulate a measurement using this sweep signal in a virtual room, s(t) was convolved with the binaural room impulse response h SR (t): p(t) = h SR (t) s(t). (7.1) To model a finite SNR, this virtual measurement should be corrupted by noise. In order to do so, a pink noise signal w(t) is added to the measurement result. This pink noise signal has a 1/f spectrum for frequencies above 2 Hz and is normalized to an RMS amplitude of 1. The resulting total signal is called p(t): p(t) = p(t) + p RMS 1 SNR 2 w(t), (7.2) where p RMS is the RMS amplitude of p(t) and SNR is the target signal-to-noise ratio in db. The new BRIR h SR (t) is now obtained by deconvolving p(t) with s(t), as explained in Chapter 2: h SR (t) = p(t) 1 s(t). (7.3) Next, the audio sample ỹ(t) is constructed by convolving h SR (t) with an anechoically recorded audio signal x(t): ỹ ind (t) = h SR (t) x(t), (7.4) where the subscript ind stands for indirect. In this test, x(t) was the male speech sample used in the listening tests (see Chapter 5). Finally, ỹ ind (t) is used as input for the auditory model, resulting in the objective parameters S X. For each room, the SNR was set to 2, 3, 4 and 5 db subsequently. For each SNR, the results were compared with the noise-free results (i.e. SNR = db). The results are shown in Fig. 7.5 for each objective parameter separately. In these graphs, the + symbols denote outliers (see Section for the definition of an outlier).

162 7.4 Robustness to noise 159 From the results, it follows that the indirect method is quite robust. For SNRs of 3 db and higher, the absolute error in the objective parameters is.1 for all four parameters, in all the tested rooms S REV.2.15 S CLA SNR (db) SNR (db) (a) S REV (b) S CLA x S ASW 1 5 S LEV SNR (db) SNR (db) (c) S ASW (d) S LEV Figure 7.5: The absolute error in the objective parameters S X for various SNRs when using the indirect method. The + symbols correspond to values which are considered to be outliers Robustness of the direct method To test the robustness of the direct method it is not necessary to perform virtual BRIR measurements in the virtual test rooms. Instead, the starting point is a clean measurement y(t), constructed by convolving the noise-free BRIR h SR (t) with the anechoically recorded audio signal x(t) (again, the male speech signal with a length of 1 seconds was used): y(t) = h SR (t) x(t). (7.5)

163 16 Practical aspects of the AMARA method Note that this procedure looks like the indirect method: the input signal is obtained by convolving an anechoic signal with a BRIR. However, the output signal obtained this way is exactly equal to performing a simulation of the direct method, since the shoebox simulator outputs noise-free BRIRs, and since linearity is assumed. This extra step is only necessary because the shoebox simulator is not designed to output normal binaural signals instead of BRIRs. To model a finite SNR in the direct method, noise should be added to y(t) to yield the corrupted measurement ỹ(t). Note that this part of the procedure differs from the one in the previous section where noise was added during the virtual BRIR measurement instead. Again, a pink noise signal w(t) is added: ỹ dir (t) = y(t) + y RMS 1 SNR 2 w(t), (7.6) where y RMS is the RMS amplitude of y(t) and dir corresponds to direct. This ỹ dir (t) is used as input for the auditory model, and again the results S X are compared with the noise-free parameters. The results are shown in Fig Note that different scales are used for the y-axes in the figure. As can be seen, the direct method is much less robust than the indirect method. This is because ỹ dir (t) will effectively contain more noise than ỹ ind (t). By measuring BRIRs using sweep signals, the effective SNR is higher than the true SNR, because the signal energy is smeared out over time, resulting in a relatively higher signal level after the deconvolution process [Berkhout et al., 198]. 7.5 Influence of SPL errors The auditory model used in this research is non-linear, meaning that its results depend on the sound pressure level of the input signal (Chapter 3). As a consequence, the measurement system used to obtain the input signal(s) should be calibrated. The question arises how accurate this calibration should be. In order to test this, the objective parameters S X are determined for the 1 virtual test rooms using the male speech signal, where the signal SPL is intentionally altered in the range 5 to +5 db. The absolute errors in the four parameters S X for these SPL errors are plotted in Fig From the results, it can be concluded that S REV and S LEV are most sensitive to SPL errors. This is not surprising, because these parameters both depend, S REV completely, S LEV partially, on the average level of the reverberant stream L REV (Eq. 3.2). Generally, the reverberant stream has a low amplitude, and it is at low amplitudes where the auditory model becomes most sensitive to SPL changes (Section 3.5.1). Hence, small SPL changes may lead to large changes in L REV and thus in the parameters related to reverberance and LEV. The S CLA parameter also depends on L REV, because it is equal to the ratio L DIR /L REV

164 7.6 Stimulus types S REV.1 S CLA SNR (db) SNR (db) (a) S REV (b) S CLA S ASW.15.1 S LEV SNR (db) SNR (db) (c) S ASW (d) S LEV Figure 7.6: The absolute error in the objective parameters S X for various SNRs using the direct method. The + symbols correspond to values which are considered to be outliers. (Eq. 3.21). However, S CLA is less sensitive to SPL changes since amplitude changes in L REV will be partially compensated for by changes in L DIR. For example, if L REV is higher, L DIR will generally also be higher. Because the model is non-linear, however, the increase in L REV will generally not be equal to the increase in L DIR. The latter explains why S CLA is still sensitive to SPL changes. It can be concluded from the figure that a measurement system should be calibrated with a 1 db accuracy to keep the errors in the parameters S X smaller than Stimulus types The AMARA method depends on the stimulus type, as does the perception of room acoustics, as shown in Chapter 5. The question arises which kinds of audio signals are suitable to be used as test signals and if maybe signal types exist that are not

165 162 Practical aspects of the AMARA method.4.3 S REV.3.2 S CLA SPL error (db) SPL error (db) (a) S REV (b) S CLA.8.2 S ASW.6.4 S LEV SPL error (db) SPL error (db) (c) S ASW (d) S LEV Figure 7.7: The absolute error in the objective parameters S X for SPL errors in the range -5 to +5 db. The + symbols correspond to values which are considered to be outliers. Note that for the readability different scales were used along the y-axis of the graphs. suitable. To look further into this topic, three categories of audio signals are used as inputs for the model: voice stimuli, instrument stimuli and ensemble stimuli. The measured BRIRs for the 1 rooms used in listening test III were used as test rooms in this section Voice stimuli In the voice category, three different stimuli were tested: male speech, female speech and female singer. All three signals were recorded anechoically. Figure 7.8 shows spectrograms of the three stimuli. The male speech sample was the same sample as the one used in the listening tests. The female speech signal is downloaded from the University of Sydney website 1 and the female singer signal from the ODEON 1

166 7.6 Stimulus types 163 website 2. Figure 7.9 shows the results for the four objective parameters S X for the three voice signals. Three remarkable things can be noted from the results: (1) The difference in the parameter results between the male and female speech is small. (2) The parameter results for the female singer sample differ from the other two. For example, S REV is consistently lower for the female singer sample. This can be explained by the fact that the female singer sample contains more high ( 5 Hz), and almost no low frequency (< 5 Hz) content. Rooms are generally more reveberant in the lower frequencies. Furthermore, the female singer sample is less staccato and more continuous compared to the two speech samples, as can be seen in Fig Therefore, it is likely that compared with the speech samples more reverberant energy is being masked by the direct sound. An informal listening test by the author confirmed that the female singer sample indeed sounded much less reverberant than the male speech samples for the various rooms. (3) As a result of the lower amount of reverberance in the singer stimulus, the clarity is higher compared with the other two stimuli. Less masking of the direct sound by reverberant sound will occur thus making it easier to hear the different direct signal components. (4) Besides S REV, S ASW is also (much) lower for the female singer sample for all rooms. This might also have to do with the lack of low frequency content (as discussed in Chapter 2, it has been found that low frequency energy contributes to ASW). In the informal listening test the female singer sample indeed sounded less broad. 2

167 164 Practical aspects of the AMARA method (a) Male speech (b) Female speech (c) Female singer Figure 7.8: Spectrograms (in db s) of three voice stimuli.

168 7.6 Stimulus types S REV.6.4 male speech female speech singer.2 AR LR SW HW AUD1 AUD5 AUD6 SC AUD8 RC Room (a) S REV 1.8 S CLA.6.4 male speech female speech singer.2 AR LR SW HW AUD1 AUD5 AUD6 SC AUD8 RC Room (b) S CLA 1.8 S ASW.6.4 male speech female speech singer.2 AR LR SW HW AUD1 AUD5 AUD6 SC AUD8 RC Room (c) S ASW 1.8 S LEV.6.4 male speech female speech singer.2 AR LR SW HW AUD1 AUD5 AUD6 SC AUD8 RC Room (d) S LEV Figure 7.9: The objective parameters S X for the three voice stimuli and the 1 rooms used in listening test III.

169 166 Practical aspects of the AMARA method Instrument stimuli Three stimuli were tested in the instrument category: cello, flute and guitar. The cello sample was the same as the one used in the listening tests and the flute and guitar samples were both taken from the University of Sydney website (see the previous section). They originate from the Music for Archimedes CD. The guitar was an acoustical (classical/spanish) guitar, playing a mixture of chords and individual notes. Figure 7.1 shows the spectrograms of the three stimuli. The results for the objective parameters for the three instrument samples are shown in Fig The following things are notable from the results: (1) There is not much difference in the results between the cello and flute samples. S REV is slightly lower for the flute samples, which can be explained by the lower amount of low frequency content. This also has as effect on S ASW, as expected. (2) The guitar samples have much higher S REV values, except for the anechoic room (AR) and the reverberation chamber (RC). This is probably due to the fact that the guitar is more broad-band and more staccato compared to the cello and the flute. An informal listening test by the author confirmed this difference. For example, for the third room (SW), almost no reverberation is perceived for the cello and flute samples, as predicted by the model. In the guitar sample, much more reverberation is perceived, although the difference does not seem as large as predicted by S REV. (3) Clarity is lower for the guitar samples in all cases. Like for the female singer stimulus discussed in the previous section, this is a result of the high amount of reverberance for this particular stimulus. The clarity values for the cello and flute stimuli are similar. (4) The guitar stimulus also shows higher values for S LEV. Because of the staccato playing style, more reverberant energy will be perceived in between the direct components, as discussed above. It is the amount of energy in these parts of the signal, and the resulting ITD fluctuations, which determine the amount of perceived LEV (see Chapter 2).

170 7.6 Stimulus types 167 (a) Cello (b) Flute (c) Guitar Figure 7.1: Spectrograms (in db s) of three musical instrument stimuli.

171 168 Practical aspects of the AMARA method 1.8 S REV.6.4 cello flute guitar.2 AR LR SW HW AUD1 AUD5 AUD6 SC AUD8 RC Room (a) S REV 1.8 S CLA.6.4 cello flute guitar.2 AR LR SW HW AUD1 AUD5 AUD6 SC AUD8 RC Room (b) S CLA 1.8 S ASW.6.4 cello flute guitar.2 AR LR SW HW AUD1 AUD5 AUD6 SC AUD8 RC Room (c) S ASW 1.8 S LEV.6.4 cello flute guitar.2 AR LR SW HW AUD1 AUD5 AUD6 SC AUD8 RC Room (d) S LEV Figure 7.11: The objective parameters S X for the three instrument stimuli and the 1 rooms used in listening test III.

172 7.6 Stimulus types Ensemble stimuli Four ensemble stimuli (signals containing multiple musical instruments) were tested: bassoon duo, jazz combo, orchestra and pop music. The bassoon duo stimulus was downloaded from the ODEON website (see Section 7.6.1). The jazz combo and orchestra stimuli were taken from the website of the university of Sydney. The title of the jazz piece is unknown, the orchestral piece is an excerpt of The Fifth Symphony by Shostakovich. Both samples originate from the Denon Professional Test CD 2. The pop music stimulus is an excerpt from Stuck in a moment you can t get out of by U2, from the album All that you can t leave behind. The spectrograms of the four stimuli are shown in Fig Figure 7.13 shows the objective parameter results for the ensemble stimuli. following is notable from the results: The (1) The jazz and pop music stimuli result in higher values for S REV, and lower values for S CLA. For the jazz stimulus, this can be explained by its staccato nature (see Fig. 7.12). The pop music stimulus shows much higher values for S REV and lower S CLA values. This is most probably due to the fact that the pop music stimulus is the only stimulus not recorded anechoically; the dry sample already contained reverberation, applied when the song was recorded and mixed. This reverberation will add to the reverberation of the virtual rooms. (2) For the S ASW parameter, there is not much difference between the different stimuli. However, a strange result was found for the pop music stimulus in the RC (Reverberation Chamber) room; S ASW is almost zero in this case, while it is close to one for the other three stimuli in this particular room. A closer examination of this stimulus / room combination revealed that the resulting sample had such a high amount of reverberation that all the energy was assigned to the reverberant stream by the auditory model. As a result, there was no energy present in the detected direct stream, which means that it was impossible to define the amount of ITD fluctuation in this direct stream (see Eq. 3.24). (3) The values for S LEV are quite similar across the stimuli, although the jazz combo and pop music stimuli show slightly higher values. For the jazz stimulus, this can be explained by the staccato playing style, which leads to more space in between the direct components (see also the previous section and the discussion regarding the guitar stimulus). For the pop music stimulus, the higher value of S LEV is probably a result of the (stereo) reverberation applied to the recording, besides maybe some other audio effects like artificial widening of the stereo image.

173 17 Practical aspects of the AMARA method (a) Bassoon duo (b) Jazz combo (c) Orchestra (d) Pop music Figure 7.12: Spectrograms (in db s) of four ensemble stimuli.

174 7.6 Stimulus types S REV bassoon duo jazz orchestra u2 AR LR SW HW AUD1 AUD5 AUD6 SC AUD8 RC Room (a) S REV 1.8 S CLA bassoon duo jazz orchestra u2 AR LR SW HW AUD1 AUD5 AUD6 SC AUD8 RC Room (b) S CLA 1.8 S ASW bassoon duo jazz orchestra u2 AR LR SW HW AUD1 AUD5 AUD6 SC AUD8 RC Room (c) S ASW 1.8 S LEV bassoon duo jazz orchestra u2 AR LR SW HW AUD1 AUD5 AUD6 SC AUD8 RC Room (d) S LEV Figure 7.13: The objective parameters S X for the three ensemble stimuli and the 1 rooms used in listening test III.

175 172 Practical aspects of the AMARA method Suitable test signals From the previous results, it can be concluded that different stimulus types yield indeed different results for the various S X parameters. Almost all of these differences can be explained from a perceptual point of view. The results for the pop music sample show that it is important to use anechoic recordings; the audio processing applied to this sample had an influence on the parameters. However, there could be cases where the use of processed recordings might be a good choice, for example when a sound reproduction system needs to be tested in a room under realistic conditions. When different rooms with very similar room acoustical properties need to be compared using the AMARA method, it is best to use the same source stimulus in each room. This way, differences in the objective parameters will only be a result of acoustical differences between the rooms and not due to differences in the source signal properties. Therefore, using a database with signals that are representative for different signal classes might be practical. This will be further discussed in Chapter Using microphones instead of a dummy head The measurement methods that can be used to obtain objective parameters using the AMARA method were discussed in Section 7.1. The method involves recording the sound field through a dummy head. Since dummy heads are quite expensive, such a device will not always be available. Therefore, a relevant question is whether it is possible to use just a pair of microphones, one for each ear, to record the sound field adequately for the purpose of parameter extraction. In order to mimic the directional properties of the human head and ears, a possible choice for the type of microphone would be one with a cardioid sensitivity pattern.? Figure 7.14: In this section, it is examined if it is possible to use a pair of cardioid microphones for recording a sound field for obtaining objective acoustical parameters as an alternative to using a dummy head. To test the alternative measurement method using a pair of cardioid microphones, the impulse responses of the 1 virtual test rooms (as discussed in Section 7.4.1) were simulated using two of these microphones at the position of the ears of the

176 7.8 Influence of the dummy head position 173 dummy head, as shown in Fig The distance between these microphones was set to 2 cm. Each of the resulting 1 impulse response pairs were used to construct stereo samples by convolving the responses with the anechoically recorded male speech sample. From the resulting stereo audio samples, the four objective parameters S X were determined and compared with the ones obtained using the dummy head. The absolute errors (differences) between the parameters obtained using both methods are plotted in Fig As can be seen, the error is much larger for the parameters related to spaciousness (S ASW and S LEV ) than for the other two. Apparently, the ITD fluctuations measured using the cardioid microphones are too different from the ones obtained using the dummy head, which leads to errors in S ASW and S LEV which are larger than.5 on average. Therefore, it is not recommended to use microphones instead of a dummy head when parameters related to spaciousness need to be obtained. If only S REV and S CLA need to be measured, it might even be possible that one omnidirectional microphone is sufficient for recording the sound field, since no binaural information is necessary for determining these two parameters. However, this has not been tested so far Absolute error S_REV S_CLA S_ASW S_LEV Parameter Figure 7.15: The absolute error (difference) between the objective parameters S X as obtained by determining the room impulse responses using a pair of cardioid microphones versus using a dummy head. The plus symbols denote values which are considered to be outliers. 7.8 Influence of the dummy head position In Section 2.5.1, the values of various conventional room acoustical parameters as obtained from room impulse responses that were measured along a microphone array in the Concertgebouw in Amsterdam, were presented. From the results it was clear

177 174 Practical aspects of the AMARA method that some of these parameters suffer from severe fluctuations as a function of offset, often more than one just noticeable difference within one seat position. In this section, it will be examined if the parameters as obtained by the auditory model also show spatial fluctuations and, if so, how these fluctuations compare to those of the conventional parameters. Since the measurement in the Concertgebouw was also carried out using a KEMAR dummy head [IEC, 199], the new objective parameters can be determined at the same positions as the conventional ones. The results for all the four S X parameters are presented in Fig When comparing these results with Figs. 2.7 and 2.8, it can be said that the conventional parameters seem to fluctuate more as a function of offset than the new parameters. One way to test this formally is to express the amount of fluctuation δ fluc as a function of offset for a parameter X by its standard deviation relative to its mean absolute value: δ fluc = σ X 1%. (7.7) X Table 7.3: The amount of spatial fluctuation, as expressed using δ fluc, for the conventional versus new objective parameters related to reverberance, clarity, ASW and LEV. δ fluc Attribute Conv. parameters New parameters Reverberance (T 2 ) 1.9% (S REV ) 7.2% Clarity (C 8 ) 4% (S CLA ) 7.5% ASW (1 IACC E ) 13% (S ASW ) 2.7% LEV (1 IACC L ) 2.9% (S LEV ) 3.2% Table 7.3 shows the δ fluc values for the various parameters. For reverberance, δ fluc is lower for the conventional parameter (T 2 ) than for the new parameter S REV. For clarity and ASW, the new parameters have lower δ fluc values than the conventional ones, while for LEV the value is of the same order (although the value for S LEV is slightly higher). However, there are some issues with using δ fluc as a quantity for making comparisons. First of all, if lower δ fluc values are considered to be better, implicitly the assumption is made that the corresponding parameter should be constant over the full range of measured offset positions. This is something that is not necessarily true, perceptually speaking. Secondly, the just noticeable differences for the conventional parameters are sometimes absolute instead of relative to the parameter value, as was discussed in Chapter 2. In fact, of the four conventional parameters tested, only T 2 has a just noticeable difference that depends on the parameter value. Preferably, a measure for the amount of fluctuation should take JNDs and how they relate to the absolute parameter value into account.

178 7.8 Influence of the dummy head position 175 S REV offset (m) (a) S REV S CLA offset (m) (b) S CLA S ASW offset (m) (c) S ASW S LEV offset (m) (d) S LEV Figure 7.16: The room acoustical parameters S X as a function of measurement location, obtained from impulse response measurements carried out in the Concertgebouw in Amsterdam using a dummy head.

179 176 Practical aspects of the AMARA method Unfortunately, values for the just noticeable differences for the S X parameters are as yet unknown (see Chapter 8 for a further discussion on this matter). On the other hand, the first issue with δ fluc can be solved by assuming that the parameters should be constant within the offset range of one seat when the measurements are performed along a linear array. Like in Chapter 2, the width of one seat was set to 5 cm. To evaluate the amount of fluctuation within one seat, δ seat was defined as: δ seat = 1 N N n=1 σ Xn 1%, (7.8) X n in which X n is a subset of room impulse responses, as measured in interval n, and N is the total number of intervals (seats). Table 7.4 shows the results for δ seat. From these results, basically the same conclusions can be drawn as for δ fluc. For clarity and ASW, the S X parameters show less fluctuations, relatively speaking. For reverberance and LEV, the parameters show more fluctuation, although the δ seat values are still smaller than 4%. Still, information about the JNDs of the new parameters is necessary to be able to compare the amount of spatial fluctuations between the conventional and the new method quantitatively. However, also without knowledge on the JNDs of the new parameters, it can be predicted that the new parameters probably suffer less from spatial fluctuations compared to the conventional ones. If the new parameters would fluctuate more, the errors between the parameters and listening test results would be higher and, as a result, the correlation coefficients between the objective and the perceptual data would be lower. This is not the case; the new parameters show higher correlation results in most cases. Table 7.4: The amount of spatial fluctuation, as expressed using δ seat, for the conventional versus new objective parameters related to reverberance, clarity, ASW and LEV. δ seat Attribute Conv. parameters New parameters Reverberance (T 2 ) 1.4% (S REV ) 3.3% Clarity (C 8 ) 19% (S CLA ) 3.8% ASW (1 IACC E ) 5.2% (S ASW ) 1.7% LEV (1 IACC L ) 1.9% (S LEV ) 2.2% 7.9 Influence of the dummy head orientation Besides the measurement of binaural room impulse responses on closely spaced positions along a line, it is also possible to perform measurements for different orientations of the dummy head. In [Witew et al., 21b], the early interaural crosscorrelation coefficient (1 IACC E ) was determined for a range of angles on a certain

180 7.9 Influence of the dummy head orientation 177 position in three different auditoria (see Table 7.5). The binaural room impulse responses were measured using a FABIAN dummy head. This dummy head was developed by the Technical University of Berlin and can be rotated in azimuth and elevation above its torso, making it possible to make the measurement for different angles (azimuth and elevation) a fully automated process [Lindau et al., 27]. The measurements were also carried out by the university of Berlin. Table 7.5: Information on the three auditoria used in this section. No. Auditorium V (m 3 ) RT (s) 1 Site of TU-Berlin WFS system TUB medium/large lecture hall Recording room of Teldex Studio Berlin In [Witew et al., 21b], (1 IACC E ) was calculated for various azimuth angles (with a fixed elevation of zero degrees), after which an analysis was carried out of the uncertainties in this coefficient with respect to the orientation of the head using the Guide to the expression of Uncertainties in Measurement (GUM, see Chapter 2). From this, it was found that an uncertainty in the head s azimuth of 1 to 2 degrees is allowed to keep the error in (1 IACC E ) smaller than one JND. The exact value for this maximum angle deviation depends on which JND is used as a reference. Different JND values for the early interaural cross-correlation coefficient are found in the literature, ranging from.4 to.75 [Witew et al., 21b]. Figures 7.17, 7.18 and 7.19 show the results for (1 IACC E ) as a function of azimuth for the three rooms, respectively. For completeness, the values of (1 IACC L ) are also shown. As can be seen in the graphs, (1 IACC L ) is quite constant as a function of angle, and its value is consistently high (close to the maximum of 1). Since only binaurally measured data is available, conventional parameters, other than the ones related to the interaural cross-correlation, could not be determined. In the paper, the values for S ASW for the three different rooms as a function of azimuth were also presented 3. The results for the stimulus types male speech and cello and the three rooms are shown in Figures 7.2, 7.21 and The results for S REV, S CLA and S LEV are also added to the figures. Note that the results for S ASW in the figures are slightly different from the ones in [Witew et al., 21b]. This is because the auditory model used to calculate the results for this thesis included some improvements over the one used in the earlier paper. As expected, the monaural parameters S REV and S CLA are almost constant for the complete range of angles. The parameters related to spaciousness, S ASW and S LEV, tend to have a minimum for the zero degrees angle where the dummy head is exactly facing the sound source. For off-axis angles, both parameters increase. This is also expected, because in most situations decorrelation between the ears will increase with increasing angle; one ear will receive increasingly more direct sound, 3 In [Witew et al., 21b], S ASW is called ASW model.

181 178 Practical aspects of the AMARA method while the other ear is receiving increasingly more indirect sound from boundary reflections. This increasing de-correlation is likely to lead to higher ASW and LEV. ASW seems more affected by head rotation than LEV. This is explained by the fact that ASW depends on the amount of de-correlation in the early reflections, while LEV is affected by the late reflections. When the head rotates, the early reflection pattern is more likely to change than the late reflection pattern, which will be mostly diffuse anyway. When comparing the new parameters S X with the conventional ones based on the IACC, it can be seen that there is not much difference in terms of angular dependence. Making a quantitative comparison will be a difficult task since it is unknown how the perceptual attributes should change as a function of angle for these particular rooms. Therefore, measures like δ fluc and δ seat, as proposed in the previous section, cannot be defined. Also here, perceptual data for these rooms and angles, as well as knowledge on the JNDs of the parameters are necessary to make a proper quantitative comparison between the performance of the conventional parameters and the new ones. This is a proposed topic for further research, see Chapter 8. From the graphs, it can be seen that S REV and S LEV are higher for the speech stimulus than for the cello stimulus in all three rooms. This can be explained by the differences in properties of both source signals; the cello stimulus is more continuous while the speech stimulus is more discontinuous. As a result, more reverberance and LEV will be perceived in between the direct signal components of the speech samples. Therefore, there is also a higher probability that direct signal components are being masked by the reverberant energy. Therefore, S CLA is lower for the speech samples in all cases. The parameter that changes the most as a function of angle is S ASW. This is probably due to the fact that it is the only parameter that is mostly dependent on the early reflection pattern. The pattern of the early reflections is more likely to change as a result of head rotation than the amount of reverberant energy, for example. What is remarkable from the figures is that the speech samples seem more sensitive to changes in angle with respect to S ASW than the cello samples, at least, for rooms 1 and 2. This can be explained by the fact that the direct signal components of the cello samples will usually be longer in time than those for the more discontinuous speech samples (Section 3.6.2). Hence, the direct signal components of the cello samples are more likely to contain late, diffuse reverberation besides energy belonging to the first couple of orders of reflections in a room. The diffuse reverberation is less likely to change as a function of angle and this explains why ASW will change less for the cello samples.

182 7.9 Influence of the dummy head orientation IACC E angle (deg) (a) (1 IACC E ) IACC L angle (deg) (b) (1 IACC L ) Figure 7.17: The early and late interaural cross-correlation coefficients as a function of azimuth, as measured in auditorium 1.

183 18 Practical aspects of the AMARA method IACC E angle (deg) (a) (1 IACC E ) IACC L angle (deg) (b) (1 IACC L ) Figure 7.18: The early and late interaural cross-correlation coefficients as a function of azimuth, as measured in auditorium 2.

184 7.9 Influence of the dummy head orientation IACC E angle (deg) (a) (1 IACC E ) IACC L angle (deg) (b) (1 IACC L ) Figure 7.19: The early and late interaural cross-correlation coefficients as a function of azimuth, as measured in auditorium 3.

185 182 Practical aspects of the AMARA method S REV angle (deg) (a) S REV S CLA angle (deg) (b) S CLA S ASW angle (deg) (c) S ASW S LEV angle (deg) (d) S LEV Figure 7.2: The four S X parameters as a function of azimuth, as measured in auditorium 1. Solid line: male speech stimulus, dashed line: cello stimulus.

186 7.9 Influence of the dummy head orientation 183 S REV angle (deg) (a) S REV S CLA angle (deg) (b) S CLA S ASW angle (deg) (c) S ASW S LEV angle (deg) (d) S LEV Figure 7.21: The four S X parameters as a function of azimuth, as measured in auditorium 2. Solid line: male speech stimulus, dashed line: cello stimulus.

187 184 Practical aspects of the AMARA method S REV angle (deg) (a) S REV S CLA angle (deg) (b) S CLA S ASW angle (deg) (c) S ASW S LEV angle (deg) (d) S LEV Figure 7.22: The four S X parameters as a function of azimuth, as measured in auditorium 3. Solid line: male speech stimulus, dashed line: cello stimulus.

188 7.1 Summary and discussion Summary and discussion Various aspects of applying the AMARA method in practice have been discussed in this chapter. In the first section, it was shown that the method allows for two measurement methods. First, a loudspeaker can be used as the sound source and a dummy head as the receiver. Using this configuration, the objective room acoustical parameters S X can be determined in two ways: directly, by playing back a certain stimulus through the loudspeaker, or indirectly, by measuring a binaural room impulse response first. This BRIR can be convolved with various anechoically recorded audio samples to test a room for different applications. A second possible measurement setup does not make use of a loudspeaker; instead, a natural sound source is recorded directly (this measurement setup does not allow for the indirect method). The first setup is more flexible because various source types can be tested quite easily. Furthermore, a much higher reproducibility is expected than when natural sound sources are used. In the second section, it was discussed how the complete algorithm can be implemented in software, both in new applications as in existing ones. A software library was developed with all the necessary functions to obtain the objective parameters from a binaural audio input signal, which makes it easy to link the code to other software projects. Next, the minimum input signal length for obtaining accurate results was determined by determining the parameters for different signal lengths. It was found, that a minimum signal length of 1 seconds is required. The robustness of the method to noise was also tested. This was carried out by determining the objective parameters for a large (1) number of simulated rooms while the level of noise was varied. From a comparison with the noise-free results, the conclusion can be drawn, that the indirect method is much more robust to noise than the direct method. Therefore, it is recommended to apply the indirect method (i.e. convolve anechoically recorded samples with measured BRIRs) when the SNR is low. However, if the objective parameters need to be obtained in an occupied room, the direct method will mostly be preferred. In that case, it is important that measures are taken to reach the highest SNR possible to make sure that reliable results are obtained. Since the SPL of the audio signals at the input of the auditory model needs to be known, it is important that the complete measurement system is calibrated. In Section 7.5, the influence of errors in the true SPL value on the objective parameters was examined. The results showed that the average absolute error in the parameters rises above.5 for SPL errors larger than ±1 db. This result probably makes the calibration of the measurement system the most critical factor for obtaining accurate results. Therefore, calibration needs to be performed very accurately. Furthermore, ways should be searched for to make the auditory model less sensitive to SPL errors;

189 186 Practical aspects of the AMARA method this will be further discussed in Chapter 8. The objective room acoustical parameters as derived using the auditory model are content-specific and therefore, different types of stimuli were tested in Section 7.6. The stimuli were divided into three groups: voice, instrument and ensemble. In some cases, large differences in the objective parameters were found between the stimulus types, but all these differences could be explained by discussing the temporal and spectral properties of the stimuli and how these properties interact with the acoustics of a room. Informal listening tests confirmed the perceptual differences between the stimuli in those cases. There was one case found where the value for S ASW was clearly wrong. For that particular audio sample (pop music in a reverberation chamber), the amount of reverberance was extremely high, resulting in the model assigning all energy to the reverberant stream. This left no energy in the direct stream, which made S ASW solely dependent on the low frequency level L LOW. This explains the wrong value for S ASW, which is much too low. This is, however, an extreme case in which a highly reverberant room was used in combination with a stimulus that was not anechoic. It is highly unlikely that such an error will occur in practice. In order to apply the AMARA method, a dummy head is used to record the sound field. It was investigated if this dummy head can be replaced by a pair of (cardioid) microphones as a cheaper alternative. This alternative was tested by calculating the objective parameters for the 1 virtual test rooms using a setup consisting of two (virtual) microphones with a cardioid sensitivity pattern. When the results were compared with the ones obtained using the dummy head, it was found that the errors in the results for the monaural parameters S REV and S CLA are quite small, whereas the results for the parameters related to spaciousness (S ASW and S LEV ) are much larger. Apparently, a setup using two cardioid microphones does not sufficiently simulate the binaural properties of the human head for obtaining accurate results for these two parameters. In future research, various other microphone setups and types should be tested in order to find other cheaper alternatives which do give accurate results. A last practical aspect that was examined was the fluctuation as a function of dummy head position/orientation and how these fluctuations compare with those of the conventional room acoustical parameters. It was already shown earlier in this thesis, using results from a microphone array measurement in the Concertgebouw in Amsterdam, that some conventional parameters can fluctuate severely over small measurement offset ranges; often more than their corresponding JNDs. Binaural data, as measured using a dummy head, was also available from the same measurement. Therefore, it was possible to calculate the new S X parameters for the same positions in the hall and compare the amount of fluctuations. Unfortunately, a true quantitative comparison is impossible since the JNDs for the S X parameters are not yet obtained through experiments. This information is necessary to make statements about what amount of fluctuation is acceptable when evaluated over a certain range of measurement positions. Nevertheless, the δ seat values were compared, represent-

190 7.1 Summary and discussion 187 ing the average relative amount of fluctuation within one seat. The δ seat results for the new parameters were higher than those for the conventional parameters for the reverberance and LEV attributes. However, even though δ seat values were higher, they were still smaller than 5%. Furthermore, even without knowing the JND values for the new parameters, it can be concluded that it is very likely that the new parameters suffer less from spatial fluctuation than the conventional parameters, because the correlation coefficients between these new parameters and perceptual data is generally higher (Chapter 6). To test the influence of the dummy head orientation, the new parameters were calculated for sets of BRIRs measured by the Berlin University in three different rooms using the FABIAN dummy head. This device can rotate the head in azimuth and elevation, thus making it possible to perform BRIR measurements for various angles automatically. The conventional parameters (1 IACC E ) and (1 IACC L ) were calculated for these angles of the head, as well as the new four S X parameters. Also in this case, it is impossible to make a fair comparison since no perceptual data is available on how much all these parameters should fluctuate as a function of angle. The results look plausible, however, both for the conventional parameters as well as for the new ones. It can also be seen from the S X results that how much a parameter changes as a function of angle can depend on the stimulus type. This is especially true for S ASW. This probably has to do with the fact that ASW is mostly dependent on the early reflections and it is known that the amount of early reflections that is masked (or not masked) depends on the stimulus type, as explained in Section 7.9.

191 188 Practical aspects of the AMARA method

192 8 Conclusions and outlook [...] finally I made this discovery. A room to have good acoustics must be either long or broad, high or low, of wood or stone, round or square, and so forth... (Charles Garnier, ) 8.1 Conclusions The current methods for assessing room acoustical quality involve the measurement, or simulation, of room impulse responses. From these responses, objective parameters can be calculated that are supposed to be related to various aspects of the perception of room acoustics. For example, the reverberation time and early decay time should be related to reverberance, the clarity indices to clarity, etc. A set of these parameters is now well-established and described in ISO standard However, various problems exist with this method, which often results in a low correlation between the objective results and the perception of the acoustics of a room. These problems include the fact that, because of the nature of the stimuli needed to perform such a measurement and because of measurement times, the impulse responses are often measured in unoccupied rooms, while the acoustics can change dramatically when a room is occupied. Furthermore, the signal type(s) for which the room acoustical qualities need to be assessed are not taken into account; only the clarity index includes two different time limits between early and late arriving energy, to distinguish between applications for speech and for music. Finally, some important aspects of the human auditory system, like non-linearity and masking effects, are not considered. In Chapter 3, a new method for measuring the acoustical qualities of a room objectively was proposed. This method is called AMARA: Auditory Modelling for Assessing Room Acoustics. It is based on a binaural, non-linear model of the human

193 19 Conclusions and outlook auditory system. This auditory model is based on a binaural model proposed by Breebaart [21] and accepts arbitrary binaural recordings as input signals. This model simulates the most important stages of the auditory system, like the transfer function of the middle ear, the basilar membrane, neural adaptation and binaural interaction. The model presented in this thesis includes some modifications of this original auditory model as well as an extension with which objective parameters can be determined related to the following perceptual aspects of room acoustics: reverberance, clarity, apparent source width (ASW) and listener envelopment (LEV). This extension includes an algorithm that is able to split the output streams of the model into separate streams belonging to the direct sound (source) and the reverberant sound (environment), respectively. The average levels in the direct and reverberant output streams are used to calculate the new objective parameters related to reverberance and clarity. Binaural interaction is included in the form of a running cross-correlation, which is used to calculate the amount of ITD fluctuation over time. The latter is used, together with the monaural output streams, to determine the new objective parameters related to ASW and LEV. In total, four listening tests were conducted in order to validate the AMARA method and its resulting objective parameters. These listening tests, as well as the perceptual results, have been presented in Chapter 5. In these tests, groups of subjects had to rate sets of audio samples on the four room acoustical attributes that should be predicted by new the objective parameters. The audio samples consisted of anechoic audio recordings which were convolved with BRIRs from a variety of rooms. In all the tests two different anechoic recordings were used: male speech and cello music. Listening tests I and II were set up using sets of simulated BRIRs for virtual rooms. The binaural impulses responses were simulated using a simulator for shoebox-shaped rooms, which was developed during the course of this project. An overview of this simulator was given in Chapter 4. Tests III and IV included a set of room impulse responses which were measured in real rooms using a dummy head. In test III, all the resulting audio samples were normalized to the same (estimated) loudness level using the Replaygain algorithm. For test IV, the audio samples included the original SPL / loudness differences that existed between the real rooms. By analyzing the perceptual results statistically, various cases were found where the signal type (speech or music) had a significant effect on the ratings for the room acoustical attributes. These results emphasize the need for the AMARA method, which does take the signal type into account, since it is able to derive objective parameters from arbitrary binaural recordings after they have been processed by the auditory model. Furthermore, by analyzing the results of tests III and IV, it was found that the SPL of an audio sample indeed has a significant effect on the perceptual results in some cases. Since the auditory model is non-linear, the absolute level of the input signal is taken into account in the new method, which is not true for (most of) the conventional room acoustical parameters. It must be noted, however, that no clear correlation was found between the SPL differences and

194 8.1 Conclusions 191 how they influence the perceptual results exactly. A novel optimization procedure for optimizing the free parameters in the AMARA method, like various frequency limits, was discussed in Chapter 6. This optimization procedure was based on a genetic algorithm (GA) and used the perceptual results of listening test III as a training data set. Afterwards, the optimized model was also tested on the results of the other three tests. The results were promising; in almost all the situations tested, the new parameters correlated highly with the perceptual results. It was demonstrated that in most cases the new parameters performed better than the conventional ones. These results show that the new method successfully solves most problems with the conventional method. For example, more important aspects of the human auditory system are taken into account. Furthermore, the signal type and how the signal perceptually interacts with the acoustics of a room is automatically taken into account. It was also shown in Chapter 6 that the new objective parameters can be mapped onto a perceptual scale from to 1 by fitting sigmoid functions through the data. The resulting scaled parameters are called S REV, S CLA, S ASW and S LEV. They correspond to the parameters related to reverberance, clarity, ASW and LEV, respectively. Because they map directly onto the scale which was used in the listening tests, a result for any of the S X parameters is easy to translate into a verbal scale (very low, low, etc.). In Chapter 7, different practical aspects of applying the new method have been discussed. Two different measurement methods were presented. The first method is called the direct method. In this method, the objective parameters are obtained directly from a binaural recording, either by recording a natural sound source (violin, singer, etc.), or by playing back an anechoic recording through a loudspeaker, placed inside the room. The second method is called the indirect method. When this method is used, BRIRs are measured or simulated first, which are convolved with anechoic recordings afterwards. The resulting audio samples can be used as input for the new method, which results in the objective parameters for that particular signal type in the room where the BRIR was measured. It was discussed that the indirect method has a number of advantages over the direct method. First, the indirect method is more robust to measurement noise. Furthermore, a BRIR measurement has to be performed only once for each source / receiver location; different signal types can be analyzed afterwards. Finally, no issues with reproducibility will occur in general when the indirect method is used, The direct method also has a few advantages: it allows for assessing the objective parameters during a performance for example. Another advantage is that it takes the directivity of the sound source(s) into account automatically. Furthermore, this method makes it more convenient to perform measurements in fully occupied halls, because it is not necessary to use artificial test signals and to position loudspeakers on the stage. It was also shown in Chapter 7 that a minimum signal length of ten seconds is

195 192 Conclusions and outlook required to obtain accurate results, at least for the set of BRIRs used in listening test III and the male speech stimulus. It is expected that this required signal length holds for other signal types and rooms too. However, there will be signals which may need to be processed for a longer time to obtain results that are representative for the type of signal as a whole. This would be true, for example, for a musical piece containing parts that are very different, like slow parts with long, sustained notes versus faster, more staccato parts. In this case it is recommended to use a part of such a piece containing both kinds of styles for obtaining objective representative parameters. Or, the parameters could be calculated for the different musical parts separately. This way, a room can be tested for different playing styles within a certain musical genre, for example. In general, it is recommended to obtain the parameters for different signals. As was shown in Chapter 7, different signals yield different values, as is expected from the results of the listening tests. When the resulting parameter values are compared with the temporal and spectral properties, together with informal listening tests performed by the author, it is found that the AMARA method successfully predicts how certain signal types are perceived in a variety of rooms. One signal type (pop music with a certain amount of reverberance already present in the original recording) turned out to be problematic when it was convolved with a BRIR measured in a reverberation chamber (T 2 = 1.12 s). In this particular case, all the energy was assigned to the reverberant stream by the auditory model, leading to problems in the calculation of S ASW. For a proper evaluation of the acoustics of a room and for making a fair comparison between signal types, it is highly recommended to use anechoic source stimuli only. One of the problems with the conventional method for assessing room acoustical parameters highlighted in Chapter 2, was the fact that some parameters can fluctuate severely over small spatial intervals, whereas this is not expected from a perceptual point of view. Therefore, in Chapter 7, the new objective parameters were calculated for an array measurement which was performed in the Concertgebouw in Amsterdam. The amount of spatial fluctuation is quite low: within one seat (5 cm interval), all four S X fluctuate with an amount which is lower than 4% on average. For S CLA and S ASW, the amount of fluctuation is lower compared with the conventional parameters, and for S REV and S LEV it is (slightly) higher. However, to make a fair comparison, it is best to include the Just Noticeable Difference (JND) for each parameter in the analysis. For example, the amount of spatial fluctuation δ could be defined relative to the parameter s JND. As long as the JND values for the S X are unknown, this comparison can not be carried out. Not only the influence of the dummy head position on the parameter results was examined in Chapter 7 but also the influence of the dummy head orientation for a fixed location. The way in which the parameters change as a function of azimuth look plausible, although also here information on the corresponding JNDs is necessary in order to make statements about how well the results predict perception.

196 8.1 Conclusions 193 Still, the results from Chapter 6 show that in almost all cases the new parameters correlate better with the perceptual results from the listening tests than the conventional parameters. It is therefore expected that the new parameters fluctuate less relative to their corresponding JNDs. If they would fluctuate more than the conventional parameters, which would mean that they suffer from a higher error as a function of the measurement position, the correlation with the perceptual results would automatically be lower. Below, the main conclusions regarding the AMARA method are summarized: From the results of four listening tests, it was found that the objective parameters resulting from the AMARA method correlate better with the perceptual results than the conventional ones in most cases. An additional advantage of the new method is that the type of source signal, and how this influences the perception of the acoustics of a room, are automatically taken into account. Furthermore, measurements can be performed in occupied rooms by recording the sound field using an artificial head during a performance, for example. Various aspects of the practical application of the AMARA method were examined. For example, it was discussed that objective parameters can be obtained using the AMARA method in two ways: Directly, by processing binaural recordings from real sources, like a human talking, or from an audio sample played back through a loudspeaker positioned in the room. Indirectly, by measuring (or simulating) a binaural room impulse response first, using a loudspeaker / artificial head setup. This BRIR can be convolved with various audio signals afterwards in order to derive the S X parameters for these signals. Also, some basic requirements for obtaining accurate results were given: The input signal should at least have a length of 1 seconds. If SNR is low, the indirect method is preferred since it is more robust than the direct method. The complete measurement system should be calibrated carefully, preferably with an accuracy of ±1 db. The use of anechoic source signals is recommended, unless the effects of using pre-processed signals on the acoustics need to be tested explicitly. In terms of the amount of (relative) fluctuations within small measurement intervals (5 cm), the AMARA method performs better than the conventional

197 194 Conclusions and outlook method for the parameters related to clarity and ASW (for an array measurement performed in the Concertgebouw in Amsterdam). The amount of fluctuation for the other two parameters, related to reverberance and LEV, was slightly higher. 8.2 Recommendations for further research As explained in the previous section, the JND values for the four S X parameters are yet unknown. Knowledge on these JNDs will be very valuable, because it allows for comparing the amount of spatial fluctuation of the new parameters with that of the old ones. Furthermore, different rooms can be compared more easily, or predictions can be made whether people will hear a difference in the acoustics when a particular element (like an absorber or diffuser) is placed in a room. It is, therefore, recommended to perform tests to derive the JNDs for S REV, S CLA, S ASW and S LEV. A possible approach to finding these values experimentally can be found in [Witew et al., 25]. Another open question following from the results in Chapters 5 and 6, is how the SPL has an effect on the perceptual attributes and whether or not this effect is simulated correctly by the auditory model. When examining the statistical results from Tables 5.12 and 5.13, it might be interesting to conduct listening tests for quantifying the influence of SPL differences on the various attributes, especially ASW and LEV. Next, the perceptual results could be compared with the predictions from the auditory model. As was already discussed in Section 8.1, the calibration of the measurement system with respect to the SPL needs to be accurate (within ±1 db) to avoid large errors in the objective parameters S X. Of all the effects tested in Chapter 7, SPL errors have the largest effect on the results. From the calculation of the playback levels in the listening tests (Chapter 5), it was found that calibrating the playback level for a setup involving headphones is very difficult. However, when a measurement is performed to determine the new parameters using the AMARA method (i.e. when to listening test is carried out) no headphones are required. In that case, the setup can be calibrated relatively easy using an SPL-measurement device, for example. Still, it might be interesting to look into this aspect in more detail. This research could include finding methods for accurate calibration of measurement systems for assessing room acoustical quality. But more importantly, from the results of experiments on the effect of SPL on the objective and perceptual results, as described above, conclusions can be drawn on whether or not the effect of SPL on the parameters corresponds to perceptual effects. If not, it will be necessary to modify the model accordingly. For example, as was discussed in Chapter 3, the overshoots at the outputs of the neural adaptation stage due to signal onsets are too large in some cases, at least for predicting certain sound masking effects. So far, applying methods for reducing

198 8.2 Recommendations for further research 195 this overshoot effect, like the one described in [Münkner, 1993], did not improve the results in this project. Perhaps, another method can be found that works better within the context of the method presented in this thesis. Once it is known if (and how) the model is able to simulate the effect of SPL differences accurately, it can also be examined if the model can be used to estimate the perceived loudness level of a signal. This means that a method is needed for using the absolute output level of the model to predict the perceived loudness level, for example by calculating the average output level over certain time frames. Important questions that need to be answered in this respect are: Which parts of the output streams contribute to the perceived loudness? Direct, reverberant or both? Over which frequency bands should the perceived loudness be calculated? How should the loudness be averaged over the left and right ear channels? Squared sum, mean, etc.? Moreover, it might be interesting to check how the auditory model and the resulting parameters perform compared with other auditory modelling approaches. But, most alternative approaches to room acoustical assessment using auditory modelling discussed in Section are either focused on spatial attributes only, or the inspection of processed impulse responses (sometimes only visually). Moreover, most approaches work with impulse responses or artificial signals, rather than real life binaural signals. However, the AMARA approach could still be applied on other existing auditory models for extracting the four parameters proposed in this thesis. For example, the binaural Lindemann model [Lindemann, 1986a] could be used, because it also binaural. However, a nonlinear adaptation stage was not included in the Lindemann model. This means that the resulting parameters will no longer be level-dependent. Probably the most interesting and recent candidate for a comparison is the QES- TRAL framework discussed in Section 3.4.3, since this framework was also designed for extracting parameters from binaural signals. Nevertheless, the model used in QESTRAL also lacks a nonlinear adaptation stage. If the lack of a nonlinear stage in the Lindemann and QESTRAL models turns out to be problematic, including the nonlinear adaptation stage from the Dau model [Dau et al., 1996a] would be a possible solution for improvement. In any case, when different auditory models are used, the tuned parameters of the AMARA method will most likely need re-tuning in order to obtain the best results. Finally, it could be tested if the parameter extraction in the AMARA method is robust against use of the original Dau and Breebaart auditory models. If so, the parameter extraction part of the method could be implemented as a module working

199 196 Conclusions and outlook on the output signals of these other models. Since the differences between the Dau and Breebaart models and the model used in this thesis are small, it is expected that the parameter extraction could also be applied on these models. However, it is also expected that the results will be less good, since model used in this thesis and the parameter extraction were optimized together in order to yield the highest correlation with perceptual results. 8.3 Outlook It is clear that for the AMARA method to gain acceptance in the acoustical world, a paradigm shift will be necessary. Acoustic consultants generally work in environments in which the stakes are high and large amounts of money are involved. As a result, acousticians are often not keen to change their way of working. They have worked with the established, conventional parameters for years and are familiar with most of their strengths and weaknesses. Furthermore, the clients often have knowledge of these parameters and their desired values for certain applications too. Clients will, therefore, often specify in what ranges they demand the acoustical parameters to be. However, there is enough evidence in this thesis, but also in various other places in the literature, that the conventional parameters have major shortcomings. Based on the findings in this thesis, it can be concluded that the new parameters are an improvement over the conventional ones in various ways. Because of this, it might be appealing to acousticians to, at least, utilize the new and the old method in parallel. The conventional parameters can be used to get a general idea of the acoustics in a room. Furthermore, if a client demands certain values for T 3 and C 8, for example, then these parameters need to be measured anyway. This will most often be the case. Once the conventional parameters have been determined, the new S X parameters can be used to examine the acoustics of a room further for various types of signals. When room impulse response measurements are carried out anyway (using one or more receiver positions), then it is relatively easy to obtain the S X parameters as well using the indirect method, provided that a dummy head is available. The new parameters might also be useful when strange values are found for the conventional parameters for a particular source / receiver combination. In this case, the S X parameters can be used to investigate such a result. If the S X parameters also show unexpected values, then probably the acoustics in the room for that source / receiver combination is such that it will have an effect on the perception. Besides the field of room acoustics, different applications for the AMARA method are possible. For example, it could be used to assess the quality of reproduction of a multichannel audio system, like surround or Wave Field Synthesis systems. Instead of having to measure impulse responses for each loudspeaker to each receiver location, simulating the sound field etc., the quality of reproduction can be assessed directly

200 8.3 Outlook 197 using the direct method. For multichannel audio systems, especially the parameters related to ASW and LEV will be interesting, since they correspond to spatial aspects of the sound field. It is most important how the results of the new parameters are being reported. For example, the following statement will be insufficient: In room A the parameter S REV was measured. The result is S REV =.54. This statement does not specify for which signal type S REV was determined, which is very important since the new parameters are content-specific. Instead, it is necessary that the results are being reported as follows (also including the source and receiver locations): In room A, the parameter S REV was measured for a source at location (x s, y s, z s ) and a receiver at location (x r, y r, z r ). For a male speech sample (with a length of 1 seconds), the result was equal to S REV =.54. If several signal types are being tested, then the various results could be presented in a table, for example. Finally, if rooms are being compared acoustically for a certain signal type, it will be best to use the exact same (anechoic) audio signal as the source for each measurement. As shown in Chapter 7, male and female speech yield (slightly) different results, for example. This is expected perceptually, but when rooms are acoustically very similar, it will be difficult to tell if these differences in the parameter results are due to differences in the acoustics of the rooms or differences in the source stimulus. Because of this, an option would be to introduce a database consisting of standard test signals with one or more anechoic signal for each category: speech, singers, strings, etc. It is very important that each test signal is representative for a certain class of audio signals, both in spectral as in their temporal properties. Next, developers of impulse response measurement software, or any other software package related to room acoustics, should be able to include these signals in their software. Hopefully, people in the field of acoustics are willing to include the AMARA method in their daily work, such that more and more data becomes available on typical ranges for the new parameters for particular types of rooms and signals, preferred ranges for these situations, suitable test signals, etc. As soon as the method has proved itself as a valuable addition to the conventional ones, or maybe even as a replacement, it might become common practice to express the acoustics of a room in terms of S X values only.

201 198 Conclusions and outlook

202 A The auditory model In this appendix, all the elements of the auditory model presented in Chapter 3 will be discussed in terms of a discrete-time implementation. This implementation is based on the work of Breebaart [21], with some modifications. A.1 Model input The model accepts discrete-time input signals x[n], with n the sample number. The signals are sampled at a sampling frequency f s. Throughout this thesis, a sampling frequency of f s = 44.1 khz was used. The input signal should be scaled such that an RMS of 1 corresponds to a sound pressure level of db. A.2 Outer and Middle-ear transfer function The combined outer and middle-ear transfer function is modelled by a bandpass filter with its cutoff frequencies at 1 khz and 4 khz: y[n] = (1 q)rx[n] (1 q)rx[n 1] + (q + r)y[n 1] qry[n 2], (A.1) where q = 2 cos(2π4/f s ) (cos(2π4/f s ) 2) 2 1, (A.2) and r = 2 cos(2π1/f s ) (cos(2π1/f s ) 2) 2 1. (A.3)

203 2 The auditory model A.3 Basilar membrane The filter characteristics of the basilar membrane were simulated using a bank of fourth order gammatone filter banks at ERB spacing [Glasberg and Moore, 199]. For each filter m = 1...M, a center frequency f c is defined: Each filter is given by: f c = exp(.11m) 1. (A.4).437 h[n] = A(n/f s ) ν 1 e 2πbn/fs cos(2πf c n/f s + ψ), (A.5) where A is a scaling factor, b a constant controlling the filter bandwidth, ψ the starting phase and ν the gammatone filter order. A is given by: For each filter, the 3 db bandwidth is given by: which is translated into an ERB bandwidth using: A = 2(2πb)ν (ν 1)!. (A.6) B 3dB = 2b 2 1/ν 1, (A.7) B ERB = πb(2ν 2)!2 (2ν 2) ((ν 1)!) 2. (A.8) According to Glasberg and Moore [199], the bandwidth of the auditory filters is given by: B = 24.7 (.437f c + 1). (A.9) The constant b as a function of center frequency now follows from B = B ERB : b(f c ) = 24.7(.437f c + 1)((ν 1)!) 2 π(2ν 2)!2 (2ν 2). (A.1) The basilar membrane filters can now be implemented in the time domain by first shifting the input signal by f c Hz: y[n] = x[n]e 2πjfcn/fs. (A.11)

204 A.4 Absolute threshold of hearing 21 Next, the signal is low-pass filtered according to: y[n] = (1 e 2πb/fs ) x[n] + e 2πb/fs y[n 1]. (A.12) To achieve a third order low-pass filter, this step is repeated three times. The filter h[n] now follows from up-shifting the signal in frequency: } h[n] = 2Re {x[n]e 2πjfcn/fs. (A.13) A.4 Absolute threshold of hearing The absolute threshold of hearing (ATH) is modelled by thresholding the signal: y[n] = { x[n] if x[n] ɛ(f c ) ɛ(f c ) if x[n] < ɛ(f c ), (A.14) where ɛ(f c ) is a frequency-dependent threshold value. It was shown in Chapter 3 that this method works better in simulating the ATH than the method proposed by Breebaart, in which noise with a frequency-independent level is added to the outputs of the gammatone filterbank to simulate the ATH. Figure A.1 shows the values for ɛ as a function of frequency. The values were tuned at 1/3 octave band intervals. Linear interpolation is applied to calculate ɛ for other frequencies. 15 ε(f) (MU) frequency (Hz) Figure A.1: The threshold values ɛ that are applied prior to the adaptation stage to simulate the Absolute Threshold of Hearing in the proposed model.

205 22 The auditory model A.5 Inner hair cells The inner hair cells are modelled by applying half-wave rectification: y[n] = { x[n] if x[n] > otherwise. (A.15) The half-wave rectification is followed by a 5th order low-pass filter to simulate the loss of phaselocking for frequencies above 1 khz. This was carried out by applying the following filter five times subsequently: with u given by: y[n] = (1 u)x[n] + uy[n 1], (A.16) u = 2 cos(2π2/f s ) (cos(2π2/f s ) 2) 2 1. (A.17) The resulting 5th order low-pass filter will have a cutoff frequency of 77 Hz. A.6 Adaptation Neural adaptation was simulated using a chain of five adaptation loops. Each adaptation loop j = uses the output of the previous loop as input and is given by: y[n] = ( 1 e 1/(τjfs)) x[n] y[n 1] + e 1/(τjfs) y[n 1], (A.18) where τ j is the time constant for loop j, given by 5, 5, 129, 253 and 5 ms respectively. For stationary signals, the output level L o of the adaptation stage in the steady state will be approximately equal to: L o L 1/32 i, (A.19) with L i the level of the input signal. A.7 Scaling The outputs of the adaptation stage are scaled such that a stationary input signal with a sound pressure level of 1 db results in an output level of 1 MU in the steady state:

206 A.8 Binaural processor 23 y[n] = 1(x[n] ɛ(f c)) (1 5 ) 1/32 ɛ(f c ). (A.2) Note that this scaling method differs from Breebaart s, which uses constant scaling factors for all frequency bands. A.8 Binaural processor The procedure described above is carried out for both the left and the right ear input channels. The binaural processor also differs from the one proposed by Breebaart, as discussed in Section In the binaural processor, the outputs for both channels are first filtered with a double-sided exponential window w[n], to simulate binaural sluggishness: where τ b = 3 ms. w[n] = e.5n/(τ bf s), (A.21) Next, the ITD values as a function of time are found by evaluating the crosscorrelation between the left and right ear output signals y L [n] and y R [n]: r[n] = K 1 k= y L [n k]y R [n]. (A.22) The ITD is then given by ITD = n max /f s, where n max the sample n for which r[n] has its maximum. A.9 Envelope extraction The final monaural outputs of the model are determined by extracting the envelopes of the outputs for the left and right ear channels. The envelopes result from applying a first order low-pass filter with a time constant of 2 ms: with τ e = 2 ms. y[n] = ( 1 e 1/(τefs)) x[n] + e 1/(τefs) y[n 1], (A.23)

207 24 The auditory model

208 B The Replaygain algorithm This appendix discusses the Replaygain algorithm as proposed by Robinson [21]. It was developed to estimate the loudness level of audio signals. The results can be used to normalize a set of audio files, for example an MP3 collection, to the same perceived loudness level. Replaygain is widely used in software media players (natively or through plug-ins) and is even supported by some hardware audio players. In this thesis, the Replaygain algorithm was used to normalize the loudness level of the audio samples used in listening tests I, II and IV (see Chapter 5). Below, the various steps in the algorithm will be explained. B.1 Equal loudness filter The algorithm starts with applying an equal loudness filter to the input signal. This filter is shown graphically in the frequency domain in Fig. B.1. The filter is designed as a 1 th order Yule-Walker filter combined with a 2 nd order Butterworth low-pass filter. A MATLAB TM implementation, as well as the raw filter coefficients, can be downloaded from the Replaygain website: B.2 RMS calculation After the equal loudness filter is applied, the (stereo) signal is divided into K blocks of 5 ms length. For each block k, the RMS pressure is calculated as: p RMS,k = 1 N n 1 x2 n=n L [n] + x2 R [n], (B.1) 2 where n and n 1 are the start and end sample of the block, respectively, and N is the number of samples in the block. x L and x R are the left and right time signals.

209 26 The Replaygain algorithm 1 2 H(f) (db) frequency (Hz) Figure B.1: The equal loudness filter as used by the Replaygain algorithm. B.3 Estimating the loudness level To obtain a single value representing the loudness, the RMS results for the blocks k = 1...K are sorted from lowest to highest. From practical results, Robinson found that picking the 95% highest RMS value as the loudness level leads to the best results [Robinson, 21]. B.4 Calibration with a reference level As a reference, Replaygain uses a pink noise signal with an RMS pressure of 2 db. The Replaygain value is then defined as the inverse of the estimated loudness of a signal relative to the estimated loudness of this pink noise signal. In other words, if the Replaygain value of a signal is 6 db, then the signal should be attenuated by 6 db to yield a loudness that is equal to that of a 2 db pink noise signal.

210 C Sound absorption through air In Chapter 4, a room acoustical simulator for shoebox-shaped rooms was proposed. The simulator also accounts for sound absorption through air. In this appendix, the equations for calculating the absorption through air are given, as specified by ISO standard [ISO, 1993]. According to the ISO standard, the sound transmission coefficient through air a in [db/m] for frequency f is given by: [( ( ) 1 ) ] a = 8.686f pa T + y. (C.1) p ref T In Eq. C.1, p a is the ambient atmospheric pressure in kpa and p ref is the reference pressure (p ref = kpa). T is the ambient atmospheric temperature (in Kelvin) and T is a reference temperature of T = K. y is given by: ( ) 5/2 T y = (.1275e (f /T r,o + f 2 ) 1 + z), (C.2) T f r,o in which f r,o denotes the oxygen relaxation frequency, given by: f r,o = p ( a h.2 + h ), (C.3) p ref h with h the molar concentration of water vapor: p rel h = h r (T1/T ) , (C.4) p a with T 1 = K the triple-point isotherm temperature and h r the relative humidity (%). In Eq. C.2, z is defined as:

211 28 Sound absorption through air z =.168e 3352/T ( f r,n + f 2 f r,n with f r,n the nitrogen relaxation frequency, given by: ) 1, (C.5) f r,n = p ( ) 1/2 a T ( ) he 4.17((T/T)1/3 1). (C.6) p rel T Using Eqs. C.1 to C.6, the transmission through air in db/m can be calculated as a function of temperature T (K), atmospheric pressure p a (kpa) and relative humidity h r (%). Figure C.1 shows example values for a as a function of frequency. As can be seen in the figure, the amount of absorption increases for increasing frequency f. a (db/m) k 2k 4k 8k 16k frequency (Hz) Figure C.1: The sound transmission through air a in db/m. This example curve is calculated for T = K, p a = kpa and h = 5%.

212 D Listening test instructions Below, the complete text of the instructions given to the participants of the listening tests can be found. Introduction Thank you for participating in this listening test! In this test you will be asked to rate certain subjective attributes of room acoustics while listening to audio samples. First, the different attributes you will be asked to rate will be briefly explained, including some audio examples. The test procedure will be explained later on. Clarity Clarity is the degree to which discrete sounds in a musical performance stand apart from one another. If clarity is high, it is easy to spot individual notes in a musical piece, or individual phonemes in speech. Below you will find two audio examples (click the play and stop buttons to control playback). [Audio example: High clarity] [Audio example: Low clarity] Note: Clarity should not be confused with the Dutch word helderheid. Reverberance When a room is highly reverberant, the sound will persist in the room for a long time even when a note has suddenly stopped. In a room with less reverberance, the sound will decay rapidly. Below, you can find two examples. [Audio example: High reverberance] [Audio example: Low reverberance] Apparent source width & Envelopment In acoustics, the perception of spaciousness is usually separated into two parts. The first part is related to the source; the source may sound narrow (in the extreme case it is as if the sound is coming from a point). On the contrary the source can

21 Listening test instructions also sound very wide, in which case it is also hard to locate.

The sound is called enveloping when a perception of being surrounded by the sound occurs, because it is coming

[Audio example: Low ASW, low envelopment] [Audio example: High ASW, low envelopment] [Audio example: High ASW,

213 21 Listening test instructions also sound very wide, in which case it is also hard to locate. The second aspect of spaciousness is related to the environment. The sound is called enveloping when a perception of being surrounded by the sound occurs, because it is coming from all directions. Below, you will find three audio examples including pictures to illustrate these effects. (a) (b) (c) Figure D.1: (a) Low ASW, low envelopment; (b) High ASW, low envelopment; (c) High ASW, high envelopment. [Audio example: Low ASW, low envelopment] [Audio example: High ASW, low envelopment] [Audio example: High ASW, high envelopment] Test procedure The test procedure will be as follows. The test software will look like in the figure below. In the example interface above you are asked to rate the clarity for a set of audio samples. You can listen to the individual samples (in any order) by clicking the

214 211 Play buttons. You can stop playback by pressing this button again (playback will also stop when the sample has finished). For each sample, you can rate the clarity by moving the sliders next to the Play buttons with your mouse. As can be seen the scale is specified from very low to very high. By pressing the Sort samples button, the samples will be ordered from highest ranking (top) to lowest ranking (bottom). This way, you can easily listen to the samples again in a pairwise fashion and check if you are satisfied with your ratings by listening to the samples again. You can make adjustments if you wish, press the Sort samples button again, etcetera. Tip: An efficient way of performing this test is to make a quick categorization first, by grouping samples roughly as being very low, low, etc. After that, you can press the sort button and make fine adjustments. When you are satisfied with your final ratings, you can press the Next button to go to the next test screen.

215 212 Listening test instructions

216 E Statistical analyses This appendix will discuss the statistical analyses that are used in analyzing the results from the listening tests (see Chapter 5). E.1 N-way ANOVA An Analysis of Variance (ANOVA) is a statistical tool, which can be used to find statistically significant differences between the means of multiple populations. The null hypothesis (H ) of an ANOVA is that all means are equal. The result of an ANOVA is a p-value (p val ), which is equal to the probability that the variations between the populations may have been caused by chance. If this value is lower than a pre-defined significance level (for example p val <.5 or p val <.1), the null hypothesis is rejected. E.1.1 Factors Within ANOVA, each independent variable in a test is called a factor. For example, in the listening tests in this thesis (see Chapter 5), the factors included room and stimulus type. If only one independent variable is included, a one-way ANOVA is performed, two independent variables will result in a two-way ANOVA, etc. E.1.2 Main effects and interaction effects When the main effects are examined, an ANOVA is performed for one independent variable at a time. So, only the effect of one particular factor on the means is considered. When more than one independent variable is included, interaction effects between factors can also be examined. When an interaction effect between two factors is significant, the effect of one of those factors will have an influence on the other

217 214 Statistical analyses factor. In such a case, the main effect of either of these two factors cannot be discussed without information on the other factor. E.1.3 Performing an N-way ANOVA The steps for performing an N-way ANOVA, including first-order interaction effects, will be explained. Consider a data matrix X, with N independent variables (factors) n = 1...N. The procedure starts with calculating the total sum of squares SS t : SS t = k Xk 2 ( k X k) 2, (E.1) K where k = 1...K are the indices for all the values in matrix X. Next, for each factor P, a matrix X P is constructed, with the various levels j of the variable horizontally (columns) and the various observations i at each level vertically (rows). Then, for each level j, the sum of squares is calculated: SS P,j = i X 2 P,ij ( i X P,ij) 2 M P, (E.2) with M P the number of observations (rows) for factor P. The sum of squares within groups for factor P is then calculated as: SS P,wg = j SS P,j. (E.3) The sum of squares between groups for factor P can now be determined: SS P,bg = SS t SS P,wg. (E.4) Also, the number of degrees of freedom between groups for factor n can be defined as: df P,bg = M P 1. (E.5) Using Eqs. E.4 and E.5, the mean sum of squares between groups for factor P can be obtained: MS P,bg = SS P,bg df P,bg. (E.6)

218 E.1 N-way ANOVA 215 An ANOVA with N independent variables will include N i = 1 2 2N first-order interaction effects. For each interaction effect Q R between factors Q and R, a sum of squares can be defined. Let SS qr be the sum of squares for the levels q and r of factors Q and R, respectively: SS qr = i Xqr,ij 2 j i j X 2 qr,ij M qr,ij, (E.7) where X qr,ij are all the observations where q = i and r = j, and M qr,ij is the number of observations for which this is true. The total sum of squares within groups for the interaction between Q and R can now be obtained by summing SS qr over all possible levels for Q and R: SS QR,wg = q SS qr. r (E.8) The sum of squares between groups for the interaction effect Q R is now equal to the total sum of squares minus SS QR and the sum of squares of the individual main factors: SS QR,bg = SS t SS QR,wg SS Q,bg SS R,bg. (E.9) The number of degrees of freedom between groups for interaction effect Q R is equal to: df QR,bg = df Q df R. (E.1) This results in the following mean sum of squares for Q R: MS QR,bg = SS QR,bg df QR,bg. (E.11) The sum of squares within groups (also called error) can now be obtained by subtracting all the sum of squares between groups from the total sum of squares: SS wg = SSt p SS p,bg q SS qr,bg. r (E.12) Likewise, the degrees of freedom within groups can be calculated: df wg = df t p df p,bg q df qr,bg, r (E.13)

219 216 Statistical analyses where df t = (K 1) is the total number of degrees of freedom. The mean sum of squares within groups is now given by: MS wg = SS wg df wg. (E.14) Finally, for each (main and interaction) effect P an F -ratio can be calculated: F P = MS P,bg MS wg. (E.15) From this F -ratio, the p-value can be obtained using: ( p P,val = 1 β df P,bg F, df P,bg df P,bg F + df wg 2 where β(x, a, b) is the incomplete regularized beta function:, df ) wg, (E.16) 2 β(x, a, b) = x ta 1 (1 t) b 1 dt t a 1 (1 t) b 1 dt. (E.17) E.1.4 Presentation of the ANOVA results ANOVA results are usually presented in a table. Table E.1 shows an example of such a table for an ANOVA with two independent variables. Table E.1: An example table of how ANOVA results are usually presented. This example is for an ANOVA with two independent variables (factors A and B). Source Sum. Sq. d.f. Mean Sq. F Prob > F A SS A,bg df A,bg MS A,bg F A p A,val B SS B,bg df B,bg MS B,bg F B p B,val A B SS AB,bg df AB,bg MS AB,bg F AB p AB,val Error SS wg df wg MS wg Total SS t df t When a result out of such a table is discussed in the text, this is usually done as follows (for example): An analysis of variance showed that the effect of A was significant, F (df A,bg, df wg ) = F A, p = p A,val.

220 E.2 Tukey HSD 217 E.2 Tukey HSD A significant ANOVA result only indicates the existence of differences between the means, but it does not specify which of the means are different from each other. In order to find which of the individual means are statistically different, a post-hoc test is needed. Such a test is the Tukey HSD (Honestly Significant Difference) test. To find out if two means X 1 and X 2, with X 1 > X 2, are significantly different, the following value is calculated: q = X 1 X 2 MSwg, (E.18) N X with MS wg the mean sum of squares within groups (Eq. E.14) and N X the number of samples over which X is averaged. Next, this q-value is compared with a critical value q crit defined as: q crit = Q (df bg, N X, 1 τ), (E.19) where Q is Tukey s q studentized range, df bg the number of degrees of freedom between groups, N X the number of samples in X and τ the confidence level (for example τ =.5). No analytical expression for Q exists, but lookup tables exist in the literature, as well as online calculators. If q > q crit, then the two means differ significantly. After obtaining a significant ANOVA result, all the means can be compared pairwise using Tukey HSD to find out which means are significantly different (this is called a multiple comparison).

221 218 Statistical analyses

222 F The Gauss-Newton algorithm In Chapter 6, a Gauss-Newton algorithm was used to fit sigmoid functions through the objective versus perceptual results. The Gauss-Newton algorithm can be used to solve non-linear least-squares problems [Fletcher, 1987]. The Gauss-Newton algorithm can be used to minimize the following sum of squares: N S(p) = w n rn(p), 2 n=1 (F.1) where p = (p 1, p 2,...p M ) T is a set of M parameters to be optimized and r n are the residuals: r n (p) = y n f(x n, p), (F.2) with y n the target values and f(x n, p) the outcome of the model function, given the parameter values p. Through the weighting factors w n, more weight can be given to certain data points. If no weighting is applied, w n = 1 for all n. Generally, the inverse of the variance for each data point is used: w n = σn 2. The algorithm starts with an initial guess p () for the minimum. Next, the method proceeds by optimizing p iteratively. The parameters are updated at each iteration k, using: p (k+1) = p (k) + γ, (F.3) is calculated by solving: (J r T WJ r ) = J r T Wr, (F.4) where W is the weighting matrix, containing the w n values on its diagonal. J r is the N M Jacobian matrix of the residuals r with respect to p:

223 22 The Gauss-Newton algorithm J r = r 1 r 1 p M p r N p 1... r N p M. From Eq. F.2, it can be shown that J r = J f, with J f respect to p: J f = f 1 f 1 p M p f N p 1... f N p M. (F.5) the Jacobian of f with (F.6) Therefore, can also be calculated by solving: ( J T f WJ f ) = J T f Wr, (F.7) The factor γ in Eq. F.3 is the step length. The step length should be chosen such that the new parameters perform sufficiently better than the ones obtained in the previous iteration. The step length should not be too small because this will lead to slow convergence. On the other hand, when the step length is too high, there is the danger of missing the minimum. One approach to find a step size that leads to a sufficiently better solution is by using the Armijo condition [Nocedal and Wright, 1999]: ( S p (k+1)) ( S p (k)) ( T + µγ J T r Wr) p, (F.8) with µ a constant in the range < µ < 1. This constant is generally chosen to be small, for example µ =.1. Using J r = J f, Eq. F.8 can also be written as: ( S p (k+1)) ( S p (k)) µγ ( J T f Wr ) T p. (F.9) Usually, a back-tracking procedure is followed to find a value of γ that satisfies this condition. For example, as a starting point γ = 1 is used. If condition F.9 is not satisfied, γ is divided by two, etc. This procedure is repeated until a value for γ is found that satisfies Eq. F.9. The iterative procedure of the Gauss-Newton algorithm can be stopped as soon as a certain condition is met. Such a condition could be a maximum number of iterations and/or when the maximum absolute value of is lower than a certain threshold.

224 Bibliography AES (27). AES-4id-21 (r27): Characterisation and measurement of surface scattering uniformity. Audio Engineering Society, New York. Allen, J. B. and Berkley, D. A. (1979). Image method for efficiently simulating small-room acoustics. Journal of the Acoustical Society of America, 65(4), pp Ando, Y. (1983). Calculation of subjective preference at each seat in a concert hall. Journal of the Acoustical Society of America, 74(3), pp ANSI (24). ANSI S : Specification for octave-band and fractionaloctave-band analog and digital filters. Standards Secretariat, Acoustical Society of America. Barron, M. (1988). Subjective study of British symphony concert halls. Acustica, 66(1), pp Barron, M. (25). Using the standard on objective measures for concert auditoria, ISO 3382, to give reliable results. Acoustical Science and Technology, 26(2), pp Barron, M. and Marshall, A. H. (1981). Spatial Impression due to early lateral reflections in concert halls: The derivation of a physical measure. Journal of Sound and Vibration, 77(2), pp Becker, J. (22). Spectral and temporal contribution of different signals to ASW analysed with binaural hearing models. In Proceedings of the Forum Acusticum 22, Sevilla, Spain. Beranek, L. L. (1962). Music, acoustics and architecture. John Wiley & Sons Inc, New York. Beranek, L. L. (1996). Concert and opera halls - How they sound. Acoustical Society of America, New York.

225 222 BIBLIOGRAPHY Beranek, L. L. (23). Subjective rank-orderings and acoustical measurements for fifty-eight concert halls. Acta Acustica united with Acustica, 89(3), pp Beranek, L. L. (28). Concert hall acoustics Journal of the Audio Engineering Society, 56(7/8), pp Berg, J. and Rumsey, F. (21). Verification and correlation of attributes used for describing the spatial quality of reproduced sound. In Proceedings of the 19th International AES Conference on Surround Sound, Schloss Elmau, Germany. Berkhout, A. J. (1988). A Holographic Approach to Acoustic Control. Journal of the Audio Engineering Society, 36(12), pp Berkhout, A. J., de Vries, D., Baan, J. and van den Oetelaar, B. W. (1999). A wave field extrapolation approach to acoustical modeling in enclosed spaces. Journal of the Acoustical Society of America, 15(3), pp Berkhout, A. J., De Vries, D. and Boone, M. M. (198). A new method to acquire impulse responses in concert halls. Journal of the Acoustical Society of America, 68(1), pp Bilsen, F. A. (1994). Binaural modelling of perceptual qualities in room acoustics. In Proceedings of the International Conference on Acoustic Quality of Concert Halls, Madrid, Spain, pages Blau, M. (22). Difference limens for measures of apparent source width. In Proceedings of Forum Acusticum, Sevilla, Spain. Blauert, J. (1996). Spatial hearing. The MIT Press, Cambridge, Massachusetts, revised edition. Blauert, J. and Guski, R. (29). Critique of pure psychoacoustics. In Proceedings of NAG/DAGA 29, Rotterdam, The Netherlands, pages Blauert, J. and Lindemann, W. (1986a). Auditory spaciousness: Some further psychoacoustic analyses. Journal of the Acoustical Society of America, 8(2), pp Blauert, J. and Lindemann, W. (1986b). Spatial mapping of intracranial auditory events for various degrees of interaural coherence. Journal of the Acoustical Society of America, 79(3), pp Boone, M. M., De Vries, D. and Berkhout, A. J. (1995). Sound control (parts 1 and 2). Delft University of Technology, Delft. Boone, M. M., Opdam, R. C. G. and Schlesinger, A. (21). Downstream speech enhancement in a low directivity binaural hearing aid. In Proceedings of the 2th International Congress on Acoustics (ICA 21), Sydney, Australia.

226 BIBLIOGRAPHY 223 Bradley, J. S. (199). Contemporary approaches to evaluating auditorium acoustics. In Proceedings of the 8th International AES Conference on The Sound of Audio, Washington DC, USA, pages Bradley, J. S. (1991). A comparison of three classical concert halls. Journal of the Acoustical Society of America, 89(3), pp Bradley, J. S. (25). Using ISO 3382 measures, and their extensions, to evaluate acoustical conditions in concert halls. Acoustical Science and Technology, 26(2), pp Bradley, J. S. (21). Review of the objective room acoustics measures and future needs. In Proceedings of the International Symposium on Room Acoustics (ISRA 21), Melbourne, Australia. Bradley, J. S. and Soulodre, G. A. (1995). Objective measures of listener envelopment. Journal of the Acoustical Society of America, 98(5), pp Breebaart, D. J. (21). Modeling binaural signal detection. PhD thesis, Eindhoven University of Technology, Eindhoven. Breebaart, D. J., van de Par, S. and Kohlrausch, A. (21a). Binaural processing model based on contralateral inhibition. I. Model structure. Journal of the Acoustical Society of America, 11(2), pp Breebaart, D. J., van de Par, S. and Kohlrausch, A. (21b). Binaural processing model based on contralateral inhibition. II. Dependence on spectral parameters. Journal of the Acoustical Society of America, 11(2), pp Breebaart, D. J., van de Par, S. and Kohlrausch, A. (21c). Binaural processing model based on contralateral inhibition. III. Dependence on temporal parameters. Journal of the Acoustical Society of America, 11(2), pp Bronkhorst, A. W. (2). The cocktail party phenomenon: A review of research on speech intelligibility in multiple-talker conditions. Acta Acustica united with Acustica, 86(1), pp Chalupper, J. and Fastl, H. (22). Dynamic Loudness Model (DLM) for normal and hearing-impaired listeners. Acta Acustica united with Acustica, 88(3), pp Chevret, P. and Parizet, E. (27). An efficient alternative to the paired comparison method for the subjective evaluation of a large set of sounds. In Proceedings of the 19th International Congress on Acoustics (ICA 27), Madrid, Spain. Conetta, R., Rumsey, F., Zielinski, S., Jackson, P., Dewhirst, M., George, S., Bech, S. and Meares, D. (28). QESTRAL (Part 2): Calibrating the QESTRAL spatial quality model using listening test data. In Proceedings of the 125th AES Convention, San Fransisco, USA. preprint no

227 224 BIBLIOGRAPHY Cox, T. J. and D Antonio, P. (24). Acoustic Absorbers and Diffusers - Theory, Design and Applications. Spon Press, London. Craggs, A. (1994). A finite element method for the free vibration of air in ducts and rooms with absorbing walls. Journal of Sound and Vibration, 173(16), pp Dau, T., Püschel, D. and Kohlrausch, A. (1996a). A quantitative model of the effective signal processing in the auditory system. I. Model structure. Journal of the Acoustical Society of America, 99(6), pp Dau, T., Püschel, D. and Kohlrausch, A. (1996b). A quantitative model of the effective signal processing in the auditory system. II. Simulations and measurements. Journal of the Acoustical Society of America, 99(6), pp De Vries, D. and Han, H. L. (1983). Loudness of music in concert halls. In Proceedings of the 11th International Congress on Acoustics (ICA 1983), Paris, France, pages De Vries, D., Hulsebos, E. M. and Baan, J. (21). Spatial fluctuations in measures for spaciousness. Journal of the Acoustical Society of America, 11(2), pp De Vries, D., van Dorp Schuitman, J. and van den Heuvel, A. (27). A new digital module for variable acoustics and wave field synthesis: Design and applications. In Proceedings of the 122nd AES Convention, Vienna, Austria. preprint no Dewhirst, M., Conetta, R., Rumsey, F., Jackson, P., Zielinski, S., Meares, D., Bech, S. and George, S. (28). QESTRAL (Part 4): Test signals, combining metrics and the prediction of overall spatial quality. In Proceedings of the 125th AES Convention, San Fransisco, USA. preprint no Durlach, N. I. (1963). Equalization and cancellation theory of binaural maskinglevel differences. Journal of the Acoustical Society of America, 35(8), pp Eargle, J. (26). The Handbook of Recording Engineering, chapter 4 - Microphones: The Basic Pickup Patterns, pages Kluwer Acadamic Publishers. EBU (1988). EBU document Tech : Sound quality assessment material. Recordings for subjective tests (Users handbook for the EBU demonstration CD, SQAM). Technical report, European Broadcasting Union. Farina, A. (21). Acoustic quality of theatres: correlations between experimental measures and subjective evaluations. Applied Acoustics, 62(8), pp Farina, A. (27). Advancements in impulse response measurements by sine sweeps. In Proceedings of the 122nd AES Convention, Vienna, Austria. preprint no

228 BIBLIOGRAPHY 225 Fastl, H., Völk, F. and Straubinger, M. (29). Standards for calculating loudness of stationary or time-varying sounds. In Proceedings of Inter-Noise 29, Ottowa, Canada. Fletcher, R. (1987). Practical methods of optimization. John Wiley & Sons, New York, 2nd edition. Furuya, H., Fujimoto, K., Wakuda, A. and Nakano, Y. (25). The influence of total and directional energy of late sound on listener envelopment. Acoustical Science and Technology, 26(2), pp Gade, A. C. (1989a). Investigations of musicians room acoustic conditions in concert halls, I: Methods and laboratory experiments. Acustica, 69(5), pp Gade, A. C. (1989b). Investigations of musicians room acoustic conditions in concert halls, II: Field experiments and synthesis of results. Acustica, 69(6), pp Glasberg, B. R. and Moore, B. C. J. (199). Derivation of filter shapes from notched-noise data. Hearing Research, 47(1-2), pp Glasberg, B. R. and Moore, B. C. J. (21). The loudness of sounds whose spectra differ at the two ears. Journal of the Acoustical Society of America, 127(4), pp Grantham, D. W. and Robinson, D. E. (1977). Role of dynamic cues in monaural and binaural signal detection. Journal of the Acoustical Society of America, 61(2), pp Grantham, D. W. and Wightman, F. L. (1978). Detectability of varying interaural temporal differences. Journal of the Acoustical Society of America, 63(2), pp Griesinger, D. (1992a). IALF - Binaural measures of spatial impression and running reverberance. In Proceedings of the 92nd AES Convention, Vienna, Austria. preprint no Griesinger, D. (1992b). Room impression, reverberance, and warmth in rooms and halls. In Proceedings of the 93th AES Convention, San Fransisco, USA. preprint no Griesinger, D. (1995). How loud is my reverberation?. In Proceedings of the 98th AES Convention, Paris, France. preprint no Griesinger, D. (1997). The psychoacoustics of apparent source width, spaciousness and envelopment in performance spaces. Acta Acustica united with Acustica, 83(4), pp

229 226 BIBLIOGRAPHY Griesinger, D. (1999). Objective measures of spaciousness and envelopment. In Proceedings of the 16th International AES Conference on Spatial Sound Reproduction, Rovaniemi, Finland. Guski, R. and Blauert, J. (29). Psychoacoustics without psychology?. In Proceedings of NAG/DAGA 29, Rotterdam, The Netherlands, pages Haas, H. (1951). Über den Einfluss eines Einfachechos auf die Hörsamkeit von Sprache. Acustica, 1, pp Heckbert, P. S. and Hanrahan, P. (1984). Beam tracing polygonal objects. In Proceedings of the 11th annual conference on Computer graphics and interactive techniques, New York, USA, pages Hess, W. (26). Time-variant binaural-activity characteristics as indicator of auditory spatial attributes. PhD thesis, Fakultät für Elektrotechnik und Informationstechnik der Ruhr-Universität Bochum, Bochum. Hess, W., Braasch, J. and Blauert, J. (23). Evaluierung von Räumen anhand binauraler Aktivitätsmuster. In Proceedings of DAGA 23, Aachen, Germany, pages Hidaka, T., Nishihara, N. and Beranek, L. L. (21). Relation of acoustical parameters with and without audiences in concert halls and a simple method for simulating the occupied state. Journal of the Acoustical Society of America, 19(3), pp Holland, J. (1975). Adaptation in natural and artificial systems. The University of Michigan Press, Ann Arbor. Huszty, C., Bukuli, N., Torma, A. and Augusztinovicz, F. (28). Effects of filtering of room impulse responses on room acoustics parameters by using different filter structures. In Proceedings of Acoustics 8, Paris, France, pages IEC (199). IEC 6959: Provisional head and torso simulator for acoustic measurements on air conducting hearing aids, First edition. International Electrotechnical Commission. IEC (23). IEC ed3.: Sound system equipment - Part 16: Objective rating of speech intelligibility by speech transmission index. International Electrotechnical Commission. ISO (1993). ISO :1993: Acoustics - Attenuation of sound during propagation outdoors - Part 1: Calculation of the absorption of sound by the atmosphere. International Organization for Standardization. ISO (24). ISO :24: Acoustics - Sound-scattering properties of surfaces - Part 1: Measurement of the random-incidence scattering coefficient in a reverberation room. International Organization for Standardization.

230 BIBLIOGRAPHY 227 ISO (28). ISO :28: Acoustics - Measurement of room acoustic parameters - Part 2: Reverberation time in ordinary rooms. International Organization for Standardization. ISO (29). ISO :29: Acoustics - Measurement of room acoustic parameters - Part 1: Performance spaces. International Organization for Standardization. ISO/IEC (1993). ISO/IEC :1993: Information technology - Coding of moving pictures and associated audio for digital storage media at up to about 1,5 Mbits/s - Part 3: Audio. International Organization for Standardization. ISO/IEC (29). ISO/IEC :29: Information technology - Coding of audiovisual objects - Part 3: Audio. International Organization for Standardization. ITU-R (23). Recommendation ITU-R BS : General methods for the subjective assessment of sound quality. ITU Radiocommunication Sector. Jackson, P., Dewhirst, M., Conetta, R., Zielinski, S., Rumsey, F., Meares, D., Bech, S. and George, S. (28). QESTRAL (Part 3): System and metrics for spatial quality prediction. In Proceedings of the 125th AES Convention, San Fransisco. preprint no Jeffress, L. A. (1948). A place theory of sound localization. Journal of Comparative and Physiological Psychology, 41(1), pp Jekosch, U. (24). Basic concepts and terms of quality, reconsidered in the context of product-sound quality. Acta Acustica united with Acustica, 9(6), pp Jepsen, M. L. (26). A model of the normal and impaired auditory system. Master s thesis, Technical University of Denmark, Kgs. Lyngby, Denmark. Jordan, V. L. (1981). A group of objective acoustical criteria for concert halls. Applied Acoustics, 14(4), pp Kahle, E. and Jullien, J.-P. (1995). Subjective listening tests in concert halls: Methodology and results. In Proceedings of the International Congress on Acoustics (ICA 95), Trondheim, Norway. Keet, W. d. V. (1968). The influence of early lateral reflections on spatial impression. In Proceedings of the 6th International Congress on Acoustics (ICA 68), Tokyo, Japan, pages E53 E56. Kosten, C. W. (196). The mean free path in room acoustics. Acustica, 1, pp

231 228 BIBLIOGRAPHY Krokstad, A., Strøm, S. and Sørsdal, S. (1968). Calculating the acoustical room response by the use of a ray tracing technique. Journal of Sound and Vibration, 8(1), pp Kürer, R. (1971). Einfaches Messverfahren zur Bestimmung der Schwerpunktzeit raumakustischer Impulsantworten (A simple measuring procedure for determining the center time of room acoustical impulse responses). In Proceedings of the 7th International Congress on Acoustics (ICA 71), Budapest, Hungary. Kuttruff, H. (2). Room acoustics. Spon Press, London, fourth edition. Lacatis, R., Giménez, A., Barba Sevillano, A., Cerdá, S., Romero, J. and Cibrián, R. (28). Historical and chronological evolution of the concert hall acoustics parameters. In Proceedings of Acoustics 8, Paris, France, pages Lee, D., Cabrera, D. and Martens, W. L. (21). Equal reverberance contours for synthetic room impulse responses listened to directly: Evaluation of reverberance in terms of loudness decay parameters. In Proceedings of the International Symposium on Room Acoustics (ISRA 21), Melbourne, Australia. Levitt, H. (1971). Transformed up-down methods in psychoacoustics. Journal of the Acoustical Society of America, 49(2B), pp Lindau, A., Hohn, T. and Weinzierl, S. (27). Binaural resynthesis for comparative studies of acoustical environments. In Proceedings of the 122nd AES Convention, Vienna, Austria. preprint no Lindemann, W. (1986a). Extension of a binaural cross-correlation model by contralateral inhibition. I. Simulation of lateralization for stationary signals. Journal of the Acoustical Society of America, 8(6), pp Lindemann, W. (1986b). Extension of a binaural cross-correlation model by contralateral inhibition. II. The law of the first wave front. Journal of the Acoustical Society of America, 8(6), pp Lokki, T. and Karjalainen, M. (2). An auditorily motivated analysis method for room impulse responses. In Proceedings of the COST G-6 Conference on Digital Audio Effects (DAFX-), Verona, Italy. Lokki, T. and Karjalainen, M. (22). Analysis of room responses, motivated by auditory perception. Journal of New Music Research, 31(2), pp Lokki, T., Vertanen, H., Kuusinen, A., Pätynen, J. and Tervo, S. (21). Auditorium acoustics assessment with sensory evaluation methods. In Proceedings of the International Symposium on Room Acoustics (ISRA 21), Melbourne, Australia. MacNair, W. A. (193). Optimum reverberation time for auditoriums. Journal of the Acoustical Society of America, 1(2A), pp

232 BIBLIOGRAPHY 229 Mason, R. (22). Elicitation and measurement of auditory spatial attributes in reproduced sound. PhD thesis, University of Surrey, Guildford. Mason, R., Brookes, T. and Rumsey, F. (24). Development of the interaural crosscorrelation coefficient into a more complete auditory width prediction model. In Proceedings of the 18th International Congress on Acoustics (ICA 24), Kyoto, Japan. Mason, R., Brookes, T. and Rumsey, F. (25). The effect of various source signal properties on measurements of the interaural crosscorrelation coefficient. Acoustical Science and Technology, 26(2), pp Meddis, R. (1988). Simulation of auditoryneural transduction: Further studies. Journal of the Acoustical Society of America, 83(3), pp Menzel, D., Dauenhauer, T. and Fastl, H. (29). Crying colours and their influence on loudness judgments. In Proceedings of NAG/DAGA 29, Rotterdam, The Netherlands, pages Middlebrooks, J. C. (1999). Individual differences in external-ear transfer functions reduced by scaling in frequency. Journal of the Acoustical Society of America, 16(3), pp Mitchinson, L. (21). Bouncing off the walls - Leo Beranek and the science of concert hall design. Lingua Franca, 11(3). Morimoto, M. (22). The relation between spatial impression and the precedence effect. In Proceedings of the 8th International Conference on Auditory Display (ICAD22), Kyoto, Japan. Münkner, S. (1993). Modellentwicklung und Messungen zur Wahrnehmung nichtstationärer Signale. PhD thesis, University of Göttingen, Göttingen. Nagel, F., Sedlmeier, P. and Sporer, T. (21). Towards a statistically wellgrounded evaluation of listening tests - Avoiding pitfalls, misuse, and misconceptions. In Proceedings of the 128th AES Convention, London, UK. preprint no Nelson, P., Park, M., Takeuchi, T. and Fazi, F. (28). Binaural hearing and systems for sound reproduction. In Proceedings of Acoustics 8, Paris, France, pages Nielsen, J. L., Halstead, M. M. and Marshall, A. H. (1998). On spatial variability of room acoustic measures. In Proceedings of the 16th International Congress on Acoustics (ICA 1998), Seattle, USA, pages Nocedal, J. and Wright, S. J. (1999). Numerical optimization. Springer Science + Business Media, New York.

233 23 BIBLIOGRAPHY Nyberg, D. and Berg, J. (28). Listener Envelopment - What has been done and what future research is needed?. In Proceedings of the 124th AES Convention, Amsterdam, The Netherlands. preprint no Okano, T., Beranek, L. L. and Hidaka, T. (1998). Relations among interaural cross-correlation coefficient (IACC E ), lateral fraction (LF E ), and apparent source width (ASW) in concert halls. Journal of the Acoustical Society of America, 14(1), pp Parizet, E., Hamzaoui, N. and Sabatié, G. (25). Comparison of some listening test methods: A case study. Acta Acustica united with Acustica, 91(2), pp Patterson, R. D., Robinson, K., Holdsworth, J., McKeown, D., Zhang, C. and Allerhand, M. (1992). Complex sounds and auditory images. In Proceedings of the 9th International Symposium on Hearing, pages Plack, C. J. (25). The sense of hearing. Lawrence Erlbaum Associates, Inc., Mahwah, New Jersey. Plenge, G., Lehmann, R., Wettschureck, R. and Wilkens, H. (1975). New methods in architectural investigations to evaluate the acoustic qualities of concert halls. Journal of the Acoustical Society of America, 57(6), pp PTB (26). α-database. Physikalisch-Technische Bundesanstalt, Germany. Rennies, J., Verhey, J. L. and Fastl, H. (21). Comparison of loudness models for time-varying sounds. Acta Acustica united with Acustica, 96(2), pp Rife, D. D. and Vanderkooy, J. (1989). Transfer-function measurement with Maximum-Length Sequences. Journal of the Audio Engineering Society, 37(6), pp Robinson, D. (21). Replay Gain - A proposed standard. (last date viewed: 5/4/211). Rossing, T. D., editor (27). Springer handbook of acoustics. Springer Science + Business Media, LLC New York. Rumsey, F. (22). Spatial quality evaluation for reproduced sound: Terminology, Meaning, and a Scene-Based Paradigm. Journal of the Audio Engineering Society, 5(9), pp Rumsey, F., Zielinski, S., Jackson, P., Dewhirst, M., Conetta, R., George, S., Bech, S. and Meares, D. (28). QESTRAL (Part 1): Quality evaluation of spatial transmission and reproduction using an artificial listener. In Proceedings of the 125th AES Convention, San Fransisco, USA. preprint no

234 BIBLIOGRAPHY 231 Sabine, W. C. (1922). Collected papers on acoustics. Harvard University Press, Cambridge, Massachusetts. Savioja, L., Huopaniemi, J., Lokki, T. and Väänänen, R. (1999). Creating interactive virtual acoustic environments. Journal of the Audio Engineering Society, 47(9), pp Schlesinger, A., Ramirez, J.-P. and Boone, M. M. (21). Evaluation of a speechbased and binaural speech transmission index. In Proceedings of the 4th International AES Conference on Spatial Audio, Tokyo, Japan. Schlesinger, A., Ramirez, J.-P., Van Dorp Schuitman, J. and Boone, M. M. (29). Report on a binaural extension of the Speech Transmission Index method for nonlinear systems and narrowband interference. In Proceedings of the International Symposium on Auditory and Audiological Research (ISAAR) 29, Helsingør, Denmark. Schmitz, A. (1995). Ein neues digitales Kunstkopfmeßsystem (A new digital artificial head measuring system). Acustica, 81, pp Schroeder, M. R. (1965). New method of measuring reverberation time. Journal of the Acoustical Society of America, 37(3), pp Schroeder, M. R., Gottlob, D. and Siebrasse, K. F. (1974). Comparative study of European concert halls: correlation of subjective preference with geometric and acoustic parameters. Journal of the Acoustical Society of America, 56(4), pp Skålevik, M. (21). Room acoustical parameter values at the listener s ears - can preferred concert hall acoustics be predicted and explained?. In Proceedings of the 2th International Congress on Acoustics (ICA 21), Sydney, Australia. Soulodre, G. A., Lavoie, M. C. and Norcross, S. G. (23a). Objective measures of listener envelopment in multichannel surround systems. Journal of the Audio Engineering Society, 51(9), pp Soulodre, G. A., Lavoie, M. C. and Norcross, S. G. (23b). Temporal aspects of listener envelopment in multichannel surround systems. In Proceedings of the 114th AES Convention, Amsterdam, The Netherlands. preprint no Steeneken, H. J. M. and Houtgast, T. (198). A physical method for measuring speech-transmission quality. Journal of the Acoustical Society of America, 67(1), pp Supper, B. (25). An onset-guided spatial analyser for binaural audio. PhD thesis, University of Surrey, Guildford. Terhardt, E. (1979). Calculating virtual pitch. Hearing Research, 1, pp

235 232 BIBLIOGRAPHY Thiele, R. (1953). Richtungsverteilung und Zeitfolge der Schallrückwürfe in Räumen. Acustica, 3, pp Vogel, P. (23). Application of Wave Field Synthesis in Room Acoustics. PhD thesis, Delft University of Technology, Delft. Vorländer, M. (28). Auralization: Fundamentals of acoustics, modelling, simulation, algorithms and acoustic virtual reality (RWTHedition). Springer-Verlag Berlin, Berlin. Weise, T. (29). Global optimization algorithms - Theory and application. 2nd edition. Wickelmaier, F. and Schmid, C. (24). A Matlab function to estimate choice model parameters from paired-comparison data. Behavior Research Methods, Instruments, & Computers, 36(1), pp Wightman, F. L. and Kistler, D. J. (1992). The dominant role of low-frequency interaural time differences in sound localization. Journal of the Acoustical Society of America, 91(3), pp Witew, I. B., Behler, G. K. and Vorländer, M. (25). About just noticeable differences for aspects of spatial impressions in concert halls. Acoustical Science and Technology, 26(2), pp Witew, I. B., Dietrich, P., De Vries, D. and Vorländer, M. (21a). Uncertainty of room acoustic measurements - How many measurement positions are necessary to describe the conditions in auditoria?. In Proceedings of the International Symposium on Room Acoustics (ISRA 21), Melbourne, Australia. Witew, I. B., Lindau, A., Van Dorp Schuitman, J., Vorländer, M., Weinzierl, S. and De Vries, D. (21b). Uncertainties of IACC related to dummy head orientation. In Proceedings of DAGA 21, Berlin, Germany. Yost, W. A. (1972). Tone-in-tone masking for three binaural listening conditions. Journal of the Acoustical Society of America, 52(4B), pp

236 Summary Auditory modelling for assessing room acoustics The acoustics of a concert hall, or any other room, are generally assessed by measuring room impulse responses for one or multiple source and receiver location(s). From these responses, objective parameters can be determined that should be related to various perceptual attributes of room acoustics. A set of these parameters is collected in ISO standard However, this method of assessing room acoustical quality has some major shortcomings. First of all, it is known that the perception of the acoustics of a room is dependent on the type of source signal. This is not taken into account when only impulse responses are considered. Furthermore, because of the type of test signals used to perform such a measurement, measurements are mostly carried out in empty rooms, while the acoustics can change drastically when a room is fully occupied with people. Finally, there is evidence in the literature of cases in which the parameters sometimes do not correlate well with perception. For example, it has been found that some parameters can fluctuate severely over small measurement intervals, whereas the perceptual attributes for which these parameters should be predictors remain constant. Apparently, some important properties of the human auditory system are not taken into account sufficiently. In this thesis, a new method is proposed. The method consists of the processing of arbitrary binaural audio recordings using a binaural, non-linear auditory model. These recordings (or simulations) should be performed with a dummy head. This model simulates the most important stages of the auditory system, such as the response of the inner ear, basilar membrane and hair cells, neural adaptation and binaural interaction. Using a peak detection algorithm, the output signals of the model are split into two separate streams: one related to the source (direct sound) and one related to the environment (reverberant sound). Together with the calcu-

237 234 Summary lation of the amount of fluctuation in the Interaural Time Difference (ITD) over time, parameters can be determined that are related to the perceptual attributes reverberance, clarity, apparent source width and listener envelopment. The new method has been validated through four listening tests. In these tests, subjects had to rate the four perceptual attributes that were discussed above, for various room/stimulus combinations. Two different source stimuli were used: male speech and cello music. Two listening tests included virtual rooms, which were simulated binaurally using a simulator for shoebox-shaped rooms, which is also presented in this thesis. The two other tests included real rooms of which the impulse responses were measured binaurally. Statistical analyses were performed on the results to evaluate which factors have a significant effect on the results. In all the tests, significant differences were detected between the rooms. The source signal did also have a significant effect in some situations. Using the results of one of the four tests, the free parameters in the model, like upper- and lower frequency limits, were optimized using a genetic algorithm. Next, the method was validated by calculating the correlation coefficients between the new parameters and the average ratings. The results are very promising. In most cases, the new parameters correlate better with the perceptual data than the conventional parameters. Furthermore, compared with the conventional parameters, there were far less situations in which the new parameters showed insignificant correlation. Besides the good results in terms of correlation with perception, the method has other advantages. Since the method accepts arbitrary binaural audio recordings, a measurement can be performed in an occupied room, during a performance, for example. This way, the effect of the presence of an audience on the acoustics is automatically taken into account. Furthermore, the parameters will be contentspecific, which means that the ways in which the spectral and temporal properties of the source signal influence the perception of the acoustics of a room will be reflected in the resulting parameters. Various practical aspects of the new method are also discussed, like robustness to noise and the influence of the type of source signal. Three different signal categories were tested: voice stimuli, instrument stimuli and ensemble stimuli. It was found that even within one signal category, differences between the parameters were found for the different source signals. These differences could all be explained by the temporal and spectral properties of the signals and how these properties have an effect on the perception of the acoustics of a room. Finally, the effects of the dummy head position and orientation on the parameters was evaluated. The results do not show severe fluctuations as a function of offset or angle, although this can not be judged quantitatively as long as the just noticeable differences (JNDs) of the new parameters are unknown.

238 Samenvatting Hoormodellen voor het beoordelen van ruimte-akoestiek De akoestiek van een concertzaal, of van een willekeurige andere ruimte, wordt meestal beoordeeld door het meten van pulsresponsies op één of meerdere bronen ontvangerposities. Uit deze responsies kunnen vervolgens objectieve maten worden berekend, die gerelateerd zijn aan diverse perceptuele aspecten van ruimteakoestiek. Een groep van deze parameters is verzameld in ISO norm Echter, deze methode van het beoordelen van ruimte-akoestische kwaliteit heeft enkele grote tekortkomingen. Om te beginnen is het bekend dat de perceptie van de akoestiek van een ruimte afhangt van het type bronsignaal. Hier wordt geen rekening mee gehouden als alleen pulsresponsies in beschouwing worden genomen. Verder worden metingen doorgaans verricht in lege ruimtes, vanwege het type testsignaal dat wordt gebruikt tijdens zulke metingen, terwijl de akoestiek drastisch kan veranderen wanneer een ruimte volledig is bezet met mensen. Tot slot is er bewijs in de literatuur van gevallen waarin de parameters niet goed correleren met de perceptie. Het blijkt bijvoorbeeld, dat de parameters erg kunnen fluctueren binnen kleine meetintervallen, terwijl de perceptuele aspected die voorspeld zouden moeten worden door deze parameters constant blijven. Kennelijk worden belangrijke aspecten van het menselijk hoorsysteem niet meegenomen. In dit proefschrift wordt een nieuwe methode beschreven. De methode bestaat uit het verwerken van willekeurige binaurale audio-opnames met gebruik van een binauraal, niet-lineair hoormodel. Deze opnames (of simulaties) moeten worden gemaakt met een kunsthoofd. Het model simuleert de belangrijkste onderdelen van het hoorsysteem, zoals de responsie van het en binnenoor, het basilair membraan en de haarcellen, neurale adaptatie en binaurale interactie. Met gebruik van een piekdetectie-algoritme worden de uitgangssignalen van het model gesplitst in twee signalen: één gerelateerd aan de bron (direct geluid) en één gerelateerd aan de ruimte

239 236 Samenvatting (galmend geluid). In combinatie met de berekening van de hoeveelheid fluctuatie over tijd in de Interaural Time Difference (ITD) kunnen hieruit parameters worden bepaald die gerelateerd zijn aan de perceptuele aspecten reverberance (galm), clarity (duidelijkheid), apparent source width (schijnbare bronbreedte) en listener envelopment (omhulling van de luisteraar). De nieuwe methode is gevalideerd door middel van vier luistertests. De proefpersonen moesten in deze testen waarderingen geven aan de vier perceptuele aspecten die hierboven zijn besproken, voor diverse ruimte/stimulus-combinaties. Twee verschillende stimuli zijn hierbij gebruikt: mannelijke spraak en cellomuziek. In twee luistertesten zijn virtuele ruimtes gebruikt, die binauraal zijn gesimuleerd door middel van een simulator voor ruimtes met een vorm van een schoenendoos, die ook in dit proefschrift wordt beschreven. In de twee andere testen zijn echte ruimtes gebruikt, waarvan de pulsresponsies binauraal zijn gemeten. Er zijn statistische analyses uitgevoerd op de resultaten, om te evalueren welke factoren een significant effect hebben op de resultaten. In alle testen werden significante verschillen tussen de ruimtes gevonden. Het bronsignaal had in enkele gevallen ook een significant effect. Met gebruik van de resultaten van één van de luistertests zijn de vrije parameters in het model, zoals boven- en ondergrenzen voor de frequentie, geoptimaliseerd door middel van een genetisch algoritme. Vervolgens is de methode gevalideerd door de correlatiecoëfficiënten tussen de nieuwe parameters en de gemiddelde waarderingen te berekenen. De resultaten zijn veelbelovend. In de meeste gevallen correleren de nieuwe parameters beter met de perceptuele data dan de conventionele parameters. Bovendien waren er veel minder gevallen waarin de nieuwe parameters resulteerden in een niet-significante correlatie, vergeleken met de conventionele parameters. Naast de goede resultaten op het gebied van correlatie met de perceptie heeft de methode andere voordelen. Aangezien de methode willekeurige binaurale audioopnames accepteert, kan een meting worden uitgevoerd in een bezette ruimte, tijdens een uitvoering bijvoorbeeld. Op deze manier wordt het effect van de aanwezigheid van een publiek op de akoestiek automatisch meegenomen. Bovendien zijn de parameters inhoud-specifiek, wat betekent dat de manieren waarop de temporele en spectrale eigenschappen van het bronsignaal de perceptie van de akoestiek van een ruimte beïnvloeden naar voren komen in de resulterende parameters. Diverse praktische aspecten van de nieuwe methode komen aan bod, zoals de robuustheid tegen ruis en de invloed van het type bronsignaal. Drie verschillende signaalcategorieën zijn getest: spraak-, instrument- en ensemblesignalen. Het blijkt dat zelfs binnen één categorie verschillen tussen de parameters worden gevonden voor de verschillende bronsignalen. Deze verschillen konden allemaal worden verklaard door de temporele en spectrale eigenschappen van de signalen en hoe deze eigenschappen een effect hebben op de perceptie van akoestiek van een ruimte. Tot slot zijn ook de effecten van de positie en oriëntatie van het kunsthoofd op de parameters bepaald. De resultaten laten geen enorme fluctuaties zien als een functie van offset

240 of hoek, hoewel dit nog niet objectief kan worden gekwantificeerd zolang de just noticeable differences (JNDs, net hoorbare verschillen) van de nieuwe parameters onbekend zijn. 237

241 238 Samenvatting

About the author Jasper van Dorp Schuitman was born in Zwijndrecht, the Netherlands, on November 4 th, 198. He attended secondary school at the Walburg College in Zwijndrecht.

242 About the author Jasper van Dorp Schuitman was born in Zwijndrecht, the Netherlands, on November 4 th, 198. He attended secondary school at the Walburg College in Zwijndrecht. He received his VWO diploma in In the same year, he started studying Applied Physics at the faculty of Applied Sciences at Delft University of Technology, the Netherlands. From February to August 26, he was an intern at Acoustic Control Systems (ACS) in Garderen, the Netherlands, where he contributed on the development of a new digital electro-acoustical system. He graduated in 26 from TU Delft at the Laboratory of Acoustical Imaging and Sound Control. His MSc research focused on applying room compensation techniques in Wave Field Synthesis (WFS) systems. Following his graduation, his supervisor Diemer de Vries invited him to perform a PhD research on the topic of the assessment of room acoustical quality. He gladly accepted this invitation, and the thesis which lies before you is the result of this research project. Currently, he is employed at Philips Research in Eindhoven, the Netherlands, where he works as a research scientist in the field of audio and room acoustics.

Obtaining objective, content-specific room acoustical parameters using auditory modeling

Obtaining objective, content-specific room acoustical parameters using auditory modeling Jasper van Dorp Schuitman Philips Research, Digital Signal Processing Group, Eindhoven, The Netherlands Diemer de