Audio. Wavelets. Neural Nets

Size: px

Start display at page:

Download "Audio. Wavelets. Neural Nets"

Theodora Scott
5 years ago
Views:

1 Technische Universitaet Berlin Institut fuer Telekommunikations systeme Audio Content Description with Wavelets and Neural Nets Diploma Thesis November 23 Stephan Rein (94434) Fachgebiet Nachrichten uebertragung Prof. Dr.-Ing. Thomas Sikora Prof. Dr. Martin Reisslein a Dr. Nicolas Moreau s(n) φ low-pass hi-pass ψ w w R Approximations Details input p p 2 dist X f rdb output a downsampling A ca D cd signal coeff. low-pass hi-pass p R b q A 2 ca2 D 2 cd 2 4 SoivMen57 x 7 2th Iteration Daubechies 2 Pa3 So3 Pa2 So Pa So So Pa So2 Pa2 So3 Pa3 Men57 Hei52 Mil75 5 envelope Gauss fit time [seconds] x 6 a Prof. Dr. Martin Reisslein is with Dept. of Electrical Engineering, Arizona State University.

2 Abstract We examine MPEG-7 audio tools and Fourier techniques for precision audio content description. We find that the MPEG-7 dyadic scaling procedure and the short-term Fourier transform are less suitable for description of highly complex audio content when generalization properties are required. We develop a novel wavelet envelope descriptor with good generalization properties for audio content description and a methodology for a statistical, descriptive analysis of wavelet data for derivation of elementary content description tools. We examine the usability of a combination of 39 different wavelets and three different types of neural nets for precision audio content description. We obtain promising results for a combination of specific wavelets with a probabilistic radial net. The proposed methodology is designed for an identification service for classical movements (musical parts of a composition) to be realized by next generation internet search machines. The calculated content description data can be efficiently computed and generalizes the audio content. Single audio compositions are identified even if they are very similar to each other and significantly different to the identification system s example sets. The radial net is trained with vectors from 96 example pieces and allows retrieval of 32 novel classical audio movements. The system already obtains a success rate of 78 % when trained by only three independent example sets. A training procedure that usually processes a large number of independent example sets is not necessary. Also, a similarity vector containing labels of similar pieces is computed as a possible answer to a user-query. For our methodology we employ a novel wavelet dispersion measure that measures obtained ranks of wavelet coefficients. This measure is able to efficiently describe and summarize highly complex wavelet patterns and therefore is an addition to current signal communication techniques.

3 CONTENTS I Introduction 2 I-A Related Work II The Audio Data Base 4 III MPEG-7 Audio Content Descriptors 5 III-A Summary IV Survey on Wavelets IV-A Wavelets for Audio Content Description V Content Description with Wavelets 6 V-A Gaussian Wavelet Envelope Descriptor V-B Statistical Wavelet Analysis for Content Description V-B. Statistical Data Summarization Tools V-B.2 Scale Frequency Measure V-B.3 Percentile Correlations V-C Summary VI A novel Wavelet Dispersion Measure 24 VI-A Wavelet Dispersion Classifier Matrix VI-B Wavelet Dispersion Measure Dimension Reduction VI-C Wavelet Dispersion Measure Performance Indicator VII Neural Nets for Audio Classification 32 VII-A Perceptron Neural Networks VII-B Multilayer feedforward Backpropagation Neural Networks VII-C Probabilistic Radial Basis Neural Network VIII Performance Analysis 36 VIII-A Wavelet Performance VIII-B Similarity Matrix VIII-C Summary IX Conclusion 43 Appendix 43 A Deutsche Zusammenfassung B Definition of Statistical Summarization Tools C Percentile Plots D MPEG-7 Audio D. Basic D.2 Audio Power D D.3 Basic Spectral D.4 Spectral Basis D.5 Signal Parameters D.6 Timbre Descriptors E Audio Data Base F Matlab code for calculation of a Weighting Matrix G Matlab code for Key Results Wavelet Dispersion Measure H Matlab code for a Probabilistic Neural Net

4 2 References 59 I. INTRODUCTION Due to the immense and growing amount of world-wide available audiovisual data the development of a technique allowing for content retrieval and classification has become a challenging task. The description of multimedia content is the key to the improvement and acceleration of various current technologies and also will allow interdependencies between these technologies to realize completely novel applications, as shown in Figure. content description combine different techniques next generation applications content description voice recognition artificial intelligence multimedia content retrieval content description content description I need detailed information on wavelets. Your schedule allows for study of multimedia online tutorial An Introduction to Wavelets, by A. Graps Fig. CONTENT DESCRIPTION IS THE KEY TO THE IMPROVEMENT OF VARIOUS TECHNOLOGIES AND THUS CAN REALIZE COMPLETELY NOVEL APPLICATIONS, SUCH AS INTELLIGENT AND INTERACTIVE PERSONAL DIGITAL ASSISTANTS. Next generation internet search machines will be able to understand and process multimedia content. More precisely, a user query can be a mixture of multimedia data including text, voice, picture and video content. The search machine will give a reasonable answer providing content that is highly related to the query and of important relevance for the user. In this thesis we present a novel audio retrieval methodology that is readily applicable for next generation internet search machines, see Figure 2. The proposed technique provides good generalization abilities as it allows for identification of audio data that is not part of the example set of the search system. The proposed methodology can be enhanced for description of multidimensional content. Audio Content description for implementation in internet search machines requires the content description methodology to comply with certain requirements. In the ideal case, the proposed methodology keeps the concept of current internet search algorithms allowing internet multimedia retrieval by a software update. The requirements for an applicable methodology are given as follows: Due to the immense amount of world wide audio data the descriptors must have a very compact representation. The methodology must provide an efficient computation scheme for construction of these descriptors. An efficient mapping procedure for the descriptors is necessary to allow an user-oriented search and retrieval service. The methodology must be readily applicable. Clearly, there must be no constraints that only allow the descriptors to perform well under certain circumstances. The descriptors shall work for various kinds of world wide available audio data.

3 Theatre education class: unkown background music in student s movie audio content query: sending a small extract of 4 seconds Internet server with content descriptors J.S. Bach, Sonata No.

5 3 Theatre education class: unkown background music in student s movie audio content query: sending a small extract of 4 seconds Internet server with content descriptors J.S. Bach, Sonata No., Part IV, recorded 957 by Y. Menuhin INTERNET identify composition find similar compositions Fig. 2 A NOVEL WAVELET DISPERSION MEASURE ALLOWS A METHODOLOGY FOR EFFICIENT AUDIO CONTENT RETRIEVAL TO BE INTEGRATED IN NEXT GENERATION INTERNET SEARCH MACHINES. This thesis is organized as follows. In Section I-A a survey on related work is given. There exists a large array of audio content description literature. However, to our best knowledge a methodology for identification of highly complex musical audio recordings that are not part of the search system s data base has not yet been proposed. In Section II we present the audio recordings we have employed for the derivation and evaluation of our methodology. Precision descriptors must be able to identify and categorize audio compositions within a musical genre. This is a complex challenge, as the audio data to be categorized into different classes is very similar. In Section III we study MPEG-7 audio description tools that are mainly based on Fourier techniques. We calculate Fourier coefficients and discuss the standardized diadic scaling procedure. We find that these tools are not designed for high-precision classification. In Section IV we compare the Fourier- with the wavelet analysis. Both techniques underlay the same concept, however, wavelets are designed to describe very irregular and nonstationary signals and therefore are predestined for audio data content description. In Section V, we examine wavelet techniques for audio content description. We find that in the wavelet domain, very specific patterns represent the audio content. The wavelet coefficients are measures for similarity to mother wavelet functions. These similarities describe special audio content features that allow for construction of descriptors that are able to generalize. Such a generalization is essential for identification of audio compositions that are unknown to the classification system. The methodology proposed in this thesis is the result of a variety of analytic calculations and conducted experiments. We develop a novel analytic wavelet envelope descriptor and a methodology for a statistical analysis of wavelet data have been developed. Although these tools are not included in our finally proposed methodology, Section V reports the corresponding findings, because they led to a novel statistical dispersion measure and might be useful for further investigations. In Section VI we propose a novel wavelet dispersion measure. We find that this measure is suitable to efficiently describe the wavelet patterns discovered in Section V. We conduct some experiments that indicate that this wavelet dispersion data can be performed by a neural net thus realizing a computationally effective mapping and classification technique. We expect this measure to be also useful for improvement of other signal communication techniques, including speech recognition, as it extracts specific features of the time domain that allow for content identification and generalization. In Section VII, we give a short tutorial on three different types of neural nets, which are employed in our performance analysis. We explain why neural nets can be very useful for audio classification problems. In Section VIII we examine the performance of our wavelet dispersion measure employing different wavelet families and different wavelet s. We measure the success rate of the dispersion measure when combined with three different types of neural nets. The finally proposed methodology achieves a mean success rate of 78% for 32 highly complex audio pieces from a recording that is not in the search system s data base. For each of the 32 very similar pieces, the identification system employs an example set of three pieces from different recordings. The identification success rate for audio files known to the system is approximatively %. In Section IX, we

6 4 TABLE I SONATAS AND PARTITAS FOR THE SOLO VIOLIN, COMPOSED BY J.S. BACH AROUND 72. THE SIX PIECES ARE PRESENTED IN PAIRS, ALTERNATELY SONATA-PARTITA. ESPECIALLY THE SONATAS HAVE A VERY SIMILAR STRUCTURE, THUS RESULTING IN A SPECIAL CHALLENGE FOR AUDIO CONTENT CLASSIFIERS. OUR PRECISION CLASSIFIERS ARE EVALUATED FOR DISTINCTION BETWEEN THESE 32 HIGHLY COMPLEX MOVEMENTS. Sonata No. g-moll Sonata No. 2 a-moll Sonata No. 3 C-dur I Adagio I Grave I Adagio II Fugue II Fugue II Fugue: Alla breve III Siciliano III Andante III Largo IV Presto IV Allegro IV Allegro assai Partita No. h-moll Partita No. 2 d-moll Partita No. 3 E-dur I Allemande I Allemande I Preludio II Double II Courante II Loure III Courante III Sarabande III Gavotte en Rondeau IV Double IV Gigue IV Menuet I V Sarabande V Chaconne V Menuet II VI Double VI Boure VII Bourre VII Gigue VIII Double summarize our findings and outline further investigations. A. Related Work There exists a large body of literature on audio content description, sound classification, and audio retrieval. This literature includes audio fingerprinting systems for identification of audio songs known to the search system s data base, see for instance [] [2] [3] [4] [5]. Our system differs from these works in that it identifies unknown complex audio with a high success rate. The existing body of literature also includes retrieval systems for the categorization of different sounds. Generally, the system is trained by a number of example sounds for classification of novel sound segments into content based classes, see for instance [6] [7] [8] [9] []. MPEG-7 tools and Mel- frequency Cepstrum coefficients are combined with hidden Markov models in [] to label sports audio data by one of 6 sound classes. In [2] [3], support vector machines and line methods are employed to classify audio sounds into 6 sound classes. In [4] wavelet data is employed to classify data files containing speech, music, and sounds. There exist systems for artist detection [5] and music type detection [6] [7] [8]. Our system differs from these classification systems in that it identifies movements from different performances (differing among other things in background noise) of the same highly complex classical composition. Furthermore, it employs a novel wavelet summarization measure. To the best of our best knowledge, this is the first work to propose a methodology for identifying highly complex musical audio recordings that are not part of the search system s data base. II. THE AUDIO DATA BASE There exists a huge, unmanageable body of music recordings and a categorization in terms of user relevance and importance is not possible. There exist reference audio data bases for different sounds, including sounds of birds, telephone, or laughter (see Other audio data bases are constructed from popular charts. In this thesis we employ a specific data base as we want to classify audio pieces within one genre. We have chosen six pieces composed by Johann Sebastian Bach, the six Sonatas and Partitas for the Solo Violin, Bachwerkeverzeichnis (BWV) -6, as shown in Table I. Appendix E provides a detailed description of the employed recordings, including time duration and labels. We summarize the requirements that are fulfilled by this data base as follows: ) The audio data base must be of consistent relevance. This requirement is not fulfilled by the frequently employed charts, because they are not stable. Clearly, classification techniques derived using a chart of 23

7 5 TABLE II WE CONSIDER FOUR DIFFERENT PLAYERS FOR THE PERFORMANCE OF THE SONATAS AND PARTITAS THUS RESULTING INTO 28 AUDIO FILES WITH A TOTAL LENGTH OF APPROXIMATIVELY HOURS. THE RECORDINGS REPRESENT DIFFERENT LEVELS OF QUALITIES. THE RECORDING TECHNIQUE OF 934 DID NOT ALLOW FOR CORRECTION OF PERFORMANCE ERRORS. player year studio location Yehudi Menuhin Studio Albert Paris EMI records Yehudi Menuhin 957 Abbey Road Studios London EMI records Jascha Heifetz 952 RCA Studios Hollywood Bertelsmann Music Group Nathan Milstein 973 Cornway Hall London Polydor International GmbH might not work reasonably with a chart of 22. The Sonatas and Partitas of Bach have been composed around 72, they are a standard literature for the violin, which is the most frequently employed instrument in classical music. Bach s music is current today and still will be current in years from now. 2) For interpretable results an available manuscript describing the musical compositions is useful. A basic study of the audio material allows for a temporal as well as for a frequency coverage of the musical events. Due to our findings, even when conducting statistical experiments, a relation to single events within the composition is comprehensible and can be exploited for derivation of novel techniques. Furthermore, a manuscript ensures that there are comparable performances available. Our database is recorded by four different players, as shown in Figure 3. Such performances allow for construction of descriptors with good generalization properties. 3) The considered audio files shall represent different levels of qualities. In terms of next generation internet search machines, audio information of various quality has to be processed. Our four chosen recordings include a wide range of today available audio qualities. Yehudi Menuhin made the first complete recording of the Sonatas and Partitas. This recording represents the recording studio technique of 934. Despite the audio quality constraints, this historical recording is special because technical performance errors could not be corrected. A studio performance had to be recorded without any breaks and rerecordings. We consider the recording of Nathan Milstein as up to date audio quality. Even if this performance was recorded in 973 using analog techniques, there is a not measurable difference to digital state of the art recordings in terms of audio content information. The digital technique is of relevance for consumer oriented lossless audio reproduction. For our experiments we use a digitally remastered studio copy that reproduces the original sound-image of the recorded performance. 4) The considered compositions shall reveal polyphonic and not separable phenomena. Bach s Sonatas and Partitas demand the player to concurrently use different cords. Clearly, although there is only one solo violin, a sound comparable to the performance of many violin players is present. This is a special challenge to the compactness of the descriptors. Although there are multiple voices, a dimension reduction for example by sub-space estimation technique is not possible, because the different voices do not fulfill the statistical requirements for this technique. Furthermore, such a separation is difficult because the different voices do not occur at fixed frequency bands. For extraction of audio content description features, the recordings were down sampled to 8 khz using the software cooledit 2.3 (see III. MPEG-7 AUDIO CONTENT DESCRIPTORS The MPEG-7 (Moving Pictures Experts Group) standard is an ISO/IEC standard and describes a variety of content description tools. The standard makes content retrieval applications possible and compatible in such a way that content queries of professional but also normal users are efficiently answered. To obtain a broad generality the standard does not standardize or evaluate content retrieval applications. In this section we study the suitability of the standardized MPEG-7 audio classifiers for precision classification. MPEG-7 audio provides ) a platform for the description data, 2) low-level tools, and

6 Fig. 3 FOUR PERFORMANCES RECORDED BY THREE STARS OF THE LAST CENTURY: YEHUDI MENUHIN, JASCHA HEIFETZ, AND NATHAN MILSTEIN.

MENUHIN RERECORDED THE PARTITAS AND SONATAS. OUR METHODOLOGY ALLOWS FOR IDENTIFICATION OF THE AUDIO COMPOSITION AND THE PART WITHIN THE AUDIO COMPOSITION.

The platform for the description data is an interface allowing compatibility between the different applications that are built on MPEG-7 audio descriptors.

8 6 Fig. 3 FOUR PERFORMANCES RECORDED BY THREE STARS OF THE LAST CENTURY: YEHUDI MENUHIN, JASCHA HEIFETZ, AND NATHAN MILSTEIN. THE RECORDING OF YEHUDI MENUHIN MADE AROUND 935 WAS THE FIRST COMPLETE RECORDING OF BACH S SONATAS AND PARTITAS FOR THE SOLO VIOLIN. MORE THAN TWENTY YEARS LATER, Y. MENUHIN RERECORDED THE PARTITAS AND SONATAS. OUR METHODOLOGY ALLOWS FOR IDENTIFICATION OF THE AUDIO COMPOSITION AND THE PART WITHIN THE AUDIO COMPOSITION. IF THE RECORD IS KNOWN TO THE SYSTEM THE PLAYER AND THE DATE WHEN THE PERFORMANCE WAS RECORDED IS GIVEN. 3) high-level tools. The platform for the description data is an interface allowing compatibility between the different applications that are built on MPEG-7 audio descriptors. Clearly, this interface describes a set of standardized data containers, which store the data provided by the audio content descriptors. Precision classifier data could be stored in such containers, thus allowing interoperability between different applications. The high-level tools combine the technique of the low-level tools to allow a variety of high-level applications. They are designed to allow audio signature description, musical instrument timbre description, melody description, general sound and indexing description, and spoken content description. For reasonable usage of these description tools the low-level descriptors must provide meaningful features of the audio signal. Therefore for precision classification, a study of the low-level tools is essential. The low-level tools include techniques to describe time and frequency domain features of audio signals. We have thoroughly studied and considered these tools for a possible inclusion in our methodology. In Appendix D, the formulas for the low-level MPEG-7 descriptors are detailed. We now only shortly discuss the functionality of all these descriptors and concentrate on the MPEG-7 elementary description tools. Table V shows MPEG-7 Basic and Basic Spectral descriptors. The Basic descriptors are useful for displaying audio signals. The Basic Spectral descriptors provide elementary spectral features of audio signals. Table VI shows MPEG-7 dimension reduction tools. Generally, for audio content descriptors, a compact representation of the descriptive data is useful. This technique is able to separate different voices of musical instruments. Table VII summarizes the MPEG-7 timbre descriptors. These descriptors allow for distinction between different tonal components. For example, different sounds of instruments can be classified. In addition, a simple silence detection tool is provided by MPEG-7. Many of the descriptors detailed here already have been evaluated in the related literature. They allow for classification of different elementary sounds. The MPEG-7 dimension reduction technique is of secondary importance for precision classifiers that generally aim to identify audio pieces by analysis of only very small extracts or segments. The MPEG-7 silence detector is not useful for our consistently loud audio pieces. Among all these MPEG-7 audio descriptors we consider the Basic Spectral descriptors most relevant for precision content description. Especially the Audio Spectrum Envelope and the Audio Spectrum Flatness descriptors allow for extraction of data to describe tonal components. The Audio Spectrum Envelope descriptor is of special relevance as it provides the Fourier coefficients to be processed by almost all other MPEG-7 descriptors. Furthermore, for the Audio Spectrum Envelope a specific dyadic scaling procedure is specified, which is also used for the Audio Spectrum Flatness descriptor. Therefore, we now examine the Audio Spectrum Envelope

9 7 TABLE III EDGE FREQUENCIES IN [HZ] FOR LOGARITHMIC BANDS FOR m = 6,..., 8 AND AN RESOLUTION OF r = descriptor. a) Audio Spectrum Envelope: The Audio Spectrum Envelope descriptor employs a short time Fourier transform with overlapping Hamming windows. Let lw denote the length of the analysis window in samples. The position of each window is described by a shift h, which is the number of samples the Hamming window has to be slided over the audio file to obtain the next analysis window position. Let s(n) denote the Hamming windowed audio signal and N denote the fast Fourier transform size, which is chosen due to applicability of fast Fourier techniques to the next larger power of 2 from lw. As a consequence, the analysis window is enlarged by zero padding, thus resulting a larger number of Fourier coefficients. This process is a pseudo-enlargement of the frequency resolution. The Fourier coefficients X w (k) are calculated as follows: N X w (k) = s(n) e j2π(k )(n )/N, k N. () n= Each coefficient belongs to one of N frequencies. Only the half of these frequencies are retained due to the symmetry of the Fourier transform. The frequency distance DF between two adjacent frequencies is given as DF = sr N (2) where sr denotes the sampling rate. This is a standard procedure to obtain the Fourier coefficients for an analysis frame. The MPEG-7 standard now specifies a grouping of these coefficients to obtain a logarithmic frequency axis. This frequency axis is considered due to the logarithmic frequency properties of the human ear. To obtain such a frequency axis, logarithmic frequency bands are defined. The edge frequencies of these bands are defined as f edge = 2 rm KHz, m Z, (3) where m is the resolution within octaves. If m = 4 there are 4 edge frequencies per octave. Table III shows the calculated edge frequencies for m = 6,..., 8 and a resolution of r = /4. The 25 edge frequencies result in 24 bands. Each band is represented by a mean value calculated from the Fourier coefficients that refer to this band. The frequencies 62.5 Hz and 4 Hz are denoted as loedge and hiedge. Two additional values have to be calculated for the out of band energy for,..., loedge and hiedge,..., sr/2, where sr/2 represents the Nyquist frequency. Importantly, for the calculation of a value that represents a logarithmic band an assignment rule has to be followed: Fourier coefficients with frequencies further away than DF/2 from a band edge have to be shared between two bands in such a way that each band retains a part of the coefficient. A linear weighting function estimates these parts. This procedure is explained by Figure 4. In fact, a logarithmic frequency band contains Fourier coefficients from loedge DF/2 to hiedge+df/2, however, they have to be partially weighted using the weighting function shown in Figure 4. For computational effective realization of such a method, we propose the construction of a weighting matrix. The appropriate matlab code is given in Appendix F. We first calculate the short term fourier coefficients. The values of these coefficients are retained in a matrix C with N rows, the number of Fourier frequencies, and F columns, the number of analysis frames. Each vector of such a matrix contains the fourier coefficients of one analysis frame. With an appropriate weighting matrix W, the matrix D containing L logarithmic band values per column is given as D = (W C) W, (4) where W denotes a matrix containing the number of fourier coefficients that are considered for each logarithmic band value. In Equation 4, denotes a matrix product, whereas denotes an element by element product. The

10 8 value of weighting function weighting function left band edge width of logarithmic frequency band right band edge DF/2 DF/2 DF/2 linear frequency axis Fig. 4 THE FOURIER COEFFICIENTS HAVE TO BE WEIGHTED FOR CALCULATION OF A LOGARITHMIC BAND VALUE. THEREBY THE FOURIER COEFFICIENTS ARE SCALED FROM A LINEAR AXIS TO A LOGARITHMIC AXIS. THIS PROCEDURE IS SPECIFIED IN MPEG-7 DUE TO THE LOGARITHMIC SCALING PROPERTIES OF THE HUMAN EAR. TABLE IV EXAMPLE OF A WEIGHTING MATRIX WITH THE FIRST 2 COLUMNS AND THE FIRST 8 ROWS. EACH ROW SELECTS FOURIER COEFFICIENTS FOR A LOGARITHMIC BAND VALUE. VALUES LARGER THAN AND SMALLER THAN INDICATE THAT FOURIER COEFFICIENTS ARE SHARED BETWEEN ADJACENT LOGARITHMIC BANDS. WE PROPOSE SUCH A MATRIX FOR EFFECTIVE CALCULATION OF THE LOGARITHMIC BAND VALUES matrix W calculates mean values from the summed Fourier coefficients and is constructed from the sum of the rows of W. The resulting vector is F times repeated to construct a L-by-F matrix W. A column vector of D then contains the logarithmic band values that belong to an analysis frame. Each value of such a column is the result of a scalar product between a row of the weighting matrix and a column of the Fourier matrix D. Therefore each row of the weighting matrix has to select the appropriate values of a column of the Fourier matrix to construct a logarithmic band value. The weighting matrix must contain as many rows as there are logarithmic bands and as many columns as there are Fourier frequencies. Table IV shows an example of a weighting matrix. This matrix, and especially the matrix W, allow for inspection of the number of Fourier coefficients that each logarithmic band contains. Generally, this methodology to obtain a logarithmic is very sensitive to the choice of the logarithmic edge frequencies. Clearly, even when using the MPEG-7 default value for the lowest logarithmic edge frequency and frequency resolution, it can happen that a logarithmic band remains empty thus resulting in not suitable content description data. Furthermore, the number of coefficients per band increases exponentially. Therefore, the lower bands contain a significantly smaller number of coefficients than the higher bands. In our example, the first two bands contain less than one coefficient. We expect such a system of unequally filled bands to react sensitively to aliasing errors. Figure 5 shows the logarithmic band values of our example when using the MPEG-7 default values for the lowest band edge frequency (62.5 Hz) and frequency resolution (4 frequencies per octave). A fine structure is only visible on half of the range of logarithmic bands. Clearly, such a scaling procedure may distort audio content information when processing highly complex audio data.

11 9 TABLE V MPEG-7 BASIC AND BASIC SPECTRAL DESCRIPTORS PROVIDE A BASIC TIME DOMAIN ANALYSIS AND A BASIC FREQUENCY DOMAIN ANALYSIS. Basic Audio Waveform Audio Power Basic Spectral Audio Spectrum Envelope Audio Spectrum Centroid Audio Spectrum Spread Audio Spectrum Flatness minimum and maximum amplitude value within an audio frame temporally smoothed instantaneous power short time Fourier transform coefficients, search and comparison center of gravity of log frequency power spectrum, shape of the power spectrum, indicates dominance of either high or low frequencies in the spectrum, measure of perceptual timbre second moment of Audio Spectrum Centroid, dispersion of power spectrum, sound distinction tone/noise deviation from flat spectral shape for frequency bands, can indicate tonal components TABLE VI MPEG-7 SPECTRAL BASIS AND SIGNAL PARAMETERS DESCRIPTORS. THE SPECTRAL BASIS DESCRIPTORS USE A SINGULAR VALUE DECOMPOSITION TO RETAIN ONLY STATISTICALLY RELEVANT FEATURES. THE SIGNAL PARAMETERS DESCRIBE THE SIGNAL S PERIODICITY. Spectral Basis Audio Spectrum Basis Audio Spectrum Projection Signal Parameters Audio Fundamental Frequency Audio Harmonicity statistical basis functions to reduce the dimension of spectrum data uses Audio Spectrum Basis for low-dimension representation of the spectrum describes signal s fundamental frequency spectrum s harmonicity, distinction between different sounds TABLE VII MPEG-7 TIMBRE DESCRIPTORS. THE TIMBRE DESCRIPTORS DESCRIBE MUSICAL AND PERCEPTUAL TIMBRE OR TONE QUALITY INDEPENDENT OF LOUDNESS AND PITCH. Temporal Timbre Log Attack Time Temporal Centroid Spectral Timbre Harmonic Spectral Centroid Harmonic Spectral Deviation Harmonic Spectral Spread Harmonic Spectral Variation temporal characteristics of segments, single value for tone quality signal s time to rise from silence to maximum amplitude locate focus of signal s energy, distinction decaying/sustained tones spectral features in linear frequency space, perception of musical timbre power-weighted average of the frequency of the bins in the linear power spectrum, sharpness of a sound amplitude-weighted mean of the spectrum s harmonic peaks, refers only to tone s harmonic parts normalized amplitude weighted standard deviation of the harmonic peaks normalized correlation between harmonic peak s amplitude of two adjacent frames

4 Pa3ivMen36, shift=6, hamm=3*6 5 25 dyadic scaling, Pa3ivMen36 4 35 3 2 2 frequency [Hz] 25 2 5 5 band index 5 2 4 6 5 5 5 8 5 5 2 time [seconds] 5 5 2 time [seconds] Fig.

12 4 Pa3ivMen36, shift=6, hamm=3* dyadic scaling, Pa3ivMen frequency [Hz] band index time [seconds] time [seconds] Fig. 5 SHORT TIME FOURIER TRANSFORM (LEFT PLOT) AND THE CORRESPONDING DYADIC BAND VALUES (RIGHT PLOT). THE COLORMAPS SHOW DB-VALUES. THE FINE STRUCTURE OF THE FOURIER COEFFICIENTS IS NO LONGER VISIBLE ESPECIALLY DUE TO THE FACT THAT THE LOWER LOGARITHMIC BANDS ARE NOT SUFFICIENTLY FILLED. FOR A DYADIC SCALE, A WAVELET TRANSFORM IS MUCH MORE SUITABLE. A. Summary In this section we have studied MPEG-7 Audio content descriptors. Among all these descriptors we consider the Audio Spectrum Envelope and the Audio Spectrum Flatness descriptor to be of possible relevance for precision classification, because they aim to describe tonal structures. For both descriptors a weighting method to obtain a logarithmic frequency is specified. We find that this procedure is very sensitive to the parametrization. In our example using the default values for the edge frequencies and octave frequency resolution, the lower logarithmic bands are not reasonably filled, thus resulting in a very raw representation of the audio content. We expect such a representation to be not suitable to precisely describe highly complex audio signals. For scaling properties and a smaller set of parameters, wavelets are much more suitable, as demonstrated in Section IV. In the next section we give a survey on wavelets and discuss their possible relevance for precision descriptors. IV. SURVEY ON WAVELETS In this section we give a survey on wavelets for audio content description, see [9] [2] [25] for a more general introduction to wavelets. As wavelets are highly related to the Fourier analysis we first have a look at Fourier techniques. We further try to answer general questions on content description with both techniques. The Fourier technique is extensively employed in MPEG-7. We explain why we prefer to employ wavelets for content description of highly complex audio. The Fourier analysis allows to represent every periodic function by the sum of sine and cosine functions, given as f(x) = a + [a k cos(kx) + b k sin(kx)], (5) where the Fourier coefficients are given by a = 2π 2π f(x)dx, a k = π k= 2π f(x)cos(kx)dx, b k = π 2π f(x)sin(kx)dx. (6) We now want to analyse the first three notes of movement V (Chiaconne) in Bach s Partita No. 2. The notation

13 of these notes is illustrated in Figure 6a. An analytic representation s(t) of these notes can be given as A sin(w t) + A 2 sin(w 2 t) + A 3 sin(w 3 t) t < t t s(t) = t < t 2 A 4 sin(w t) + A 5 sin(w 2 t) + A 6 sin(w 3 t) t 2 t < t 3. (7) A 7 sin(w t) + A 8 sin(w 4 t) + A 9 sin(w 5 t) + A sin(w 6 t) t 3 t < t 4 With Table VIII the frequencies w,..., w 5 shown in Table IX can be calculated. To obtain a temporal accordance to Figure 6a we choose t =.475 seconds, t 2 =.5 seconds, t 3 = 2 seconds, and t 4 = 3 seconds. As Figure 6a does not describe the loudness of the single notes, we chose A,..., A 6 =. Figure 7a shows a Fourier analysis of this signal. The frequencies are clearly resolved, however, the single notes are not precisely resolved in time. For this reason the short term Fourier transform has been proposed. This technique performs a Fourier transform for only small segments of the time signal. To reduce discontinuities at the edges of these segments, generally a window function is used, which is slided over the entire time signal. Figure 8a shows such a short term Fourier transform with Hanning windows that overlap by half of their size. The frequencies are still reasonably resolved but the time resolution is not satisfactory due to the large window size of 3 milliseconds. Therefore we reduce the window size to 8 milliseconds as shown in Figure 8b. Now the time resolution of the individual events is very good, but the different frequencies are no longer resolved. With Fourier techniques the choice of the window size remains a compromise between a reasonable frequency or time resolution. Overall, this compromise between frequency and time should not exclude the powerful Fourier technique to be considered for precision descriptors. We could develop a parametrization technique that constructs two content description vectors to reasonably resolve either time or frequency. We therefore now analyse a real performance of Bach s Chiaconne using Fourier techniques. Figure 7b shows the measured frequencies as a result of a Fourier transform on Yehudi Menuhin s performance of 934. We measure frequencies that are less precisely resolved than in Figure 7a. In fact, a variety of frequencies are measured that are only partially shown in Figure 7b due to our restricted plot bandwidth of Hz. These additional frequencies are called overtones and harmonics. Harmonics are integer multiples of the fundamental frequency. Overtones are any resonant frequency over the fundamental frequency. Overtones can be harmonics. These tones are responsible for the sound timbre and are very important for audio content description. As shown in Figure 9a, the short term Fourier transform has even more difficulties in resolving these frequencies. Figure 9b shows that also the choice of a very small Hanning window does not allow for a precise time resolution. These plots indicate that Fourier techniques are less suitable for precision content description of highly complex audio signals due to several reasons: ) Musical sounds have each a specific timbre that results in highly complex, less regular signals. The spectrum shows not bounded, smooth structures that make an appropriate parametrization either to time or frequency resolution difficult. 2) Figure 9 b indicates that the performance of Y. Menuhin concerning tone pitch and temporal resolution is less in accordance to the tones noted in Bach s manuscript, which are shown in Figure 6a. This is due to Y. Menuhin s individual interpretation of the Chiaconne, which is roughly detailed in Figure 6b. Such individual interpretations are general phenomena in musical performances and make a time- and frequency oriented generalization extremely difficult. 3) Our analytic low complexity signal assumed equal amplitudes for each sinusoid. In reality each tone has a different loudness. Thus a reasonable frequency estimation is more difficult. The Fourier coefficients describe similarities of the audio signal to sinusoids, but these similarities do not describe the very specific features we are looking for. Therefore we now consider the technique of wavelets. Wavelets solve the time frequency resolution problem of the Fourier technique. A. Wavelets for Audio Content Description A wavelet transform is highly related to a Fourier transform. A Fourier transform decomposes a signal into a sum of weighted sinusoids. The weights are called Fourier coefficients. A wavelet transform decomposes a signal into a weighted sum of wavelet functions. The weights are called wavelet coefficients. The Fourier coefficients are calculated by the operation of convolution. A convolution ( ) can be interpreted as a correlation ( ): s(t) ψ(t) = s(t) ψ( t) (8)

14 q a Fig. 6 q FIRST THREE NOTES OF BACH S FAMOUS CHIACONNA. THE MANUSCRIPT (FIG. 6A) ALLOWS FOR A VERIFICATION OF THE MEASURED FREQUENCIES. FOR PRACTICAL REASONS THESE NOTES GENERALLY ARE SLIGHTLY DIFFERENTLY PLAYED. FIG. 6 B ROUGHLY DESCRIBES THE INTERPRETATION OF Y. MENUHIN, WHERE THE CHORDS ARE BROKEN. SUCH INDIVIDUAL INTERPRETATIONS EACH RESULTING IN RECORDINGS THAT DIFFER IN TIME AND FREQUENCY ARE GENERAL PHENOMENA IN MUSICAL PERFORMANCES AND MAKE A GENERALIZATION OF AUDIO CONTENT EXTREMELY DIFFICULT. b 2 TABLE VIII INTERVALS AND FREQUENCY RATIOS EXCLUDING DIMINISHED AND AUGMENTED INTERVALS. RATIOS WITH SMALL NUMBERS PRODUCE A CONSONANT SOUND, WHEREAS SECONDS AND SEVENTHS PRODUCE A DISSONANT SOUND. interval half frequ. interval half frequ. steps ratio steps ratio unison : perfect fifth 7 3:2 minor second 6:5 minor sixth 8 8:5 major second 2 9:8 major sixth 9 5:3 minor third 3 6:5 minor seventh 9:5 major third 4 5:4 major seventh 5:8 perfect fourth 5 4:3 perfect octave 2 2: TABLE IX FREQUENCIES IN [HZ] OF THE FIRST NOTES OF THE CHIACONNA CALCULATED FROM FIG. 6. t < t t 3 t < t 4 t 2 t < t 3 e f 6 = 66 Hz a f 3 =44 Hz b f 5 = Hz f f 2 =352 Hz g f 4 = 39. Hz d f = Hz d f = Hz ψ( t) represents the flipped function ψ(t). Thus the Fourier coefficients can be interpreted as similarity measures to sinus functions. Similar, the wavelet coefficients are measures of similarity to wavelet functions. A wavelet coefficient is calculated for a s and a position p. The s describes how the mother wavelet function is d. It can either be dilated or compressed. The position p describes a shift of the wavelet function. Thus, the wavelet coefficients are calculated as C(s, p) = s(t) ( ) t p ψ dt. (9) s s Whereas a short term Fourier transform refers to a time-frequency signal representation, the wavelet transform refers to a time- signal representation, as illustrated in Figure. Thus the Fourier resolution problem, namely the choice of a window size towards time or frequency does not exist for a wavelet transform. When performing a wavelet decomposition, the s-d mother wavelet function is slided along the entire signal s(t). For each shift p a wavelet coefficient is calculated. This procedure is repeated for each. The higher the the more dilated is the mother function. Similarly, the lower the the more compressed is the mother function. Therefore, a high refers to a low frequency, whereas a low refers to a high frequency. A wavelet transform that only uses s and shifts of powers of two is called a dyadic wavelet transform. An important property of wavelets are their vanishing moments: t j ψ(t)dt =, j =,..., k. () R

15 3 real X(k) real X(k) frequency [Hz] a Fig frequency [Hz] MEASURED FREQUENCIES WITH THE FOURIER TECHNIQUE. FIG. 7A EXACTLY SHOWS THE FREQUENCIES WE HAVE CALCULATED FROM BACH S MANUSCRIPT. HOWEVER, IT IS BUILT FROM A MATHEMATICAL SIGNAL THAT CANNOT BE GENERATED BY ANY VIOLIN. FIG. 7B SHOWS THE FOURIER ANALYSIS OF THE PERFORMANCE OF Y. MENUHIN. A VARIETY OF OVERTONES ARE MEASURED THAT ARE RESPONSIBLE FOR THE SOUND TIMBRE. b This feature allows for suppression of the polynomials s(t) = k j= a j t j. All these polynomials have zero wavelet coefficients. The number of vanishing moments is called the wavelet s order number. Two examples of a wavelet function, illustrated in Figure, are the Mexican Hat and the Morlet wavelet, which are defined as and ψ mexh = ( 2 3 π /4 ) ( x 2 )e x2 /2 () ψ morl = C e x2 /2 cos(5x), (2) where C is a normalization constant. These two wavelets are exceptions, because generally wavelets do not have an analytical function. The wavelet s shape is given by its corresponding filter coefficients. The filter coefficients refer to decomposition filters that allow for a discrete wavelet transform. Thereby a signal is decomposed into details and approximations, as illustrated in Figure 2. The details refer to the signal s high frequency components as they are calculated using a high-pass filter. They indicate similarities to the wavelet mother function ψ. For many wavelets there exists an additional function that is very similar to the wavelet mother function: The scaling function φ is related to the approximations, which refer to the signal s low frequency components. Thus, a wavelet decomposition can be performed using high- and low-pass filters that give wavelet approximation and detail coefficients. The shape of the wavelet function can be approximated by upsampling and convolving the high-pass filter, see Figure 3. Similarly, the shape of the scaling function can be approximated by upsampling and convolving the low-pass filter. It is possible to reconstruct the original signal by reconstruction filters. A set of low- and high-pass decomposition and reconstruction filters is called a system of quadrature-mirror filters. The filters have to be designed in such a way that aliasing effects are minimized. The main drawback of the Fourier technique is the fixed size of the analysis window. When analysing frequencies and using a large window, the frequencies cannot be sufficiently resolved in time. Using a small window results into a fine time resolution, however, low frequency components can no longer be measured. This drawback is solved by the wavelet analysis. The varying wavelet allows for analysis of low frequency components at a fine frequency resolution and for analysis of high frequencies with a fine time resolution. One reason why wavelets still did not displace the Fourier technique especially in the area of content description may be the less comprehensible interpretation of the different s. As previously noted, the s refer to high and low frequencies. However, we note that compared to a Fourier transform, a wavelet decomposition is

16 4 frequency [Hz] time [seconds] frequency [Hz] time [seconds] a b Fig. 8 SHORT TERM TIME LOCALIZED FREQUENCY TRANSFORM. THE LEFT PLOT REASONABLY RESOLVES THE FREQUENCIES. THE RIGHT PLOT ALLOWS FOR A CORRECT TIME RESOLUTION OF THE EVENTS. A SHORT TERM FOURIER TRANSFORM ALWAYS IS A COMPROMISE BETWEEN TIME AND FREQUENCY. frequency [Hz] time [seconds] frequency [Hz] time [seconds] a b Fig. 9 SHORT TERM FOURIER TRANSFORM OF THE FIRST THREE NOTES OF BACH S CHIACONNA PERFORMED BY Y. MENUHIN. THE SAME PARAMETERS AS IN FIGURE 8 HAVE BEEN CHOSEN FOR A REASONABLE TIME RESOLUTION (LEFT PLOT) AND A REASONABLE FREQUENCY RESOLUTION (RIGHT PLOT). THE SHARP AND BOUNDED STRUCTURES OF FIGURE 8 ARE NO LONGER VISIBLE DUE TO THE SOUND TIMBRE. FURTHERMORE THE PERFORMANCE OF MENUHIN AS DETAILED IN FIGURE 6 B RESULTS IN A MORE COMPLEX REPRESENTATION OF THE AUDIO CONTENT IN THE FREQUENCY DOMAIN.

17 5 higher low frequency convolution lower high frequency time (position) Fig. A WAVELET TRANSFORM IS A TIME-SCALE REPRESENTATION OF THE SIGNAL. THE DILATED OR COMPRESSED WAVELET MOTHER FUNCTION (DIFFERENT SCALES) IS SLIDED OVER THE ENTIRE SIGNAL (DIFFERENT POSITIONS). A WAVELET COEFFICIENT C(s, p) IS A MEASURE OF SIMILARITY BETWEEN THE SIGNAL AND THE WAVELET FUNCTION FOR A SCALE s AND A POSITION p..5 Mexican Hat Morlet Fig. MEXICAN HAT AND MORLET WAVELET FUNCTION. WAVELET FUNCTIONS GENERALLY DECREASE QUICKLY TOWARDS. not a straight-forward frequency estimation technique. In fact, the wavelet coefficients indicate similarities to wavelet functions, which do not have a frequency, because they are not periodic functions. The relation between and frequency exists due to a possible assignment of a pseudo-frequency to a wavelet function. These pseudo-frequencies are estimated to describe the shape of a d wavelet for a restricted time as closely as possible. Furthermore, an analysis of concurrent frequencies with wavelets is difficult. Generally, for the analysis of frequencies of stationary signals the Fourier technique is preferable. In this thesis we are looking for an extraction technique that allows for the representation of specific features of the audio signals. Our data base contains highly complex, non-stationary and less regular audio signals. From an intuitive point of view it makes more sense to describe these signals using more complex and less regular wavelets functions than periodic sinus functions. Wavelets can reveal very small discontinuities that cannot be described by sinoids. Figure 4 indicates that the wavelet coefficients in fact describe very specific details of the audio signal. The single events each are resolved by very sharp and bounded patterns. A compact description for these patterns would allow for a verification of the here noted assumptions. From now on we consider a non-dyadic wavelet transform for our methodology. Recall that a dyadic wavelet transform employs powers of two for the shifts and s. A dyadic wavelet transform results in a more space

18 6 s(n) φ low-pass hi-pass ψ Approximations Details downsampling A ca D cd signal coeff. low-pass hi-pass A 2 ca2 D 2 cd 2 Fig. 2 WAVELET DECOMPOSITION TREE. A SIGNAL CAN BE DECOMPOSED INTO APPROXIMATIONS AND DETAILS: s = A + D = A 2 + D 2 + D 4th Iteration Daubechies 2 x th Iteration 4 Daubechies Fig. 3 SHAPE OF THE WIDELY USED DAUBECHIES 2 WAVELET. A WAVELET SHAPE CAN BE APPROXIMATED BY UPSAMPLING AND CONVOLUTION OF THE LOW-PASS RECONSTRUCTION FILTER COEFFICIENTS. saving representation of the content, however, the extracted features are less readable and of lower precision. For our derivation of a technique for precision classifiers we initially need all the details that can be resolved by the wavelet technique. V. CONTENT DESCRIPTION WITH WAVELETS As detailed in Chapter IV we want to use wavelet coefficients for the description of audio content. The wavelet coefficients precisely describe the audio content, however, we only want to retain a very compact representation that shall allow for efficient search and retrieval. The feature extraction technique has to solve an extremely demanding problem: On one hand, a very precise content information has to be extracted because we want to derive precision descriptors. These descriptors allow for classification of audio data even if very similar content is described. On the other hand, the extracted data should allow for a generalization. Clearly, the data must not describe the content too precisely, because then it will not be possible to identify a recording that is not part of the example set of the system. Figure 5 roughly describes the scenario we consider in this chapter. We consider movement iv of Sonata No. recorded by Y. Menuhin and N. Milstein. The recording of N. Milstein represents

Identifying the classical music composition of an unknown performance with wavelet dispersion vector and neural nets q

Information Sciences 176 (26) 1629 1655 www.elsevier.com/locate/ins Identifying the classical music composition of an unknown performance with wavelet dispersion vector and neural nets q Stephan Rein a,1,