LEARNING THE SIMILARITY OF AUDIO MUSIC IN BAG-OF- FRAMES REPRESENTATION FROM TAGGED MUSIC DATA

Similar documents
Experimental Design For Model Discrimination And Precise Parameter Estimation In WDS Analysis

COS 424: Interacting with Data. Written Exercises

Intelligent Systems: Reasoning and Recognition. Perceptrons and Support Vector Machines

Pattern Recognition and Machine Learning. Learning and Evaluation for Pattern Recognition

This model assumes that the probability of a gap has size i is proportional to 1/i. i.e., i log m e. j=1. E[gap size] = i P r(i) = N f t.

Interactive Markov Models of Evolutionary Algorithms

Kernel Methods and Support Vector Machines

Using EM To Estimate A Probablity Density With A Mixture Of Gaussians

Pattern Recognition and Machine Learning. Artificial Neural networks

Block designs and statistics

Pattern Recognition and Machine Learning. Artificial Neural networks

A Simplified Analytical Approach for Efficiency Evaluation of the Weaving Machines with Automatic Filling Repair

Non-Parametric Non-Line-of-Sight Identification 1

Handwriting Detection Model Based on Four-Dimensional Vector Space Model

Feature Extraction Techniques

An improved self-adaptive harmony search algorithm for joint replenishment problems

Model Fitting. CURM Background Material, Fall 2014 Dr. Doreen De Leon

Chapter 6 1-D Continuous Groups

A Smoothed Boosting Algorithm Using Probabilistic Output Codes

An Improved Particle Filter with Applications in Ballistic Target Tracking

Randomized Recovery for Boolean Compressed Sensing

e-companion ONLY AVAILABLE IN ELECTRONIC FORM

ASSUME a source over an alphabet size m, from which a sequence of n independent samples are drawn. The classical

Nonmonotonic Networks. a. IRST, I Povo (Trento) Italy, b. Univ. of Trento, Physics Dept., I Povo (Trento) Italy

Hybrid System Identification: An SDP Approach

Intelligent Systems: Reasoning and Recognition. Artificial Neural Networks

PAC-Bayes Analysis Of Maximum Entropy Learning

Correlated Bayesian Model Fusion: Efficient Performance Modeling of Large-Scale Tunable Analog/RF Integrated Circuits

Convolutional Codes. Lecture Notes 8: Trellis Codes. Example: K=3,M=2, rate 1/2 code. Figure 95: Convolutional Encoder

Quantum algorithms (CO 781, Winter 2008) Prof. Andrew Childs, University of Waterloo LECTURE 15: Unstructured search and spatial search

An Adaptive UKF Algorithm for the State and Parameter Estimations of a Mobile Robot

Use of PSO in Parameter Estimation of Robot Dynamics; Part One: No Need for Parameterization

Identical Maximum Likelihood State Estimation Based on Incremental Finite Mixture Model in PHD Filter

Ch 12: Variations on Backpropagation

A Self-Organizing Model for Logical Regression Jerry Farlow 1 University of Maine. (1900 words)

A note on the multiplication of sparse matrices

Lower Bounds for Quantized Matrix Completion

The Transactional Nature of Quantum Information

Tracking using CONDENSATION: Conditional Density Propagation

Bayes Decision Rule and Naïve Bayes Classifier

Pattern Recognition and Machine Learning. Artificial Neural networks

Physics 215 Winter The Density Matrix

A Low-Complexity Congestion Control and Scheduling Algorithm for Multihop Wireless Networks with Order-Optimal Per-Flow Delay

Inspection; structural health monitoring; reliability; Bayesian analysis; updating; decision analysis; value of information

Proc. of the IEEE/OES Seventh Working Conference on Current Measurement Technology UNCERTAINTIES IN SEASONDE CURRENT VELOCITIES

Recovering Data from Underdetermined Quadratic Measurements (CS 229a Project: Final Writeup)

Fixed-to-Variable Length Distribution Matching

Gary J. Balas Aerospace Engineering and Mechanics, University of Minnesota, Minneapolis, MN USA

NUMERICAL MODELLING OF THE TYRE/ROAD CONTACT

An Inverse Interpolation Method Utilizing In-Flight Strain Measurements for Determining Loads and Structural Response of Aerospace Vehicles

Qualitative Modelling of Time Series Using Self-Organizing Maps: Application to Animal Science

Effective joint probabilistic data association using maximum a posteriori estimates of target states

Distributed Subgradient Methods for Multi-agent Optimization

Probability Distributions

Support Vector Machine Classification of Uncertain and Imbalanced data using Robust Optimization

Ensemble Based on Data Envelopment Analysis

Numerical Studies of a Nonlinear Heat Equation with Square Root Reaction Term

A Simple Regression Problem

Fault Tree Modeling for Redundant Multi-Functional Digital Systems

Upper bound on false alarm rate for landmine detection and classification using syntactic pattern recognition

Vulnerability of MRD-Code-Based Universal Secure Error-Correcting Network Codes under Time-Varying Jamming Links

Design of Spatially Coupled LDPC Codes over GF(q) for Windowed Decoding

W-BASED VS LATENT VARIABLES SPATIAL AUTOREGRESSIVE MODELS: EVIDENCE FROM MONTE CARLO SIMULATIONS

Data-Driven Imaging in Anisotropic Media

OBJECTIVES INTRODUCTION

The Algorithms Optimization of Artificial Neural Network Based on Particle Swarm

Random Vibration Fatigue Analysis with LS-DYNA

Asynchronous Gossip Algorithms for Stochastic Optimization

A method to determine relative stroke detection efficiencies from multiplicity distributions

Deflation of the I-O Series Some Technical Aspects. Giorgio Rampa University of Genoa April 2007

Testing Properties of Collections of Distributions

Stochastic Subgradient Methods

PROXSCAL. Notation. W n n matrix with weights for source k. E n s matrix with raw independent variables F n p matrix with fixed coordinates

Fairness via priority scheduling

Using a De-Convolution Window for Operating Modal Analysis

Kernel based Object Tracking using Color Histogram Technique

CS Lecture 13. More Maximum Likelihood

A Model for the Selection of Internet Service Providers

Constrained Consensus and Optimization in Multi-Agent Networks arxiv: v2 [math.oc] 17 Dec 2008

Chaotic Coupled Map Lattices

MAXIMUM LIKELIHOOD BASED TECHNIQUES FOR BLIND SOURCE SEPARATION AND APPROXIMATE JOINT DIAGONALIZATION

Fast Structural Similarity Search of Noncoding RNAs Based on Matched Filtering of Stem Patterns

REDUCTION OF FINITE ELEMENT MODELS BY PARAMETER IDENTIFICATION

Sharp Time Data Tradeoffs for Linear Inverse Problems

Optimization of ripple filter for pencil beam scanning

New Bounds for Learning Intervals with Implications for Semi-Supervised Learning

Determining the Robot-to-Robot Relative Pose Using Range-only Measurements

Statistical Logic Cell Delay Analysis Using a Current-based Model

Genetic Quantum Algorithm and its Application to Combinatorial Optimization Problem

INNER CONSTRAINTS FOR A 3-D SURVEY NETWORK

Best Arm Identification: A Unified Approach to Fixed Budget and Fixed Confidence

Combining Classifiers

ON THE TWO-LEVEL PRECONDITIONING IN LEAST SQUARES METHOD

Warning System of Dangerous Chemical Gas in Factory Based on Wireless Sensor Network

Figure 1: Equivalent electric (RC) circuit of a neurons membrane

Transformation-invariant Collaborative Sub-representation

Introduction to Machine Learning. Recitation 11

INTELLECTUAL DATA ANALYSIS IN AIRCRAFT DESIGN

On Constant Power Water-filling

CHAPTER 19: Single-Loop IMC Control

Transcription:

2th International Society for Music Inforation Retrieval Conference (ISMIR 20) LEARNING THE SIMILARITY OF AUDIO MUSIC IN BAG-OF- FRAMES REPRESENTATION FROM TAGGED MUSIC DATA Ju-Chiang Wang,2, Hung-Shin Lee,2, Hsin-Min Wang 2 and Shyh-Kang Jeng Departent of Electrical Engineering, National Taiwan University, Taipei, Taiwan 2 Institute of Inforation Science, Acadeia Sinica, Taipei, Taiwan {asriver, hslee, wh}@iis.sinica.edu.tw, sjeng@cc.ee.ntu.edu.tw ABSTRACT Due to the cold-start proble, easuring the siilarity between two pieces of audio usic based on their low-level acoustic features is critical to any Music Inforation Retrieval (MIR) systes. In this paper, we apply the bag-offraes (BOF) approach to represent low-level acoustic features of a song and exploit usic tags to help iprove the perforance of the audio-based usic siilarity coputation. We first introduce a Gaussian ixture odel (GMM) as the encoding reference for BOF odeling, then we propose a novel learning algorith to iniize the siilarity gap between low-level acoustic features and usic tags with respect to the prior weights of the pre-trained GMM. The results of audio-based query-by-exaple MIR experients on the MajorMiner and Magnatagatune datasets deonstrate the effectiveness of the proposed ethod, which gives a potential to guide MIR systes that eploy BOF odeling.. INTRODUCTION Measuring the siilarity between two pieces of usic is a fundaental but difficult tas in Music Inforation Retrieval (MIR) research [] since usic siilarity is inherently based on huan subjective point of view and can be bias aong people who have different usical tastes and prior nowledge. A piece of usic contains a variety of usical contents, including the low-level audio signal; the etadata, such as the artist, albu, song nae, and release year; and a nuber of high-level perceptive descriptions, such as tibre, instruentation, style/genre, ood, and social inforation (e.g., tags, blogs, and explicit or iplicit user feedbac). Aong the usical contents, only the audio signal is always available while the etadata and highlevel perceptive descriptions are often unavailable or ex- This wor was supported in part by the Taiwan e-learning and Digital Archives Progra (TELDAP) sponsored by the National Science Council of Taiwan under Grant: NSC 00-263-H-00-03. Perission to ae digital or hard copies of all or part of this wor for personal or classroo use is granted without fee provided that copies are not ade or distributed for profit or coercial advantage and that copies bear this notice and the full citation on the first page. 20 International Society for Music Inforation Retrieval pensive to obtain. Owing to the cold-start proble, easuring the siilarity between two pieces of audio usic based on their low-level acoustic features is critical to any MIR systes [2, 3]. These systes are usually evaluated against the objective criteria derived fro the etadata and highlevel perceptive descriptions, which in fact correspond to the subjective criteria that huans use to easure usic siilarity. The siilarity gap between the acoustic features and huan subjective perceptions inevitably degrades the perforances of the MIR systes. The gap ay coe fro an insufficient song-level acoustic feature extraction or representation and an ill siilarity etric. Therefore, the goal of iproving audio-based usic siilarity coputation is to reduce the gap between audio features and huan perceptions, and it can be achieved fro a usic feature representation perspective [3-8] or a siilarity learning perspective [, 0]. Due to the glass ceiling of perforance that the pure audio-based usic siilarity coputation systes have faced, several high-level perceptive descriptions, which are considered having a saller gap between the siilarity coputed on the and the subjective siilarity of huan, have been eployed in soe previous wor. For exaple, in [6, 7], an interediate seantic space (e.g. genre or text caption) is used to bridge and reduce the siilarity gap. During recent years, social inforation has been very popular and becoe a ajor source of contextual nowledge for MIR systes. The social inforation generated by Internet users aes the wisdo of crowds available for investigating the general criteria of huan subjective usic siilarity. In [], the usic blogs are exploited to learn the usic siilarity etric of audio features. In [8], the social tags are concatenated with the audio features to represent usic in a query-by-exaple MIR scenario. Furtherore, Ki et al. [9] conduct explicit and iplicit user feedbac, which can be ipleented by collaborative filtering (CF, the user-artist atrix), to easure artist siilarity. Surprisingly, the experiental results show that CF can be a very efficient source in usic siilarity coputation. Afterwards, the CF data is used in [0] to learn the audio-based siilarity etric and significant iproveents in query-byexaple MIR perforance are achieved with three types of song-level representations, naely, acoustic, auto-tag, and huan-tag representations. 85

Poster Session In the aboveentioned wor, usic tags are ostly treated as part of usic features to represent a song [8-0]. In this paper, we adopt usic tags to create a ground truth seantic space to be used to easure huan subjective siilarity for three reasons. First, usic tags are huan labels that represent huan usical perceptions. According to previous studies [9, 0], the siilarity fro tags is highly relative to the subjective siilarity for evaluation, i.e., the siilarity gap is relatively sall. Second, usic tags are free-text labels that include all inds of usical inforation, such as genre, ood, instruentation, personal preference, and etadata, which are used to objectively evaluate the effectiveness of audio-based siilarity coputation in previous wor. Third, usic tags are generally considered noisy, redundant, bias, and unstable when collected fro a copletely non-constrained tagging environent, such as last.f. Consequently, several web-based usic tagging gaes have been created with a purpose of collecting reliable and useful tags, e.g., MajorMiner.org [2] and Tag A Tune [3]. In these tagging gaes, usic clips are randoly assigned to taggers in order to reduce the tagging bias. Carefully extracting tags with high ter frequencies and erging equivalent tags can intuitively reduce the noisy factors. With a set of well-refined usic tags, the seantic space which siulates the huan usic siilarity can be established. In ost audio-based MIR systes, the sequence of shorttie frae-based or segent-based acoustic feature vectors of a song is converted into a fixed-diensional vector so that the song-level seantic descriptions (or tags) can be incorporated into it. The bag-of-fraes (BOF) or bag-ofsegents approach is a popular and efficient way to represent a set of frae-based acoustic vectors of a song and has been widely used in MIR applications [8,0,4]. In the traditional BOF approach, a set of frae representatives (e.g., codeboo, denoted as an encoding reference hereafter) are selected or learned in an unsupervised anner, then a song is represented by the histogra over the encoding reference. In the BOF representation vector, each diension represents the effective quantity of its corresponding frae representative (e.g., codeword) within a song. Based on the effective quantities, we can estiate the audio-based siilarity of two songs. Motivated by the etric learning for audio-based usic siilarity coputation in [, 0], we could learn a etric transforation over the BOF representation vector by iniizing the siilarity gap between acoustic features and usic tags. Since the BOF vector is generated by the encoding reference, the iniization of siilarity gap can be achieved by learning the encoding reference rather than learning a etric transforation on the native BOF space. This leads to a supervised ethod for learning the encoding reference fro a tagged usic dataset to iprove the BOF representation. Hopefully, the learned encoding reference could better generalize the BOF odeling than a stacing transforation over the native etric. The reainder of this paper is organized as follows. Section 2 describes the audio feature extraction odule and song-level BOF representation. In Section 3, we introduce the ethod for learning the encoding reference fro the tagged usic data. In Section 4, we evaluate the proposed ethod on the MajorMiner and Magnatagatune datasets in a query-by-exaple MIR scenario. Finally, we suarize our conclusions in Section 5. 2. BAG-OF-FRAMES REPRESENTATION FOR ACOUSTIC FEATURES 2. Frae-based Acoustic Feature Extraction We use MIRToolbox.3 for acoustic feature extraction [4]. As shown in Table, we consider four types of features, naely, dynaic, spectral, tibre, and tonal features. To ensure alignent and prevent isatch of different features in a vector, all the features are extracted with the sae fixed-sized short-tie frae. Given a song, a sequence of 70-diensional feature vectors is extracted with a 50s frae size and 0.5 hop shift. Then, we noralize the 70- diensional frae-based feature vectors in each diension to ean 0 and standard deviation. Types Feature Description Di dynaic rs spectral centroid, spread, sewness, urtosis, entropy, flatness, rolloff 85, rolloff 95, bright- ness, roughness, irregularity tibre zero crossing rate, spectral flux, MFCC, 4 delta MFCC, delta-delta MFCC tonal ey clarity, ey ode possibility, HCDF, chroa, chroa pea, chroa centroid 7 Table. The usic features used in the 70-diensional fraebased usic feature vector. 2.2 The Encoding Reference and BOF Representation The BOF approach is argued that each frae of a song should not be treated equally, and an isolated frae of lowlevel acoustic feature is not representative for high-level perceptive descriptions [5]. Besides, the effectiveness of BOF odeling is highly ipacted by the size of encoding reference and will encounter a glass ceiling when the size is too large [6]. Our goal of iproving the encoding reference for BOF odeling is twofold: First, we ai at choosing a type of frae representative that gives better generalization ability and a ore reliable distance easure criterion. Second, each frae representative should not have equal inforation load during song-level encoding. The BOF odeling starts with generating the encoding reference fro a set of available fraes (denoted as F). The fraes are usually selected randoly and uniforly fro each song in a usic dataset. We use a Gaussian ixture odel (GMM) instead of a codeboo derived by the K- ean algorith as the encoding reference [7]. In the 86

2th International Society for Music Inforation Retrieval Conference (ISMIR 20) GMM, each coponent Gaussian distribution, denoted as z, =,,K, corresponds to a frae representative. The GMM is trained on F by the expectation-axiization (EM) algorith, and is expressed as follows: K p( v) = π ( v μ, Σ ), () = N where v is a frae-based feature vector, N ( ) is the -th coponent Gaussian distribution with ean vector μ and covariance atrix Σ, and π is the prior weight of the -th ixture coponent. Given v, the posterior probability of a ixture coponent is coputed by: p( z v) = K = π N ( v μ, Σ ) π N ( v μ, Σ Given a song s with L fraes, its BOF posterior-probability representation (denoted as vector x) is coputed by: L x p( z s) = p( z v t ), (3) L t= where x is the -th eleent in vector x. When encoding a frae by GMM, the posterior probability is based on the lielihood of each coponent Gaussian distribution. The posterior probability of each ixture coponent yields a soft-assigned encoding criterion which enhances the odeling ability of the GMM-based encoding reference over the vector-quantization-based (VQ-based) one. Our contention is that the diversity of frae representatives in the encoding reference is proportional to the ability of the BOF odeling, i.e., the BOF odeling can involve ore audio inforation of a song to be encoded. However, lie other encoding references (e.g., a set of randoly selected vectors or a trained codeboo), the GMM is generated in an unsupervised anner. The factors that we can control includes the nuber of coponents in GMM, i.e., K, the types of acoustic features used in the frae-based vector, and the construction of F. Except for K, the other two factors are fixed in the beginning. As K increases, the frae representatives becoe ore diverse, but soe of the are in fact redundant. This otivates us to deterine the iportance of each frae representative in a discriinative way. The EM training for GMM provides the estiation of the data distribution over F, which is assued to follow a ixture of Gaussian distributions, by the axiu lielihood criterion. The prior π of the -th coponent Gaussian represents the corresponding effective nuber of fraes in training set F. However, the construction of F iplies that the estiated distribution of F actually does not have inforation about the song-level distribution of acoustic feature vectors. In other words, it ay not reflect the iportance of each ixture coponent when encoding a song. In fact, as will be discussed later in Sec. 4, our experiental results show that setting the trained priors to a unifor distribution iproves the MIR perforance.. ) (2) In light of the observations described above and the beneficial characteristics of usic tags, we readily incorporate the tagged usic data as a supervision guide to deterine the iportance of each ixture coponent in the GMM. 3. LEARNING THE AUDIO-BASED SIMILARITY In this wor, learning the siilarity of audio usic fro tagged usic data is achieved by learning the encoding reference to iniize the siilarity gap between low-level acoustic features and high-level usic tags. To this end, we conduct learning with respect to the paraeters of the GMM trained fro F. In this paper, we only consider the relearning of the prior probabilities, i.e., the pre-learned paraeters μ and Σ, =,,K, are fixed. The proposed iterative learning algorith has two steps, naely, encoding songs into BOF vectors and iniizing the siilarity gap with respect to the prior probabilities of the GMM. 3. Preliinary Suppose there is a tagged usic corpus D with N songs. Given a song s i in D, we have its BOF vector x i R K, which is encoded by the GMM to represent the acoustic features, and its tag vector y i {0,} M, in which each tag is binary labeled (ulti-label case) fro a pre-defined tag set with M tags. Two siilarity atrices are defined: S X is coputed on the N BOF vectors, and S Y is coputed on the N tag vectors. We estiate the acoustic siilarity between s i and s j in D by coputing the inner product of x i and x j. Therefore, the acoustic siilarity atrix S X of D can be expressed as: S = X T X, (4) X where X is a K-by-N atrix with x i as its i-th colun. The tag siilarity atrix S Y of D is expressed as: S = Y T Y, (5) Y where Y is an M-by-N atrix with y i y i as its i-th colun. Since each song ay have different nubers of tags, to ensure that the tag-based siilarity of a song itself is always the largest, we copute the cosine siilarity between y i and y j in Eq. (5) to estiate the tag-based siilarity to siulate the huan siilarity between s i and s j. The ethods for audio-based siilarity coputation can be evaluated by a query-by-exaple MIR syste, i.e., given a query song with the audio signal only, the syste rans all the songs in the database based on audio-based siilarity coputation only. To evaluate the effectiveness of S X, we perfor leave-one-out MIR tests to evaluate the noralized discounted cuulative gain () [8] with respect to the ground truth relevance derived by S Y. That is, each song s i in D is taen as a query song in turn, the output raned list for s i is generated by sorting the eleents in the i-th row of S X in descending order, and the corresponding ground truth relevance is the i-th row of S Y. The 87

Poster Session i @P, which represents the quality of raning of the top P retrieved songs for query s i, is forulated as follows: P Ri ( j) i @ P = Ri () +, 2 log2 (6) QP j= j where R i (j) is the ground truth relevance (obtained fro the i-th row of S Y ) of the j-th song on the raned list, and Q P is the noralization ter representing the ideal raning of the P songs [8]. Intuitively, if ore songs with large ground truth relevance are raned higher, a larger will be obtained. The query-by-exaple MIR perforance on D based on S X with respect to S Y is evaluated by N ( D )@ P = i @ P. (7) N i= The larger in Eq. (7) is, the ore effective the audio siilarity coputation for S X is. 3.2 Miniizing the Siilarity Gap We define a K-by-K syetric transforation atrix W for the BOF vector space. The transfored BOF vector for s i is expressed by Wx i, and the new acoustic siilarity atrix S T of D can be obtained by: T T S = ( WX) ( WX) = X TX, (8) T where T=W T W. Therefore, iniizing the siilarity gap between the transfored BOF vector space and huan tag vector space is equivalent to iniizing the distance or axiizing the correlation [9] between the two ernel atrices S T and S Y with respect to W. In this paper, otivated by the wor in [20], we express the N songs in D as two rando vectors, Z x R N for the transfored acoustic feature and Z y R N for the tag label, which follow two ultivariate Gaussian distributions N x and N y, respectively. There exists a siple bijection between the two ultivariate Gaussians. Without loss of generality, we assue N x and N y have an equal ean and are paraeterized by (μ, S T ) and (μ, S Y ), respectively. Then, the closeness between N x and N y can be easured by the relative entropy KL(N x N y ) (i.e., the KL-divergence), which is equivalent to d(s T S Y ): d( ST SY ) = { tr( ST SY ) log ST SY N }, (9) 2 where tr( ) and are the trace and deterinant of a atrix, respectively. The iniization of d(s T S Y ) can be solved by setting the derivative of d(s T S Y ) with respect to T to zero. The solution that iniizes d(s T S Y ) is as follows: T ( X( S ) X ). T (0) * = Y Since W is syetric, the optial atrix W is derived by * * 2 W = ( T ). () To prevent singularity, a sall value 0.00 is added to each diagonal eleent of the atrices that are inversed in solving W. If we restrict W in Eq. (8) to be diagonal, i.e., we ignore the correlation aong different diensions in the BOF vector, and define vector w diag(w), the optial w * is the diagonal of W * : * * w = diag( W ), (2) where each eleent in w * ust be greater than zero. The derivations of Eqs. (0) and (2) are sipped due to the space liitation. In the testing phase, each song is first encoded into a BOF vector by the GMM using Eq. (3). Then, the audiobased siilarity between any two songs s i and s j is coputed as x i T T * x j, where T * can be a full or diagonal atrix according to the initial setting of W in Eq. (8). In the experients, this ethod with full transforation and diagonal transforation is denoted as and respectively, while the ethod without transforation is denoted as (i.e., the native GMM). 3.3 Relearning the Priors of the GMM Instead of learning a transforation atrix, we can also iniize the siilarity gap by relearning the prior weights of the GMM. We propose a two-step iterative learning ethod, which iteratively updates the prior weights of the GMM until convergence. The in Eq. (7) can be used as the criterion for checing the convergence of the learning procedure. The iniization of siilarity gap iplies the iproveent in since the learned S T tries to preserve the structure of S Y, which is used as the ground truth relevance in coputing. If is no longer iproved, the learning algorith stops. The learning ethod is suarized in Algorith. According to Algorith, there are two steps in an iteration. Line 05 corresponds to the first step, which encodes all songs into their BOF vectors; and lines and 3 correspond to the second step, which iniizes the siilarity gap with respect to the prior weights of the GMM. Since encoding all songs in D is a coplicated procedure, directly optiizing with respect to the paraeters of the GMM with Eqs. (), (2) and (3) is infeasible. Therefore, we turn to find an indirect solution that iniizes the siilarity gap with respect to the priors of the GMM. We exploit the property of w * to derive Eq. (3), which serves as an indirect optiizer for axiizing the (D)@N by reweighting the prior weights of the GMM. Intuitively, the vector w * derived in line plays a role to select ixture coponents in the GMM. In the testing phase, each song is encoded into a BOF vector by the GMM with the relearned prior weights using Eq. (3). Then, the audio-based siilarity between any two songs s i and s j is coputed as the inner product of x i and x j, without the need to apply any stacing transforation in the BOF space. In the experients, the proposed ethod ipleented in this way is denoted as. 88

2th International Society for Music Inforation Retrieval Conference (ISMIR 20) Algorith. The learning algorith Input: Initial GMM paraeters {μ, Σ }, =,,K; A tagged usic corpus D: a set of fraes V i for s i, i=,,n, and tag siilarity atrix S Y fro Eq. (5); Output: Learned GMM prior { πˆ }; (0) 0: Initialize π to be /K; 02: Iteration index t 0; 03: L(t) 0; 04: while t 0 do 05: ( t) Encode V i into x i with Eq. (3) using { π,μ, Σ }; 06: Copute S X with Eq. (4); 07: t t + ; 08: L(t) (D)@N with Eq. (7) using S X and S Y ; 09: If (L(t)-L(t-))/L(t) < 0 then ( ) 0: Return ˆ π t π and brea; : Copute w * with Eq. (2) using S X and S Y ; 2: for =,,K, do 3: ( t ) ( t) wπ π K ; ( t ) q = w qπ q (3) (where w is the -th eleent in w * ) 4: end for 5: end while 4. Datasets 4. EVALUATIONS We evaluate the proposed ethod on the MajorMiner and Magnatagatune datasets in a query-by-exaple MIR scenario. Both datasets are generated fro social tagging gaes with a purpose (GWAP) [, 2] to collect reliable and useful tag labels. The MajorMiner dataset has been a well-nown benchar in MIREX since 2008. The one used in this paper is crawled fro the MajorMiner website in March 20. It contains 2,472 0-second usic clips and,03 raw tags. After exacting the high frequency tags and erging the redundant tags, 76 tags are left. The Magnatagatune dataset [2], which contains 25,860 30-second audio clips and 88 pre-processed tags, is downloaded fro [2]. To construct F, we randoly select 25% and 2% of fraes fro the two datasets, respectively. For MajorMiner, F contains 235,000 fraes, while for Magnatagatune, F contains 535,800 fraes. The F constructed in this way is blind to song-level inforation. To prevent bias in the tagbased siilarity coputation of S Y, we ignore the clips labeled with fewer tags. For the MajorMiner dataset,,200 clips having at least 5 tags are left. For the Magnatagatune dataset, 3,764 clips having at least 7 tags are left. 4.2 Experiental Results and Discussions In the experients, we repeat three-fold cross-validation 0 ties on the MajorMiner dataset, which is divided into @0 5 5 05 0.895 0.89 5 The Convergence of (K=256) 0.875 0 5 0 5 20 Iteration The Convergence of (K=256) 0.8907 0.8905 95 85 0 5 0 5 Iteration three folds at rando. In each run, two folds are used for training the transforation atrix of the and ethods or relearning the prior weights of the GMM for the ethod, while the reaining fold, which serves as both the test queries and the target database to retrieve, is used for the leave-one-out audio-based MIR outside test. For the Magnatagatune dataset, all clips have been divided into 6 folds to prevent that two or ore clips originated fro the sae song occur in different folds. We erge the 6 folds into 4 folds and perfor four-fold cross-validation. The @P in Eq. (7) is used as the evaluation etric in both inside and outside tests. First, we exaine the learning process of on the MajorMiner dataset. Figure shows an exaple learning curve in ters of for one of the three-fold crossvalidation runs. The curve is equivalent to the inside test perforance evaluated on the training data. We can see that the learning curve of (K=256) increases onotonically till convergence, although can only iprove the of the training data indirectly as discussed in Section 3.3. gains an absolute increase of 0.04 in @0 and 0.002 in @800. The of can be considered an upper bound for since it adopts a direct optiization strategy. Next, we evaluate and the VQ-based ethod on the MajorMiner dataset. There is no need to divide the data into three folds since no supervised learning is involved in the ethods. Fro the MIR results shown in Table 2, we observe that replacing the priors of the GMM trained fro F with a unifor distribution enhances the perforance. We also observe that, even with a large K, outperfors VQ-based BOF odeling. The results deonstrate the better odeling ability of the GMM over the K-eans derived codeboo. Finally, we copare with three baselines, i.e.,,, and. The results of threefold cross-validation on the MajorMiner dataset are shown in Figure 2, while the results of four-fold cross-validation on the Magnatagatune dataset are shown in Figure 3. Fro Figures 2 and 3, it is obvious that the proposed outperfors all other ethods in ost cases. The conventional BOF approach does face a glass ceiling when K is @800 0.89 9 (a) (b) Figure. The learning curve in ters of (a) @0 and (b) @800 evaluated on the MajorMiner training data. 89

Poster Session too large, as evidenced by the observation that the perforance of saturates at around K=,024 for MajorMiner (0-second clips) and K=2,048 for Magnatagatune (30-second clips). The proposed enhances the perforance over the glass ceiling of with a saller K, e.g., with K=52 outperfors with K=2,048 on the MajorMiner dataset. Full- Trans outperfors and only when K is sall. However, tends to saturate early since it has ore paraeters to train and thus requires ore training data, copared with and. In Figure, the perforance of shows an upper bound of in inside test; however, in outside test, outperfors except when K is sall. The experiental results in Figures 2 and 3 deonstrate the excellent generalization ability of, which learns the siilarity of audio usic by relearning the priors of the GMM instead of a transforation in the BOF vector space. 5. CONCLUSIONS In this paper, we have addressed a novel research direction that the audio-based usic siilarity coputation can be learned by iniizing the siilarity gap or axiizing the easure with respect to the paraeters of the encoding reference in BOF representation. We have ipleented the idea by learning the prior weights of the GMM fro tagged usic data. The experiental results deonstrate the effectiveness of the proposed ethod, which gives a potential to guide MIR systes that eploy BOF representation, e.g., the can be directly cobined with the codeword Bernoulli average (CBA) ethod [3], a well-nown autoatic usic tagging ethod. @5 @0 @20 @30 (K=2,048) w/o Prior 382 05 0.8753 0.8674 (K=2,048) w Prior 322 0.8992 0.8743 0.8669 VQ-based (K=2,048) Histogra 97 0.8930 0.872 0.8650 Table 2. The results of and the VQ-based ethod on the coplete MajorMiner dataset (,200 clips). 6 4 0.86 MajorMiner P=5 0.84 32 64 28 256 52 024 2048 0.76 32 64 28 256 52 024 2048 6. REFERENCES [] M. Slaney, K. Weinberger, and W. White: Learning a etric for usic siilarity, ISMIR, 2008. [2] J.-J. Aucouturier and F. Pachet: Music siilarity easures: What s the use?, ISMIR, 2002. [3] M. Mandel and D. Ellis: Song-level features and SVMs for usic classification, ISMIR, 2005. [4] E. Papal: Audio-based usic siilarity and retrieval: Cobining a spectral siilarity odel with inforation extracted fro fluctuation patterns, ISMIR, 2006. [5] M. Hoffan, D. Blei and P. Coo: Content-based usical siilarity coputation using the hierarchical Dirichlet process, ISMIR, 2008. [6] K. West and P. Laere: A odel-based approach to constructing usic siilarity functions, EURASIP Journal on Advances in Signal Processing, 2007(), 0, 2007. [7] L. Barrington, A. Chan, D. Turnbull, and G. Lancriet: Audio inforation retrieval using seantic siilarity, ICASSP, 2007. [8] M. Levy and M. Sandler: Music inforation retrieval using social tags and audio, IEEE TMM, (3), 383-395, 2009. [9] J. H. Ki, B. Toasi, and D. Turnbull: Using artist siilarity to propagate seantic inforation, ISMIR, 2009. [0] B. McFee, L. Barrington and G. Lancriet: Learning siilarity fro collaborative filters, ISMIR, 200. [] M. Mandel and D. Ellis: A web-based gae for collecting usic etadata, J. New Mus. Res., 37(2), 5 65, 2008. [2] E. Law and L. von Ahn: Input-agreeent: A new echanis for data collection using huan coputation gaes, ACM CHI, 2009. [3] M. Hoffan, D. Blei and P. Coo: Easy as CBA: A siple probabilistic odel for tagging usic, ISMIR, 2009. [4] O. Lartillot and P. Toiviainen: A Matlab toolbox for usical feature extraction fro audio, DAFx, 2007. [5] G. Marques, et al.: Additional evidence that coon lowlevel features of individual audio fraes are not representative of usic genre, SMC Conference, 200. [6] J.-J. Aucouturier, B. Defreville and F. Pachet: The bag-offraes approach to audio pattern recognition: A sufficient odel for urban soundscapes but not for polyphonic usic, J. Acoust. Soc. A., 22(2), 88 9, 2007. [7] K. Yoshii, M. Goto, K. Koatani, T. Ogata, and H. G. Ouno: An efficient hybrid usic recoender syste using an increentally trainable probabilistic generative odel, IEEE TASLP, 6(2), 435 447, 2008. [8] K. Jarvelin and J. Kealainen: Cuulated gain-based evaluation of IR techniques, ACM Trans. on Inforation Systes, 20(4), 422 446, 2002. [9] Y. Zhang and Z.-H. Zhou: Multi-label diensionality reduction via dependence axiization, AAAI, 2008. [20] J. V. Davis, B. Kulis, P. Jain, S. S. and I. S. Dhillon: Inforation-theoretic etric learning, ICML, 2007. [2] http://tagatune.org/magnatagatune.htl 0.86 0.84 0.82 0.8 0.78 MajorMiner P=0 Figure 2. The results in ters of @5 and @0 on the MajorMiner dataset with different K. 5 4 Magnatagatune P=5 0.86 28 256 52 024 2048 2560 0.89 0.87 0.85 0.83 0.8 Magnatagatune P=0 0.79 28 256 52 024 2048 2560 Figure 3. The results in ters of @5 and @0 on the Magnatagatune dataset with different K. 90