Research Seminar Télécom ParisTech

Size: px

Start display at page:

Download "Research Seminar Télécom ParisTech"

Denis Morton
5 years ago
Views:

Research Seminar Télécom ParisTech Computational Methods

Audio Signal Processing Arnaud Dessein August 30th 2013

Francis Bach Rapporteur M. Frank Nielsen Rapporteur M.

1 Research Seminar Télécom ParisTech Computational Methods of Information Geometry with Real-Time Applications in Audio Signal Processing Arnaud Dessein August 30th 2013 M. Gérard Assayag Directeur M. Arshia Cont Encadrant M. Francis Bach Rapporteur M. Frank Nielsen Rapporteur M. Roland Badeau Examinateur M. Silvère Bonnabel Examinateur M. Jean-Luc Zarader Examinateur August 30th 2013 Research Seminar Télécom ParisTech 1/46

2 Introduction From information geometry theory: Study of statistics with concepts from: Differential geometry such as smooth manifolds. Information theory such as statistical divergences. Parametric statistical models possess an intrinsic geometrical structure. August 30th 2013 Research Seminar Télécom ParisTech 2/46

3 Introduction From information geometry theory: Study of statistics with concepts from: Differential geometry such as smooth manifolds. Information theory such as statistical divergences. Parametric statistical models possess an intrinsic geometrical structure. To computational information geometry: Broad community around the development and application of computational methods based on information geometry theory. Many techniques in machine learning and signal processing rely on statistical models or distance functions: principal component analysis, independent component analysis, centroid computation, k-means, expectation-maximization, nearest neighbor search, range search, smallest enclosing balls, Voronoi diagrams. August 30th 2013 Research Seminar Télécom ParisTech 2/46

4 Introduction From information geometry theory: Study of statistics with concepts from: Differential geometry such as smooth manifolds. Information theory such as statistical divergences. Parametric statistical models possess an intrinsic geometrical structure. To computational information geometry: Broad community around the development and application of computational methods based on information geometry theory. Many techniques in machine learning and signal processing rely on statistical models or distance functions: principal component analysis, independent component analysis, centroid computation, k-means, expectation-maximization, nearest neighbor search, range search, smallest enclosing balls, Voronoi diagrams. Objectives of the thesis: Employ this framework for audio signal processing. Primary motivations from real-time machine listening. August 30th 2013 Research Seminar Télécom ParisTech 2/46

5 Outline I. Computational Methods of Information Geometry 1. Preliminaries on Information Geometry 2. Sequential Change Detection with Exponential Families 3. Non-Negative Matrix Factorization with Convex-Concave Divergences II. Real-Time Applications in Audio Signal Processing 4. Real-Time Audio Segmentation 5. Real-Time Polyphonic Music Transcription August 30th 2013 Research Seminar Télécom ParisTech 3/46

6 Outline Separable divergences on the space of discrete positive measures Exponential families of probability distributions I. Computational Methods of Information Geometry 1. Preliminaries on Information Geometry 2. Sequential Change Detection with Exponential Families 3. Non-Negative Matrix Factorization with Convex-Concave Divergences II. Real-Time Applications in Audio Signal Processing 4. Real-Time Audio Segmentation 5. Real-Time Polyphonic Music Transcription August 30th 2013 Research Seminar Télécom ParisTech 4/46

7 Basic notions and common information divergences Separable divergences on the space of discrete positive measures Exponential families of probability distributions Divergences or distance functions generalize metric distances: Divergence: D(y y ) 0 and D(y y) = 0. Separable divergence: D(y y ) = m i=1 d(y i y i ). Convex-concave divergence: d(y y ) = d(y y q ) + d(y y p ). Squared Euclidean distance on R: d E (y y ) = (y y ) 2. Kullback-Leibler divergence on R + : d KL(y y ) = y log y/y y + y. Itakura-Saito divergence on R + : d IS(y y ) = y/y log y/y 1. arnaud.dessein@ircam.fr August 30th 2013 Research Seminar Télécom ParisTech 5/46

8 Basic notions and common information divergences Separable divergences on the space of discrete positive measures Exponential families of probability distributions Divergences or distance functions generalize metric distances: Divergence: D(y y ) 0 and D(y y) = 0. Separable divergence: D(y y ) = m i=1 d(y i y i ). Convex-concave divergence: d(y y ) = d(y y q ) + d(y y p ). Squared Euclidean distance on R: d E (y y ) = (y y ) 2. Kullback-Leibler divergence on R + : d KL(y y ) = y log y/y y + y. Itakura-Saito divergence on R + : d IS(y y ) = y/y log y/y 1. Separable divergences on (R + )n generated by a smooth convex function ϕ: Csiszár: d (C) ϕ (y y ) = y ϕ(y /y), where ϕ(1) = ϕ (1) = 0. α-divergences: d (a) α (y y ) = 1 α(1 α) (αy + (1 α)y y α y 1 α ). Kullback-Leibler, dual Kullback-Leibler, Hellinger, Pearson s χ 2, Neyman s χ 2. Bregman: d (B) ϕ (y y ) = ϕ(y) ϕ(y ) (y y )ϕ (y ). β-divergences: d (b) β (y y ) = 1 β(β 1) (y β + (β 1)y β βyy β 1 ). Squared Euclidean, Kullback-Leibler, Itakura-Saito. Skew Jeffreys-Bregman: d (JB) ϕ,λ (y y ) = λd ϕ(y y ) + (1 λ)d ϕ(y y). Skew Jensen-Bregman: d (JB ) ϕ,λ (y y ) = λd ϕ(y λy + (1 λ)y ) + (1 λ)d ϕ(y λy + (1 λ)y ). arnaud.dessein@ircam.fr August 30th 2013 Research Seminar Télécom ParisTech 5/46

9 Skew (α, β, λ)-divergences Separable divergences on the space of discrete positive measures Exponential families of probability distributions Separable divergences on (R + )n (continued): (α, β)-divergence: d (ab) α,β (y y 1 ) = αβ(α+β) (αy α+β + βy α+β (α + β)y α y β ). α-divergence for α + β = 1. (β + 1)-divergence for α = 1. Skew (α, β, λ)-divergence: d (ab) α,β,λ (y y ) = λd (ab) α,β (y y ) + (1 λ)d (ab) α,β (y y). arnaud.dessein@ircam.fr August 30th 2013 Research Seminar Télécom ParisTech 6/46

10 Basic notions and properties Separable divergences on the space of discrete positive measures Exponential families of probability distributions Exponential family: p θ (x) = exp(θ x ψ(θ)). The sufficient observations x belong to R m. The natural parameters θ belong to a convex set N R m. The log-normalizer ψ is convex on N and smooth on int N. arnaud.dessein@ircam.fr August 30th 2013 Research Seminar Télécom ParisTech 7/46

11 Basic notions and properties Separable divergences on the space of discrete positive measures Exponential families of probability distributions Exponential family: p θ (x) = exp(θ x ψ(θ)). The sufficient observations x belong to R m. The natural parameters θ belong to a convex set N R m. The log-normalizer ψ is convex on N and smooth on int N. Many common models: Bernoulli, Dirichlet, Gaussian, Laplace, Pareto, Poisson, Rayleigh, Von Mises-Fisher, Weibull, Wishart, log-normal, exponential, beta, gamma, geometric, binomial, negative binomial, categorical, multinomial. arnaud.dessein@ircam.fr August 30th 2013 Research Seminar Télécom ParisTech 7/46

12 Basic notions and properties Separable divergences on the space of discrete positive measures Exponential families of probability distributions Exponential family: p θ (x) = exp(θ x ψ(θ)). The sufficient observations x belong to R m. The natural parameters θ belong to a convex set N R m. The log-normalizer ψ is convex on N and smooth on int N. Many common models: Bernoulli, Dirichlet, Gaussian, Laplace, Pareto, Poisson, Rayleigh, Von Mises-Fisher, Weibull, Wishart, log-normal, exponential, beta, gamma, geometric, binomial, negative binomial, categorical, multinomial. Legendre-Fenchel conjugate: φ(η) = sup θ R m θ η ψ(θ). The expectation parameters η belong to the convex set int K. We have duality between natural and expectation parameters through ψ and φ. arnaud.dessein@ircam.fr August 30th 2013 Research Seminar Télécom ParisTech 7/46

13 Basic notions and properties Separable divergences on the space of discrete positive measures Exponential families of probability distributions Exponential family: p θ (x) = exp(θ x ψ(θ)). The sufficient observations x belong to R m. The natural parameters θ belong to a convex set N R m. The log-normalizer ψ is convex on N and smooth on int N. Many common models: Bernoulli, Dirichlet, Gaussian, Laplace, Pareto, Poisson, Rayleigh, Von Mises-Fisher, Weibull, Wishart, log-normal, exponential, beta, gamma, geometric, binomial, negative binomial, categorical, multinomial. Legendre-Fenchel conjugate: φ(η) = sup θ R m θ η ψ(θ). The expectation parameters η belong to the convex set int K. We have duality between natural and expectation parameters through ψ and φ. Maximum likelihood: pη ml (x 1,..., x n) = 1 n n j=1 x j. Simple arithmetic mean in expectation parameters. Natural parameters obtained by convex duality. arnaud.dessein@ircam.fr August 30th 2013 Research Seminar Télécom ParisTech 7/46

14 Dually flat geometry Separable divergences on the space of discrete positive measures Exponential families of probability distributions Canonical divergences: Kullback-Leibler divergence: D KL (P θ P θ ) = p θ log(p θ /p θ ) dν. Bregman divergences: B ϕ(ξ ξ ) = ϕ(ξ) ϕ(ξ ) (ξ ξ ) ϕ(ξ ). Relation: D KL (P θ P θ ) = B ψ (θ θ) = B φ (η(θ) η(θ )). arnaud.dessein@ircam.fr August 30th 2013 Research Seminar Télécom ParisTech 8/46

15 Outline Statistical framework Methods for exponential families I. Computational Methods of Information Geometry 1. Preliminaries on Information Geometry 2. Sequential Change Detection with Exponential Families 3. Non-Negative Matrix Factorization with Convex-Concave Divergences II. Real-Time Applications in Audio Signal Processing 4. Real-Time Audio Segmentation 5. Real-Time Polyphonic Music Transcription August 30th 2013 Research Seminar Télécom ParisTech 9/46

16 Background Statistical framework Methods for exponential families Principle: Decide whether the process presents some structural modifications along time. Find the time instants corresponding to the different change points. Characterize the properties within the respective segments. August 30th 2013 Research Seminar Télécom ParisTech 10/46

17 Background Statistical framework Methods for exponential families Principle: Decide whether the process presents some structural modifications along time. Find the time instants corresponding to the different change points. Characterize the properties within the respective segments. Applications: Quality control in industrial production. Fault detection in technological processes. Automatic surveillance for intrusion and abnormal behavior in security monitoring. Signal processing in geophysics, econometrics, audio, medicine, image. August 30th 2013 Research Seminar Télécom ParisTech 10/46

18 Background Statistical framework Methods for exponential families Principle: Decide whether the process presents some structural modifications along time. Find the time instants corresponding to the different change points. Characterize the properties within the respective segments. Applications: Quality control in industrial production. Fault detection in technological processes. Automatic surveillance for intrusion and abnormal behavior in security monitoring. Signal processing in geophysics, econometrics, audio, medicine, image. Approaches: Statistical modeling and monitoring of the distributions [Basseville & Nikiforov, 1993, Lai, 1995, Poor & Hadjiliadis, 2009, Chen & Gupta, 2012, Polunchenko & Tartakovsky, 2012]. Machine learning techniques relying on distance functions [Harchaoui & Lévy-Leduc, 2008, Harchaoui & Lévy-Leduc, 2010, Vert & Bleakley, 2010, Desobry et al., 2005, Harchaoui et al., 2009]. August 30th 2013 Research Seminar Télécom ParisTech 10/46

19 Motivations and contributions Statistical framework Methods for exponential families Issues of statistical online approaches: Either approximations of the exact statistics with unknown parameters for tractability. Or restrictions on the data and scenarios [Siegmund & Venkatraman, 1995, Mei, 2006]. With the exception of a full Bayesian framework for exponential families [Lai & Xing, 2010]. arnaud.dessein@ircam.fr August 30th 2013 Research Seminar Télécom ParisTech 11/46

20 Motivations and contributions Statistical framework Methods for exponential families Issues of statistical online approaches: Either approximations of the exact statistics with unknown parameters for tractability. Or restrictions on the data and scenarios [Siegmund & Venkatraman, 1995, Mei, 2006]. With the exception of a full Bayesian framework for exponential families [Lai & Xing, 2010]. Goals in the non-bayesian framework: Known or unknown parameters. Additive or non-additive changes. Topology of the parameters and data. Exact inference for online schemes. arnaud.dessein@ircam.fr August 30th 2013 Research Seminar Télécom ParisTech 11/46

21 Motivations and contributions Statistical framework Methods for exponential families Issues of statistical online approaches: Either approximations of the exact statistics with unknown parameters for tractability. Or restrictions on the data and scenarios [Siegmund & Venkatraman, 1995, Mei, 2006]. With the exception of a full Bayesian framework for exponential families [Lai & Xing, 2010]. Goals in the non-bayesian framework: Known or unknown parameters. Additive or non-additive changes. Topology of the parameters and data. Exact inference for online schemes. Contributions in this context: Study of the generalized likelihood ratios within the dually flat information geometry. Estimation with arbitrary estimators compared to maximum likelihood. Alternative expression of the statistics through convex duality. Attractive simplification for exact inference with maximum likelihood. arnaud.dessein@ircam.fr August 30th 2013 Research Seminar Télécom ParisTech 11/46

22 Multiple hypothesis Statistical framework Methods for exponential families Problem formulation: X 1,..., X n are mutually independent from P = {P ξ } ξ Ξ. Observe sx = (x 1,..., x n) X n. Decide whether X 1,..., X n are i.i.d. or not. arnaud.dessein@ircam.fr August 30th 2013 Research Seminar Télécom ParisTech 12/46

23 Multiple hypothesis Statistical framework Methods for exponential families Problem formulation: X 1,..., X n are mutually independent from P = {P ξ } ξ Ξ. Observe sx = (x 1,..., x n) X n. Decide whether X 1,..., X n are i.i.d. or not. Multiple hypotheses: H 0 : X 1,..., X n P ξ0, ξ 0 Ξ 0. H 1 : X 1,..., X i P ξ i, ξ i 0 Ξi 0, Xi+1,..., Xn P 0 ξ i, 1 ξi 1 Ξi 1, i v1, n 1w. H i 1 : X1,..., Xi P ξ i 0, ξi 0 Ξi 0, Xi+1,..., Xn P ξ i 1, ξi 1 Ξi arnaud.dessein@ircam.fr August 30th 2013 Research Seminar Télécom ParisTech 12/46

24 Test statistics and decision rules Statistical framework Methods for exponential families Likelihood ratio for known parameters: Λ i (sx)= 2 log nj=1 p ξbef (x j ) ij=1 p ξbef (x j ) n j=i+1 p ξaft (x j ). Simplification as cumulative sum statistics: 1 2 Λi (sx) = n j=i+1 log p ξ aft (x j ) p ξbef (x j ). Efficient recursive implementation for online procedures. arnaud.dessein@ircam.fr August 30th 2013 Research Seminar Télécom ParisTech 13/46

25 Test statistics and decision rules Statistical framework Methods for exponential families Likelihood ratio for known parameters: Λ i (sx)= 2 log nj=1 p ξbef (x j ) ij=1 p ξbef (x j ) n j=i+1 p ξaft (x j ). 1 Simplification as cumulative sum statistics: 2 Λi (sx) = n j=i+1 log p ξ (x aft j ) p ξbef (x j ). Efficient recursive implementation for online procedures. nj=i+1 Generalized likelihood ratio: Λ pi p ξ p (sx) = 2 log 0 (sx) (x j ) ij=1 p ξ p i 0 (sx)(x j ) n j=i+1 p ξ p i 1 (sx)(x j ). Two cumulative sums: 1 2 p Λ i (sx) = i p j=1 log ξ p i 0 (sx) (x j ) p p ξ 0 (sx) (x j ) + n p j=i+1 log ξ p i 1 (sx) (x j ) p ξ p 0 (sx) (x j ). Computationally more demanding so usually approximated for online procedures. arnaud.dessein@ircam.fr August 30th 2013 Research Seminar Télécom ParisTech 13/46

26 Test statistics and decision rules Statistical framework Methods for exponential families Likelihood ratio for known parameters: Λ i (sx)= 2 log nj=1 p ξbef (x j ) ij=1 p ξbef (x j ) n j=i+1 p ξaft (x j ). 1 Simplification as cumulative sum statistics: 2 Λi (sx) = n j=i+1 log p ξ (x aft j ) p ξbef (x j ). Efficient recursive implementation for online procedures. nj=i+1 Generalized likelihood ratio: Λ pi p ξ p (sx) = 2 log 0 (sx) (x j ) ij=1 p ξ p i 0 (sx)(x j ) n j=i+1 p ξ p i 1 (sx)(x j ). Two cumulative sums: 1 2 p Λ i (sx) = i p j=1 log ξ p i 0 (sx) (x j ) p p ξ 0 (sx) (x j ) + n p j=i+1 log ξ p i 1 (sx) (x j ) p ξ p 0 (sx) (x j ). Computationally more demanding so usually approximated for online procedures. Non-Bayesian decision rule for a change: max 1 i n 1 p Λ i (sx) H 1 H 0 λ. Comparison of the maximum statistics to a threshold. Change point estimated as the first time point where the maximum is reached. arnaud.dessein@ircam.fr August 30th 2013 Research Seminar Télécom ParisTech 13/46

27 Generic scheme Statistical framework Methods for exponential families Theorem (2.1) For an exponential family, the generalized likelihood ratio satisfies: 1 ( pλ i (sx) = i {D KL 2 P p θ i 0 ml (sx) Pp θ 0 (sx) + (n i) ) D KL (P pθ i0 ml(sx) Ppθ i0(sx) )} ( ) )} Pp Ppθ {D KL Pθ p i 1 ml (sx) D θ 0 (sx) KL (P pθ i1 ml(sx) i1(sx). arnaud.dessein@ircam.fr August 30th 2013 Research Seminar Télécom ParisTech 14/46

28 Specific cases Statistical framework Methods for exponential families Various scenarios: Known parameters before and after change. Known parameter before change, unknown parameter after change. Unknown parameters before and after change. Example (2.4) Exact statistics and maximum likelihood: 1 Λ p ) ( ) 2 i Pp Pp (sx) = i D KL (Pθ p i 0 ml (sx) + (n i) D θ 0 ml (sx) KL Pθ p i 1 ml (sx). θ 0 ml (sx) arnaud.dessein@ircam.fr August 30th 2013 Research Seminar Télécom ParisTech 15/46

29 Specific cases Statistical framework Methods for exponential families Various scenarios: Known parameters before and after change. Known parameter before change, unknown parameter after change. Unknown parameters before and after change. Example (2.4) Exact statistics and maximum likelihood: 1 Λ p ) ( ) 2 i Pp Pp (sx) = i D KL (Pθ p i 0 ml (sx) + (n i) D θ 0 ml (sx) KL Pθ p i 1 ml (sx). θ 0 ml (sx) Example (2.3) Approximate statistics: 1 Λ p ) 2 i Pp (sx) = (n i) D KL (Pθ p i 1 ml (sx). θ 0 (sx) arnaud.dessein@ircam.fr August 30th 2013 Research Seminar Télécom ParisTech 15/46

30 Revisiting through convex duality Statistical framework Methods for exponential families Proposition (2.3) For an exponential family, the generalized likelihood ratio satisfies: 1 pλ i (sx) = i φ(pη i 0 2 (sx)) + (n i) φ(pηi 1 (sx)) n φ(pη 0 (sx)) + i ml (sx). where the corrective term i ml compared to maximum likelihood estimation equals: i ml (sx) = i (pηi 0 ml (sx) pηi 0 (sx)) φ(pη i 0 (sx)) + (n i) (pηi 1 ml (sx) pηi 1 (sx)) φ(pη i 1 (sx)) n (pη 0 ml (sx) pη 0 (sx)) φ(pη 0 (sx)). arnaud.dessein@ircam.fr August 30th 2013 Research Seminar Télécom ParisTech 16/46

31 Revisiting through convex duality Statistical framework Methods for exponential families Proposition (2.3) For an exponential family, the generalized likelihood ratio satisfies: 1 pλ i (sx) = i φ(pη i 0 2 (sx)) + (n i) φ(pηi 1 (sx)) n φ(pη 0 (sx)) + i ml (sx). where the corrective term i ml compared to maximum likelihood estimation equals: i ml (sx) = i (pηi 0 ml (sx) pηi 0 (sx)) φ(pη i 0 (sx)) + (n i) (pηi 1 ml (sx) pηi 1 (sx)) φ(pη i 1 (sx)) n (pη 0 ml (sx) pη 0 (sx)) φ(pη 0 (sx)). Example (2.5) The exact generalized likelihood ratio for unknown parameters and maximum likelihood estimation verifies: 1 pλ i (sx) = i φ(pη i 0 ml 2 (sx)) + (n i) φ(pηi 1 ml (sx)) n φ(pη 0 ml (sx)). arnaud.dessein@ircam.fr August 30th 2013 Research Seminar Télécom ParisTech 16/46

32 Statistical framework Methods for exponential families Summary: Standard non-bayesian approach to sequential change detection. Dually flat information geometry of exponential families. Generalized likelihood ratios with arbitrary estimators. Attractive scheme for exact inference when unknown parameters. August 30th 2013 Research Seminar Télécom ParisTech 17/46

33 Statistical framework Methods for exponential families Summary: Standard non-bayesian approach to sequential change detection. Dually flat information geometry of exponential families. Generalized likelihood ratios with arbitrary estimators. Attractive scheme for exact inference when unknown parameters. Perspectives: Direct extensions: Non-steep or curved exponential families. Maximum a posteriori estimators. Asymptotic properties: Distribution of the test statistics. Optimality formulation and analysis. Statistical dependence: Autoregressive models. Non-linear systems and particle filtering. Alternative test statistics: Reversing the problem and starting from geometric considerations. Information divergences and more robust estimators. arnaud.dessein@ircam.fr August 30th 2013 Research Seminar Télécom ParisTech 17/46

34 Outline Proposed approach Experimental results I. Computational Methods of Information Geometry 1. Preliminaries on Information Geometry 2. Sequential Change Detection with Exponential Families 3. Non-Negative Matrix Factorization with Convex-Concave Divergences II. Real-Time Applications in Audio Signal Processing 4. Real-Time Audio Segmentation 5. Real-Time Polyphonic Music Transcription August 30th 2013 Research Seminar Télécom ParisTech 18/46

35 Background Proposed approach Experimental results Principle: Determine time boundaries that partition a sound into homogeneous and continuous temporal segments, such that adjacent segments exhibit inhomogeneties. Define a criterion to quantify the homogeneity of the segments. arnaud.dessein@ircam.fr August 30th 2013 Research Seminar Télécom ParisTech 19/46

36 Background Proposed approach Experimental results Principle: Determine time boundaries that partition a sound into homogeneous and continuous temporal segments, such that adjacent segments exhibit inhomogeneties. Define a criterion to quantify the homogeneity of the segments. Approaches: Supervised: high-level classes and automatic classification. Unsupervised: statistical and distance-based approaches: Musical onset detection [Bello et al., 2005, Dixon, 2006]. Speaker segmentation [Kemp et al., 2000, Kotti et al., 2008]. arnaud.dessein@ircam.fr August 30th 2013 Research Seminar Télécom ParisTech 19/46

37 Motivations and contributions Proposed approach Experimental results Issues of unsupervised approaches to audio segmentation: Often tailored to particular types of signal and homogeneity criterion. Specific distance functions or models. Some are offline. Others approximate the exact statistics. August 30th 2013 Research Seminar Télécom ParisTech 20/46

38 Motivations and contributions Proposed approach Experimental results Issues of unsupervised approaches to audio segmentation: Often tailored to particular types of signal and homogeneity criterion. Specific distance functions or models. Some are offline. Others approximate the exact statistics. Goals towards a unifying framework for audio segmentation: Arbitrary types of signals homogeneity criteria. Large choice of distance functions or models. Real-time constraints. Exact online inference. arnaud.dessein@ircam.fr August 30th 2013 Research Seminar Télécom ParisTech 20/46

39 Motivations and contributions Proposed approach Experimental results Issues of unsupervised approaches to audio segmentation: Often tailored to particular types of signal and homogeneity criterion. Specific distance functions or models. Some are offline. Others approximate the exact statistics. Goals towards a unifying framework for audio segmentation: Arbitrary types of signals homogeneity criteria. Large choice of distance functions or models. Real-time constraints. Exact online inference. Contributions in this context: Generic framework for real-time audio segmentation. Unification of several standard approaches. Online change detection with exponential families. Exact generalized likelihood ratios and maximum likelihood. arnaud.dessein@ircam.fr August 30th 2013 Research Seminar Télécom ParisTech 20/46

40 System architecture Proposed approach Experimental results Segmentation scheme: 1 Represent frames with a short-time sound description. 2 Model the observations with probability distributions. 3 Detect sequentially changes in the distribution parameters. Audio segmentation (online) Auditory scene Short-time sound representation Statistical modeling Change detection arnaud.dessein@ircam.fr August 30th 2013 Research Seminar Télécom ParisTech 21/46

41 System architecture Proposed approach Experimental results Segmentation scheme: 1 Represent frames with a short-time sound description. 2 Model the observations with probability distributions. 3 Detect sequentially changes in the distribution parameters. Short-time sound representation: Energy for information on loudness. Fourier transform for information on spectral content. Mel-frequency cepstral coefficients for information on timbre. Many other possibilities. Audio segmentation (online) Auditory scene Short-time sound representation Statistical modeling Change detection arnaud.dessein@ircam.fr August 30th 2013 Research Seminar Télécom ParisTech 21/46

System architecture Proposed approach Experimental results Segmentation scheme: 1 Represent frames with a short-time sound description. 2 Model the observations with probability distributions.

42 System architecture Proposed approach Experimental results Segmentation scheme: 1 Represent frames with a short-time sound description. 2 Model the observations with probability distributions. 3 Detect sequentially changes in the distribution parameters. Short-time sound representation: Energy for information on loudness. Fourier transform for information on spectral content. Mel-frequency cepstral coefficients for information on timbre. Many other possibilities. Statistical modeling: Exponential families and generalized likelihood ratios. Unknown parameters and maximum likelihood. Audio segmentation (online) Auditory scene Short-time sound representation Statistical modeling Change detection arnaud.dessein@ircam.fr August 30th 2013 Research Seminar Télécom ParisTech 21/46

43 Change detection Proposed approach Experimental results Clarification of the relations between statistical approaches: Likelihood statistics: 2 log(p(sx H 0 )/p(sx H i 1 )) > λ. Exact GLR: pη 0 ml (sx) = n 1 nj=1 x j, pη i 0 ml (sx) = 1 ij=1 x i j, pη i 1 ml (sx) = n i 1 nj=i+1 x j. Approximate GLR on the whole window: pη i 0 ml (sx) pη 0 ml (sx) = n 1 nj=1 x j. Approximate GLR in a dead region: pη i 0 ml (sx) pη 0 ml (sx) n 1 n0 0 j=1 x j. Model selection: 2 log(p(sx H 0 )/p(sx H i 1 )) > λ. AIC: λ = 2d. BIC: λ = d log n. Penalized BIC: λ = γ d log n. arnaud.dessein@ircam.fr August 30th 2013 Research Seminar Télécom ParisTech 22/46

44 Change detection Proposed approach Experimental results Clarification of the relations between statistical approaches: Likelihood statistics: 2 log(p(sx H 0 )/p(sx H i 1 )) > λ. Exact GLR: pη 0 ml (sx) = n 1 nj=1 x j, pη i 0 ml (sx) = 1 ij=1 x i j, pη i 1 ml (sx) = n i 1 nj=i+1 x j. Approximate GLR on the whole window: pη i 0 ml (sx) pη 0 ml (sx) = n 1 nj=1 x j. Approximate GLR in a dead region: pη i 0 ml (sx) pη 0 ml (sx) n 1 n0 0 j=1 x j. Model selection: 2 log(p(sx H 0 )/p(sx H i 1 )) > λ. AIC: λ = 2d. BIC: λ = d log n. Penalized BIC: λ = γ d log n. Links with distance-based approaches: 1 Λ p ) ( ) 2 i Pp Pp (sx) = i D KL (Pθ p i 0 ml (sx) + (n i) D θ 0 ml (sx) KL Pθ p i 1 ml (sx). θ 0 ml (sx) Heuristics: Threshold on the observations. Distance between the observations at successive frames. Kernels methods: Equivalence between one-class support vector machines for novelty detection and approximate GLR statistics [Canu & Smola, 2006]. arnaud.dessein@ircam.fr August 30th 2013 Research Seminar Télécom ParisTech 22/46

45 Segmentation into silence and activity Proposed approach Experimental results Parameters: Short-time sound representation: energy in a Mel-frequency filter bank at Hz. Statistical model: Rayleigh distributions. Topology: 1 dimension, continuous non-negative values Original audio Short time sound representation Reference annotation Silence Activity Silence Activity Silence Time in seconds arnaud.dessein@ircam.fr August 30th 2013 Research Seminar Télécom ParisTech 23/46

46 Segmentation into music and speech Proposed approach Experimental results Parameters: Short-time sound representation: Mel-frequency cepstral coefficients at Hz. Parametric statistical model: multivariate spherical normal distributions fixed variance. Topology: 12 dimensions, continuous real values Original audio Reference annotation Music 1 Speech 1 Music 2 Speech 2 Music Time in seconds arnaud.dessein@ircam.fr August 30th 2013 Research Seminar Télécom ParisTech 24/46

47 Segmentation into different speakers Proposed approach Experimental results Parameters: Short-time sound representation: Mel-frequency cepstral coefficients at Hz. Parametric statistical model: multivariate spherical normal distributions fixed variance. Topology: 12 dimensions, continuous real values Original audio Estimated Mel frequency cepstral coefficients Reference annotation Speaker 1 Speaker 2 Speaker 3 Speaker 4 Speaker Time in seconds arnaud.dessein@ircam.fr August 30th 2013 Research Seminar Télécom ParisTech 25/46

48 Segmentation into polyphonic note slices Proposed approach Experimental results Parameters: Short-time sound representation: normalized magnitude spectrum at Hz. Parametric statistical model: categorical distributions. Topology: 257 dimensions, discrete frequency histograms A5 Original audio Reference piano roll E5 B4 F4# Pitch C4# G3# D3# A2# F Time in seconds arnaud.dessein@ircam.fr August 30th 2013 Research Seminar Télécom ParisTech 26/46

49 Evaluation on musical onset detection Proposed approach Experimental results Parameters: Short-time sound representation: normalized magnitude spectrum at Hz. Parametric statistical model: categorical distributions. Topology: 513 dimensions, discrete frequency histograms. August 30th 2013 Research Seminar Télécom ParisTech 27/46

50 Evaluation on musical onset detection Proposed approach Experimental results Parameters: Short-time sound representation: normalized magnitude spectrum at Hz. Parametric statistical model: categorical distributions. Topology: 513 dimensions, discrete frequency histograms. Evaluation of generalized likelihood ratios GLR and spectral flux SF: Difficult dataset [Leveau et al., 2004]. Standard methodology. Algorithm Threshold P R F Distance function GLR Kullback-Leibler SF Euclidean SF Kullback-Leibler SF Half-wave rectified difference arnaud.dessein@ircam.fr August 30th 2013 Research Seminar Télécom ParisTech 27/46

51 Evaluation on musical onset detection Proposed approach Experimental results Parameters: Short-time sound representation: normalized magnitude spectrum at Hz. Parametric statistical model: categorical distributions. Topology: 513 dimensions, discrete frequency histograms. Evaluation of generalized likelihood ratios GLR and spectral flux SF: Difficult dataset [Leveau et al., 2004]. Standard methodology. Algorithm Threshold P R F Distance function GLR Kullback-Leibler SF Euclidean SF Kullback-Leibler SF Half-wave rectified difference Comparison to the state-of-the-art: LFSF: offline spectral flux with a logarithmic frequency scale and filtering [Böck et al., 2012]. TSPC: online spectral peak classification into transients and non-transients [Röbel, 2011]. Algorithm P R F GLR LFSF TSPC arnaud.dessein@ircam.fr August 30th 2013 Research Seminar Télécom ParisTech 27/46

52 Proposed approach Experimental results Summary: Real-system for audio segmentation. Various types of signals and of homogeneity criteria. Sequential change detection with exponential families. Unification and generalization of several statistical and distance-based approaches. August 30th 2013 Research Seminar Télécom ParisTech 28/46

53 Proposed approach Experimental results Summary: Real-system for audio segmentation. Various types of signals and of homogeneity criteria. Sequential change detection with exponential families. Unification and generalization of several statistical and distance-based approaches. Perspectives: Dependent observations: Autoregressive models. Non-linear systems and particle filtering. Improved robustness: Post-processing by smoothing, adaptation. Growing and sliding window heuristics. Consideration of prior information: Maximum a posteriori. Full Bayesian framework. Further applications: Audio processing and music information retrieval. Other domains in signal processing. arnaud.dessein@ircam.fr August 30th 2013 Research Seminar Télécom ParisTech 28/46

54 Outline Optimization framework Methods for convex-concave divergences I. Computational Methods of Information Geometry 1. Preliminaries on Information Geometry 2. Sequential Change Detection with Exponential Families 3. Non-Negative Matrix Factorization with Convex-Concave Divergences II. Real-Time Applications in Audio Signal Processing 4. Real-Time Audio Segmentation 5. Real-Time Polyphonic Music Transcription August 30th 2013 Research Seminar Télécom ParisTech 29/46

Background Optimization framework Methods for convex-concave divergences Principle: Decompose the observed non-negative data as a linear combination of

55 Background Optimization framework Methods for convex-concave divergences Principle: Decompose the observed non-negative data as a linear combination of basis elements. Constrain the dictionary and encoding to be non-negative. arnaud.dessein@ircam.fr August 30th 2013 Research Seminar Télécom ParisTech 30/46

56 Background Optimization framework Methods for convex-concave divergences Principle: Decompose the observed non-negative data as a linear combination of basis elements. Constrain the dictionary and encoding to be non-negative. Applications: Bioinformatics, spectroscopy, surveillance. Signal processing in text, image, audio. arnaud.dessein@ircam.fr August 30th 2013 Research Seminar Télécom ParisTech 30/46

57 Background Optimization framework Methods for convex-concave divergences Principle: Decompose the observed non-negative data as a linear combination of basis elements. Constrain the dictionary and encoding to be non-negative. Applications: Bioinformatics, spectroscopy, surveillance. Signal processing in text, image, audio. Approaches: Euclidean cost function alternating non-negative least squares [Paatero & Tapper, 1994, Paatero, 1997]. Euclidean or Kullback-Leibler cost functions and multiplicative updates [Lee & Seung, 1999, Lee & Seung, 2001]. Many extensions [Berry et al., 2007, Cichocki et al., 2009]. Statistical inference [Févotte & Cemgil, 2009, Cemgil, 2009]. arnaud.dessein@ircam.fr August 30th 2013 Research Seminar Télécom ParisTech 30/46

58 Motivations and contributions Optimization framework Methods for convex-concave divergences Issues of approaches to optimizing factored models: Often particular cost functions or statistical models. Potential problems in convergence. With the exception of updates for classes of divergences [Cichocki et al., 2006, Cichocki et al., 2008, Dhillon & Sra, 2006, Dhillon & Sra, 2006, Kompass, 2007, Nakano et al., 2010, Févotte & Idier, 2011, Cichocki et al., 2011].. August 30th 2013 Research Seminar Télécom ParisTech 31/46

59 Motivations and contributions Optimization framework Methods for convex-concave divergences Issues of approaches to optimizing factored models: Often particular cost functions or statistical models. Potential problems in convergence. With the exception of updates for classes of divergences [Cichocki et al., 2006, Cichocki et al., 2008, Dhillon & Sra, 2006, Dhillon & Sra, 2006, Kompass, 2007, Nakano et al., 2010, Févotte & Idier, 2011, Cichocki et al., 2011].. Goals for providing generic methods: Common information divergences. Multiplicative updates. Convergence guarantees. August 30th 2013 Research Seminar Télécom ParisTech 31/46

60 Motivations and contributions Optimization framework Methods for convex-concave divergences Issues of approaches to optimizing factored models: Often particular cost functions or statistical models. Potential problems in convergence. With the exception of updates for classes of divergences [Cichocki et al., 2006, Cichocki et al., 2008, Dhillon & Sra, 2006, Dhillon & Sra, 2006, Kompass, 2007, Nakano et al., 2010, Févotte & Idier, 2011, Cichocki et al., 2011].. Goals for providing generic methods: Common information divergences. Multiplicative updates. Convergence guarantees. Contributions in this context: Generic scheme for arbitrary convex-concave divergences. Majorization-minimization scheme with these auxiliary functions. Known and novel updates for common information divergences. Multiplicative updates for α-divergences, β-divergences, (α, β)-divergences. Extension to skew divergences. August 30th 2013 Research Seminar Télécom ParisTech 31/46

61 Cost function minimization problem Optimization framework Methods for convex-concave divergences Problem formulation: y R m + is an observation vector. A R m r + is a dictionary matrix. Find an encoding vector x R r + of y into A such that y Ax. arnaud.dessein@ircam.fr August 30th 2013 Research Seminar Télécom ParisTech 32/46

62 Cost function minimization problem Optimization framework Methods for convex-concave divergences Problem formulation: y R m + is an observation vector. A R m r + is a dictionary matrix. Find an encoding vector x R r + of y into A such that y Ax. Optimization formulation: Separable divergence: D(y y ) = m i=1 d(y i y i ). Cost function: C(x) = D(y Ax). Problem: minimize C(x) subject to x X. arnaud.dessein@ircam.fr August 30th 2013 Research Seminar Télécom ParisTech 32/46

63 Variational bounding and auxiliary functions Optimization framework Methods for convex-concave divergences Variational bounding: Iterative technique for minimization problems. Cost function replaced at each step with a surrogate majorizing function. arnaud.dessein@ircam.fr August 30th 2013 Research Seminar Télécom ParisTech 33/46

64 Variational bounding and auxiliary functions Optimization framework Methods for convex-concave divergences Variational bounding: Iterative technique for minimization problems. Cost function replaced at each step with a surrogate majorizing function. Auxiliary function: G(sx sx) = C(sx) and G(x sx) C(x). Iterative scheme: if G(x sx) G(sx sx), then C(x) C(sx). Majorization-minimization scheme: minimize G(x, sx) subject to x X. s ( ) G( sx) This th C( ) ( C(sx) =G(sx sx) )= (sx), which G(x sx) C(x) s x, sx arnaud.dessein@ircam.fr August 30th 2013 Research Seminar Télécom ParisTech 33/46

65 Generic updates Optimization framework Methods for convex-concave divergences Proposition (3.3) For a convex-concave divergence, we have the following auxiliary function: { ( ) ( ) m r r a ik sx k r G(x sx) = pd y i a il sx l + q x k r i=1 l=1 k=1 l=1 a d y i a il sx l il sx l sx l=1 k ( )} r + a ik (x k sx k ) p d r y y i a il sx l k=1 l=1. arnaud.dessein@ircam.fr August 30th 2013 Research Seminar Télécom ParisTech 34/46

66 Generic updates Optimization framework Methods for convex-concave divergences Proposition (3.3) For a convex-concave divergence, we have the following auxiliary function: { ( ) ( ) m r r a ik sx k r G(x sx) = pd y i a il sx l + q x k r i=1 l=1 k=1 l=1 a d y i a il sx l il sx l sx l=1 k ( )} r + a ik (x k sx k ) p d r y y i a il sx l k=1 l=1. Theorem (3.4) For a convex-concave divergence, the cost function decreases monotonically by iteratively solving: m a q ( ) d r x k m ik y i=1 y i a il sx l = a p ( ) d r ik sx l=1 k y i=1 y i a il sx l. l=1 arnaud.dessein@ircam.fr August 30th 2013 Research Seminar Télécom ParisTech 34/46

67 Specific cases Optimization framework Methods for convex-concave divergences Various informations divergences: Csiszár ϕ-divergences. α-divergences. Left-sided Bregman ϕ-divergences. Skew Jensen-Bregman (ϕ, λ)-divergences. Skew Jeffreys (β, λ)-divergences. Skew Jensen β-divergences. (α, β)-divergences. Skew (α, β, λ)-divergences. August 30th 2013 Research Seminar Télécom ParisTech 35/46

68 Specific cases Optimization framework Methods for convex-concave divergences Various informations divergences: Csiszár ϕ-divergences. α-divergences. Left-sided Bregman ϕ-divergences. Skew Jensen-Bregman (ϕ, λ)-divergences. Skew Jeffreys (β, λ)-divergences. Skew Jensen β-divergences. (α, β)-divergences. Skew (α, β, λ)-divergences. Example (3.8) Multiplicative ( updates for (α, β)-divergences: mi=1 a x k = sx k ik yi α ( ) r l=1 a il sx l) β 1 pα,β mi=1 a ik( r. l=1 a il sx l) α+β 1 arnaud.dessein@ircam.fr August 30th 2013 Research Seminar Télécom ParisTech 35/46

69 Optimization framework Methods for convex-concave divergences Summary: Non-negative matrix factorization with convex-concave divergences. Many common information divergences. Variational bounding with auxiliary functions. Symmetrized and skew divergences. August 30th 2013 Research Seminar Télécom ParisTech 36/46

70 Optimization framework Methods for convex-concave divergences Summary: Non-negative matrix factorization with convex-concave divergences. Many common information divergences. Variational bounding with auxiliary functions. Symmetrized and skew divergences. Perspectives: Enhanced factorization model: Convex models, non-negative tensor models. Convolutive models. Extended cost functions: Penalty terms for regularization. l p-norms for sparsity regularization. Other generic updates: Majorization-equalization and tempered updates. Further assumptions on convexity properties. Strong convergence properties: Convergence of the cost function to a minimum. Convergence of the updates. arnaud.dessein@ircam.fr August 30th 2013 Research Seminar Télécom ParisTech 36/46

71 Outline Proposed approach Experimental results I. Computational Methods of Information Geometry 1. Preliminaries on Information Geometry 2. Sequential Change Detection with Exponential Families 3. Non-Negative Matrix Factorization with Convex-Concave Divergences II. Real-Time Applications in Audio Signal Processing 4. Real-Time Audio Segmentation 5. Real-Time Polyphonic Music Transcription August 30th 2013 Research Seminar Télécom ParisTech 37/46

72 Background Proposed approach Experimental results Principle: Convert a raw music signal into a symbolic representation such as a score. From a low-level audio waveform to high-level symbolic information. arnaud.dessein@ircam.fr August 30th 2013 Research Seminar Télécom ParisTech 38/46

73 Background Proposed approach Experimental results Principle: Convert a raw music signal into a symbolic representation such as a score. From a low-level audio waveform to high-level symbolic information. Approaches: Many methods for multiple pitch estimation [de Cheveigné, 2006, Klapuri & Davy, 2006]. Non-negative matrix factorization [Smaragdis & Brown, 2003, Abdallah & Plumbley, 2004, Virtanen & Klapuri, 2006, Raczyński et al., 2007, Niedermayer, 2008, Marolt, 2009, Grindlay & Ellis, 2009, Vincent et al., 2010]. Probabilistic latent component analysis [Smaragdis et al., 2008, Mysore & Smaragdis, 2009, Grindlay & Ellis, 2011, Hennequin et al., 2011, Fuentes et al., 2011, Benetos & Dixon, 2011]. arnaud.dessein@ircam.fr August 30th 2013 Research Seminar Télécom ParisTech 38/46

74 Motivations and contributions Proposed approach Experimental results Issues of approaches based on non-negative matrix factorization: Inherently suitable for offline processing. Need for structuring the dictionary. With the exception of systems for audio decomposition [Sha & Saul, 2005, Cheng et al., 2008, Cont, 2006, Cont et al., 2007]. Specific cost functions. No suitable controls on the decomposition. Potential convergence issues. Contributions in this context: Parametric family of (α, β)-divergences for decomposition. Insights into their relevancy for audio analysis. Flexible control on the frequency compromise during the decomposition. Multiplicative updates tailored to real time. Monotonic decrease of the cost function. August 30th 2013 Research Seminar Télécom ParisTech 39/46

75 System architecture Proposed approach Experimental results Transcription scheme: 1 Learn a dictionary offline for note templates. 2 Decompose the music signal online onto the dictionary. 3 Output the activations along time. Note template learning (offline) Isolated note samples Music signal decomposition (online) Auditory scene Short-term sound representation Short-time sound representation Non-negative matrix factorization Non-negative decomposition Note templates Note activations arnaud.dessein@ircam.fr August 30th 2013 Research Seminar Télécom ParisTech 40/46

76 System architecture Proposed approach Experimental results Transcription scheme: 1 Learn a dictionary offline for note templates. 2 Decompose the music signal online onto the dictionary. 3 Output the activations along time. Note template learning. Characteristic and discriminative templates. Short-term magnitude or power spectrum. Non-negative matrix factorization. Note template learning (offline) Isolated note samples Short-term sound representation Music signal decomposition (online) Auditory scene Short-time sound representation Non-negative matrix factorization Non-negative decomposition Frequency (Hz) Note templates Note activations A0 A1 A2 A3 A4 A5 A6 A7 Note 11 arnaud.dessein@ircam.fr August 30th 2013 Research Seminar Télécom ParisTech 40/46

77 Non-negative decomposition Proposed approach Experimental results Scaling property of (α, β)-divergences: d (ab) α,β (γy γy ) = γ α+β d (ab) α,β (y y ). Different emphasis is put on the coefficients depending on their magnitude. α + β > 0: more emphasis is put on the higher magnitude coefficients. α + β < 0: converse effect. Compromise between fundamental frequencies, first partials, and higher partials. arnaud.dessein@ircam.fr August 30th 2013 Research Seminar Télécom ParisTech 41/46

78 Non-negative decomposition Proposed approach Experimental results Scaling property of (α, β)-divergences: d (ab) α,β (γy γy ) = γ α+β d (ab) α,β (y y ). Different emphasis is put on the coefficients depending on their magnitude. α + β > 0: more emphasis is put on the higher magnitude coefficients. α + β < 0: converse effect. Compromise between fundamental frequencies, first partials, and higher partials. ( ( ( )) ) A y α e r (Ax) β 1 pα,β Tailored updates in vector form: x x A (Ax) α+β 1. 1 element-wise matrix multiplication per time frame. 1 vector transposed replication per time frame. 3 matrix-vector multiplications per iteration. 1 element-wise vector multiplication per iteration. 1 element-wise vector division per iteration. 3 element-wise vector powers per iteration. arnaud.dessein@ircam.fr August 30th 2013 Research Seminar Télécom ParisTech 41/46

79 Sample example of piano music Proposed approach Experimental results β = 2 A7 A6 A5 A4 A3 A2 A1 A0 α = 1 α = 0 α = 1 α = β = 1 A7 A6 A5 A4 A3 A2 A1 A β = 0 A7 A6 A5 A4 A3 A2 A1 A β = 1 A7 A6 A5 A4 A3 A2 A1 A Time in seconds Time in seconds Time in seconds Time in seconds arnaud.dessein@ircam.fr August 30th 2013 Research Seminar Télécom ParisTech 42/46

80 Proposed approach Experimental results Evaluation on multiple fundamental frequency estimation Evaluation: Standard piano dataset [Emiya et al., 2010]. Methodology from MIREX [Bay et al., 2009]. August 30th 2013 Research Seminar Télécom ParisTech 43/46

81 Proposed approach Experimental results Evaluation on multiple fundamental frequency estimation α + β α β Threshold F Distance function ± Itakura-Saito ± β-divergence with β = Neyman s χ α-divergence with α = Dual Kullback-Leibler Hellinger +1.0 ± Kullback-Leibler α-divergence with α = Pearson s χ α-divergence with α = α-divergence with α = α-divergence with α = α-divergence with α = α-divergence with α = α-divergence with α = 5 arnaud.dessein@ircam.fr August 30th 2013 Research Seminar Télécom ParisTech 43/46

82 Proposed approach Experimental results Evaluation on multiple fundamental frequency estimation Evaluation: Standard piano dataset [Emiya et al., 2010]. Methodology from MIREX [Bay et al., 2009]. August 30th 2013 Research Seminar Télécom ParisTech 43/46

83 Proposed approach Experimental results Evaluation on multiple fundamental frequency estimation Evaluation: Standard piano dataset [Emiya et al., 2010]. Methodology from MIREX [Bay et al., 2009]. Comparison of ABND to the state-of-the-art: BHNMF: offline unsupervised non-negative matrix factorization with the β-divergences and a harmonic model exploiting spectral smoothness [Vincent et al., 2010]. SACS: offline sinusoidal analysis with a candidate selection exploiting spectral features [Yeh et al., 2010]. Algorithm P R F A E sub E mis E fal E tot ABND BHNMF SACS arnaud.dessein@ircam.fr August 30th 2013 Research Seminar Télécom ParisTech 43/46

An information-geometric approach to real-time audio segmentation

An information-geometric approach to real-time audio segmentation Arnaud Dessein, Arshia Cont To cite this version: Arnaud Dessein, Arshia Cont. An information-geometric approach to real-time audio segmentation.