Semi-Supervised Learning of Speech Sounds

Size: px

Start display at page:

Download "Semi-Supervised Learning of Speech Sounds"

Henry Owens
5 years ago
Views:

1 Aren Jansen Partha Niyogi Department of Computer Science Interspeech 2007

2 Objectives 1 Present a manifold learning algorithm based on locality preserving projections for semi-supervised phone classification (LPPSSL) 2 Perform toy classification experiments designed to isolate the role of geometric structure in the speech domain 3 Demonstrate that exploiting both manifold and cluster structure is necessary for semi-supervised success

3 The Speech Manifold a(t) A φ φ[a(t)] M s(t) A = space of vocal tract articulatory configurations M = space of vocal tract transfer functions Physics φ : A M is a diffeomorphism Low dim(a) M is a low-dimensional manifold

4 The Laplacian Operator, M Second-order differential operator on manifold M Normalized eigenfunctions {e i } form orthogonal basis for L 2 (M) (i.e. f = i a ie i ) Define smoothness functional: S[f] = M f 2 dµ = M f, f L 2 (M) M S[e i ] = λ i Low λ i e i varies more smoothly with geodesic distance along manifold

5 The Graph Laplacian Operator, L G Given x 1, x 2,...x N M construct k-nearest neighbor adjacency graph G, with adjacency matrix W L G = W D, where D ii = j W ij Analogous to M, but restricted to functions on graph S G [f] = ft L G f f T Df, where f = f(x 1),..., f(x N ) T Function that minimizes S G is minimum cut for G

6 Computing an Ordered Intrinsic Basis with LPP Solve optimization problem: f = arg min f T L G f f T Df=1 L G f k = λ k Df k Extend k th eigenfunction, f k, out of sample (f k H K ): f k (v) = N i=1 α (k) i K(x i, v) where α (k) = K + f k and K + = pseudoinverse of the N N Gram matrix with K ij = K(x i, x j ) Sort eigenfunctions according to m(f) = (f) T L G (f) (f) T D (f)

7 LPPSSL: Incorporating Labelled Data Given d non-trivial basis functions, {f 1,...,f d }, cast labelled examples {x i } l i=1 into intrinsic representation: x = f 1 (x),..., f d (x) Determine map from intrinsic representation to labels using any machine learning method For a linear map (min-norm solution): 1. Let F be (l d) matrix with F ij = f i (x j ) 2. Let y l { 1,1} l be the vector of training labels 3. Solve y l = Fβ (for l d, use Moore-Penrose inverse)

8 A Detailed Example: /a/-/æ/ Classification 50-dim DFT representation for each example 500 training examples from each class Test procedure (repeat 400 times): 1. Randomly label l/2 of each class 2. Compute linear classifiers with l labelled and u = 1000 l unlabelled examples 3. Test on additional 1000 examples [a] [ae]: Classification Test Error Histogram 2 Component LPPSSL Minimum Norm Test error (%) Optimal RLS Error = 9.6% Median LPPSSL Error = 11.8% Median Min-norm Error = 24.3%

9 A Detailed Example: /a/-/æ/ Classification [a] [ae]: Median Test Error vs. # Labelled 2 Component LPPSSL Minimum Norm Optimal Median Test Error Number of Labelled Examples Define gap improvement: G(l) = Min-norm error LPPSSL error Min-norm error Optimal error const

10 Performance Across the Vowel Manifold Pair E opt E mn E ssl G(6) æ æ æ æ æ E opt = optimal RLS error E mn = median min-norm error (l = 6) E ssl = median 2-comp. LPPSSL error (l = 6) G(6) = Emn E ssl E mn E opt

11 Performance Across the Vowel Manifold Group A pairs are poorly separable/minimally clustered (e.g. close vs. near close) Group B pairs are highly clustered with distinct articulator configurations (e.g. close vs. open, front vs. back ) Manifold structure admits significant gap improvements for Group A pairs Percentage of Gap Improvement (LPPSSL) LPPSSL Gap Improvement (l=6) vs. Optimal Error Rate 100 Group A 90 Group B Optimal RLS Test Error (%)

12 Isolating the Role of Cluster Structure Semi-supervised EMGMM algorithm: 1. Train 2-mixture GMM with l = 10 labelled examples 2. Classify unlabelled examples and iterate Optimal GMM and RLS error rates clearly correlated EMGMM fails on Group A problems Percentage of Gap Improvement (EMGMM) EMGMM Gap Improvement (l=10) vs. Optimal Error Rate 80 Group A Group B Optimal RLS Test Error (%)

13 Broad Class Performance Pair E opt E mn E ssl G(6) Ap-V St-F St-Ap N-Ap St-V Af-F N-V St-N F-Ap St-Af F-V F-N Af-Ap Af-V Af-N Classes: Vowels, Approximants, Nasals, Fricatives, Affricates, Stops 500 train/test examples for each class Individual phones represented according to their occurrence rate in TIMIT

14 Broad Class vs. Vowel Performance Broad class clusters more separated than vowels Min-norm broad class outperforms min-norm vowels LPPSSL performance roughly the same Accomodating both manifold and cluster structure provides invariance to cluster separation Min norm Error (l=6) LPPSSL Error (l=6) Vowels BC Optimal Optimal RLS Error Vowels BC Optimal Optimal RLS Error

15 Conclusions Speech sounds have an approximate low-dimensional manifold structure Presented LPPSSL algorithm to leverage manifold structure for semi-supervised learning Cluster structure alone is insufficient for the speech domain Manifold structure can be beneficial even with minimal supervision

A graph based approach to semi-supervised learning

A graph based approach to semi-supervised learning 1 Feb 2011 Two papers M. Belkin, P. Niyogi, and V Sindhwani. Manifold regularization: a geometric framework for learning from labeled and unlabeled examples.