Advances in kernel exponential families

Size: px

Start display at page:

Download "Advances in kernel exponential families"

Janel Shepherd
5 years ago
Views:

1 Advances in kernel exponential families Arthur Gretton Gatsby Computational Neuroscience Unit, University College London NIPS, /39

2 Outline Motivating application: Fast estimation of complex multivariate densities The infinite exponential family: Multivariate Gaussian! Gaussian process Finite mixture model! Dirichlet process mixture model Finite exponential family!??? In this talk: Guaranteed speed improvements by Nystrom Conditional models 2/39

3 Goal: learn high dimensional, complex densities Red Wine Parkinsons x x x x15 We want: Efficient computation and representation Statistical guarantees 3/39

4 The exponential family The exponential family in in R d p(x ) = exp Examples: 0 * {z} natural parameter Gaussian density: T (x ) = Gamma density: T (x ) = h ; T (x ) {z } sufficient startistic h x ln x x 2 i Can we extend this to infinite dimensions? x i + A() {z } log normaliser 1 C A q 0 (x ) {z } base measure 4/39

5 Infinitely many features using kernels Kernels: dot products of features Feature map '(x ) 2 H, '(x ) = [: : : ' i (x ) : : :] 2 `2 For positive definite k, k (x ; x 0 ) = h'(x ); '(x 0 )i H = k (x ; ); k (x 0 ; ) H Infinitely many features '(x ), dot product in closed form! 5/39

6 Infinitely many features using kernels Kernels: dot products of features Feature map '(x ) 2 H, Exponentiated quadratic kernel k (x ; x 0 ) = exp kx x 0 k 2 '(x ) = [: : : ' i (x ) : : :] 2 `2 For positive definite k, k (x ; x 0 ) = h'(x ); '(x 0 )i H = k (x ; ); k (x 0 ; ) H Infinitely many features '(x ), dot product in closed form! Features: Gaussian Processes for Machine learning, Rasmussen and Williams, Ch. 4. 5/39

7 Functions of infinitely many features Functions are linear combinations of features: 6/39

8 How to represent functions? Function with exponentiated quadratic kernel: f (x ) : = mx i=1 mx = i=1 = = * m X i=1 1X `=1 i k (x i ; x ) i h'(x i ); '(x )i H i '(x i ); '(x ) f`'`(x ) + H f(x) x 7/39

9 How to represent functions? Function with exponentiated quadratic kernel: f (x ) : = mx i=1 mx = i=1 = = * m X i=1 1X `=1 i k (x i ; x ) i h'(x i ); '(x )i H i '(x i ); '(x ) f`'`(x ) + H f(x) x 7/39

10 How to represent functions? Function with exponentiated quadratic kernel: f (x ) : = mx i=1 mx = i=1 = = * m X i=1 1X `=1 i k (x i ; x ) i h'(x i ); '(x )i H i '(x i ); '(x ) f`'`(x ) + H f(x) x f` = P m i=1 i '`(x i ) 7/39

11 The kernel exponential family Kernel exponential families [Canu and Smola (2006), Fukumizu (2009)] and their GP counterparts [Adams, Murray, MacKay (2009), Rasmussen(2003)] n o P = p f (x ) = e hf ;'(x )i H A(f ) q 0 (x ); x 2 ; f 2 F where F = Z f 2 H : A(f ) = log e f (x ) q 0 (x ) < 1 8/39

12 The kernel exponential family Kernel exponential families [Canu and Smola (2006), Fukumizu (2009)] and their GP counterparts [Adams, Murray, MacKay (2009), Rasmussen(2003)] n o P = p f (x ) = e hf ;'(x )i H A(f ) q 0 (x ); x 2 ; f 2 F where F = Z f 2 H : A(f ) = log e f (x ) q 0 (x ) < 1 Finite dimensional RKHS: one-to-one correspondence between finite dimensional exponential family and RKHS. Example: Gaussian kernel, T (x ) = k (x ; y) = xy + x 2 y 2 h i x x 2 = '(x ) and 8/39

13 Fitting an infinite dimensional exponential family Given random samples, X 1 ; : : : ; X n drawn i.i.d. from an unknown density, p 0 := p f0 2 P, estimate p 0 9/39

14 How not to do it: maximum likelihood Maximum likelihood: f ML = arg max f 2F = arg max f 2F nx i=1 nx i=1 log p f (X i ) Z f (X i ) n log Solving the above yields that f ML satisfies where p fml = d P ML. 1 n nx i=1 '(x i ) = Z '(x )p fml (x ) e f (x ) q 0 (x ) : Ill posed for infinite dimensional '(x )! 10/39

15 Score matching Loss is Fisher Score: D F (p 0 ; p f ) := 1 2 Z p 0 (x ) kr x log p 0 (x ) r x log p f (x )k 2 11/39

16 Score matching (general version) Assuming p f to be differentiable (w.r.t. x ) and R p0 (x )kr x log p f (x )k 2 < 1; 8 2 D F (p 0 ; p f ) := 1 2 (a) = Z Z p 0 (x ) kr x log p 0 (x ) r x log p f (x )k 2 dx log pf (x 2 ) p 0 (x ) 2 log p f (x ) i Z i=1 p 0 (x log p 2 i where partial integration is used in (a) under the condition that p 0 (x log p f (x i 2! 0 as x i! 1; 8 i = 1; : : : ; d! 12/39

17 Empirical score matching p n represents n i.i.d. samples from P 0 D F (p n ; p f ) := 1 n nx dx a=1 i=1 1 log pf (X a i 2 2 log p f (X a 2 i! + C Since D F (p n ; p f ) is independent of A(f ), f n = arg min f 2F D F (p n ; p f ) should be easily computable, unlike the MLE. 13/39

18 Empirical score matching p n represents n i.i.d. samples from P 0 D F (p n ; p f ) := 1 n nx dx a=1 i=1 1 log pf (X a i 2 2 log p f (X a 2 i! + C Since D F (p n ; p f ) is independent of A(f ), f n = arg min f 2F D F (p n ; p f ) should be easily computable, unlike the MLE. Add extra term kf k 2 H to regularize. 13/39

19 A kernel solution Infinite exponential log p f (x ) = p f (x ) = e hf ;'(x )i H A(f ) q 0 hf ; '(x )i log q 0(x ): 14/39

20 A kernel solution Infinite exponential log p f (x ) = Kernel trick for derivatives: p f (x ) = e hf ;'(x )i H A(f ) q 0 i f (X hf ; '(x )i H + f ; Dot product between feature '(X ); '(X 0 j i '(X log q 0(x ): d+j k (X ; X 0 ) 14/39

21 A kernel solution Infinite exponential log p f (x ) = Kernel trick for derivatives: p f (x ) = e hf ;'(x )i H A(f ) q 0 i f (X hf ; '(x )i H + f ; Dot product between feature '(X ); '(X 0 j By representer theorem: f n = ^ + + i '(X ) dx = `=1 log q 0(x ): d+j k (X ; X 0 j 14/39

22 An RKHS solution The RKHS solution f n = ^ + nx dx `=1 j j Need to solve a linear system n = 1 0 G {z XX } nd nd + ni 1 C A 1 h X Very costly in high dimensions! f(x) x 15/39

23 The Nystrom approximation 16/39

24 Nystrom approach for efficient solution Find best estimator f n ;m in H Y := span f@ i k (y a ; )g ; where a2[m];i2[d] y a 2 fx i g n i=1 chosen at random. Nystrom solution: 0 1y n B ;m 1 n B XY > C B {z XY + G } {z YY } A md nd md md Solve in time O(nm 2 d 3 ), evaluate in time O(md): Sill cubic in d, but similar results if we take a random dimension per datapoint. h Y 17/39

25 Consistency: original solution Define C as the covariance between feature derivatives. Then from [Sriperumbudur et al. JMLR (2017)] Rates of convergence: Suppose Then f 0 2 R(C ) for some > 0. = n max 1 2(+1) 3 ; 1 as n! 1: min 2 D F (p 0 ; p fn ) = O p0 n 3 ; 2(+1) 18/39

26 Consistency: original solution Define C as the covariance between feature derivatives. Then from [Sriperumbudur et al. JMLR (2017)] Rates of convergence: Suppose Then f 0 2 R(C ) for some > 0. = n max 1 2(+1) 3 ; 1 as n! 1: min 2 D F (p 0 ; p fn ) = O p0 n 3 ; 2(+1) Convergence in other metrics: KL, Hellinger, L r ; 1 < r < 1. 18/39

27 Consistency: Nystrom solution Define C as the covariance between feature derivatives. Suppose f 0 2 R(C ) for some > 0. Number of subsampled points m = (n log n) for = (min(2; 1) + 2) = n max 1 Then 2(+1) 3 ; 1 3 ; 1 2 as n! 1: D F (p 0 ; p fn ;m ) = O p 0 n min 2 3 ; 2(+1) 19/39

28 Consistency: Nystrom solution Define C as the covariance between feature derivatives. Suppose f 0 2 R(C ) for some > 0. Number of subsampled points m = (n log n) for = (min(2; 1) + 2) = n max 1 Then 2(+1) 3 ; 1 3 ; 1 2 as n! 1: D F (p 0 ; p fn ;m ) = O p 0 n min 2 3 ; 2(+1) Convergence in other metrics: KL, Hellinger, L r ; 1 < r < 1. Same rate but saturates sooner. Full KL original saturates at O p0 n 1 2 Nystrom saturates at O p0 n /39

29 Experimental results: ring Sample: Score: /39

30 Experimental results: comparison with autoencoder score d Comparison with regularized auto-encoders [Alain and Bengio (JMLR, 2014)] n=500 training points full nyström, m = 42 nyström, m = 167 dae, m = 100 dae, m = /39

31 Experimental results: grid of Gaussians Sample: Score: /39

32 Experimental results: comparison with autoencoder score d Comparison with regularized auto-encoders [Alain and Bengio (JMLR, 2014)] n=500 training points full nyström, m = 42 nyström, m = 167 dae, m = 100 dae, m = /39

33 The kernel conditional exponential family 24/39

34 The kernel conditional exponential family Can we take advantage of the graphical structure of (X 1 ; :::; X d )? Start from a general factorization of P P (X 1 ; :::; X d ) = Y i P (X i j X (i) {z } parents of X i ) Estimate each factor independently 25/39

35 Kernel conditional exponential family General definition, kernel conditional exponential family [Smola and Canu, 2006] p f (yjx ) = e hf ; (x ;y)i H A(f ;x ) q 0 (y) A(f ; x ) = log Z q o (y)e hf ; (x ;y)ih dy (joint feature map (x ; y)) 26/39

36 Kernel conditional exponential family Our kernel conditional exponential family: p f (x ) = e hfx ;(y)i G A(f ;x ) q 0 (y) A(f ; x ) = log Z q o (y)e hfx ;(y)i G linear in the sufficient statistic (y) 2 G. 27/39

37 Kernel conditional exponential family Our kernel conditional exponential family: p f (x ) = e hfx ;(y)i G A(f ;x ) q 0 (y) A(f ; x ) = log Z q o (y)e hfx ;(y)i G linear in the sufficient statistic (y) 2 G. What does this RKHS look like? [Micchelli and Pontil, (2005)] hf x ; (y)i G = h x f ; (y)i G = hf ; x (y)i H 28/39

38 Kernel conditional exponential family Our kernel conditional exponential family: p f (x ) = e hfx ;(y)i G A(f ;x ) q 0 (y) A(f ; x ) = log Z q o (y)e hfx ;(y)i G linear in the sufficient statistic (y) 2 G. What does this RKHS look like? [Micchelli and Pontil, (2005)] hf x ; (y)i G = h x f ; (y)i G = hf ; x (y)i H x : H! G is a linear operator 28/39

39 Kernel conditional exponential family Our kernel conditional exponential family: p f (x ) = e hfx ;(y)i G A(f ;x ) q 0 (y) A(f ; x ) = log Z q o (y)e hfx ;(y)i G linear in the sufficient statistic (y) 2 G. What does this RKHS look like? [Micchelli and Pontil, (2005)] hf x ; (y)i G = h x f ; (y)i G x : G! H is a linear = hf ; x (y)i H operator. The feature map (x ; y) := x (y) 28/39

40 What is our loss function? The obvious approach: minimise D F [p 0 (x )p 0 (yjx )kp f (x )p f (yjx )] Problem: the expression still contains R p 0 (yjx )dy. 29/39

41 What is our loss function? The obvious approach: minimise D F [p 0 (x )p 0 (yjx )kp f (x )p f (yjx )] Problem: the expression still contains R p 0 (yjx )dy. Our loss function: ed F (p 0 ; p f ) := Z D F (p 0 (yjx )jjp f (yjx ))(x ) for some (x ) that includes the support of p(x ): 29/39

42 Finite sample estimate of the conditional density Use the simplest operator-valued RKHS x = I G k (x ; ). x : G! H x (y) 7! (y)k (x ; ) 30/39

43 Finite sample estimate of the conditional density Use the simplest operator-valued RKHS x = I G k (x ; ). x : G! H x (y) 7! (y)k (x ; ) Solution: where f n (yjx ) = nx dx b=1 i=1 n 1 = (b;i)k (X b ; x )@ i K(Y b ; y) + ^ (G + ni ) 1 h (G) (a ;i);(b;j ) =k (X a ; X b )@ j +d K(Y a ; Y b ); and h(y); (y 0 )i G = K(y; y 0 ). 30/39

44 Expected conditional score: a failure case P (Y jx = 1) P (Y jx = 1) P (Y ) = 1 2 (P (Y jx = 1) + P (Y jx = 1)) ed F (p(yjx ) {z } target {z } model ; p(y) ) = 0 31/39

45 Expected conditional score: a failure case P (Y jx = 1) P (Y jx = 1) P (Y ) = 1 2 (P (Y jx = 1) + P (Y jx = 1)) ed F (p(yjx ) {z } target {z } model ; p(y) ) = 0 31/39

46 Expected conditional score: a failure case P (Y jx = 1) P (Y jx = 1) P (Y ) = 1 2 (P (Y jx = 1) + P (Y jx = 1)) ed F (p(yjx ) {z } target {z } model ; p(y) ) = 0 31/39

47 Expected conditional score: a failure case P (Y jx = 1) P (Y jx = 1) P (Y ) = 1 2 (P (Y jx = 1) + P (Y jx = 1)) ed F (p(yjx ) {z } target {z } model ; p(y) ) = 0 31/39

48 Expected conditional score: a failure case Why does it fail? Recall ed F (p 0 (yjx ); p f (yjx )) := Z (x )D F (p 0 (yjx ); p f (yjx )) Note that D F (p(yjx = 1) {z } target {z } model ; p(y) ) = Z p(yj1) kr x log p(yj1) r x log p(y)k 2 dy Model p(y) puts mass where target conditional p(yj1) has no support. Care needed when this failure mode approached! 32/39

49 Unconditional vs conditional model in practice Red Wine: Physiochemical measurements on wine samples. Parkinsons: Biomedical voice measurements from patients with early stage Parkinson s disease. Parkinsons Red Wine Dimension Samples /39

50 Unconditional vs conditional model in practice Red Wine: Physiochemical measurements on wine samples. Parkinsons: Biomedical voice measurements from patients with early stage Parkinson s disease. Comparison with LSCDE model: with consistency guarantees [Sugiyama et al., (2010)] RNADE model: mixture models with deep features of parents, no guarantees [Uria et al. (2016)] 34/39

51 Unconditional vs conditional model in practice Red Wine: Physiochemical measurements on wine samples. Parkinsons: Biomedical voice measurements from patients with early stage Parkinson s disease. Comparison with LSCDE model: with consistency guarantees [Sugiyama et al., (2010)] RNADE model: mixture models with deep features of parents, no guarantees [Uria et al. (2016)] Negative log likelihoods (smaller is better, average over 5 test/train splits) Parkinsons Red wine KCEF 2:86 0:77 11:8 0:93 LSCDE 15:89 1:48 14:43 1:5 NADE 3:63 0:0 9:98 0:0 34/39

52 Results: unconditional model Red Wine Parkinsons 6 Data KEF Data KEF x x6 x x15 35/39

53 Results: conditional model x Red Wine Data KCEF x6 x Parkinsons 1 1 x15 Data KCEF 36/39

54 Co-authors From Gatsby: Michael Arbel Heiko Strathmann Dougal Sutherland External collaborators: Kenji Fukumizu Bharath Sriperumbudur Questions? 37/39

55 Score matching: 1-D proof D F (p 0 ; p f ) = 1 2 Z b a p 0 (x ) d log p0 (x ) d log p f (x 2 ) 38/39

56 Score matching: 1-D proof D F (p 0 ; p f ) 1 Z b d log p0 (x ) d log p f (x 2 ) = p 0 (x ) 2 a 1 Z b d log p0 (x 2 ) = p 0 (x ) 1 Z b d log pf (x 2 ) + p 0 (x ) 2 a 2 a Z b d log pf (x ) d log p0 (x ) p 0 (x ) a 38/39

57 Score matching: 1-D proof D F (p 0 ; p f ) 1 Z b d log p0 (x ) d log p f (x 2 ) = p 0 (x ) 2 a 1 Z b d log p0 (x 2 ) = p 0 (x ) 1 Z b d log pf (x 2 ) + p 0 (x ) 2 a 2 a Z b d log pf (x ) d log p0 (x ) p 0 (x ) a Final term: Z b d log pf (x ) p 0 (x ) d log pf (x ) 1 a Z b = a p 0 (x ) = d log pf (x ) p 0 (x ) d log p0 (x ) b a p 0 (x ) Z b a dp 0 (x )! p 0 (x ) d 2 log p f (x ) 2 : 38/39

58 Score matching: 1-D proof D F (p 0 ; p f ) 1 Z b d log p0 (x ) d log p f (x 2 ) = p 0 (x ) 2 a 1 Z b d log p0 (x 2 ) = p 0 (x ) 1 Z b d log pf (x 2 ) + p 0 (x ) 2 a 2 a Z b d log pf (x ) d log p0 (x ) p 0 (x ) a Final term: Z b d log pf (x ) p 0 (x ) d log pf (x ) 1 a Z b = a p 0 (x ) = d log pf (x ) p 0 (x ) d log p0 (x ) b a p 0 (x ) Z b a dp 0 (x )! p 0 (x ) d 2 log p f (x ) 2 : 38/39

59 Score matching: 1-D proof D F (p 0 ; p f ) 1 Z b d log p0 (x ) d log p f (x 2 ) = p 0 (x ) 2 a 1 Z b d log p0 (x 2 ) = p 0 (x ) 1 Z b d log pf (x 2 ) + p 0 (x ) 2 a 2 a Z b d log pf (x ) d log p0 (x ) p 0 (x ) a Final term: Z b d log pf (x ) p 0 (x ) d log pf (x ) 1 a Z b = a p 0 (x ) = d log pf (x ) p 0 (x ) d log p0 (x ) b a p 0 (x ) Z b a dp 0 (x )! p 0 (x ) d 2 log p f (x ) 2 : 38/39

60 39/39

Adaptive HMC via the Infinite Exponential Family

Adaptive HMC via the Infinite Exponential Family Arthur Gretton Gatsby Unit, CSML, University College London RegML, 2017 Arthur Gretton (Gatsby Unit, UCL) Adaptive HMC via the Infinite Exponential Family