Introduction to Machine Learning

Size: px

Start display at page:

Download "Introduction to Machine Learning"

Martin Newton
5 years ago
Views:

1 Introduction to Machine Learning 12. Gaussian Processes Alex Smola Carnegie Mellon University

2 The Normal Distribution

3 The Normal Distribution

4 Gaussians in Space

5 Gaussians in Space samples in R 2

6 The Normal Distribution Density for scalar variables p(x) = 1 p 2 2 e (x µ) Density in d dimensions p(x) =(2 ) d 2 1 e 1 2 (x µ)> 1 (x µ) Principal components Eigenvalue decomposition = U > U Product representation p(x) =(2 ) d 2 e 1 2 (U(x µ))> 1 U(x µ)

7 The Normal Distribution principal principal components components = U > U p(x) =(2 ) d 2 dy i=1 1 ii e 1 2 (U(x µ))> 1 U(x µ)

8 Why do we care? Central limit theorem shows that in the limit all averages behave like Gaussians Easy to estimate parameters (MLE) µ = 1 m mx x i and = 1 m mx x i x > i µµ > i=1 i=1 Distribution with largest uncertainty (entropy) for a given mean and covariance. Works well even if the assumptions are wrong

9 Why do we care? Central limit theorem shows that in the limit all averages behave like Gaussians Easy to estimate parameters (MLE) µ = 1 m mx i=1 X: data m: sample size x i and = 1 m mx x i x > i µµ > i=1 mu = (1/m)*sum(X,2) sigma = (1/m)*X*X - mu*mu

10 Sampling from a Gaussian Case 1 - We have a normal distribution (randn) We want Recipe: where and Proof: x N(µ, ) x = µ + Lz z N(0, 1) = LL > E (x µ)(x µ) > = E Lzz > L > = LE zz > L > = LL > = Case 2 - Box-Müller transform for U[0,1] p(x) = 1 2 e 1 2 kxk 2 =) p(,r)= 1 2 e 1 2 r h1 F (,r)= e 1 r2i 2 2 2

11 Sampling from a Gaussian Φ r p(x) = 1 2 e 1 2 kxk 2 =) p(,r)= 1 2 e 1 2 r h1 F (,r)= e 1 r2i 2 2 2

12 Sampling from a Gaussian Cumulative distribution function F (,r)= 2 h1 e 1 2 r2i Draw radial and angle component separately tmp1 = rand() tmp2 = rand() r = sqrt(-2*log(tmp1)) x1 x2 = r*sin(tmp2/(2*pi)) = r*cos(tmp2/(2*pi))

13 Sampling from a Gaussian Cumulative distribution function F (,r)= 2 h1 e 1 2 r2i Draw radial and angle component separately tmp1 = rand() tmp2 = rand() r = sqrt(-2*log(tmp1)) Why can we use tmp1 instead of 1-tmp1? x1 x2 = r*sin(tmp2/(2*pi)) = r*cos(tmp2/(2*pi))

14 Example: correlating weight and height

15 Example: correlating weight and height assume Gaussian correlation

16 p(weight height) = p(height, weight) p(height) / p(height, weight)

17 p(x 2 x 1 ) / exp " 1 2 apple x1 µ 1 x 2 µ 2 > apple apple x1 µ 1 x 2 µ 2 # keep linear and quadratic terms of exponent

18 The gory math Correlated Observations Assume that the random variables t 2 R n,t 0 2 R n0 are jointly normal with mean (µ, µ 0 ) and covariance matrix K p(t, t 0 ) / exp 1 2 apple t µ t 0 µ 0 > apple Ktt K tt 0 K > tt 0 K t 0 t 0 1 apple t µ t 0 µ 0 Inference Given t, estimate t 0 via p(t 0 t). Translation into machine learning language: we learn t 0 from t. Practical Solution Since t 0 t N( µ, K), we only need to collect all terms in p(t, t 0 ) depending on t 0 by matrix inversion, hence K = K t 0 t 0 K> tt 0 Ktt 1 K tt 0 and µ = µ 0 + K tt > 0 K 1 tt (t µ) {z } independent of t 0 Handbook of Matrices, Lütkepohl 1997 (big timesaver)!.

19 Mini Summary Normal distribution p(x) =(2 ) d 2 1 e 1 2 (x µ)> 1 (x µ) Sampling from x N(µ, ) Use where and x = µ + Lz z N(0, 1) = LL > Estimating mean and variance µ = 1 m mx i=1 x i and = 1 m Conditional distribution is Gaussian, too! p(x 2 x 1 ) / exp " 1 2 apple x1 µ 1 x 2 µ 2 > apple mx i=1 x i x > i µµ > 1 apple x1 µ 1 x 2 µ 2 #

20 Gaussian Processes

21 Gaussian Process Key Idea Instead of a fixed set of random variables t, t 0 we assume a stochastic process t : X! R, e.g. X = R n. Previously we had X = {age, height, weight,...}. Definition of a Gaussian Process A stochastic process t : X! R, where all (t(x 1 ),...,t(x m )) are normally distributed. Parameters of a GP Mean µ(x) := E[t(x)] Covariance Function k(x, x 0 ) := Cov(t(x),t(x 0 )) Simplifying Assumption We assume knowledge of k(x, x 0 ) and set µ =0.

22 Gaussian Process Sampling from a Gaussian Process Points x where we want to sample Compute covariance matrix X Can only obtain values at those points! In general entire function f(x) is NOT available

23 Gaussian Process Sampling from a Gaussian Process Points x where we want to sample Compute covariance matrix X Can only obtain values at those points! In general entire function f(x) is NOT available only looks smooth (evaluated at many points)

24 Gaussian Process Sampling from a Gaussian Process Points x where we want to sample Compute covariance matrix X Can only obtain values at those points! In general entire function f(x) is NOT available p(t X) =(2 ) m 1 2 K exp 1 2 (t µ)> K 1 (t µ) where K ij = k(x i,x j ) and µ i = µ(x i )

25 Kernels... Covariance Function Function of two arguments Leads to matrix with nonnegative eigenvalues Describes correlation between pairs of observations Kernel Function of two arguments Leads to matrix with nonnegative eigenvalues Similarity measure between pairs of observations Lucky Guess We suspect that kernels and covariance functions are the same... yes!

26 Mini Summary Gaussian Process Think distribution over function values (not functions) Defined by mean and covariance function p(t X) =(2 ) m 1 2 K exp 1 2 (t µ)> K 1 (t µ) Generates vectors of arbitrary dimensionality (via X) Covariance function via kernels

27 Gaussian Process Regression

28 Gaussian Processes for Inference X = {height, weight} p(weight height) = p(height, weight) p(height) / p(height, weight)

29 Joint Gaussian Model Random variables (t,t ) are drawn from GP p(t, t 0 ) / exp 1 2 apple t µ t 0 µ 0 0 > apple Ktt K tt 0 K > tt 0 K t 0 t 0 1 apple t µ t 0 µ 0! Observe subset t Predict t using K = K t 0 t 0 K > tt 0 K 1 tt K tt 0 and µ = µ 0 + K > tt 0 K 1 tt (t µ) Linear expansion (precompute things) Predictive uncertainty is data independent Good for experimental design Predictive uncertainty is data independent

30 Linear Gaussian Process Regression Linear kernel: k(x, x 0 )=hx, x 0 i Problem Kernel matrix X > X Mean and covariance K = X 0> X 0 X 0> X(X > X) 1 X > X 0 = X 0> (1 P X )X 0. µ = X 0> X(X > X) 1 t µ is a linear function of X 0. The covariance matrix X > X has at most rank n. After n observations (x 2 R n ) the variance vanishes. This is not realistic. Flat pancake or cigar distribution.

31 Degenerate Covariance

32 Degenerate Covariance x t

33 Degenerate Covariance x t y fatten up covariance

34 Degenerate Covariance x t y fatten up covariance

35 Degenerate Covariance x t y fatten up covariance t N(µ, K) y i N(t i, 2 )

36 Additive Noise Indirect Model Instead of observing t(x) we observe y = t(x)+, where is a nuisance term. This yields Z Y m p(y X) = p(y i t i )p(t X)dt i=1 where we can now find a maximum a posteriori solution for t by maximizing the integrand (we will use this later). Additive Normal Noise If N(0, 2 ) then y is the sum of two Gaussian random variables. Means and variances add up. y N(µ, K + 2 1).

37 Data

38 Predictive mean > k(x, X) > (K(X, X)+ 2 1) 1 y

39 Variance >

40 Putting it all together

41 Putting it all together

42 Ugly details Covariance Matrices Additive noise K = K kernel Predictive mean and variance K = K t 0 t 0 K> tt 0 K 1 tt K tt 0 and µ = K > tt 0 K 1 tt t With Noise K = K t 0 t K tt > 0 K tt K tt 0 h and µ = µ 0 + K tt > 0 K tt i (y µ)

43 Pseudocode K = K t 0 t K tt > K 0 tt K tt 0 h i and µ = µ 0 + K tt > K 0 tt (y µ) ktrtr ktetr ktete = k(xtrain,xtrain) + sigma2 * eye(mtr) = k(xtest,xtrain) = k(xtest,xtest) alpha = ytr/ktrtr %better if you use cholesky yte = ktetr * alpha sigmate = ktete + sigma2 * eye(mte) +... ktetr * (ktetr/ktrtr)

44 The connection between SVM and GP Gaussian Process on Parameters t N(µ, K) where K ij = k(x i,x j ) Linear Model in Feature Space t(x) =h (x),wi + µ(x) where w N(0, 1) The covariance between t(x) and t(x 0 ) is then given by E w [h (x),wihw, (x 0 )i] =h (x), (x 0 )i = k(x, x 0 ) Linear model in feature space induces a Gaussian Process

45 Mini Summary Latent variables t drawn from a Gaussian Process Observations y are t corrupted with noise Observations y are drawn from Gaussian Process µ! µ and K! K Estimate y y,x,x (matrix inversion) K = K t 0 t K tt > 0 K tt K tt 0 h and µ = µ 0 + K tt > 0 K tt i (y µ) SVM kernel is GP kernel

46 Gaussian Process Classification

47 Gaussian Process Classification Regression Data y is scalar x t y Connection to t is by additive noise t N(µ, K) and y i N(t i, (Binary) Classification 2 ) i.e. p(y i t i )= e (y i t i ) 2 Data y in {-1, 1} Connection to t is by logistic model 1 t N(µ, K) and p(y i t i )= 1+e y it i

48 Logistic function p(y i t i )= 1 1+e y it i

49 Gaussian Process Classification Regression t N(µ, K) and y i N(t i, 2 )hencey N(µ, K + 2 1) We can integrate out the latent variable t. Classification Closed form solution is not possible t N(µ, K) and y i Logistic(t i ) (we cannot solve the integral in t). p(t y, x) / p(t x) / exp my i=1 p(y i t i ) 1 Y m 2 t> K 1 t i=1 1 1+e y it i

50 Gaussian Process Classification What we should do: integrate out t,t Z p(y 0 y, x, x 0 )= d(t, t 0 )p(y 0 t 0 )p(y t)p(t, t 0 x, x 0 ) But this is very very expensive (e.g. MCMC) Maximum a Posteriori approximation Find ˆt := argmax t p(y t)p(t x) Ignore correlation in test data (horrible) Find Estimate ˆt 0 (x 0 ) := argmax t 0 p(ˆt, t 0 x, x 0 ) y 0 y, x, x 0 Logistic(ˆt 0 (x 0 ))

51 Maximum a Posteriori Approximation Step 1 - maximize p(t y,x) minimize t 1 2 t> K 1 t + mx i=1 log 1+e y it i Step 2 - find t t for MAP estimate of t t 0 = K tt > K 1 0 tt t precompute Step 3 - estimate p(y t ) p(y 0 t 0 )= 1 1+e y0 t 0

52 Clean Data

53 Noisy Data

54 Connection to SVMs revisited SVM objective minimize 1 2 > K + mx i=1 max (0, 1 y i [K ] i ) Logistic regression objective (MAP estimation) minimize t 1 2 t> K 1 t + mx i=1 log 1+e y it i Reparametrize = K 1 t minimize 1 2 > K + mx log (1 + exp y i [K ] i ) i=1

55 More loss functions Logistic log h1+e f(x)i Huberized loss 8 >< >: 0 if f(x) > (1 f(x))2 if f(x) 2 [0, 1] 1 2 f(x) if f(x) < 0 (asymptotically) linear Soft margin max(0, 1 f(x)) (asymptotically) 0

56 Mini Summary Latent variables drawn from Gaussian Process Observation drawn from logistic model Impossible to integrate out latent variables Maximum a posteriori inference (with many hacks to make it scale) Optimization problem is similar to SVM (different loss and parametrization = K 1 t) Advanced topic - adjusting K via prior on k

Gaussian Processes (10/16/13)

STA561: Probabilistic machine learning Gaussian Processes (10/16/13) Lecturer: Barbara Engelhardt Scribes: Changwei Hu, Di Jin, Mengdi Wang 1 Introduction In supervised learning, we observe some inputs