Announcements Proposals graded Kevin Jamieson 2018 1
Bayesian Methods Machine Learning CSE546 Kevin Jamieson University of Washington November 1, 2018 2018 Kevin Jamieson 2
MLE Recap - coin flips Data: sequence D= (HHTHT ), k heads out of n flips Hypothesis: P(Heads) = θ, P(Tails) = 1-θ P (D ) = k (1 ) n k Maximum likelihood estimation (MLE): Choose θ that maximizes the probability of observed data: b MLE = arg max = arg max P (D ) log P (D ) b MLE = k n 2018 Kevin Jamieson 3
What about prior Billionaire: Wait, I know that the coin is close to 50-50. What can you do for me now? You say: I can learn it the Bayesian way 2018 Kevin Jamieson 4
Bayesian Learning Use Bayes rule: Or equivalently: 2018 Kevin Jamieson 5
Bayesian Learning for Coins Likelihood function is simply Binomial: What about prior? Represent expert knowledge Conjugate priors: P (D ) = k (1 ) n k Closed-form representation of posterior For Binomial, conjugate prior is Beta distribution 2018 Kevin Jamieson 6
Beta prior distribution P(θ) Mean: Mode: Beta(2,3) Beta(20,30) Likelihood function: Posterior: P (D ) = k (1 ) n k 2018 Kevin Jamieson 7
Posterior distribution Prior: Data: k heads and (n-k) tails Posterior distribution: P ( D) =Beta(k + H, (n k)+ T ) H =1, t =1 H = 10, t = 10 H = 50, t = 50 Prior P ( ) Posterior P ( D) k = 23,n= 25 2018 Kevin Jamieson 8
Using Bayesian posterior Posterior distribution: P ( D) =Beta(k + H, (n k)+ T ) Bayesian inference: Estimate mean E[ ] = Z 1 0 P ( D)d Estimate arbitrary function f For arbitrary f integral is often hard to compute 2018 Kevin Jamieson 9
MAP: Maximum a posteriori approximation P ( D) =Beta(k + H, (n k)+ T ) As more data is observed, Beta is more certain MAP: use most likely parameter: 2018 Kevin Jamieson 10
MAP for Beta distribution P ( D) / k+ H 1 (1 ) n k+ T 1 MAP: use most likely parameter: 2018 Kevin Jamieson 11
MAP for Beta distribution P ( D) / k+ H 1 (1 ) n k+ T 1 MAP: use most likely parameter: Beta prior equivalent to extra coin flips As N 1, prior is forgotten k + H 1 n k + T 1 But, for small sample size, prior is important! 2018 Kevin Jamieson 12
Bayesian vs Frequentist Data: D Frequentists treat unknown θ as fixed and the data D as random. b = t(d) Bayesian treat the data D as fixed and the unknown θ as random P ( D) 2018 Kevin Jamieson 13
Recap for Bayesian learning Bayesians are optimists: If we model it correctly, we quantify uncertainty exactly Answers all questions simultaneously with posterior probability Assumes one can accurately model: Observations and link to unknown parameter θ: Distribution, structure of unknown θ: p( ) p(x ) Frequentist are pessimists: All models are wrong, prove to me your estimate is good Answers each question with a separately analyzed estimator Makes very few assumptions, e.g. E[X 2 ] < 1 and constructs an estimator (e.g., median of means of disjoint subsets of data) 2018 Kevin Jamieson 14
Nearest Neighbor Machine Learning CSE546 Kevin Jamieson University of Washington November 1, 2018 2018 Kevin Jamieson 15
Some data, Bayes Classifier Training data: True label: +1 True label: -1 Optimal Bayes classifier: P(Y =1 X = x) = 1 2 Predicted label: +1 Predicted label: -1 Figures stolen from Hastie et al 2018 Kevin Jamieson 16
Linear Decision Boundary Training data: True label: +1 True label: -1 Learned: Linear Decision boundary x T w + b =0 Predicted label: +1 Predicted label: -1 Figures stolen from Hastie et al 2018 Kevin Jamieson 17
15 Nearest Neighbor Boundary Training data: True label: +1 True label: -1 Learned: 15 nearest neighbor decision boundary (majority vote) Predicted label: +1 Predicted label: -1 2018 Kevin Jamieson 18
1 Nearest Neighbor Boundary Training data: True label: +1 True label: -1 Learned: 1 nearest neighbor decision boundary (majority vote) Predicted label: +1 Predicted label: -1 2018 Kevin Jamieson 19
k-nearest Neighbor Error Bias-Variance tradeoff As k->infinity? Bias: Best possible Variance: As k->1? Bias: Variance: 2018 Kevin Jamieson 20
Notable distance metrics (and their level sets) L 2 norm L 1 norm (taxi-cab) Mahalanobis L-infinity (max) norm Kevin Jamieson 2018 21
1 nearest neighbor One can draw the nearest-neighbor regions in input space. Dist(x i,x j ) = (x i 1 xj 1 )2 + (x i 2 xj 2 )2 Dist(x i,x j ) =(x i 1 xj 1 )2 +(3x i 2 3xj 2 )2 The relative scalings in the distance metric affect region shapes Kevin Jamieson 2018 22
1 nearest neighbor guarantee {(x i,y i )}) n i=1 x i 2 R d,y i 2 {1,...,k} As n!1assume the x i s become dense in R d and P(Y = j X = x) issmooth As x a! x b we have P(Y a = j)! P(Y b = j) for all j 2018 Kevin Jamieson 23
1 nearest neighbor guarantee {(x i,y i )}) n i=1 x i 2 R d,y i 2 {1,...,k} As n!1assume the x i s become dense in R d and P(Y = j X = x) issmooth As x a! x b we have P(Y a = j)! P(Y b = j) for all j If p` = P(Y a = `) =P(Y b = `) and ` = arg max `=1,...,k p` then Bayes Error = 1 p` 2018 Kevin Jamieson 24
1 nearest neighbor guarantee {(x i,y i )}) n i=1 x i 2 R d,y i 2 {1,...,k} As n!1assume the x i s become dense in R d and P(Y = j X = x) issmooth As x a! x b we have P(Y a = j)! P(Y b = j) for all j If p` = P(Y a = `) =P(Y b = `) and ` = arg max `=1,...,k p` then Bayes Error = 1 1-nearest neighbor error = p` P(Y a 6= Y b )= kx P(Y a = `,Y b 6= `) `=1 2018 Kevin Jamieson 25
1 nearest neighbor guarantee {(x i,y i )}) n i=1 x i 2 R d,y i 2 {1,...,k} As n!1assume the x i s become dense in R d and P(Y = j X = x) issmooth As x a! x b we have P(Y a = j)! P(Y b = j) for all j If p` = P(Y a = `) =P(Y b = `) and ` = arg max `=1,...,k p` then Bayes Error = 1 1-nearest neighbor error = = P(Y a 6= Y b )= kx P(Y a = `,Y b 6= `) `=1 kx p`(1 p`) apple 2(1 p` ) `=1 As n->infinity, then 1-NN rule error is at most twice the Bayes error! p` [Cover, Hart, 1967] k 2 (1 p` ) k 1 2018 Kevin Jamieson 26
Curse of dimensionality Ex. 1 side length r X is uniformly distributed over [0, 1] p. What is P(X 2 [0,r] p )? 2018 Kevin Jamieson 27
Curse of dimensionality Ex. 2 {X i } n i=1 are uniformly distributed over [.5,.5]p. What is the median distance from a point at origin to its 1NN? 2018 Kevin Jamieson 28
Nearest neighbor regression {(x i,y i )}) n i=1 N k (x 0 )=k-nearest neighbors of x 0 bf(x 0 )= X 1 k y i x i 2N k (x 0 ) Kevin Jamieson 2018 29
Nearest neighbor regression {(x i,y i )}) n i=1 Why are far-away neighbors weighted same as close neighbors! Kernel smoothing: K(x, y) N k (x 0 )=k-nearest neighbors of x 0 bf(x 0 )= X 1 k y i x i 2N k (x 0 ) bf(x 0 )=P n i=1 K(x 0,x i )y i P n i=1 K(x 0,x i ) Kevin Jamieson 2018 30
Nearest neighbor regression {(x i,y i )}) n i=1 N k (x 0 )=k-nearest neighbors of x 0 bf(x 0 )= X 1 k y i f(x0 b )= x i 2N k (x 0 ) P n i=1 K(x 0,x i )y i P n i=1 K(x 0,x i ) Kevin Jamieson 2018 31
Nearest neighbor regression {(x i,y i )}) n i=1 N k (x 0 )=k-nearest neighbors of x 0 bf(x 0 )= X x i 2N k (x 0 ) 1 k y i b f(x0 )= Why just average them? P n i=1 K(x 0,x i )y P i n i=1 K(x 0,x i ) Kevin Jamieson 2018 32
Nearest neighbor regression {(x i,y i )}) n i=1 N k (x 0 )=k-nearest neighbors of x 0 bf(x 0 )= X 1 k y i x i 2N k (x 0 ) bf(x 0 )=P n i=1 K(x 0,x i )y i P n i=1 K(x 0,x i ) w(x 0 ),b(x 0 ) = arg min w,b bf(x 0 )=b(x 0 )+w(x 0 ) T x 0 nx K(x 0,x i )(y i (b + w T x i )) 2 i=1 Local Linear Regression Kevin Jamieson 2018 33
Nearest Neighbor Overview Very simple to explain and implement No training! But finding nearest neighbors in large dataset at test can be computationally demanding (kd-trees help) 2018 Kevin Jamieson 34
Nearest Neighbor Overview Very simple to explain and implement No training! But finding nearest neighbors in large dataset at test can be computationally demanding (kd-trees help) You can use other forms of distance (not just Euclidean) Smoothing with Kernels and local linear regression can improve performance (at the cost of higher variance) 2018 Kevin Jamieson 35
Nearest Neighbor Overview Very simple to explain and implement No training! But finding nearest neighbors in large dataset at test can be computationally demanding (kd-trees help) You can use other forms of distance (not just Euclidean) Smoothing with Kernels and local linear regression can improve performance (at the cost of higher variance) With a lot of data, local methods have strong, simple theoretical guarantees. Without a lot of data, neighborhoods aren t local and methods suffer. 2018 Kevin Jamieson 36
Kernels Machine Learning CSE546 Kevin Jamieson University of Washington October 26, 2018 2018 Kevin Jamieson 37
Machine Learning Problems Have a bunch of iid data of the form: {(x i,y i )} n i=1 x i 2 R d y i 2 R nx Learning a model s parameters: Each `i(w) is convex. i=1 `i(w) Hinge Loss: `i(w) = max{0, 1 y i x T i w} Logistic Loss: `i(w) = log(1 + exp( y i x T i w)) Squared error Loss: `i(w) =(y i x T i w)2 All in terms of inner products! Even nearest neighbor can use inner products! Kevin Jamieson 2018 38
What if the data is not linearly separable? Use features of features of features of features. (x) :R d! R p Feature space can get really large really quickly! 2018 Kevin Jamieson 39
Dot-product of polynomials exactly d d =1: (u) = apple u1 u 2 h (u), (v)i = u 1 v 1 + u 2 v 2 2017 Kevin Jamieson 40
Dot-product of polynomials d =1: (u) = d =2: (u) = apple u1 u 2 h (u), (v)i = u 1 v 1 + u 2 v 2 2 6 4 u 2 1 u 2 2 u 1 u 2 u 2 u 1 3 exactly d 7 5 h (u), (v)i = u2 1v1 2 + u 2 2v2 2 +2u 1 u 2 v 1 v 2 2017 Kevin Jamieson 41
Dot-product of polynomials d =1: (u) = d =2: (u) = apple u1 u 2 h (u), (v)i = u 1 v 1 + u 2 v 2 2 6 4 u 2 1 u 2 2 u 1 u 2 u 2 u 1 3 exactly d 7 5 h (u), (v)i = u2 1v1 2 + u 2 2v2 2 +2u 1 u 2 v 1 v 2 General d : Dimension of (u) is roughly p d if u 2 R p 2017 Kevin Jamieson 42
Kernel Trick bw = arg min w nx (y i x T i w) 2 + w 2 w i=1 There exists an 2 R n : bw = nx i x i i=1 Why? b = arg min nx X n (y i j hx j,x i i) 2 + i=1 j=1 nx nx i j hx i,x j i i=1 j=1 2017 Kevin Jamieson 43
Kernel Trick bw = arg min w nx (y i x T i w) 2 + w 2 w i=1 There exists an 2 R n : bw = nx i x i i=1 Why? b = arg min = arg min nx X n (y i j hx j,x i i) 2 + i=1 j=1 nx X n (y i j K(x i,x j )) 2 + i=1 j=1 nx nx i j hx i,x j i i=1 nx j=1 i=1 j=1 nx i j K(x i,x j ) = arg min y K 2 2 + T K K(x i,x j )=h (x i ), (x j )i 2017 Kevin Jamieson 44
Why regularization? Typically, K 0. What if = 0? b = arg min y K 2 2 + T K 2017 Kevin Jamieson 45
Why regularization? Typically, K 0. What if = 0? b = arg min y K 2 2 + T K Unregularized kernel least squares can (over) fit any data! b = K 1 y 2017 Kevin Jamieson 46
Common kernels Polynomials of degree exactly d Polynomials of degree up to d Gaussian (squared exponential) kernel u v 2 K(u, v) =exp 2 Sigmoid 2 2 2017 Kevin Jamieson 47
Mercer s Theorem When do we have a valid Kernel K(x,x )? Definition 1: when it is an inner product Mercer s Theorem: K(x,x ) is a valid kernel if and only if K is a positive semi-definite. PSD in the following sense: Z Z h(x)k(x, x 0 )h(x 0 )dxdx 0 0 8h : R d! R, x,x 0 x h(x) 2 dx apple1 2017 Kevin Jamieson 48
RBF Kernel K(u, v) =exp u v 2 2 2 2 Note that this is like weighting bumps on each point like kernel smoothing but now we learn the weights 2017 Kevin Jamieson 49
RBF Kernel K(u, v) =exp u v 2 2 2 2 The bandwidth sigma has an enormous effect on fit: = 10 2 = 10 4 = 10 1 = 10 4 = 10 0 = 10 4 bf(x) = nx b i K(x i,x) i=1 2017 Kevin Jamieson 50
RBF Kernel K(u, v) =exp u v 2 2 2 2 The bandwidth sigma has an enormous effect on fit: = 10 2 = 10 4 = 10 1 = 10 4 = 10 0 = 10 4 = 10 3 = 10 4 = 10 1 = 10 0 bf(x) = nx b i K(x i,x) i=1 2017 Kevin Jamieson 51
RBF kernel and random features 2 cos( ) cos( ) = cos( + ) + cos( ) e jz = cos(z)+sin(z) Recall HW1 where we used the feature map: 2p 3 2 cos(w T 1 x + b 1 ) 6 7 (x) = 4 p. 5 2 cos(w T p x + b p ) E[ 1 p (x)t (y)] = 1 p w k N (0, 2 I) b k uniform(0, ) px E[2 cos(wk T x + b k ) cos(wk T y + b k )] k=1 = E w,b [2 cos(w T x + b) cos(w T y + b)] 2017 Kevin Jamieson 52
RBF kernel and random features 2 cos( ) cos( ) = cos( + ) + cos( ) e jz = cos(z)+sin(z) Recall HW1 where we used the feature map: 2p 3 2 cos(w T 1 x + b 1 ) 6 7 (x) = 4 p. 5 2 cos(w T p x + b p ) E[ 1 p (x)t (y)] = 1 p w k N (0, 2 I) b k uniform(0, ) px E[2 cos(wk T x + b k ) cos(wk T y + b k )] k=1 = E w,b [2 cos(w T x + b) cos(w T y + b)] = e x y 2 2 [Rahimi, Recht NIPS 2007] NIPS Test of Time Award, 2018 2017 Kevin Jamieson 53
RBF Classification min,b nx bw = max{0, 1 y i (b + x T i w)} + w 2 2 i=1 nx nx nx max{0, 1 y i (b + j hx i,x j i)} + i j hx i,x j i i=1 j=1 i,j=1 2017 Kevin Jamieson 54
Wait, infinite dimensions? Isn t everything separable there? How are we not overfitting? Regularization! Fat shattering (R/margin)^2 2018 Kevin Jamieson 55
String Kernels Example from Efron and Hastie, 2016 Amino acid sequences of different lengths: x1 x2 All subsequences of length 3 (of possible 20 amino acids) 2017 Kevin Jamieson 56