Announcements. Proposals graded

Announcements Proposals graded Kevin Jamieson 2018 1

Bayesian Methods Machine Learning CSE546 Kevin Jamieson University of Washington November 1, 2018 2018 Kevin Jamieson 2

MLE Recap - coin flips Data: sequence D= (HHTHT ), k heads out of n flips Hypothesis: P(Heads) = θ, P(Tails) = 1-θ P (D ) = k (1 ) n k Maximum likelihood estimation (MLE): Choose θ that maximizes the probability of observed data: b MLE = arg max = arg max P (D ) log P (D ) b MLE = k n 2018 Kevin Jamieson 3

What about prior Billionaire: Wait, I know that the coin is close to 50-50. What can you do for me now? You say: I can learn it the Bayesian way 2018 Kevin Jamieson 4

Bayesian Learning Use Bayes rule: Or equivalently: 2018 Kevin Jamieson 5

Bayesian Learning for Coins Likelihood function is simply Binomial: What about prior? Represent expert knowledge Conjugate priors: P (D ) = k (1 ) n k Closed-form representation of posterior For Binomial, conjugate prior is Beta distribution 2018 Kevin Jamieson 6

Beta prior distribution P(θ) Mean: Mode: Beta(2,3) Beta(20,30) Likelihood function: Posterior: P (D ) = k (1 ) n k 2018 Kevin Jamieson 7

Posterior distribution Prior: Data: k heads and (n-k) tails Posterior distribution: P ( D) =Beta(k + H, (n k)+ T ) H =1, t =1 H = 10, t = 10 H = 50, t = 50 Prior P ( ) Posterior P ( D) k = 23,n= 25 2018 Kevin Jamieson 8

Using Bayesian posterior Posterior distribution: P ( D) =Beta(k + H, (n k)+ T ) Bayesian inference: Estimate mean E[ ] = Z 1 0 P ( D)d Estimate arbitrary function f For arbitrary f integral is often hard to compute 2018 Kevin Jamieson 9

MAP: Maximum a posteriori approximation P ( D) =Beta(k + H, (n k)+ T ) As more data is observed, Beta is more certain MAP: use most likely parameter: 2018 Kevin Jamieson 10

MAP for Beta distribution P ( D) / k+ H 1 (1 ) n k+ T 1 MAP: use most likely parameter: 2018 Kevin Jamieson 11

MAP for Beta distribution P ( D) / k+ H 1 (1 ) n k+ T 1 MAP: use most likely parameter: Beta prior equivalent to extra coin flips As N 1, prior is forgotten k + H 1 n k + T 1 But, for small sample size, prior is important! 2018 Kevin Jamieson 12

Bayesian vs Frequentist Data: D Frequentists treat unknown θ as fixed and the data D as random. b = t(d) Bayesian treat the data D as fixed and the unknown θ as random P ( D) 2018 Kevin Jamieson 13

Recap for Bayesian learning Bayesians are optimists: If we model it correctly, we quantify uncertainty exactly Answers all questions simultaneously with posterior probability Assumes one can accurately model: Observations and link to unknown parameter θ: Distribution, structure of unknown θ: p( ) p(x ) Frequentist are pessimists: All models are wrong, prove to me your estimate is good Answers each question with a separately analyzed estimator Makes very few assumptions, e.g. E[X 2 ] < 1 and constructs an estimator (e.g., median of means of disjoint subsets of data) 2018 Kevin Jamieson 14

Nearest Neighbor Machine Learning CSE546 Kevin Jamieson University of Washington November 1, 2018 2018 Kevin Jamieson 15

Some data, Bayes Classifier Training data: True label: +1 True label: -1 Optimal Bayes classifier: P(Y =1 X = x) = 1 2 Predicted label: +1 Predicted label: -1 Figures stolen from Hastie et al 2018 Kevin Jamieson 16

Linear Decision Boundary Training data: True label: +1 True label: -1 Learned: Linear Decision boundary x T w + b =0 Predicted label: +1 Predicted label: -1 Figures stolen from Hastie et al 2018 Kevin Jamieson 17

15 Nearest Neighbor Boundary Training data: True label: +1 True label: -1 Learned: 15 nearest neighbor decision boundary (majority vote) Predicted label: +1 Predicted label: -1 2018 Kevin Jamieson 18

1 Nearest Neighbor Boundary Training data: True label: +1 True label: -1 Learned: 1 nearest neighbor decision boundary (majority vote) Predicted label: +1 Predicted label: -1 2018 Kevin Jamieson 19

k-nearest Neighbor Error Bias-Variance tradeoff As k->infinity? Bias: Best possible Variance: As k->1? Bias: Variance: 2018 Kevin Jamieson 20

Notable distance metrics (and their level sets) L 2 norm L 1 norm (taxi-cab) Mahalanobis L-infinity (max) norm Kevin Jamieson 2018 21

1 nearest neighbor One can draw the nearest-neighbor regions in input space. Dist(x i,x j ) = (x i 1 xj 1 )2 + (x i 2 xj 2 )2 Dist(x i,x j ) =(x i 1 xj 1 )2 +(3x i 2 3xj 2 )2 The relative scalings in the distance metric affect region shapes Kevin Jamieson 2018 22

1 nearest neighbor guarantee {(x i,y i )}) n i=1 x i 2 R d,y i 2 {1,...,k} As n!1assume the x i s become dense in R d and P(Y = j X = x) issmooth As x a! x b we have P(Y a = j)! P(Y b = j) for all j 2018 Kevin Jamieson 23

1 nearest neighbor guarantee {(x i,y i )}) n i=1 x i 2 R d,y i 2 {1,...,k} As n!1assume the x i s become dense in R d and P(Y = j X = x) issmooth As x a! x b we have P(Y a = j)! P(Y b = j) for all j If p` = P(Y a = `) =P(Y b = `) and ` = arg max `=1,...,k p` then Bayes Error = 1 1-nearest neighbor error = p` P(Y a 6= Y b )= kx P(Y a = `,Y b 6= `) `=1 2018 Kevin Jamieson 25

1 nearest neighbor guarantee {(x i,y i )}) n i=1 x i 2 R d,y i 2 {1,...,k} As n!1assume the x i s become dense in R d and P(Y = j X = x) issmooth As x a! x b we have P(Y a = j)! P(Y b = j) for all j If p` = P(Y a = `) =P(Y b = `) and ` = arg max `=1,...,k p` then Bayes Error = 1 1-nearest neighbor error = = P(Y a 6= Y b )= kx P(Y a = `,Y b 6= `) `=1 kx p`(1 p`) apple 2(1 p` ) `=1 As n->infinity, then 1-NN rule error is at most twice the Bayes error! p` [Cover, Hart, 1967] k 2 (1 p` ) k 1 2018 Kevin Jamieson 26

Curse of dimensionality Ex. 1 side length r X is uniformly distributed over [0, 1] p. What is P(X 2 [0,r] p )? 2018 Kevin Jamieson 27

Curse of dimensionality Ex. 2 {X i } n i=1 are uniformly distributed over [.5,.5]p. What is the median distance from a point at origin to its 1NN? 2018 Kevin Jamieson 28

Nearest neighbor regression {(x i,y i )}) n i=1 N k (x 0 )=k-nearest neighbors of x 0 bf(x 0 )= X 1 k y i x i 2N k (x 0 ) Kevin Jamieson 2018 29

Nearest neighbor regression {(x i,y i )}) n i=1 Why are far-away neighbors weighted same as close neighbors! Kernel smoothing: K(x, y) N k (x 0 )=k-nearest neighbors of x 0 bf(x 0 )= X 1 k y i x i 2N k (x 0 ) bf(x 0 )=P n i=1 K(x 0,x i )y i P n i=1 K(x 0,x i ) Kevin Jamieson 2018 30

Nearest neighbor regression {(x i,y i )}) n i=1 N k (x 0 )=k-nearest neighbors of x 0 bf(x 0 )= X 1 k y i f(x0 b )= x i 2N k (x 0 ) P n i=1 K(x 0,x i )y i P n i=1 K(x 0,x i ) Kevin Jamieson 2018 31

Nearest neighbor regression {(x i,y i )}) n i=1 N k (x 0 )=k-nearest neighbors of x 0 bf(x 0 )= X x i 2N k (x 0 ) 1 k y i b f(x0 )= Why just average them? P n i=1 K(x 0,x i )y P i n i=1 K(x 0,x i ) Kevin Jamieson 2018 32

Nearest neighbor regression {(x i,y i )}) n i=1 N k (x 0 )=k-nearest neighbors of x 0 bf(x 0 )= X 1 k y i x i 2N k (x 0 ) bf(x 0 )=P n i=1 K(x 0,x i )y i P n i=1 K(x 0,x i ) w(x 0 ),b(x 0 ) = arg min w,b bf(x 0 )=b(x 0 )+w(x 0 ) T x 0 nx K(x 0,x i )(y i (b + w T x i )) 2 i=1 Local Linear Regression Kevin Jamieson 2018 33

Nearest Neighbor Overview Very simple to explain and implement No training! But finding nearest neighbors in large dataset at test can be computationally demanding (kd-trees help) You can use other forms of distance (not just Euclidean) Smoothing with Kernels and local linear regression can improve performance (at the cost of higher variance) 2018 Kevin Jamieson 35

Kernels Machine Learning CSE546 Kevin Jamieson University of Washington October 26, 2018 2018 Kevin Jamieson 37

Machine Learning Problems Have a bunch of iid data of the form: {(x i,y i )} n i=1 x i 2 R d y i 2 R nx Learning a model s parameters: Each `i(w) is convex. i=1 `i(w) Hinge Loss: `i(w) = max{0, 1 y i x T i w} Logistic Loss: `i(w) = log(1 + exp( y i x T i w)) Squared error Loss: `i(w) =(y i x T i w)2 All in terms of inner products! Even nearest neighbor can use inner products! Kevin Jamieson 2018 38

What if the data is not linearly separable? Use features of features of features of features. (x) :R d! R p Feature space can get really large really quickly! 2018 Kevin Jamieson 39

Dot-product of polynomials exactly d d =1: (u) = apple u1 u 2 h (u), (v)i = u 1 v 1 + u 2 v 2 2017 Kevin Jamieson 40

Dot-product of polynomials d =1: (u) = d =2: (u) = apple u1 u 2 h (u), (v)i = u 1 v 1 + u 2 v 2 2 6 4 u 2 1 u 2 2 u 1 u 2 u 2 u 1 3 exactly d 7 5 h (u), (v)i = u2 1v1 2 + u 2 2v2 2 +2u 1 u 2 v 1 v 2 2017 Kevin Jamieson 41

Dot-product of polynomials d =1: (u) = d =2: (u) = apple u1 u 2 h (u), (v)i = u 1 v 1 + u 2 v 2 2 6 4 u 2 1 u 2 2 u 1 u 2 u 2 u 1 3 exactly d 7 5 h (u), (v)i = u2 1v1 2 + u 2 2v2 2 +2u 1 u 2 v 1 v 2 General d : Dimension of (u) is roughly p d if u 2 R p 2017 Kevin Jamieson 42

Kernel Trick bw = arg min w nx (y i x T i w) 2 + w 2 w i=1 There exists an 2 R n : bw = nx i x i i=1 Why? b = arg min nx X n (y i j hx j,x i i) 2 + i=1 j=1 nx nx i j hx i,x j i i=1 j=1 2017 Kevin Jamieson 43

Kernel Trick bw = arg min w nx (y i x T i w) 2 + w 2 w i=1 There exists an 2 R n : bw = nx i x i i=1 Why? b = arg min = arg min nx X n (y i j hx j,x i i) 2 + i=1 j=1 nx X n (y i j K(x i,x j )) 2 + i=1 j=1 nx nx i j hx i,x j i i=1 nx j=1 i=1 j=1 nx i j K(x i,x j ) = arg min y K 2 2 + T K K(x i,x j )=h (x i ), (x j )i 2017 Kevin Jamieson 44

Why regularization? Typically, K 0. What if = 0? b = arg min y K 2 2 + T K 2017 Kevin Jamieson 45

Why regularization? Typically, K 0. What if = 0? b = arg min y K 2 2 + T K Unregularized kernel least squares can (over) fit any data! b = K 1 y 2017 Kevin Jamieson 46

Common kernels Polynomials of degree exactly d Polynomials of degree up to d Gaussian (squared exponential) kernel u v 2 K(u, v) =exp 2 Sigmoid 2 2 2017 Kevin Jamieson 47

Mercer s Theorem When do we have a valid Kernel K(x,x )? Definition 1: when it is an inner product Mercer s Theorem: K(x,x ) is a valid kernel if and only if K is a positive semi-definite. PSD in the following sense: Z Z h(x)k(x, x 0 )h(x 0 )dxdx 0 0 8h : R d! R, x,x 0 x h(x) 2 dx apple1 2017 Kevin Jamieson 48

RBF Kernel K(u, v) =exp u v 2 2 2 2 Note that this is like weighting bumps on each point like kernel smoothing but now we learn the weights 2017 Kevin Jamieson 49

RBF Kernel K(u, v) =exp u v 2 2 2 2 The bandwidth sigma has an enormous effect on fit: = 10 2 = 10 4 = 10 1 = 10 4 = 10 0 = 10 4 bf(x) = nx b i K(x i,x) i=1 2017 Kevin Jamieson 50

RBF Kernel K(u, v) =exp u v 2 2 2 2 The bandwidth sigma has an enormous effect on fit: = 10 2 = 10 4 = 10 1 = 10 4 = 10 0 = 10 4 = 10 3 = 10 4 = 10 1 = 10 0 bf(x) = nx b i K(x i,x) i=1 2017 Kevin Jamieson 51

RBF kernel and random features 2 cos( ) cos( ) = cos( + ) + cos( ) e jz = cos(z)+sin(z) Recall HW1 where we used the feature map: 2p 3 2 cos(w T 1 x + b 1 ) 6 7 (x) = 4 p. 5 2 cos(w T p x + b p ) E[ 1 p (x)t (y)] = 1 p w k N (0, 2 I) b k uniform(0, ) px E[2 cos(wk T x + b k ) cos(wk T y + b k )] k=1 = E w,b [2 cos(w T x + b) cos(w T y + b)] 2017 Kevin Jamieson 52

RBF Classification min,b nx bw = max{0, 1 y i (b + x T i w)} + w 2 2 i=1 nx nx nx max{0, 1 y i (b + j hx i,x j i)} + i j hx i,x j i i=1 j=1 i,j=1 2017 Kevin Jamieson 54

Wait, infinite dimensions? Isn t everything separable there? How are we not overfitting? Regularization! Fat shattering (R/margin)^2 2018 Kevin Jamieson 55

String Kernels Example from Efron and Hastie, 2016 Amino acid sequences of different lengths: x1 x2 All subsequences of length 3 (of possible 20 amino acids) 2017 Kevin Jamieson 56