Accouncements You should turn in a PDF and a python file(s) Figure for problem 9 should be in the PDF Please do not zip these files and submit (unless there are >5 files) 1
Bayesian Methods Machine Learning CSE546 Kevin Jamieson University of Washington September 28, 2017 2
MLE Recap - coin flips Data: sequence D= (HHTHT ), k heads out of n flips Hypothesis: P(Heads) = θ, P(Tails) = 1-θ P (D ) = k (1 ) n k Maximum likelihood estimation (MLE): Choose θ that maximizes the probability of observed data: b MLE = arg max = arg max P (D ) log P (D ) b MLE = k n 3
MLE Recap - Gaussians MLE: log P (D µ, )= n log( p nx (x i µ) 2 2 ) bµ MLE = 1 n nx x i c2 MLE = 1 n MLE for the variance of a Gaussian is biased E[ c2 MLE] 6= 2 2 2 nx (x i bµ MLE ) 2 Unbiased variance estimator: c2 unbiased = 1 nx (x i bµ MLE ) 2 n 1 4
MLE Recap Learning is Collect some data E.g., coin flips Choose a hypothesis class or model E.g., binomial Choose a loss function E.g., data likelihood Choose an optimization procedure E.g., set derivative to zero to obtain MLE Justifying the accuracy of the estimate E.g., Hoeffding s inequality 5
What about prior Billionaire: Wait, I know that the coin is close to 50-50. What can you do for me now? You say: I can learn it the Bayesian way 6
Bayesian vs Frequentist Data: D Estimator: b = t(d) loss: `(t(d), ) Frequentists treat unknown θ as fixed and the data D as random. Bayesian treat the data D as fixed and the unknown θ as random 7
Bayesian Learning Use Bayes rule: Or equivalently: 8
Bayesian Learning for Coins Likelihood function is simply Binomial: What about prior? Represent expert knowledge Conjugate priors: Closed-form representation of posterior For Binomial, conjugate prior is Beta distribution 9
Beta prior distribution P(θ) Mean: Mode: Beta(2,3) Beta(20,30) Likelihood function: Posterior: 10
Posterior distribution Prior: Data: α H heads and α T tails Posterior distribution: Beta(2,3) Beta(20,30) 11
Using Bayesian posterior Posterior distribution: Bayesian inference: No longer single parameter: Integral is often hard to compute 12
MAP: Maximum a posteriori approximation As more data is observed, Beta is more certain MAP: use most likely parameter: 13
MAP for Beta distribution MAP: use most likely parameter: 14
MAP for Beta distribution MAP: use most likely parameter: H + H 1 H + T + H + T 2 Beta prior equivalent to extra coin flips As N 1, prior is forgotten But, for small sample size, prior is important! 15
Recap for Bayesian learning Learning is Collect some data E.g., coin flips Choose a hypothesis class or model E.g., binomial and prior based on expert knowledge Choose a loss function E.g., parameter posterior likelihood Choose an optimization procedure E.g., set derivative to zero to obtain MAP Justifying the accuracy of the estimate E.g., If the model is correct, you are doing best possible 16
Recap for Bayesian learning Bayesians are optimists: If we model it correctly, we output most likely answer Assumes one can accurately model: Observations and link to unknown parameter θ: Distribution, structure of unknown θ: p( ) p(x ) Frequentist are pessimists: All models are wrong, prove to me your estimate is good Makes very few assumptions, e.g. E[X 2 ] < 1 and constructs an estimator (e.g., median of means of disjoint subsets of data) Prove guarantee E[( ) b 2 ] apple under hypothetical true θ s 17
Linear Regression Machine Learning CSE546 Kevin Jamieson University of Washington Oct 3, 2017 18
The regression problem Given past sales data on zillow.com, predict: y = House sale price from x = {# sq. ft., zip code, date of sale, etc.} Training Data: {(x i,y i )} n x i 2 R d y i 2 R Sale Price # square feet 19
The regression problem Sale Price Given past sales data on zillow.com, predict: y = House sale price from x = {# sq. ft., zip code, date of sale, etc.} best linear fit # square feet Training Data: {(x i,y i )} n Hypothesis: linear y i x T i w Loss: least squares min w nx x i 2 R d y i 2 R y i x T i w 2 20
The regression problem in matrix notation y = 2 6 4 y 1. y n bw LS = arg min w 3 7 5 X = 2 3 6 4x T 7 1. 5 x T n nx y i x T i w 2 = arg min w (y Xw)T (y Xw) 21
The regression problem in matrix notation bw LS = arg min w Xw 2 2 = arg min w Xw)T (y Xw) 22
The regression problem in matrix notation bw LS = arg min w y Xw 2 2 =(X T X) 1 X T y What about an offset? bw LS, b b LS = arg min w,b nx y i (x T i w + b) 2 = arg min w,b y (Xw + 1b) 2 2 23
Dealing with an offset bw LS, b b LS = arg min w,b y (Xw + 1b) 2 2 24
Dealing with an offset bw LS, b b LS = arg min w,b y (Xw + 1b) 2 2 X T X bw LS + b b LS X T 1 = X T y 1 T X bw LS + b b LS 1 T 1 = 1 T y If X T 1 = 0 (i.e., if each feature is mean-zero) then bw LS =(X T X) 1 X T Y b bls = 1 n nx y i 25
The regression problem in matrix notation bw LS = arg min w y Xw 2 2 =(X T X) 1 X T y But why least squares? Consider y i = x T i w + i where i i.i.d. N (0, 2 ) P (y x, w, )= 26
Maximizing log-likelihood Maximize: Y n 1 log P (D w, ) = log( p 2 ) n e (y i x T i w)2 2 2 27
MLE is LS under linear model bw LS = arg min w nx y i x T i w 2 bw MLE = arg max P (D w, ) if y i = x T w i.i.d. i w + i and i N (0, 2 ) bw LS = bw MLE =(X T X) 1 X T Y 28
The regression problem Sale Price Given past sales data on zillow.com, predict: y = House sale price from x = {# sq. ft., zip code, date of sale, etc.} best linear fit # square feet Training Data: {(x i,y i )} n Hypothesis: linear y i x T i w Loss: least squares min w nx x i 2 R d y i 2 R y i x T i w 2 29
The regression problem Sale Price Given past sales data on zillow.com, predict: y = House sale price from x = {# sq. ft., zip code, date of sale, etc.} Training Data: date of sale best linear fit {(x i,y i )} n Hypothesis: linear y i x T i w Loss: least squares min w nx x i 2 R d y i 2 R y i x T i w 2 30
The regression problem Training Data: {(x i,y i )} n Hypothesis: linear y i x T i w x i 2 R d y i 2 R Transformed data: Loss: least squares min w nx y i x T i w 2 31
The regression problem Training Data: {(x i,y i )} n Hypothesis: linear y i x T i w Loss: least squares min w nx x i 2 R d y i 2 R y i x T i w 2 Transformed data: h : R d! R p maps original features to a rich, possibly high-dimensional space 2 3 2 h 1 (x) 6h 2 (x) 7 6 in d=1: h(x) = 6 4. h p (x) 7 5 = 6 4 for d>1, generate {u j } p j=1 Rd 1 h j (x) = 1 + exp(u T j x) h j (x) =(u T j x) 2 3 x x 2 7. 5 x p h j (x) = cos(u T j x) 32
The regression problem Training Data: {(x i,y i )} n Hypothesis: linear y i x T i w Loss: least squares min w nx x i 2 R d y i 2 R y i x T i w 2 Transformed data: h(x) = Hypothesis: linear y i h(x i ) T w Loss: least squares min w nx 2 3 h 1 (x) h 2 (x) 6 7 4. 5 h p (x) w 2 R p y i h(x i ) T w 2 33
The regression problem Training Data: {(x i,y i )} n x i 2 R d y i 2 R Transformed data: h(x) = 2 3 h 1 (x) h 2 (x) 6 7 4. 5 h p (x) best linear fit Hypothesis: linear Sale Price y i h(x i ) T w Loss: least squares min w nx w 2 R p y i h(x i ) T w 2 date of sale 34
The regression problem Training Data: {(x i,y i )} n Sale Price date of sale x i 2 R d y i 2 R small p fit Transformed data: h(x) = Hypothesis: linear y i h(x i ) T w Loss: least squares min w nx 2 3 h 1 (x) h 2 (x) 6 7 4. 5 h p (x) w 2 R p y i h(x i ) T w 2 35
The regression problem Training Data: {(x i,y i )} n Sale Price date of sale x i 2 R d y i 2 R large p fit Transformed data: h(x) = Hypothesis: linear y i h(x i ) T w Loss: least squares min w nx 2 3 h 1 (x) h 2 (x) 6 7 4. 5 h p (x) w 2 R p y i h(x i ) T w 2 What s going on here? 36