Bayesian Methods Machine Learning CSE546 Carlos Guestrin University of Washington September 30, 2013 1 What about prior n Billionaire says: Wait, I know that the thumbtack is close to 50-50. What can you do for me now? n You say: I can learn it the Bayesian way n Rather than estimating a single θ, we obtain a distribution over possible values of θ 2 1
Bayesian Learning n Use Bayes rule: n Or equivalently: 3 Bayesian Learning for Thumbtack n Likelihood function is simply Binomial: n What about prior? Represent expert knowledge Simple posterior form n Conjugate priors: Closed-form representation of posterior For Binomial, conjugate prior is Beta distribution 4 2
Beta prior distribution P(θ) Mean: Mode: Beta(2,3) Beta(20,30) n Likelihood function: n Posterior: 5 Posterior distribution n Prior: n Data: α H heads and α T tails n Posterior distribution: Beta(2,3) Beta(20,30) 6 3
Using Bayesian posterior n Posterior distribution: n Bayesian inference: No longer single parameter: Integral is often hard to compute 7 MAP: Maximum a posteriori approximation n As more data is observed, Beta is more certain n MAP: use most likely parameter: 8 4
MAP for Beta distribution n MAP: use most likely parameter: n Beta prior equivalent to extra thumbtack flips n As N 1, prior is forgotten n But, for small sample size, prior is important! 9 Linear Regression Machine Learning CSE546 Carlos Guestrin University of Washington September 30, 2013 10 5
Prediction of continuous variables n Billionaire sayz: Wait, that s not what I meant! n You sayz: Chill out, dude. n He sayz: I want to predict a continuous variable for continuous inputs: I want to predict salaries from GPA. n You sayz: I can regress that 11 The regression problem n Instances: <x j, t j > n Learn: Mapping from x to t(x) n Hypothesis space: Given, basis functions Find coeffs w={w 1,,w k } Why is this called linear regression??? n model is linear in the parameters n Precisely, minimize the residual squared error: 12 6
The regression problem in matrix notation N data points K basis func N data points K basis functions weights 13 observations Minimizing the Residual 14 7
Regression solution = simple matrix operations where k k matrix for k basis functions k 1 vector 15 But, why? n Billionaire (again) says: Why sum squared error??? n You say: Gaussians, Dr. Gateson, Gaussians n Model: prediction is linear function plus Gaussian noise t(x) = i w i h i (x) + ε x n Learn w using MLE 16 8
Maximizing log-likelihood Maximize: Least-squares Linear Regression is MLE for Gaussians!!! 17 Announcements n Go to recitation!! J Tuesday, 5:30pm in LOW 101 n First homework will go out today Due on October 14 Start early!! 18 9
Bias-Variance Tradeoff Machine Learning CSE546 Carlos Guestrin University of Washington September 30, 2013 19 Bias-Variance tradeoff Intuition n Model too simple è does not fit the data well A biased solution n Model too complex è small changes to the data, solution changes a lot A high-variance solution 20 10
(Squared) Bias of learner n Given dataset D with N samples, learn function h D (x) n If you sample a different dataset D with N samples, you will learn different h D (x) n Expected hypothesis: E D [h D (x)] n Bias: difference between what you expect to learn and truth Measures how well you expect to represent true solution Decreases with more complex model Bias 2 at one point x: Average Bias 2 : 21 Variance of learner n Given dataset D with N samples, learn function h D (x) n If you sample a different dataset D with N samples, you will learn different h D (x) n Variance: difference between what you expect to learn and what you learn from a particular dataset Measures how sensitive learner is to specific dataset Decreases with simpler model Variance at one point x: Average variance: 22 11
Bias-Variance Tradeoff n Choice of hypothesis class introduces learning bias More complex class less bias More complex class more variance 23 Bias-Variance Decomposition of Error h N (x) =E D [h D (x)] n Expected mean squared error: n To simplify derivation, drop x: n Expanding the square: MSE = E D h E x h (t(x) h D (x)) 2ii 24 12
Moral of the Story: Bias-Variance Tradeoff Key in ML n Error can be decomposed: h h MSE = E D E x (t(x) h D (x)) 2ii h = E x t(x) hn (x) 2i h h + E D E x h(x) hd (x) 2ii n Choice of hypothesis class introduces learning bias More complex class less bias More complex class more variance 25 What you need to know n Regression Basis function = features Optimizing sum squared error Relationship between regression and Gaussians n Bias-variance trade-off n Play with Applet 26 13
Overfitting Machine Learning CSE546 Carlos Guestrin University of Washington September 30, 2013 27 Bias-Variance Tradeoff n Choice of hypothesis class introduces learning bias More complex class less bias More complex class more variance 28 14
Training set error n Given a dataset (Training data) n Choose a loss function e.g., squared error (L 2 ) for regression n Training set error: For a particular set of parameters, loss function on training data: 29 Training set error as a function of model complexity 30 15
Prediction error n Training set error can be poor measure of quality of solution n Prediction error: We really care about error over all possible input points, not just training data: 31 Prediction error as a function of model complexity 32 16
Computing prediction error n Computing prediction Hard integral May not know t(x) for every x n Monte Carlo integration (sampling approximation) Sample a set of i.i.d. points {x 1,,x M } from p(x) Approximate integral with sample average 33 Why training set error doesn t approximate prediction error? n Sampling approximation of prediction error: n Training error : n Very similar equations!!! Why is training set a bad measure of prediction error??? 34 17
Why training set error doesn t approximate prediction error? n Sampling approximation Because of you prediction cheated!!! error: Training error good estimate for a single w, But you optimized w with respect to the training error, and found w that is good for this set of samples n Training error : Training error is a (optimistically) biased estimate of prediction error n Very similar equations!!! Why is training set a bad measure of prediction error??? 35 Test set error n Given a dataset, randomly split it into two parts: Training data {x 1,, x Ntrain } Test data {x 1,, x Ntest } n Use training data to optimize parameters w n Test set error: For the final output w, ^ evaluate the error using: 36 18
Test set error as a function of model complexity 37 Overfitting n Overfitting: a learning algorithm overfits the training data if it outputs a solution w when there exists another solution w such that: 38 19
How many points to I use for training/testing? n Very hard question to answer! Too few training points, learned w is bad Too few test points, you never know if you reached a good solution n Bounds, such as Hoeffding s inequality can help: n More on this later this quarter, but still hard to answer n Typically: If you have a reasonable amount of data, pick test set large enough for a reasonable estimate of error, and use the rest for learning If you have little data, then you need to pull out the big guns n e.g., bootstrapping 39 Error estimators 40 20
Error as a function of number of training examples for a fixed model complexity little data infinite data 41 Error estimators Be careful!!! Test set only unbiased if you never never ever ever do any any any any learning on the test data For example, if you use the test set to select the degree of the polynomial no longer unbiased!!! (We will address this problem later in the quarter) 42 21
What you need to know n True error, training error, test error Never learn on the test data Never learn on the test data Never learn on the test data Never learn on the test data Never learn on the test data n Overfitting 43 22