Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables?

Similar documents
Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

Machine Learning CSE546 Sham Kakade University of Washington. Oct 4, What about continuous variables?

Gaussians Linear Regression Bias-Variance Tradeoff

CSE446: Linear Regression Regulariza5on Bias / Variance Tradeoff Winter 2015

Bias-Variance Tradeoff

Point Estimation. Vibhav Gogate The University of Texas at Dallas

Point Estimation. Maximum likelihood estimation for a binomial distribution. CSE 446: Machine Learning

Some slides from Carlos Guestrin, Luke Zettlemoyer & K Gajos 2

Accouncements. You should turn in a PDF and a python file(s) Figure for problem 9 should be in the PDF

Linear Regression. Machine Learning CSE546 Kevin Jamieson University of Washington. Oct 5, Kevin Jamieson 1

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin

Bayesian Methods: Naïve Bayes

Naïve Bayes Lecture 17

Announcements. Proposals graded

PAC-learning, VC Dimension and Margin-based Bounds

Parametric Models: from data to models

Announcements. Proposals graded

Readings: K&F: 16.3, 16.4, Graphical Models Carlos Guestrin Carnegie Mellon University October 6 th, 2008

Machine Learning CSE546

Learning Theory Continued

Lecture : Probabilistic Machine Learning

Machine Learning

Learning Theory. Aar$ Singh and Barnabas Poczos. Machine Learning / Apr 17, Slides courtesy: Carlos Guestrin

PAC-learning, VC Dimension and Margin-based Bounds

Introduction to Machine Learning CMU-10701

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Introduc)on to Bayesian methods (con)nued) - Lecture 16

Machine Learning

Stochastic Gradient Descent

Machine Learning CSE546

Learning P-maps Param. Learning

Overfitting, Bias / Variance Analysis

Bayesian Learning (II)

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

Machine Learning Gaussian Naïve Bayes Big Picture

BN Semantics 3 Now it s personal! Parameter Learning 1

STA414/2104 Statistical Methods for Machine Learning II

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

Support Vector Machines

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Machine Learning

Machine Learning

Generative v. Discriminative classifiers Intuition

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

ECE 5424: Introduction to Machine Learning

6.867 Machine Learning

Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression

STA 4273H: Sta-s-cal Machine Learning

Density Estimation. Seungjin Choi

CSC321 Lecture 18: Learning Probabilistic Models

COMP90051 Statistical Machine Learning

Machine Learning

Probability and Statistical Decision Theory

Bayesian Inference and MCMC

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Computational Cognitive Science

Machine Learning, Midterm Exam: Spring 2009 SOLUTION

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Introduction to Bayesian Learning. Machine Learning Fall 2018

Machine Learning CMPT 726 Simon Fraser University. Binomial Parameter Estimation

Bayesian RL Seminar. Chris Mansley September 9, 2008

Machine Learning

ECE521 week 3: 23/26 January 2017

Generative v. Discriminative classifiers Intuition

CS 6140: Machine Learning Spring What We Learned Last Week. Survey 2/26/16. VS. Model

DS-GA 1003: Machine Learning and Computational Statistics Homework 7: Bayesian Modeling

Overview. Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation

CS 6140: Machine Learning Spring 2016

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Bayesian Models in Machine Learning

Rirdge Regression. Szymon Bobek. Institute of Applied Computer science AGH University of Science and Technology

CPSC 540: Machine Learning

Naïve Bayes. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

PMR Learning as Inference

Computational Cognitive Science

Week 5: Logistic Regression & Neural Networks

ECE 5984: Introduction to Machine Learning

Machine Learning. Ensemble Methods. Manfred Huber

Classification Logistic Regression

Bias-Variance in Machine Learning

Machine Learning

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

COMP 551 Applied Machine Learning Lecture 19: Bayesian Inference

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Lecture 11: Probability Distributions and Parameter Estimation

CPSC 340: Machine Learning and Data Mining

Kernels. Machine Learning CSE446 Carlos Guestrin University of Washington. October 28, Carlos Guestrin

Machine Learning, Midterm Exam: Spring 2008 SOLUTIONS. Q Topic Max. Score Score. 1 Short answer questions 20.

VBM683 Machine Learning

Introduction to Machine Learning Fall 2017 Note 5. 1 Overview. 2 Metric

Machine Learning

Machine Learning 4771

Inconsistency of Bayesian inference when the model is wrong, and how to repair it

Bayesian Inference and an Intro to Monte Carlo Methods

Bayesian Regression Linear and Logistic Regression

cxx ab.ec Warm up OH 2 ax 16 0 axtb Fix any a, b, c > What is the x 2 R that minimizes ax 2 + bx + c

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Transcription:

Linear Regression Machine Learning CSE546 Carlos Guestrin University of Washington September 30, 2014 1 What about continuous variables? n Billionaire says: If I am measuring a continuous variable, what can you do for me? n You say: Let me tell you about Gaussians 2 1

Some properties of Gaussians n affine transformation (multiplying by scalar and adding a constant) X ~ N(µ,σ 2 ) Y = ax + b è Y ~ N(aµ+b,a 2 σ 2 ) n Sum of Gaussians X ~ N(µ X,σ 2 X ) Y ~ N(µ Y,σ 2 Y ) Z = X+Y è Z ~ N(µ X +µ Y, σ 2 X +σ2 Y ) 3 Learning a Gaussian n Collect a bunch of data Hopefully, i.i.d. samples e.g., exam scores n Learn parameters Mean Variance 4 2

MLE for Gaussian n Prob. of i.i.d. samples D={x 1,,x N }: n Log-likelihood of data: 5 Your second learning algorithm: MLE for mean of a Gaussian n What s MLE for mean? 6 3

MLE for variance n Again, set derivative to zero: 7 Learning Gaussian parameters n MLE: n BTW. MLE for the variance of a Gaussian is biased Expected result of estimation is not true parameter! Unbiased variance estimator: 8 4

Prediction of continuous variables n Billionaire sayz: Wait, that s not what I meant! n You sayz: Chill out, dude. n He sayz: I want to predict a continuous variable for continuous inputs: I want to predict salaries from GPA. n You sayz: I can regress that 9 The regression problem n Instances: <x j, t j > n Learn: Mapping from x to t(x) n Hypothesis space: Given, basis functions Find coeffs w={w 1,,w k } Why is this called linear regression??? n model is linear in the parameters n Precisely, minimize the residual squared error: 10 5

The regression problem in matrix notation N data points K basis func N data points K basis functions weights 11 observations Minimizing the Residual 12 6

Regression solution = simple matrix operations where k k matrix for k basis functions k 1 vector 13 But, why? n Billionaire (again) says: Why sum squared error??? n You say: Gaussians, Dr. Gateson, Gaussians n Model: prediction is linear function plus Gaussian noise t(x) = i w i h i (x) + ε x n Learn w using MLE 14 7

Maximizing log-likelihood Maximize: Least-squares Linear Regression is MLE for Gaussians!!! 15 Announcements n Go to recitation!! J Wednesday, 5pm in EEB 045 n First homework will go out today Due on October 14 Start early!! 16 8

Bias-Variance Tradeoff Machine Learning CSE546 Carlos Guestrin University of Washington September 30, 2014 17 Bias-Variance tradeoff Intuition n Model too simple è does not fit the data well A biased solution n Model too complex è small changes to the data, solution changes a lot A high-variance solution 18 9

(Squared) Bias of learner n Given dataset D with N samples, learn function h D (x) n If you sample a different dataset D with N samples, you will learn different h D (x) n Expected hypothesis: E D [h D (x)] n Bias: difference between what you expect to learn and truth Measures how well you expect to represent true solution Decreases with more complex model Bias 2 at one point x: Average Bias 2 : 19 Variance of learner n Given dataset D with N samples, learn function h D (x) n If you sample a different dataset D with N samples, you will learn different h D (x) n Variance: difference between what you expect to learn and what you learn from a particular dataset Measures how sensitive learner is to specific dataset Decreases with simpler model Variance at one point x: Average variance: 20 10

Bias-Variance Tradeoff n Choice of hypothesis class introduces learning bias More complex class less bias More complex class more variance 21 Bias-Variance Decomposition of Error h N (x) =E D [h D (x)] n Expected mean squared error: n To simplify derivation, drop x: n Expanding the square: MSE = E D h E x h (t(x) h D (x)) 2ii 22 11

Moral of the Story: Bias-Variance Tradeoff Key in ML n Error can be decomposed: h h MSE = E D E x (t(x) h D (x)) 2ii h = E x t(x) hn (x) 2i h h + E D E x h(x) hd (x) 2ii n Choice of hypothesis class introduces learning bias More complex class less bias More complex class more variance 23 What you need to know n Regression Basis function = features Optimizing sum squared error Relationship between regression and Gaussians n Bias-variance trade-off n Play with Applet 24 12

Overfitting Machine Learning CSE546 Carlos Guestrin University of Washington September 30, 2014 25 Bias-Variance Tradeoff n Choice of hypothesis class introduces learning bias More complex class less bias More complex class more variance 26 13

Training set error n Given a dataset (Training data) n Choose a loss function e.g., squared error (L 2 ) for regression n Training set error: For a particular set of parameters, loss function on training data: 27 Training set error as a function of model complexity 28 14

Prediction error n Training set error can be poor measure of quality of solution n Prediction error: We really care about error over all possible input points, not just training data: 29 Prediction error as a function of model complexity 30 15

Computing prediction error n Computing prediction Hard integral May not know t(x) for every x n Monte Carlo integration (sampling approximation) Sample a set of i.i.d. points {x 1,,x M } from p(x) Approximate integral with sample average 31 Why training set error doesn t approximate prediction error? n Sampling approximation of prediction error: n Training error : n Very similar equations!!! Why is training set a bad measure of prediction error??? 32 16

Why training set error doesn t approximate prediction error? n Sampling approximation Because of you prediction cheated!!! error: Training error good estimate for a single w, But you optimized w with respect to the training error, and found w that is good for this set of samples n Training error : Training error is a (optimistically) biased estimate of prediction error n Very similar equations!!! Why is training set a bad measure of prediction error??? 33 Test set error n Given a dataset, randomly split it into two parts: Training data {x 1,, x Ntrain } Test data {x 1,, x Ntest } n Use training data to optimize parameters w n Test set error: For the final output w, ^ evaluate the error using: 34 17

Test set error as a function of model complexity 35 Overfitting n Overfitting: a learning algorithm overfits the training data if it outputs a solution w when there exists another solution w such that: 36 18

How many points to I use for training/testing? n Very hard question to answer! Too few training points, learned w is bad Too few test points, you never know if you reached a good solution n Bounds, such as Hoeffding s inequality can help: n More on this later this quarter, but still hard to answer n Typically: If you have a reasonable amount of data, pick test set large enough for a reasonable estimate of error, and use the rest for learning If you have little data, then you need to pull out the big guns n e.g., bootstrapping 37 Error estimators 38 19

Error as a function of number of training examples for a fixed model complexity little data infinite data 39 Error estimators Be careful!!! Test set only unbiased if you never never ever ever do any any any any learning on the test data For example, if you use the test set to select the degree of the polynomial no longer unbiased!!! (We will address this problem later in the quarter) 40 20

What you need to know n True error, training error, test error Never learn on the test data Never learn on the test data Never learn on the test data Never learn on the test data Never learn on the test data n Overfitting 41 42 21

43 44 22

Bayesian Methods Machine Learning CSE546 Carlos Guestrin University of Washington September 30, 2014 45 What about prior n Billionaire says: Wait, I know that the thumbtack is close to 50-50. What can you do for me now? n You say: I can learn it the Bayesian way n Rather than estimating a single θ, we obtain a distribution over possible values of θ 46 23

Bayesian Learning n Use Bayes rule: n Or equivalently: 47 Bayesian Learning for Thumbtack n Likelihood function is simply Binomial: n What about prior? Represent expert knowledge Simple posterior form n Conjugate priors: Closed-form representation of posterior For Binomial, conjugate prior is Beta distribution 48 24

Beta prior distribution P(θ) Mean: Mode: Beta(2,3) Beta(20,30) n Likelihood function: n Posterior: 49 Posterior distribution n Prior: n Data: α H heads and α T tails n Posterior distribution: Beta(2,3) Beta(20,30) 50 25

Using Bayesian posterior n Posterior distribution: n Bayesian inference: No longer single parameter: Integral is often hard to compute 51 MAP: Maximum a posteriori approximation n As more data is observed, Beta is more certain n MAP: use most likely parameter: 52 26

MAP for Beta distribution n MAP: use most likely parameter: n Beta prior equivalent to extra thumbtack flips n As N 1, prior is forgotten n But, for small sample size, prior is important! 53 27