Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

Similar documents
Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables?

Machine Learning CSE546 Sham Kakade University of Washington. Oct 4, What about continuous variables?

Gaussians Linear Regression Bias-Variance Tradeoff

CSE446: Linear Regression Regulariza5on Bias / Variance Tradeoff Winter 2015

Bias-Variance Tradeoff

Accouncements. You should turn in a PDF and a python file(s) Figure for problem 9 should be in the PDF

Point Estimation. Vibhav Gogate The University of Texas at Dallas

Linear Regression. Machine Learning CSE546 Kevin Jamieson University of Washington. Oct 5, Kevin Jamieson 1

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin

PAC-learning, VC Dimension and Margin-based Bounds

Some slides from Carlos Guestrin, Luke Zettlemoyer & K Gajos 2

Announcements. Proposals graded

Point Estimation. Maximum likelihood estimation for a binomial distribution. CSE 446: Machine Learning

Readings: K&F: 16.3, 16.4, Graphical Models Carlos Guestrin Carnegie Mellon University October 6 th, 2008

Learning Theory Continued

Announcements. Proposals graded

Bayesian Methods: Naïve Bayes

Lecture : Probabilistic Machine Learning

Learning Theory. Aar$ Singh and Barnabas Poczos. Machine Learning / Apr 17, Slides courtesy: Carlos Guestrin

Stochastic Gradient Descent

PAC-learning, VC Dimension and Margin-based Bounds

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Introduc)on to Bayesian methods (con)nued) - Lecture 16

Machine Learning

Bayesian Learning (II)

Learning P-maps Param. Learning

Machine Learning

Introduction to Machine Learning CMU-10701

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

ECE 5424: Introduction to Machine Learning

STA 4273H: Sta-s-cal Machine Learning

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

Support Vector Machines

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

STA414/2104 Statistical Methods for Machine Learning II

BN Semantics 3 Now it s personal! Parameter Learning 1

Machine Learning

COMP90051 Statistical Machine Learning

Machine Learning

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Computational Cognitive Science

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression

DS-GA 1003: Machine Learning and Computational Statistics Homework 7: Bayesian Modeling

Generative v. Discriminative classifiers Intuition

Density Estimation. Seungjin Choi

Machine Learning

Introduction to Bayesian Learning. Machine Learning Fall 2018

CSC321 Lecture 18: Learning Probabilistic Models

6.867 Machine Learning

Machine Learning Gaussian Naïve Bayes Big Picture

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Overview. Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation

Bayesian Inference and MCMC

CS 6140: Machine Learning Spring What We Learned Last Week. Survey 2/26/16. VS. Model

CS 6140: Machine Learning Spring 2016

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

Bayesian Models in Machine Learning

Computational Cognitive Science

CPSC 340: Machine Learning and Data Mining

Probability and Statistical Decision Theory

ECE521 week 3: 23/26 January 2017

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

Overfitting, Bias / Variance Analysis

CPSC 540: Machine Learning

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Machine Learning

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Generative v. Discriminative classifiers Intuition

Naïve Bayes. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

VBM683 Machine Learning

ECE 5984: Introduction to Machine Learning

Bayesian RL Seminar. Chris Mansley September 9, 2008

Bayesian Regression Linear and Logistic Regression

Rirdge Regression. Szymon Bobek. Institute of Applied Computer science AGH University of Science and Technology

Machine Learning. Ensemble Methods. Manfred Huber

Parametric Models: from data to models

Linear Models A linear model is defined by the expression

Classification Logistic Regression

Machine Learning. Classification. Bayes Classifier. Representing data: Choosing hypothesis class. Learning: h:x a Y. Eric Xing

Bayesian Deep Learning

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

Machine Learning CSE546

Machine Learning CMPT 726 Simon Fraser University. Binomial Parameter Estimation

Bayesian Classifiers, Conditional Independence and Naïve Bayes. Required reading: Naïve Bayes and Logistic Regression (available on class website)

Data Mining Chapter 4: Data Analysis and Uncertainty Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

PMR Learning as Inference

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

Logistic Regression Logistic

Machine Learning

Week 5: Logistic Regression & Neural Networks

Bayesian Inference and an Intro to Monte Carlo Methods

Generative Clustering, Topic Modeling, & Bayesian Inference

Introduction to Machine Learning Fall 2017 Note 5. 1 Overview. 2 Metric

Generative v. Discriminative classifiers Intuition

6.867 Machine Learning

Introduction to Machine Learning

Statistical learning. Chapter 20, Sections 1 3 1

Bayesian Methods for Machine Learning

Machine Learning 4771

Transcription:

Bayesian Methods Machine Learning CSE546 Carlos Guestrin University of Washington September 30, 2013 1 What about prior n Billionaire says: Wait, I know that the thumbtack is close to 50-50. What can you do for me now? n You say: I can learn it the Bayesian way n Rather than estimating a single θ, we obtain a distribution over possible values of θ 2 1

Bayesian Learning n Use Bayes rule: n Or equivalently: 3 Bayesian Learning for Thumbtack n Likelihood function is simply Binomial: n What about prior? Represent expert knowledge Simple posterior form n Conjugate priors: Closed-form representation of posterior For Binomial, conjugate prior is Beta distribution 4 2

Beta prior distribution P(θ) Mean: Mode: Beta(2,3) Beta(20,30) n Likelihood function: n Posterior: 5 Posterior distribution n Prior: n Data: α H heads and α T tails n Posterior distribution: Beta(2,3) Beta(20,30) 6 3

Using Bayesian posterior n Posterior distribution: n Bayesian inference: No longer single parameter: Integral is often hard to compute 7 MAP: Maximum a posteriori approximation n As more data is observed, Beta is more certain n MAP: use most likely parameter: 8 4

MAP for Beta distribution n MAP: use most likely parameter: n Beta prior equivalent to extra thumbtack flips n As N 1, prior is forgotten n But, for small sample size, prior is important! 9 Linear Regression Machine Learning CSE546 Carlos Guestrin University of Washington September 30, 2013 10 5

Prediction of continuous variables n Billionaire sayz: Wait, that s not what I meant! n You sayz: Chill out, dude. n He sayz: I want to predict a continuous variable for continuous inputs: I want to predict salaries from GPA. n You sayz: I can regress that 11 The regression problem n Instances: <x j, t j > n Learn: Mapping from x to t(x) n Hypothesis space: Given, basis functions Find coeffs w={w 1,,w k } Why is this called linear regression??? n model is linear in the parameters n Precisely, minimize the residual squared error: 12 6

The regression problem in matrix notation N data points K basis func N data points K basis functions weights 13 observations Minimizing the Residual 14 7

Regression solution = simple matrix operations where k k matrix for k basis functions k 1 vector 15 But, why? n Billionaire (again) says: Why sum squared error??? n You say: Gaussians, Dr. Gateson, Gaussians n Model: prediction is linear function plus Gaussian noise t(x) = i w i h i (x) + ε x n Learn w using MLE 16 8

Maximizing log-likelihood Maximize: Least-squares Linear Regression is MLE for Gaussians!!! 17 Announcements n Go to recitation!! J Tuesday, 5:30pm in LOW 101 n First homework will go out today Due on October 14 Start early!! 18 9

Bias-Variance Tradeoff Machine Learning CSE546 Carlos Guestrin University of Washington September 30, 2013 19 Bias-Variance tradeoff Intuition n Model too simple è does not fit the data well A biased solution n Model too complex è small changes to the data, solution changes a lot A high-variance solution 20 10

(Squared) Bias of learner n Given dataset D with N samples, learn function h D (x) n If you sample a different dataset D with N samples, you will learn different h D (x) n Expected hypothesis: E D [h D (x)] n Bias: difference between what you expect to learn and truth Measures how well you expect to represent true solution Decreases with more complex model Bias 2 at one point x: Average Bias 2 : 21 Variance of learner n Given dataset D with N samples, learn function h D (x) n If you sample a different dataset D with N samples, you will learn different h D (x) n Variance: difference between what you expect to learn and what you learn from a particular dataset Measures how sensitive learner is to specific dataset Decreases with simpler model Variance at one point x: Average variance: 22 11

Bias-Variance Tradeoff n Choice of hypothesis class introduces learning bias More complex class less bias More complex class more variance 23 Bias-Variance Decomposition of Error h N (x) =E D [h D (x)] n Expected mean squared error: n To simplify derivation, drop x: n Expanding the square: MSE = E D h E x h (t(x) h D (x)) 2ii 24 12

Moral of the Story: Bias-Variance Tradeoff Key in ML n Error can be decomposed: h h MSE = E D E x (t(x) h D (x)) 2ii h = E x t(x) hn (x) 2i h h + E D E x h(x) hd (x) 2ii n Choice of hypothesis class introduces learning bias More complex class less bias More complex class more variance 25 What you need to know n Regression Basis function = features Optimizing sum squared error Relationship between regression and Gaussians n Bias-variance trade-off n Play with Applet 26 13

Overfitting Machine Learning CSE546 Carlos Guestrin University of Washington September 30, 2013 27 Bias-Variance Tradeoff n Choice of hypothesis class introduces learning bias More complex class less bias More complex class more variance 28 14

Training set error n Given a dataset (Training data) n Choose a loss function e.g., squared error (L 2 ) for regression n Training set error: For a particular set of parameters, loss function on training data: 29 Training set error as a function of model complexity 30 15

Prediction error n Training set error can be poor measure of quality of solution n Prediction error: We really care about error over all possible input points, not just training data: 31 Prediction error as a function of model complexity 32 16

Computing prediction error n Computing prediction Hard integral May not know t(x) for every x n Monte Carlo integration (sampling approximation) Sample a set of i.i.d. points {x 1,,x M } from p(x) Approximate integral with sample average 33 Why training set error doesn t approximate prediction error? n Sampling approximation of prediction error: n Training error : n Very similar equations!!! Why is training set a bad measure of prediction error??? 34 17

Why training set error doesn t approximate prediction error? n Sampling approximation Because of you prediction cheated!!! error: Training error good estimate for a single w, But you optimized w with respect to the training error, and found w that is good for this set of samples n Training error : Training error is a (optimistically) biased estimate of prediction error n Very similar equations!!! Why is training set a bad measure of prediction error??? 35 Test set error n Given a dataset, randomly split it into two parts: Training data {x 1,, x Ntrain } Test data {x 1,, x Ntest } n Use training data to optimize parameters w n Test set error: For the final output w, ^ evaluate the error using: 36 18

Test set error as a function of model complexity 37 Overfitting n Overfitting: a learning algorithm overfits the training data if it outputs a solution w when there exists another solution w such that: 38 19

How many points to I use for training/testing? n Very hard question to answer! Too few training points, learned w is bad Too few test points, you never know if you reached a good solution n Bounds, such as Hoeffding s inequality can help: n More on this later this quarter, but still hard to answer n Typically: If you have a reasonable amount of data, pick test set large enough for a reasonable estimate of error, and use the rest for learning If you have little data, then you need to pull out the big guns n e.g., bootstrapping 39 Error estimators 40 20

Error as a function of number of training examples for a fixed model complexity little data infinite data 41 Error estimators Be careful!!! Test set only unbiased if you never never ever ever do any any any any learning on the test data For example, if you use the test set to select the degree of the polynomial no longer unbiased!!! (We will address this problem later in the quarter) 42 21

What you need to know n True error, training error, test error Never learn on the test data Never learn on the test data Never learn on the test data Never learn on the test data Never learn on the test data n Overfitting 43 22