Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Similar documents
Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Lecture 5: Logistic Regression. Neural Networks

Machine Learning. Lecture 04: Logistic and Softmax Regression. Nevin L. Zhang

Iterative Reweighted Least Squares

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Regression with Numerical Optimization. Logistic

Stochastic Gradient Descent

Overfitting, Bias / Variance Analysis

Least Squares Regression

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Lecture 5: GPs and Streaming regression

Least Squares Regression

Machine Learning Lecture 7

Ch 4. Linear Models for Classification

Neural Network Training

Linear Models for Regression CS534

Machine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang

ECE521 week 3: 23/26 January 2017

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Machine Learning. Lecture 3: Logistic Regression. Feng Li.

Bayesian Regression Linear and Logistic Regression

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Reading Group on Deep Learning Session 1

Linear classifiers: Overfitting and regularization

Machine Learning. 7. Logistic and Linear Regression

Machine Learning - Waseda University Logistic Regression

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

Classification Logistic Regression

Machine Learning Basics: Maximum Likelihood Estimation

Linear classifiers: Logistic regression

Machine Learning. Ensemble Methods. Manfred Huber

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

Linear and logistic regression

Machine Learning 2017

Multiclass Logistic Regression

Introduction to Bayesian Learning. Machine Learning Fall 2018

Linear Models for Regression

Logistic Regression. Seungjin Choi

Linear Models for Classification

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d)

Machine Learning

Lecture 3 September 1

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

HOMEWORK #4: LOGISTIC REGRESSION

Introduction to Machine Learning Fall 2017 Note 5. 1 Overview. 2 Metric

Regression. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh.

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

Bias-Variance Tradeoff

Why should you care about the solution strategies?

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

CSC 411: Lecture 04: Logistic Regression

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

10-701/ Machine Learning, Fall

STA 4273H: Statistical Machine Learning

Reminders. Thought questions should be submitted on eclass. Please list the section related to the thought question

HOMEWORK #4: LOGISTIC REGRESSION

Outline Lecture 2 2(32)

Machine Learning (COMP-652 and ECSE-608)

Linear Models for Regression CS534

Linear Regression. Machine Learning CSE546 Kevin Jamieson University of Washington. Oct 5, Kevin Jamieson 1

Gradient Descent. Sargur Srihari

CSC242: Intro to AI. Lecture 21

Linear Models for Regression CS534

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Introduction to Machine Learning

The classifier. Theorem. where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know

The classifier. Linear discriminant analysis (LDA) Example. Challenges for LDA

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Machine Learning CSE546 Sham Kakade University of Washington. Oct 4, What about continuous variables?

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables?

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION

Week 3: Linear Regression

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Machine Learning

Introduction to Statistical Learning Theory

Machine Learning (COMP-652 and ECSE-608)

Lecture : Probabilistic Machine Learning

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Ad Placement Strategies

12 Statistical Justifications; the Bias-Variance Decomposition

Introduction to Machine Learning. Regression. Computer Science, Tel-Aviv University,

DATA MINING AND MACHINE LEARNING. Lecture 4: Linear models for regression and classification Lecturer: Simone Scardapane

Midterm exam CS 189/289, Fall 2015

Gaussians Linear Regression Bias-Variance Tradeoff

Logistic Regression. Machine Learning Fall 2018

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017

Machine Learning Lecture 5

MODULE -4 BAYEIAN LEARNING

Linear Models for Regression

Transcription:

Lecture 4: Types of errors. Bayesian regression models. Logistic regression A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting more generally COMP-652 and ECSE-68, Lecture 4 - January 2, 26

The anatomy of the error of an estimator Suppose we have examples x, y where y = f(x) + ɛ and ɛ is Gaussian noise with zero mean and standard deviation σ We fit a linear hypothesis h(x) = w T x, such as to minimize sum-squared error over the training data: m (y i h(x i )) 2 i= Because of the hypothesis class that we chose (hypotheses linear in the parameters) for some target functions f we will have a systematic prediction error Even if f were truly from the hypothesis class we picked, depending on the data set we have, the parameters w that we find may be different; this variability due to the specific data set on hand is a different source of error COMP-652 and ECSE-68, Lecture 4 - January 2, 26 2

Bias-variance analysis Given a new data point x, what is the expected prediction error? Assume that the data points are drawn independently and identically distributed (i.i.d.) from a unique underlying probability distribution P ( x, y ) = P (x)p (y x) The goal of the analysis is to compute, for an arbitrary given point x, [ E P (y h(x)) 2 x ] where y is the value of x in a data set, and the expectation is over all training sets of a given size, drawn according to P For a given hypothesis class, we can also compute the true error, which is the expected error over the input distribution: [ E P (y h(x)) 2 x ] P (x) x (if x continuous, sum becomes integral with appropriate conditions). We will decompose this expectation into three components COMP-652 and ECSE-68, Lecture 4 - January 2, 26 3

Recall: Statistics Let X be a random variable with possible values x i, i =... n and with probability distribution P (X) The expected value or mean of X is: E[X] = n x i P (x i ) i= If X is continuous, roughly speaking, the sum is replaced by an integral, and the distribution by a density function The variance of X is: V ar[x] = E[(X E(X)) 2 ] = E[X 2 ] (E[X]) 2 COMP-652 and ECSE-68, Lecture 4 - January 2, 26 4

The variance lemma V ar[x] = E[(X E[X]) 2 ] = n (x i E[X]) 2 P (x i ) = = i= n (x 2 i 2x i E[X] + (E[X]) 2 )P (x i ) i= n x 2 i P (x i ) 2E[X] i= i= n x i P (x i ) + (E[X]) 2 = E[X 2 ] 2E[X]E[X] + (E[X]) 2 = E[X 2 ] (E[X]) 2 n i= P (x i ) We will use the form: E[X 2 ] = (E[X]) 2 + V ar[x] COMP-652 and ECSE-68, Lecture 4 - January 2, 26 5

Simple algebra: Bias-variance decomposition E P [ (y h(x)) 2 x ] = E P [ (h(x)) 2 2yh(x) + y 2 x ] = E P [ (h(x)) 2 x ] + E P [ y 2 x ] 2E P [y x]e P [h(x) x] Let h(x) = E P [h(x) x] denote the mean prediction of the hypothesis at x, when h is trained with data drawn from P For the first term, using the variance lemma, we have: E P [(h(x)) 2 x] = E P [(h(x) h(x)) 2 x] + ( h(x)) 2 Note that E P [y x] = E P [f(x) + ɛ x] = f(x) (because of linearity of expectation and the assumption on ɛ N (, σ)) For the second term, using the variance lemma, we have: E[y 2 x] = E[(y f(x)) 2 x] + (f(x)) 2 COMP-652 and ECSE-68, Lecture 4 - January 2, 26 6

Bias-variance decomposition (2) Putting everything together, we have: E P [ (y h(x)) 2 x ] = E P [(h(x) h(x)) 2 x] + ( h(x)) 2 2f(x) h(x) + E P [(y f(x)) 2 x] + (f(x)) 2 = E P [(h(x) h(x)) 2 x] + (f(x) h(x)) 2 + E[(y f(x)) 2 x] The first term, E P [(h(x) h(x)) 2 x], is the variance of the hypothesis h at x, when trained with finite data sets sampled randomly from P The second term, (f(x) h(x)) 2, is the squared bias (or systematic error) which is associated with the class of hypotheses we are considering The last term, E[(y f(x)) 2 x] is the noise, which is due to the problem at hand, and cannot be avoided COMP-652 and ECSE-68, Lecture 4 - January 2, 26 7

Error decomposition.5.2.9 (bias) 2 variance (bias) 2 + variance test error.6.3 3 2 2 The bias-variance sum approximates well the test error over a set of points x-axis measures the hypothesis complexity (decreasing left-to-right) Simple hypotheses usually have high bias (bias will be high at many points, so it will likely be high for many possible input distributions) Complex hypotheses have high variance: the hypothesis is very dependent on the data set on which it was trained. ln λ COMP-652 and ECSE-68, Lecture 4 - January 2, 26 8

Bias-variance trade-off Typically, bias comes from not having good hypotheses in the considered class Variance results from the hypothesis class containing too many hypotheses MLE estimation is typically unbiased, but has high variance Bayesian estimation is biased, but typically has lower variance Hence, we are faced with a trade-off: choose a more expressive class of hypotheses, which will generate higher variance, or a less expressive class, which will generate higher bias Making the trade-off has to depend on the amount of data available to fit the parameters (data usually mitigates the variance problem) COMP-652 and ECSE-68, Lecture 4 - January 2, 26 9

More on overfitting Overfitting depends on the amount of data, relative to the complexity of the hypothesis With more data, we can explore more complex hypotheses spaces, and still find a good solution N = 5 N = t t x x COMP-652 and ECSE-68, Lecture 4 - January 2, 26

Bayesian view of regularization Start with a prior distribution over hypotheses As data comes in, compute a posterior distribution We often work with conjugate priors, which means that when combining the prior with the likelihood of the data, one obtains the posterior in the same form as the prior Regularization can be obtained from particular types of prior (usually, priors that put more probability on simple hypotheses) E.g. L 2 regularization can be obtained using a circular Gaussian prior for the weights, and the posterior will also be Gaussian E.g. L regularization uses double-exponential prior (see (Tibshirani, 996)) COMP-652 and ECSE-68, Lecture 4 - January 2, 26

Bayesian view of regularization Prior is round Gaussian Posterior will be skewed by the data COMP-652 and ECSE-68, Lecture 4 - January 2, 26 2

What does the Bayesian view give us? t t x x t t x x Circles are data points Green is the true function Red lines on right are drawn from the posterior distribution COMP-652 and ECSE-68, Lecture 4 - January 2, 26 3

What does the Bayesian view give us? t t x x t t x x Functions drawn from the posterior can be very different Uncertainty decreases where there are data points COMP-652 and ECSE-68, Lecture 4 - January 2, 26 4

What does the Bayesian view give us? Uncertainty estimates, i.e. how sure we are of the value of the function These can be used to guide active learning: ask about inputs for which the uncertainty in the value of the function is very high In the limit, Bayesian and maximum likelihood learning converge to the same answer In the short term, one needs a good prior to get good estimates of the parameters Sometimes the prior is overwhelmed by the data likelihood too early. Using the Bayesian approach does NOT eliminate the need to do crossvalidation in general More on this later... COMP-652 and ECSE-68, Lecture 4 - January 2, 26 5

Logistic regression Suppose we represent the hypothesis itself as a logistic function of a linear combination of inputs: h(x) = + exp(w T x) This is also known as a sigmoid neuron Suppose we interpret h(x) as P (y = x) Then the log-odds ratio, ( ) P (y = x) ln = w T x P (y = x) which is linear (nice!) The optimum weights will maximize the conditional likelihood of the outputs, given the inputs. COMP-652 and ECSE-68, Lecture 4 - January 2, 26 6

The cross-entropy error function Suppose we interpret the output of the hypothesis, h(x i ), as the probability that y i = Then the log-likelihood of a hypothesis h is: log L(h) = = m log P (y i x i, h) = i= m i= { log h(xi ) if y i = log( h(x i )) if y i = m y i log h(x i ) + ( y i ) log( h(x i )) i= The cross-entropy error function is the opposite quantity: ( m ) J D (w) = y i log h(x i ) + ( y i ) log( h(x i )) i= COMP-652 and ECSE-68, Lecture 4 - January 2, 26 7

Cross-entropy error surface for logistic function ( m ) J D (w) = y i log σ(w T x i ) + ( y i ) log( σ(w T x i )) i= Nice error surface, unique minimum, but cannot solve in closed form COMP-652 and ECSE-68, Lecture 4 - January 2, 26 8

Gradient descent The gradient of J at a point w can be thought of as a vector indicating which way is uphill. 2 5 SSQ 5 5 w 5 5 w 5 If this is an error function, we want to move downhill on it, i.e., in the direction opposite to the gradient COMP-652 and ECSE-68, Lecture 4 - January 2, 26 9

Example gradient descent traces For more general hypothesis classes, there may be may local optima In this case, the final solution may depend on the initial parameters COMP-652 and ECSE-68, Lecture 4 - January 2, 26 2

Gradient descent algorithm The basic algorithm assumes that J is easily computed We want to produce a sequence of vectors w, w 2, w 3,... with the goal that: J(w ) > J(w 2 ) > J(w 3 ) >... lim i w i = w and w is locally optimal. The algorithm: Given w, do for i =,, 2,... w i+ = w i α i J(w i ), where α i > is the step size or learning rate for iteration i. COMP-652 and ECSE-68, Lecture 4 - January 2, 26 2

Maximization procedure: Gradient ascent First we compute the gradient of log L(w) wrt w: log L(w) = y i h i w (x i ) h w(x i )( h w (x i ))x i + ( y i ) h w (x i ) h w(x i )( h w (x i ))x i ( ) = i x i (y i y i h w (x i ) h w (x i ) + y i h w (x i )) = i (y i h w (x i ))x i The update rule (because we maximize) is: m w w + α log L(w) = w + α (y i h w (x i ))x i = w + αx T (y ŷ) i= where α (, ) is a step-size or learning rate parameter COMP-652 and ECSE-68, Lecture 4 - January 2, 26 22

This is called logistic regression If one uses features of the input, we have: w w + αx T (y ŷ) COMP-652 and ECSE-68, Lecture 4 - January 2, 26 23

Another algorithm for optimization Recall Newton s method for finding the zero of a function g : R R At point w i, approximate the function by a straight line (its tangent) Solve the linear equation for where the tangent equals, and move the parameter to this point: w i+ = w i g(wi ) g (w i ) COMP-652 and ECSE-68, Lecture 4 - January 2, 26 24

Application to machine learning Suppose for simplicity that the error function J has only one parameter We want to optimize J, so we can apply Newton s method to find the zeros of J = d dw J We obtain the iteration: w i+ = w i J (w i ) J (w i ) Note that there is no step size parameter! This is a second-order method, because it requires computing the second derivative But, if our error function is quadratic, this will find the global optimum in one step! COMP-652 and ECSE-68, Lecture 4 - January 2, 26 25

Second-order methods: Multivariate setting If we have an error function J that depends on many variables, we can compute the Hessian matrix, which contains the second-order derivatives of J: H ij = 2 J w i w j The inverse of the Hessian gives the optimal learning rates The weights are updated as: w w H w J This is also called Newton-Raphson method for logistic regression, or Fisher scoring COMP-652 and ECSE-68, Lecture 4 - January 2, 26 26

Which method is better? Newton s method usually requires significantly fewer iterations than gradient descent Computing the Hessian requires a batch of data, so there is no natural on-line algorithm Inverting the Hessian explicitly is expensive, but almost never necessary Computing the product of a Hessian with a vector can be done in linear time (Schraudolph, 994) COMP-652 and ECSE-68, Lecture 4 - January 2, 26 27

Newton-Raphson for logistic regression Leads to a nice algorithm called iterative recursive least squares The Hessian has the form: H = Φ T RΦ where R is the diagonal matrix of h(x i )( h(x i )) (you can check that this is the form of the second derivative. The weight update becomes: w (Φ T RΦ) Φ T R(Φw R (Φw y)) COMP-652 and ECSE-68, Lecture 4 - January 2, 26 28

Regularization for logistic regression One can do regularization for logistic regression just like in the case of linear regression Recall regularization makes a statement about the weights, so does not affect the error function Eg: L 2 regularization will have the optimization criterions: J(w = J D (w) + λ 2 wt w COMP-652 and ECSE-68, Lecture 4 - January 2, 26 29