Linear Models in Machine Learning

Similar documents
Big Data Analytics. Lucas Rego Drumond

Linear Regression (9/11/13)

CS229 Supplemental Lecture notes

Overfitting, Bias / Variance Analysis

Linear Regression (continued)

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION

Least Squares Regression

Machine Learning. Lecture 3: Logistic Regression. Feng Li.

Least Squares Regression

Machine Learning. Classification, Discriminative learning. Marc Toussaint University of Stuttgart Summer 2015

Association studies and regression

Lecture 7. Logistic Regression. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 11, 2016

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Statistical Data Mining and Machine Learning Hilary Term 2016

Linear Models for Regression CS534

CSCI567 Machine Learning (Fall 2014)

Midterm exam CS 189/289, Fall 2015

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Linear Models for Regression

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

Multiple regression. CM226: Machine Learning for Bioinformatics. Fall Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Machine Learning Lecture 7

Machine Learning. Bayesian Regression & Classification. Marc Toussaint U Stuttgart

Linear Models for Regression CS534

Introduction to Machine Learning

Lecture 3: Statistical Decision Theory (Part II)

Gradient-Based Learning. Sargur N. Srihari

MS&E 226: Small Data. Lecture 11: Maximum likelihood (v2) Ramesh Johari

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

ECE521 week 3: 23/26 January 2017

Machine Learning. 7. Logistic and Linear Regression

Machine Learning and Data Mining. Linear classification. Kalev Kask

Linear classifiers: Overfitting and regularization

Warm up: risk prediction with logistic regression

Linear Models for Classification

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

Lecture 2 Machine Learning Review

CS6220: DATA MINING TECHNIQUES

COS 424: Interacting with Data. Lecturer: Rob Schapire Lecture #15 Scribe: Haipeng Zheng April 5, 2007

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Logistic Regression. Will Monroe CS 109. Lecture Notes #22 August 14, 2017

Gaussian and Linear Discriminant Analysis; Multiclass Classification

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Neural Network Training

ECS171: Machine Learning

Regression with Numerical Optimization. Logistic

CS 340 Lec. 16: Logistic Regression

Linear Models for Regression CS534

Logistic Regression Trained with Different Loss Functions. Discussion

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Machine Learning Practice Page 2 of 2 10/28/13

Machine Learning - Waseda University Logistic Regression

Logistic Regression. Machine Learning Fall 2018

Machine Learning: Assignment 1

Generative v. Discriminative classifiers Intuition

DATA MINING AND MACHINE LEARNING. Lecture 4: Linear models for regression and classification Lecturer: Simone Scardapane

Classification Logistic Regression

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Jeff Howbert Introduction to Machine Learning Winter

Ph.D. Qualifying Exam Friday Saturday, January 6 7, 2017

COMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

STAT 135 Lab 13 (Review) Linear Regression, Multivariate Random Variables, Prediction, Logistic Regression and the δ-method.

Logistic Regression. COMP 527 Danushka Bollegala

CS6220: DATA MINING TECHNIQUES

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

Bias-Variance Tradeoff

Lecture 1: Supervised Learning

Linear discriminant functions

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

The Perceptron Algorithm

ECS171: Machine Learning

10-701/ Machine Learning - Midterm Exam, Fall 2010

Linear Models for Regression

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Recap from previous lecture

Lecture 3 - Linear and Logistic Regression

Lecture 3 Feedforward Networks and Backpropagation

Advanced computational methods X Selected Topics: SGD

Classification. Chapter Introduction. 6.2 The Bayes classifier

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Is the test error unbiased for these programs?

Stochastic Gradient Descent

Machine Learning. Lecture 04: Logistic and Softmax Regression. Nevin L. Zhang

Machine Learning 4771

Empirical Risk Minimization

Classification Logistic Regression

Ad Placement Strategies

Lecture #11: Classification & Logistic Regression

Discriminative Models

IEOR 165 Lecture 7 1 Bias-Variance Tradeoff

UNIVERSITETET I OSLO

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Transcription:

CS540 Intro to AI Linear Models in Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu We briefly go over two linear models frequently used in machine learning: linear regression for, well, regression, and logistic regression for classification. Linear regression. The D case without regularization (Ordinary Least Squares) x: input variable (also called independent, predictor, explanatory variable) y: output variable (also called dependent, response variable) A scatter plot shows (x i, y i ) for i =... n as dots. Model assumption: Y = β 0 + β x + ɛ where ɛ N(0, σ 2 ). On a test point x, E(Y x ) = E(β 0 + β x + ɛ) = β 0 + β x + E(ɛ) = β 0 + β x. σ 2 (Y x ) = σ2. Given data, maximum likelihood estimate = least squares estimate Convex objective. Setting gradient to zero gives ( ˆβ 0, ˆβ n ) = arg b0,b (β 0 + β x i y i ) 2. ˆβ = (xi x)(y i ȳ) (xi x) 2. ˆβ 0 = ȳ ˆβ x. We used x = ( x i )/n and ȳ = ( y i )/n. This is known as ordinary least squares (OLS). Let ŷ i = ˆβ 0 + ˆβ x i. The residual is y i ŷ i. The residual sum of squares is (yi ŷ i ) 2, and an unbiased estimate of σ 2 is ˆσ 2 (yi ŷ i ) 2 =. n 2 The denoator is n 2 because ˆβ 0, ˆβ used up two degrees of freedom. Note the best constant fit would have been f(x) = ȳ, whose residual sum of squares is (yi ȳ) 2. The coefficient of deteration r 2 is a measure of how much better a non-constant line fit is: r 2 (yi ŷ i ) 2 = (yi ȳ) 2. r 2 [0, ], being desirable.

Linear Models in Machine Learning 2.2 Ridge regression Input variables x = (x 0 =, x,..., x p ) R p+, output variable y R. y = x β + ɛ. ɛ N(0, σ 2 ). Linear in β. x i can be nonlinear functions of input. Examples: pth-degree polynomial regression y = β 0 + β x + β 2 x 2 +... + β p x p + ɛ. second order model: y = β 0 + β x + β 2 x 2 + β 3 x 2 + β 4 x x 2 + β 5 x 2 2 + ɛ. In general y = β 0 + j β j φ j (x) + ɛ, where φ j () s are arbitrary feature functions of x. An unbiased estimate of σ 2 is ˆσ 2 = (yi ŷ i ) 2 n (p + ). It will be convenient to arrange the input in a matrix X n p where each row is a feature vector, and a label vector y n. The linear regression model is y = Xβ + ɛ where ɛ is a vector, too. Without regularization, the ordinary least squares (OLS) method finds parameter Rewrite the objective as β y Xβ 2. (y Xβ) (y Xβ) = y y 2β X y + β X Xβ. Take the gradient with respect to β and set it to the zero vector: Assug X X is invertible we obtain the solution 2X y + 2X Xβ = 0 β = (X X) X y. OLS is certainly not going to work, however, in the high dimension setting when p > n, since X X is rank deficient and not invertible. More frequently used in practice for its stability, ridge regression solves a regularized optimization problem instead: y Xβ 2 + λ β 2. () β Rewrite the objective as y y 2β X y + β X Xβ + λβ β.

Linear Models in Machine Learning 3 Take the gradient with respect to β and set it to the zero vector: The solution is 2X y + 2X Xβ + 2λβ = 0 2X y + 2(X X + λi)β = 0 β = (X X + λi) X y. Note (X X + λi) is always invertible. With appropriate λ ridge regression can have smaller mean squared error. 2 Logistic Regression Consider binary classification with y =,, and each example is represented by a feature vector x. The intuition is to first map x to a real number, such that large positive number means that x is likely to be positive (y = ), and negative number means x is negative (y = ). This is done just like in linear regression: x (2) The next step is to squash the range down to [0, ] so one can interpret it as the label probability. This is done via the logistic function: p(y = x) = σ( x) = + exp( x). (3) For binary classification with -, class encoding, one can unify the definition for p(y = x) and p(y = x) with p(y x) = + exp( y x). (4) Logistic regression can be easily generalized to multiple classes. Let there be K classes. Each class has its own parameter k. The probability is defined via the softmax function p(y = k x) = exp(k x) K (5) exp( i x). If a predicted class label (instead of a probability) is desired, simply predict the class with the largest probability p(y = k x). Logistic regression produces linear decision boundaries between classes. We will focus on binary classification in the rest of this note. 2. Training Training (i.e., finding the parameter ) can be done by maximizing the conditional log likelihood of the training data {(x, y) :n }: max log p(y i x i, ). (6) However, when the training data is linearly separable, two bad things happen:. goes to infinity; 2. There are infinite number of MLE s. To see this, note any step function (sigmoid with = ) that is in the gap between the two classes is an MLE. One way to avoid this is to incorporate a prior on in the form of a zero-mean Gaussian with covariance λ I, N (0, λ I ), (7) Strictly speaking, one needs only K parameter vectors.

Linear Models in Machine Learning 4 and seek the MAP estimate. This is essentially smoothing, since large values will be penalized more. That is, we seek the that maximizes log p() + = λ 2 2 + = λ 2 2 Equivalently, one imizes the l 2 -regularized negative log likelihood loss log p(y i x i, ) (8) log p(y i x i, ) + constant (9) log ( + exp( y i x i ) ) + constant. (0) λ 2 2 + log ( + exp( y i x i ) ). () This is a convex function so there is a unique global imum of. Unfortunately there is no closed form solution. One typically solves the optimization problem with Newton-Raphson iterations, also known as iterative reweighted least squares for logistic regression. 3 Stochastic Gradient Descent In anticipation of more complex non-convex learners, we present a simple training algorithm that works for both linear regression () and logistic regression (). Observing that both models can be written as follows: l(x i, y i, ) + λ 2 2 (2) where l(x i, y i, ) is the loss function. The gradient descent algorithm starts at an arbitrary 0, and repeats the following update until convergence: ( n ) t+ = t η l(x i, y i, ) + λ 2 2 (3) = t η l(x i, y i, ) λ (4) where η is a small positive number known as the step size. Each iteration of gradient descent needs to go over all n training points to compute the gradient, which may be too expensive. Note that we can write it equivalently as t+ = t nηe l(x, y, ) ηλ (5) where the expectation is with respect to the empirical distribution on the training set. Stochastic Gradient Descent (SGD) instead performs the following: At each iteration draw ( x, ỹ) from the empirical distribution, i.e. select a training point uniformly at random. Perform the update t+ = t nη l( x, ỹ, ) ηλ. (6)

Linear Models in Machine Learning 5 The update, in expectation, agrees with (5). Note the above may differ from other text by the coefficient n. This is an artifact of us not dividing the sum of losses by n, and amounts to a rescaling of λ. For ridge regression, Therefore, For logistic regression, Therefore, l(x, y, ) = SGD can be further stablized by: l(x, y, ) = (x y) 2. l(x, y, ) = 2(x y)x. l(x, y, ) = log ( + exp( y x) ). + exp( y x) exp( y x)( yx).. Instead of taking the last iteration s parameter vector T, take the average of all parameters over time: = T T t. t= You will see that, for reasonably large T the average parameter achieves an MSE objective that is very close to the imum. It will still fluctuate a little as one should expect, but overall it s a lot more stable. 2. Alternatively (or in conjunction), you may try to diish the step size η as you go. One common scheme is to let the stepwise in iteration t to be η t = η 0 / t where η 0 is the original stepsize. As t =, 2,... increases, η decreases. This will also stabilize SGD. Both methods are motivated by theoretical analysis of SGD.