Linear and logistic regression

Similar documents
Logistic Regression. Seungjin Choi

Linear Models for Classification

Introduction to Machine Learning

Regression with Numerical Optimization. Logistic

CSC 411: Lecture 04: Logistic Regression

Neural Network Training

Machine Learning. Lecture 04: Logistic and Softmax Regression. Nevin L. Zhang

Vasil Khalidov & Miles Hansard. C.M. Bishop s PRML: Chapter 5; Neural Networks

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Maximum likelihood estimation

Lecture 7. Logistic Regression. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 11, 2016

Machine Learning. 7. Logistic and Linear Regression

Iterative Reweighted Least Squares

Logistic Regression. Sargur N. Srihari. University at Buffalo, State University of New York USA

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

CS 340 Lec. 16: Logistic Regression

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Overfitting, Bias / Variance Analysis

Machine Learning - Waseda University Logistic Regression

Machine Learning 2017

Statistical Data Mining and Machine Learning Hilary Term 2016

The classifier. Theorem. where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know

The classifier. Linear discriminant analysis (LDA) Example. Challenges for LDA

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Support Vector Machines

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Classification Logistic Regression

Stochastic Gradient Descent

April 9, Depto. de Ing. de Sistemas e Industrial Universidad Nacional de Colombia, Bogotá. Linear Classification Models. Fabio A. González Ph.D.

Least Squares Regression

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Generative v. Discriminative classifiers Intuition

Logistic Regression. Machine Learning Fall 2018

Logistic Regression. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, / 48

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Week 5: Logistic Regression & Neural Networks

1 Machine Learning Concepts (16 points)

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Linear classifiers: Logistic regression

Multiclass Logistic Regression

Logistic Regression. Stochastic Gradient Descent

Midterm exam CS 189/289, Fall 2015

Statistical Machine Learning Hilary Term 2018

Classification Logistic Regression

Least Squares Regression

Lecture 1: Supervised Learning

Generative v. Discriminative classifiers Intuition

Linear Models in Machine Learning

Introduction to Machine Learning

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

Ch 4. Linear Models for Classification

Nonparametric Bayesian Methods (Gaussian Processes)

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

Logistic Regression. Mohammad Emtiyaz Khan EPFL Oct 8, 2015

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

SCMA292 Mathematical Modeling : Machine Learning. Krikamol Muandet. Department of Mathematics Faculty of Science, Mahidol University.

Bias-Variance Tradeoff

Reading Group on Deep Learning Session 1

Regression. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh.

Linear Regression (continued)

Machine Learning Lecture 7

Machine Learning Basics III

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

Machine Learning, Fall 2012 Homework 2

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

DATA MINING AND MACHINE LEARNING. Lecture 4: Linear models for regression and classification Lecturer: Simone Scardapane

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

Machine Learning - MT Classification: Generative Models

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

CSCI-567: Machine Learning (Spring 2019)

Learning Parameters of Undirected Models. Sargur Srihari

Introduction to Machine Learning

MLPR: Logistic Regression and Neural Networks

Outline. MLPR: Logistic Regression and Neural Networks Machine Learning and Pattern Recognition. Which is the correct model? Recap.

Variations of Logistic Regression with Stochastic Gradient Descent

Machine Learning for NLP

Linear Regression (9/11/13)

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5

An Introduction to Statistical and Probabilistic Linear Models

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Partially Directed Graphs and Conditional Random Fields. Sargur Srihari

Weighted Least Squares I

Introduction to Logistic Regression and Support Vector Machine

Chap 2. Linear Classifiers (FTH, ) Yongdai Kim Seoul National University

Logistic Regression. COMP 527 Danushka Bollegala

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION

Bayesian Logistic Regression

Part 4: Conditional Random Fields

Computer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression

Convex Optimization Algorithms for Machine Learning in 10 Slides

Logistic Regression Trained with Different Loss Functions. Discussion

Transcription:

Linear and logistic regression Guillaume Obozinski Ecole des Ponts - ParisTech Master MVA Linear and logistic regression 1/22

Outline 1 Linear regression 2 Logistic regression 3 Fisher discriminant analysis Linear and logistic regression 2/22

Linear regression Linear and logistic regression 4/22

Design matrix Consider a finite collection of vectors x i R d pour i = 1... n. Design Matrix X = x 1.... x n We will assume that the vectors are centered, i.e. that n x i = 0, and normalized, i.e. that 1 n n x2 i = 1. If x i are not centered the design matrix of centered data can be constructed with the rows x i x with x = 1 n n x i. Normalization usually consists in dividing each row by its empirical standard deviation. Linear and logistic regression 5/22

Generative models vs conditional models X is the input variable Y is the output variable A generative model is a model of the joint distribution p(x, y). A conditional model is a model of the conditional distribution p(y x). Conditional models vs Generative models CM make fewer assumptions about the data distribution CM require fewer parameters CM are typically computationally harder to learn CM can typically not handle missing data or latent variables Linear and logistic regression 6/22

Probabilistic version of linear regression Modeling the conditional distribution of Y given X by Y X N (w X + b, σ 2 ) or equivalently Y = w X + b + ɛ with ɛ N (0, σ 2 ). The offset can be ignored up to a reparameterization. ( ) Y = w x + ɛ. 1 Likelihood for one pair p(y i x i ) = 1 ( 1 exp (y i w x i ) 2 ) 2πσ 2 2 σ 2 Negative log-likelihood l(w, σ 2 ) = log p(y i x i ) = n 2 log(2πσ2 ) + 1 2 (y i w x i ) 2 σ 2. Linear and logistic regression 7/22

Probabilistic version of linear regression n min σ 2,w 2 log(2πσ2 ) + 1 2 The minimization problem in w min w (y i w x i ) 2 1 2σ 2 y Xw 2 2 that we recognize as the usual linear regression, with y = (y 1,..., y n ) and σ 2 X the design matrix with rows equal to x i. Optimizing over σ 2, we find: σ 2 MLE = 1 n (y i ŵmlex i ) 2 Linear and logistic regression 8/22

Solving linear regression To solve min w R p Rn (f w ), we consider that R n (f w ) = 1 ( w X Xw 2 w X y + y 2) 2n is a differentiable convex function whose minima are thus characterized by the Normal equations X Xw X y = 0 If X X is invertible, then ŵ MLE is given by: ŵ MLE = (X X) 1 X y. Problem: X X is never invertible for p > n and thus the solution is not unique (and any solution is overfit). Linear and logistic regression 9/22

Logistic regression Linear and logistic regression 11/22

Logistic regression Classification setting: X = R p, Y {0, 1}. Key assumption: log P(Y = 1 X = x) P(Y = 0 X = x) = w x Implies that P(Y = 1 X = x) = σ(w x) for σ : z 1 1 + e z, the logistic function. 1 0.8 0.6 0.4 0.2 0 10 5 0 5 10 The logistic function is part of the family of sigmoid functions. Often called the sigmoid function. Properties: z R, σ( z) = 1 σ(z), z R, σ (z) = σ(z)(1 σ(z)) = σ(z)σ( z). Linear and logistic regression 12/22

Likelihood for logistic regression Let η := σ(w x + b). W.l.o.g. we assume b = 0. By assumption: Y X = x Ber(η). Likelihood p(y = y X = x) = η y (1 η) 1 y = σ(w x) y σ( w x) 1 y. Log-likelihood l(w) = y log σ(w x) + (1 y) log σ( w x) = y log η + (1 y) log(1 η) η = y log + log(1 η) 1 η = y w x + log σ( w x) Linear and logistic regression 13/22

Maximizing the log-likelihood Log-likelihood of a sample Given an i.i.d. training set D = {(x 1, y 1 ),, (x n, y n )} l(w) = y i w x i + log σ( w x i ). The log-likelihood is differentiable and concave. Its global maxima are its stationary points. Gradient of l l(w) = = σ( w x i )σ(w x i ) y i x i x i σ( w x i ) (y i η i )x i with η i = σ(w x i ). Thus, l(w) = 0 n x i(y i σ(θ x i )) = 0. No closed form solution! Linear and logistic regression 14/22

Second order Taylor expansion Need an iterative method to solve x i (y i σ(θ x i )) = 0. Gradient descent (aka steepest descent) Newton s method Hessian of l Hl(w) = = x i (0 σ (w x i )σ ( w x i )x i ) η i (1 η i )x i x i = X Diag(η i (1 η i ))X where X is the design matrix. Note that Hl is p.s.d. l is concave. Linear and logistic regression 15/22

Newton s method Use the Taylor expansion l(w t ) + (w w t ) l(w t ) + 1 2 (w wt ) Hl(w t )(w w t ). and minimize w.r.t. w. Setting h = w w t, we get max h h w l(w) + 1 2 h Hl(w)h. I.e., for logistic regression, writing D η = Diag ( ) (η i (1 η i )) i min h h X (y η) 1 2 h X D η Xh Modified normal equations X D η X h X ỹ with ỹ = y η. Linear and logistic regression 16/22

Iterative Reweighted Least Squares (IRLS) Assuming X D η X is invertible, the algorithm takes the form w (t+1) w (t) + (X D η (t)x) 1 X (y η (t) ). This is called iterative reweighted least squares because each step is equivalent to solving the reweighted least squares problem: 1 2 τ 2 i 1 (x i h ˇy i ) 2 with τ 2 i = 1 η (t) i (1 η (t) i ) and ˇy i = τ 2 i (y i η (t) i ). Linear and logistic regression 17/22

Alternate formulation of logistic regression If y { 1, 1}, then P(Y = y X = x) = σ(y w x) Log-likelihood l(w) = log σ(yw x) = log ( 1 + exp( yw x) ) Log-likelihood for a training set l(w) = log ( 1 + exp( y i w x i ) ) Linear and logistic regression 18/22

Fisher discriminant analysis Linear and logistic regression 20/22

Generative classification X R p and Y {0, 1}. Instead of modeling directly p(y x) model p(y) and p(x y) and deduce p(y x) using Bayes rule. In classification P(Y = 1 X = x) = P(X = x Y = 1) P(Y = 1) P(X = x Y = 1) P(Y = 1) + P(X = x Y = 0) P(Y = 0) For example one can assume P(Y = 1) = π P(X = x Y = 1) N (x; µ 1, Σ 1 ) P(X = x Y = 0) N (x; µ 0, Σ 0 ). Linear and logistic regression 21/22

Fisher s discriminant aka Linear Discriminant Analysis (LDA) Previous model with the constraint Σ 1 = Σ 0 = Σ. Given a training set, the different model parameters can be estimated using the maximum likelihood principle, which leads to ( π, µ 1, µ 0, Σ 1, Σ 0 ). Linear and logistic regression 22/22