CSC 411: Lecture 04: Logistic Regression

Similar documents
CSC 411: Lecture 02: Linear Regression

CSC 411: Lecture 09: Naive Bayes

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Linear Models for Classification

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Linear and logistic regression

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Machine Learning. 7. Logistic and Linear Regression

Neural Network Training

Logistic Regression. Machine Learning Fall 2018

CMU-Q Lecture 24:

CSC 411: Lecture 03: Linear Classification

ECE521 Tutorial 2. Regression, GPs, Assignment 1. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel for the slides.

The classifier. Theorem. where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know

The classifier. Linear discriminant analysis (LDA) Example. Challenges for LDA

Logistic Regression. COMP 527 Danushka Bollegala

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

Relevance Vector Machines

April 9, Depto. de Ing. de Sistemas e Industrial Universidad Nacional de Colombia, Bogotá. Linear Classification Models. Fabio A. González Ph.D.

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

CPSC 340 Assignment 4 (due November 17 ATE)

Bias-Variance Tradeoff

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Overfitting, Bias / Variance Analysis

Generative v. Discriminative classifiers Intuition

Q-learning Tutorial. CSC411 Geoffrey Roeder. Slides Adapted from lecture: Rich Zemel, Raquel Urtasun, Sanja Fidler, Nitish Srivastava

Machine Learning 2017

ECE 5984: Introduction to Machine Learning

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

CS 340 Lec. 16: Logistic Regression

ECE521 week 3: 23/26 January 2017

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS16

Variations of Logistic Regression with Stochastic Gradient Descent

Stochastic Gradient Descent

Machine Learning for NLP

Ch 4. Linear Models for Classification

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Machine Learning, Fall 2012 Homework 2

Machine Learning Lecture 5

ECE521: Inference Algorithms and Machine Learning University of Toronto. Assignment 1: k-nn and Linear Regression

Machine Learning Linear Classification. Prof. Matteo Matteucci

Regression with Numerical Optimization. Logistic

CSC 411 Lecture 7: Linear Classification

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

TDA231. Logistic regression

Machine Learning - Waseda University Logistic Regression

Generative v. Discriminative classifiers Intuition

CSC 412 (Lecture 4): Undirected Graphical Models

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

Artificial Neural Networks

Classification Logistic Regression

Logistic Regression. Sargur N. Srihari. University at Buffalo, State University of New York USA

Outline Lecture 2 2(32)

Bayesian Logistic Regression

Machine Learning. Lecture 04: Logistic and Softmax Regression. Nevin L. Zhang

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

From perceptrons to word embeddings. Simon Šuster University of Groningen

Machine Learning Lecture 7

Introduction to Machine Learning

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Linear Models for Regression CS534

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Week 5: Logistic Regression & Neural Networks

Midterm. Introduction to Machine Learning. CS 189 Spring You have 1 hour 20 minutes for the exam.

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Vasil Khalidov & Miles Hansard. C.M. Bishop s PRML: Chapter 5; Neural Networks

Lecture 2 - Learning Binary & Multi-class Classifiers from Labelled Training Data

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Cheng Soon Ong & Christian Walder. Canberra February June 2018

CS229 Supplemental Lecture notes

Linear Models for Regression

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Logistic Regression. Stochastic Gradient Descent

Lecture 7. Logistic Regression. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 11, 2016

Machine Learning

Linear Models for Regression CS534

ECE521 Lecture7. Logistic Regression

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

CSC321 Lecture 18: Learning Probabilistic Models

Machine Learning Practice Page 2 of 2 10/28/13

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Introduction to Logistic Regression and Support Vector Machine

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Linear Models in Machine Learning

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

CSC 411 Lecture 17: Support Vector Machine

Linear & nonlinear classifiers

Linear classifiers: Overfitting and regularization

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

CSC321 Lecture 4: Learning a Classifier

Logistic Regression Logistic

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

10-701/ Machine Learning, Fall

Transcription:

CSC 411: Lecture 04: Logistic Regression Raquel Urtasun & Rich Zemel University of Toronto Sep 23, 2015 Urtasun & Zemel (UofT) CSC 411: 04-Prob Classif Sep 23, 2015 1 / 16

Today Key Concepts: Logistic Regression Regularization Cross validation Urtasun & Zemel (UofT) CSC 411: 04-Prob Classif Sep 23, 2015 2 / 16

Logistic Regression An alternative: replace the sign( ) with the sigmoid or logistic function We assumed a particular functional form: sigmoid applied to a linear function of the data y(x) = σ ( w T x + w 0 ) where the sigmoid is defined as σ(z) = 1 1 + exp( z) 1 0.5 0 0 The output is a smooth function of the inputs and the weights Urtasun & Zemel (UofT) CSC 411: 04-Prob Classif Sep 23, 2015 3 / 16

Logistic Regression We assumed a particular functional form: sigmoid applied to a linear function of the data y(x) = σ ( w T x + w 0 ) where the sigmoid is defined as σ(z) = 1 1 + exp( z) One parameter per data dimension (feature) Features can be discrete or continuous Output of the model: value y [0, 1] This allows for gradient-based learning of the parameters: smoothed version of the sign( ) Urtasun & Zemel (UofT) CSC 411: 04-Prob Classif Sep 23, 2015 4 / 16

Shape of the Logistic Function Let s look at how modifying w changes the function shape 1D example: y = σ (w 1 x + w 0 ) Demo Urtasun & Zemel (UofT) CSC 411: 04-Prob Classif Sep 23, 2015 5 / 16

Probabilistic Interpretation If we have a value between 0 and 1, let s use it to model the posterior p(c = 0 x) = σ(w T 1 x + w 0 ) with σ(z) = 1 + exp( z) Substituting we have p(c = 0 x) = 1 1 + exp ( w T x w 0 ) Supposed we have two classes, how can I compute p(c = 1 x)? Use the marginalization property of probability p(c = 1 x) + p(c = 0 x) = 1 Thus (show matlab) p(c = 1 x) = 1 1 1 + exp ( w T x w 0 ) = exp( wt x w 0 ) 1 + exp ( w T x w 0 ) Urtasun & Zemel (UofT) CSC 411: 04-Prob Classif Sep 23, 2015 6 / 16

Conditional likelihood Assume t {0, 1}, we can write the probability distribution of each of our training points p(t (1),, t (N) x (1), x (N) ) Assuming that the training examples are sampled IID: independent and identically distributed N p(t (1),, t (N) x (1), x (N) ) = p(t (i) x (i) ) We can write each probability as p(t (i) x (i) ) = p(c = 1 x (i) ) t(i) p(c = 0 x (i) ) 1 t(i) ( ) t (i) = 1 p(c = 0 x (i) ) p(c = 0 x (i) ) 1 t(i) We might want to learn the model, by maximizing the conditional likelihood N max p(t (i) x (i) ) w Convert this into a minimization so that we can write the loss function Urtasun & Zemel (UofT) CSC 411: 04-Prob Classif Sep 23, 2015 7 / 16

Loss Function p(t (1),, t (N) x (1), x (N) ) = = N p(t (i) x (i) ) N ( ) t (i) 1 p(c = 0 x (i) ) p(c = 0 x (i) ) 1 t(i) It s convenient to take the logarithm and convert the maximization into minimization by changing the sign l log (w) = N N t (i) log(1 p(c = 0 x (i), w)) (1 t (i) ) log p(c = 0 x (i), w) Why is this equivalent to maximize the conditional likelihood? Is there a closed form solution? It s a convex function of w. Can we get the global optimum? Urtasun & Zemel (UofT) CSC 411: 04-Prob Classif Sep 23, 2015 8 / 16

Gradient Descent { } N N min l(w) = min t (i) log(1 p(c = 0 x (i), w)) (1 t (i) ) log p(c = 0 x (i), w) w w Gradient descent: iterate and at each iteration compute steepest direction towards optimum, move in that direction, step-size λ w (t+1) w (t) λ l(w) w But where is w? p(c = 0 x) = 1 1 + exp ( w T x w 0 ) p(c = 1 x) = exp( wt x w 0 ) 1 + exp ( w T x w 0 ) You can write this in vector form [ l(w) l(w) =,, l(w) ] T, and (w) = λ l(w) w 0 w k Urtasun & Zemel (UofT) CSC 411: 04-Prob Classif Sep 23, 2015 9 / 16

Let s look at the updates The log likelihood is l log loss (w) = N N t (i) log p(c = 1 x (i), w) (1 t (i) ) log p(c = 0 x (i), w) where the probabilities are p(c = 0 x, w) = and z = w T x + w 0 We can simplify 1 1 + exp( z) p(c = 1 x, w) = exp( z) 1 + exp( z) l(w) = i t (i) log(1 + exp( z (i) )) + i t (i) z (i) + i (1 t (i) ) log(1 + exp( z (i) )) = i log(1 + exp( z (i) )) + i t (i) z (i) Now it s easy to take derivatives Urtasun & Zemel (UofT) CSC 411: 04-Prob Classif Sep 23, 2015 10 / 16

Updates l(w) = i t (i) z (i) + i log(1 + exp( z (i) )) Now it s easy to take derivatives Remember z = w T x + w 0 l = w i t (i) x (i) x (i) exp( z (i) ) 1 + exp( z (i) ) What s x (i)? And simplifying l = w i ( ) x (i) t (i) p(c = 1 x (i) ) Don t get confused with indexes: for the weight that we are updating and i for the training example Logistic regression has linear decision boundary Urtasun & Zemel (UofT) CSC 411: 04-Prob Classif Sep 23, 2015 11 / 16

Logistic regression vs least squares logistic regression least squares regression If the right answer is 1 and the model says 1.5, it loses, so it changes the boundary to avoid being too correct (tilts away 33 from outliers) Urtasun & Zemel (UofT) CSC 411: 04-Prob Classif Sep 23, 2015 12 / 16

Regularization We can also look at p(w {t}, {x}) p({t} {x}, w) p(w) with {t} = (t (1),, t (N) ), and {x} = (x (1),, x (N) ) We can define priors on parameters w This is a form of regularization Helps avoid large weights and overfitting [ max log w p(w) i p(t (i) x (i), w) ] What s p(w)? Urtasun & Zemel (UofT) CSC 411: 04-Prob Classif Sep 23, 2015 13 / 16

Regularized Logistic Regression For example, define prior: normal distribution, zero mean and identity covariance p(w) = N (0, αi) This prior pushes parameters towards zero Including this prior the new gradient is w (t+1) w (t) λ l(w) w λαw (t) where t here refers to iteration of the gradient descent How do we decide the best value of α? Urtasun & Zemel (UofT) CSC 411: 04-Prob Classif Sep 23, 2015 14 / 16

Use of Validation Set We can divide the set of training examples into two disoint sets: training and validation Use the first set (i.e., training) to estimate the weights w for different values of α Use the second set (i.e., validation) to estimate the best α, by evaluating how well the classifier does in this second set This test how well you generalized to unseen data The parameter α is the importance of the regularization, and it s a hyper-parameter Urtasun & Zemel (UofT) CSC 411: 04-Prob Classif Sep 23, 2015 15 / 16

Cross-Validation Leave-p-out cross-validation: We use p observations as the validation set and the remaining observations as the training set. This is repeated on all ways to cut the original training set. It requires C p n for a set of n examples Leave-1-out cross-validation: When p = 1, does not have this problem k-fold cross-validation: The training set is randomly partitioned into k equal size subsamples. Of the k subsamples, a single subsample is retained as the validation data for testing the model, and the remaining k 1 subsamples are used as training data. The cross-validation process is then repeated k times (the folds). The k results from the folds can then be averaged (or otherwise combined) to produce a single estimation Urtasun & Zemel (UofT) CSC 411: 04-Prob Classif Sep 23, 2015 16 / 16