INF 5860 Machine learning for image classification. Lecture 3 : Image classification and regression part II Anne Solberg January 31, 2018

Similar documents
For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results.

CS 3710: Visual Recognition Classification and Detection. Adriana Kovashka Department of Computer Science January 13, 2015

Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Discriminative classifier: Logistic Regression. CS534-Machine Learning

10-701/ Machine Learning, Fall 2005 Homework 3

Support Vector Machines

Which Separator? Spring 1

Discriminative classifier: Logistic Regression. CS534-Machine Learning

Support Vector Machines

Logistic Regression Maximum Likelihood Estimation

MLE and Bayesian Estimation. Jie Tang Department of Computer Science & Technology Tsinghua University 2012

Week 5: Neural Networks

Generative classification models

Ensemble Methods: Boosting

CIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M

Intro to Visual Recognition

Classification learning II

Kernel Methods and SVMs Extension

1 Convex Optimization

Generalized Linear Methods

Support Vector Machines. Vibhav Gogate The University of Texas at dallas

The Gaussian classifier. Nuno Vasconcelos ECE Department, UCSD

Pattern Classification

C4B Machine Learning Answers II. = σ(z) (1 σ(z)) 1 1 e z. e z = σ(1 σ) (1 + e z )

MIMA Group. Chapter 2 Bayesian Decision Theory. School of Computer Science and Technology, Shandong University. Xin-Shun SDU

Maximum Likelihood Estimation of Binary Dependent Variables Models: Probit and Logit. 1. General Formulation of Binary Dependent Variables Models

Homework Assignment 3 Due in class, Thursday October 15

CSE 546 Midterm Exam, Fall 2014(with Solution)

Multinomial logit regression

Multi-layer neural networks

Multilayer Perceptron (MLP)

Probabilistic Classification: Bayes Classifiers. Lecture 6:

15-381: Artificial Intelligence. Regression and cross validation

Lecture 10 Support Vector Machines II

The exam is closed book, closed notes except your one-page cheat sheet.

Support Vector Machines

Lecture Notes on Linear Regression

Maximum Likelihood Estimation of Binary Dependent Variables Models: Probit and Logit. 1. General Formulation of Binary Dependent Variables Models

Lecture 2: Prelude to the big shrink

SDMML HT MSc Problem Sheet 4

Support Vector Machines

Support Vector Machines

Linear Classification, SVMs and Nearest Neighbors

Image classification. Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing i them?

Logistic Classifier CISC 5800 Professor Daniel Leeds

Kernels in Support Vector Machines. Based on lectures of Martin Law, University of Michigan

CSC321 Tutorial 9: Review of Boltzmann machines and simulated annealing

Classification as a Regression Problem

9.913 Pattern Recognition for Vision. Class IV Part I Bayesian Decision Theory Yuri Ivanov

Boostrapaggregating (Bagging)

Multilayer neural networks

CSC 411 / CSC D11 / CSC C11

Limited Dependent Variables

Why Bayesian? 3. Bayes and Normal Models. State of nature: class. Decision rule. Rev. Thomas Bayes ( ) Bayes Theorem (yes, the famous one)

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #16 Scribe: Yannan Wang April 3, 2014

Linear Approximation with Regularization and Moving Least Squares

Probabilistic Classification: Bayes Classifiers 2

Maximum Likelihood Estimation (MLE)

Predictive Analytics : QM901.1x Prof U Dinesh Kumar, IIMB. All Rights Reserved, Indian Institute of Management Bangalore

Structural Extensions of Support Vector Machines. Mark Schmidt March 30, 2009

EEE 241: Linear Systems

Multilayer Perceptrons and Backpropagation. Perceptrons. Recap: Perceptrons. Informatics 1 CG: Lecture 6. Mirella Lapata

MACHINE APPLIED MACHINE LEARNING LEARNING. Gaussian Mixture Regression

Expectation Maximization Mixture Models HMMs

Online Classification: Perceptron and Winnow

Correlation and Regression. Correlation 9.1. Correlation. Chapter 9

Evaluation for sets of classes

Kristin P. Bennett. Rensselaer Polytechnic Institute

Advanced Introduction to Machine Learning

We present the algorithm first, then derive it later. Assume access to a dataset {(x i, y i )} n i=1, where x i R d and y i { 1, 1}.

1. Inference on Regression Parameters a. Finding Mean, s.d and covariance amongst estimates. 2. Confidence Intervals and Working Hotelling Bands

THE ROYAL STATISTICAL SOCIETY 2006 EXAMINATIONS SOLUTIONS HIGHER CERTIFICATE

Chapter 2 - The Simple Linear Regression Model S =0. e i is a random error. S β2 β. This is a minimization problem. Solution is a calculus exercise.

Lecture 12: Classification

Evaluation of classifiers MLPs

P R. Lecture 4. Theory and Applications of Pattern Recognition. Dept. of Electrical and Computer Engineering /

Natural Language Processing and Information Retrieval

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification

Department of Computer Science Artificial Intelligence Research Laboratory. Iowa State University MACHINE LEARNING

Learning from Data 1 Naive Bayes

Maximal Margin Classifier

Other NN Models. Reinforcement learning (RL) Probabilistic neural networks

Support Vector Machines CS434

The Geometry of Logit and Probit

Supporting Information

UNIVERSITY OF TORONTO Faculty of Arts and Science. December 2005 Examinations STA437H1F/STA1005HF. Duration - 3 hours

e i is a random error

Neural networks. Nuno Vasconcelos ECE Department, UCSD

Deep Learning. Boyang Albert Li, Jie Jay Tan

Support Vector Machines CS434

17 Support Vector Machines

Lecture 3: Dual problems and Kernels

INF 4300 Digital Image Analysis REPETITION

Feature Selection: Part 1

Pattern Classification

Lecture Nov

8 : Learning in Fully Observed Markov Networks. 1 Why We Need to Learn Undirected Graphical Models. 2 Structural Learning for Completely Observed MRF

Singular Value Decomposition: Theory and Applications

ENG 8801/ Special Topics in Computer Engineering: Pattern Recognition. Memorial University of Newfoundland Pattern Recognition

Hidden Markov Models & The Multivariate Gaussian (10/26/04)

Transcription:

INF 5860 Machne learnng for mage classfcaton Lecture 3 : Image classfcaton and regresson part II Anne Solberg January 3, 08

Today s topcs Multclass logstc regresson and softma Regularzaton Image classfcaton usng a lnear classfer. Lnk to probablstc classfers and SVM INF 5860 3

Relevant addtonal vdeo lnks: https://www.youtube.com/playlst?lst=pl3f W7Lu35JvHM8lY-zLfQRF3EO8sYv Lecture and 3 Remark: they do not cover regreson. INF 5860 4

From last week: Introducton to logstc regresson Let us show how a regresson problem can be transformed nto a bnary -class classfcaton problem usng a nonlnear loss functon. Then generalze to multple classes net week. INF 5860 5

From last week: What f we ftted t to a functon f that s close to ether 0 or? Hypothess h s now a non-lnear functon of Classfcaton: y=0 or Threshold h : f h >0.5 : set y=, otherwse set y=0 Desrable to have h Cod= Herrng=0 INF 5860 length 6

Logstc regresson model Want 0 h bnary problem Let gz s called the sgmod functon INF 5860 7

Decsons for logstc regresson Decde y= f h > 0.5, and y=0 otherwse h X T g g z z e h X e T gz>0.5 f z>0 w+b>0 gz<0 f z<0 w+b<0 Here the compact notaton means the vector of parameters [w,b] INF 5860 8

Loss functon for logstc regreeson We have two classes, and 0. Let us use a probablstc model Let the parameters be =[w,.w nk,b] f we have nk features. Py=, = h Py=0,= - h Ths can be wrtten more compactly as py, = h y - h -y INF 5860 9

Loss functon for logstc regreeson The lkelhood of the parameter values s It s easer to mamze the log-lkelhood We wll use gradent descent to mamze ths, takng a step n the postve drecton snce we are mamzng, not mnmzng INF 5860 0

Computng the gradent of the lkelhood functon Here, we used the fact the g z=gz-gz INF 5860

Gradent descent of J=-L INF 5860,,: : Repeat usng gradent descent that mnmze J fnd : To fnd,: log,: log X y X h m X h y X h y m J m m Ths algorthm looks smlar to lnear regresson, but now T e h

Overfttng and regularzaton For any classfer, t s a rsk of overfttng to the tranng data. Overfttng: Hgh accuracy on tranng data Lower accuracy on valdaton data. Ths rsk s hgher the more parameters the classfer can use. INF 5860 3

Eample: polynomal regresson If a lnear model s not suffcent, we can etend to allow hgherorder terms or cross-terms between the varables by changng our hypothess h INF 5860 4 0 3 3 0... h h

The danger of overfttng A hgher-order model can easly overft the tranng data For the hgher order terms: The hgher the value of the coeffcents, the more the curve can fluctuate Ths s not vald for the frst two coeffcents Restrctng only the value of hgher-order terms s dffcult n general e.g. for neural nets But we can restrct the magntude of the coeffcents ecept 0. INF 5860 5

Overfttng for classfcaton Overfttng must be avoded for classfaton also ths s partly why we start wth smple lnear models INF 5860 6

Regularzaton - ntuton 0 3 4 0 3 4 Suppose we add a penalty to restrct 3 and 4 m J,: 00 3 00 4 h X y m To mnmze, 3 and 4 must be small INF 5860 7

Regularzed cost functon Smplfy the hypothess by havng small values for,. n n m J h X,: y m s the regularzaton parameter Ths s L-regularzaton, later we wll see Dropout, ma norm Remark: we do not regularze the offset b also called 0 INF 5860 8

What f s very large? Wll we get overft or underft? INF 5860 9

Gradent descent wth regularzaton: lnear regresson INF 5860 0 : Repeat NO penalty on Note : usng gradent descent that mnmze J fnd : To fnd 0,,:,,:,,: 0 0 X y X h m m m X y X h m X y X h m m m m

Regularzed logstc regresson: gradent descent INF 5860,...,,:,0,: : Repeat 0 0 X m m T e X h n m X y X h m X y X h m

Introducng classfyng CIFAR mages CIFAR-0 mages: 333 pels Stack one mage nto a vector of length 333=307 Classfcaton wll be to fnd a mappng fw,,b from mage space to a set of C classes. For CIFAR: pel pel 307 weght for pelfor class W weght for pelfor class0 weght for pel 307 for class weght for pel 307 for class0 b b b0 INF 5860

Small eample classes 40 Graylevel mage 6 Score for class 0.5 Score for class.0 36. 0. 40 36 6 0. 0.5 W 0.5.0. 0. 0. 0.5 40.0 36. 4.5 0.3 6 0.3 43..0 0.3 W: 4 One weght w, for pel for class b. 0.3 INF 5860 3

If color mage, append the r,g,b bands nto one long vector. Note: no spatal nformaton concernng pel neghbors s used here. Convolutonal nets use spatal nformaton. All mages are standarzed to the same sze! For CIFAR-0 t s 33. If a classfer s traned on CIFAR and we have a new mage to classfy, resze to 33. INF 5860 4

W for multclass mage classfcaton W s a Cn+-matr C classes, n pels n the mage plus for b We tran one lnear model pr. class, so each class has a dfferent Wc,:-vector If Wc,: s a vector of length n+ pel pel 307 b W b C weght for pelfor class weght for pelfor class C... weght for pel 307 for class weght for pel 307 for class C Let the score for class s c be fw,=wc,: b s ncluded n W and INF 5860 5

From to C classes: alternatve One vs. all classfcaton: Tran a logstc classfer h,c for each class c to predct the probablty for y=c. Classfy new sample by pckng the class c that mamze ma, c h c INF 5860 6

From to multple classes: Softma The common generalzaton to multple clasess s the softma classfer. We want to predct the class label y ={, C} for sample X,:, y can take one of C dscrete values, so t follows a multnomal probablty dstrbuton. Ths s derved from an assumpton that the probablty/score of class y=k s T k e h p y k, C e T INF 5860 7

Softma predcton/classfcaton Assgn each sample to the class that mamze the score: T k e h p y k, C T e INF 5860 8

Cross-entropy From nformaton theory, the cross entropy between a true dstrbuton p and an estmated dstrbuton q s: H p, q p log q Softma mnmze the cross-entropy between the estmated class probabltes and the true dstrbuton the dstrbuton where all the mass s n the correct class. INF 5860 9

Softma From a tranng data set wth m samples, we formulate the loglkelhood functon that the model fts the data: l m log p y X,:, We can now fnd that mamze the lkelhood usng e.g. gradent ascent of the log-lkelhood functon. Or we can mnmze l usng gradent descent More detals on dervng softma net week Ole-Johan INF 5860 30

Cross-entropy loss functon for softma The loss functon for softma, ncludng regularzaton: Iy= s the ndcator functon that s f y= and zero otherwse. See http://ufldl.stanford.edu/wk/nde.php/softma_regresson INF 5860 3 n n C C C l n T W y p y I n J W e e y I n J W X T l T,, log the row for class,:, let, values for mage the n pel,,: 0

Softma predcton eample INF 5860 3

Gradents of the cross entropy loss, ncludng regularzaton INF 5860 33 n n C C C l n T W y p y I n J W e e y I n J W X T l T,, log the row for class,:, let, values for mage the n pel,,: 0

For those who want calculus.. Computng the dervatve of the softma functon: see all detals at https://el.thegreenplace.net/06/thesoftma-functon-and-ts-dervatve/ INF 5860 34

Lnk to Gaussan classfers In INF 4300, we used a tradtonal Gaussan classfer Ths type of models s called generatve models, where a specfc dstrbuton s assumed. INF 5860 35

FROM INF 4300:Dscrmnant functons for the Gaussan densty When fndng the class wth the hghest probablty, these functons are equvalent: Wth a multvarate Gaussan we get: If we assume all classes have equal dagonal covarance matr, the dscrmnant functon s a lnear functon of : INF 4300 36 ln ln ln t P d g μ μ ln ln P p g P p g p P p P g ln T T P μ μ μ

Gaussan classfer vs. logstc regresson These Gaussan wth dagonal covarance and the logstc regresson/softma classfer wll result n dfferent lnear decson boundares. If the Gaussan assumpton s correct, we wll epect that ths classfer has the lowest error rate. The logstc regreson mght be better f the data s not entrely Gaussan. NOTE: SOFTMAX reduces to logstc regresson f we have classes. INF 5860 37

Support Vector Machne classfers Another popular classfer s the Support Vector Machne SVM formulaton, whch also can be formulated n terms of loss functons The followng fols are for completeness, only a basc understand of the SVM as a mamum-margn classfer s epected n ths course. INF 5860 38

Hyperplanes and margns Background SVM. Have a margn of w w w. Requre that all pels are correctly classfed: w w T T w w 0 0,, Goal: fnd w and w 0 39

Support Vector Machne loss A SVM loss functon can be formulated by havng as large margn as possble. Ths s generalzed to multple classes so the SVM wants the correct class to have a score hgher than the scores for the ncorrect classes by som margn If s s the score for class, the loss functon for SVM s L ma 0, s s y Ths s called the hnge loss 40

SVM and gradent descent We can also solve the SVM usng gradent descent also, we wll not cover ths, but see http://www.robots.o.ac.uk/~az/lectures/ml/lect.pdf INF 5860 4

FROM INF 4300:Dscrmnant functons for the Gaussan densty When fndng the class wth the hghest probablty, these functons are equvalent: Wth a multvarate Gaussan we get: If we assume all classes have equal dagonal covarance matr, the dscrmnant functon s a lnear functon of : INF 4300 4 ln ln ln t P d g μ μ ln ln P p g P p g p P p P g ln T T P μ μ μ

Net week: Feed forward nets and learnng by backpropagaton Readng materal: http://cs3n.gthub.o/neural-networks-/ http://cs3n.gthub.o/neural-networks-/ http://cs3n.gthub.o/optmzaton-/ Deep learnng Chapter 6 INF 5860 43