Statistical Learning Theory and the C-Loss cost function

Similar documents
Machine Learning for NLP

Discriminative Models

Discriminative Models

Warm up: risk prediction with logistic regression

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

Machine Learning for NLP

Bayesian Support Vector Machines for Feature Ranking and Selection

CMU-Q Lecture 24:

Ch 4. Linear Models for Classification

Lecture 5: Logistic Regression. Neural Networks

Linear & nonlinear classifiers

Support Vector Machine for Classification and Regression

Advanced statistical methods for data analysis Lecture 2

Linear Models for Classification

PAC-learning, VC Dimension and Margin-based Bounds

Introduction to Machine Learning

The classifier. Theorem. where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know

The classifier. Linear discriminant analysis (LDA) Example. Challenges for LDA

Support vector machines Lecture 4

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

Optimization and Gradient Descent

Lecture 2 Machine Learning Review

Kernel Methods and Support Vector Machines

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS16

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Binary Classification / Perceptron

CS229 Supplemental Lecture notes

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

18.9 SUPPORT VECTOR MACHINES

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

ECE521 Lecture7. Logistic Regression

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Machine Learning and Data Mining. Linear classification. Kalev Kask

Dreem Challenge report (team Bussanati)

10-701/ Machine Learning - Midterm Exam, Fall 2010

Statistical Data Mining and Machine Learning Hilary Term 2016

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie

Pattern Recognition and Machine Learning

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

Machine Learning Lecture 7

CS325 Artificial Intelligence Chs. 18 & 4 Supervised Machine Learning (cont)

Listwise Approach to Learning to Rank Theory and Algorithm

Least Squares Regression

CPSC 340: Machine Learning and Data Mining

1 Machine Learning Concepts (16 points)

Accelerated Training of Max-Margin Markov Networks with Kernels

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Jeff Howbert Introduction to Machine Learning Winter

Least Squares Regression

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

About this class. Maximizing the Margin. Maximum margin classifiers. Picture of large and small margin hyperplanes

Lecture 9: PGM Learning

Neural networks and support vector machines

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Machine Learning Practice Page 2 of 2 10/28/13

Support Vector Machines and Bayes Regression

Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families

Statistical learning theory, Support vector machines, and Bioinformatics

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Oslo Class 2 Tikhonov regularization and kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Neural Network Training

Introduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research

Linear & nonlinear classifiers

ECE521 week 3: 23/26 January 2017

GRADIENT DESCENT AND LOCAL MINIMA

Importance Reweighting Using Adversarial-Collaborative Training

Statistical Machine Learning Hilary Term 2018

Announcements. Proposals graded

Logistic Regression Trained with Different Loss Functions. Discussion

Kernelized Perceptron Support Vector Machines

9.520 Problem Set 2. Due April 25, 2011

Advanced Introduction to Machine Learning CMU-10715

CSC 411: Lecture 04: Logistic Regression

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Machine Learning and Data Mining. Linear regression. Kalev Kask

ECE662: Pattern Recognition and Decision Making Processes: HW TWO

Regularization. CSCE 970 Lecture 3: Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline

GRADIENT DESCENT. CSE 559A: Computer Vision GRADIENT DESCENT GRADIENT DESCENT [0, 1] Pr(y = 1) w T x. 1 f (x; θ) = 1 f (x; θ) = exp( w T x)

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

RegML 2018 Class 2 Tikhonov regularization and kernels

CS 446 Machine Learning Fall 2016 Nov 01, Bayesian Learning

Lecture 3: Pattern Classification

Minimax risk bounds for linear threshold functions

CS60021: Scalable Data Mining. Large Scale Machine Learning

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

IFT Lecture 7 Elements of statistical learning theory

Linear models: the perceptron and closest centroid algorithms. D = {(x i,y i )} n i=1. x i 2 R d 9/3/13. Preliminaries. Chapter 1, 7.

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

Expectation Propagation for Approximate Bayesian Inference

Curve Fitting Re-visited, Bishop1.2.5

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

The Bayes classifier

Transcription:

Statistical Learning Theory and the C-Loss cost function Jose Principe, Ph.D. Distinguished Professor ECE, BME Computational NeuroEngineering Laboratory and principe@cnel.ufl.edu

Statistical Learning Theory In the methodology of science there are two primary methodologies to create undisputed principles (knowledge): Deduction starts with an hypothesis that must be scientific validated to arrive at a general principle that then can be applied to many different specific cases. Induction starts from specific cases to reach universal principles. Much harder than deducation. Learning from samples uses an inductive principle and so must be checked for generalization.

Statistical Learning Theory Statistical Learning Theory uses mathematics to study induction. The theory has received lately a lot of attention and major advances were achieved. The learning setting needs to be first properly defined. Here we will only treat the case of classfication.

Empirical Risk Minimization (ERM) principle Let us consider a learning machine d d x,d are real r.v. with joint distribution P(x,y). F(x) is a function of some parameters w, i.e. f(x,w).

Empirical Risk Minimization (ERM) principle How can we find the possible best learning machine that generalizes for unseen data from the same distribution? Define the Risk functional as R( w) L( f ( x, w), d) dp( x, d) E [ L( f ( x, w), d)] L(.) is called the Loss function, and minimize it w.r.t. w achieving the best possible loss. But we can not do this integration because the joint is normally not known in functional form. xd

Empirical Risk Minimization (ERM) principle The only hope is to substitute the expected value by the empirical mean to yield 1 R E( w) L( f ( xi, w), di) N Giovani and Cantelli proved that the ER converges to the true Risk, and Kolmogorov proved the convergence rate is exponential. So there is hope to achieve inductive machines. What should the best loss function be for classification? i

Empirical Risk Minimization (ERM) principle The only hope is to substitute the expected value by the empirical mean to yield 1 R E( w) L( f ( xi, w), di) N Giovanni and Cantelli proved that the ER converges to the true Risk functional, and Kolmogorov proved that the convergence rate is exponential. So there is hope to achieve inductive machines. i

Empirical Risk Minimization (ERM) principle What should the best loss function be for classification? We know from Bayes theory that the classification error is the integral over the tails of the likelihoods, but this is very difficult to do in practice. In the confusion tables, what we do is to count errors, so this seems to be a good approach. Therefore the ideal Loss is f ( x, w), d) Which makes the Risk l 0/1( 1 0 df ( x, w) otherwise R( w) P( Y sign( f ( x, w)) E[ l0/ 1( f ( X, w), D) 0

Empirical Risk Minimization (ERM) principle Again, the problem is that the l0/1 loss is very difficult to work in practice. The most widely used family of losses are the polynomial losses that take the form R( w) E[( f e ( X, w) f ( x, w) Let us define the error as. If d={-1,1} and the learning machine has an output between [-1,1], the error will be between [-2,2]. Errors beyond e >1 correspond to wrong class assignments. Sometimes we define the margin as. The margin is therefore in [-1,1] and for >0 we have perfect class assignments. d D) p ] df ( x, w)

Empirical Risk Minimization (ERM) principle In the space of the margin the l 0/1 loss and the l 2 norm look as in the figure. The hinge loss is a l 1 norm of the error. Notice that the square loss is convex, but the hinge is a limiting case, and l 0/1 is definitely non convex.

Empirical Risk Minimization (ERM) principle It turns out that the quadratic loss is easy to work with for the minimization (we can use gradient descent). The hinge loss requires dynamic programming in the minimization, but the current availability of fast computers and optimization software is becoming practical. The l 0/1 loss is still impractical to work with. The down side of the quadratic loss (our well known MSE) is that machines trained with it are unable to control generalization, so they do not lead to useful inductive machines. The user must find additional ways to guarantee generalization (as we have seen early stopping, weight decay).

Correntropy: A new generalized similarity measure Define correntropy of two random variables X,Y as v( X, Y) E ( ( X Y)) by analogy to the correlation function. K is the Gaussian kernel. The name correntropy comes from the fact that the average over the dimensions of the r.v. is the information potential (the argument of Renyi s entropy) We can estimate readily correntropy with the empirical mean. vˆ( x, y) 1 N XY i N 1 ( x( i) y( i))

Correntropy: A new generalized similarity measure Some Properties of Correntropy: It has a maximum at the origin ( 1/ 2 ) It is a symmetric positive function Its mean value is the argument of the log of quadratic Renyi s entropy of X-Y (hence its name) Correntropy is sensitive to second and higher order moments of data (correlation only measures second order statistics) v( x, y) n 0 ( n 2 n 1) E X 2n n! Correntropy estimates the probability of X = Y. Y 2n

Correntropy: A new generalized similarity measure Correntropy as a cost function versus MSE. MSE X Y E X Y 2 (, ) [( ) ] 2 ( ) XY (, ) xy, x y f x y dxdy V( X, Y) E[ k( X Y)] xy, k( x y) f ( x, y) dxdy XY 2 e fe e () e de e k( e) f ( e) de E

Correntropy: A new generalized similarity measure Correntropy induces a metric in the sample space (CIM) defined by CIM( X, Y) ( v(0,0) v( x, y)) Correntropy uses different L norms depending on the actual sample distances. 1/2 This can be very useful for outlier s control and also to improve generalization

The Correntropy Loss (C-loss) Function We define the C-loss function as l C ( d, f ( x, w)) [1 ( d f ( x, w))] In terms of the classification margin l C ( ) [1 (1 )] is a positive scaling constant that guarantees The expected risk of the C-Loss function is ( w) (1 E[ ( d f ( x, w))]) (1 v( D, f ( X, w))) R C Clearly, minimizing C-Risk is equivalent to maximizing the similarity in the correntropy metric sense between the true label and the machine output. l C ( 0) 1

The Correntropy Loss (C-loss) Function The C-Loss for several values of The C-loss is non convex, but approximates better the l 0/1 loss and it is Fisher consistent.

The Correntropy Loss (C-loss) Function Training with the C-Loss Can use backpropagation with a minor modification: the injected error is now the partial of the C-Risk w.r.t. the error R C( e) lc ( en ) e e or l All the rest is the same! n 2 2 C( en ) en en en 1 exp exp 2 2 2 en en 2 2 n

The Correntropy Loss (C-loss) Function Automatic selection of the kernel size An unexpected advantage of the C-Loss is that it allows for an automatic selection of the kernel size. We select = 0.5 to give maximal importance to the correctly classified samples

The Correntropy Loss (C-loss) Function How to train with the C-loss The only disadvantage of the C-loss is that the performance surface is non convex and full of local minima. I suggest to first train with MSE for 10-20 epochs, and then switch to the C-loss Alternatively can use the composite cost function R ( w) (1 ) R2 ( w) RC ( w) N N where N is the number of training iterations, and by the user. is set

The Correntropy Loss (C-loss) Function Synthetic example: two Gaussian classes Notice how smooth is the separation surface

The Correntropy Loss (C-loss) Function Synthetic example: more difficult case Notice how smooth is the separation surface

The Correntropy Loss (C-loss) Function Wisconsin Breast Cancer Data Set C-loss does NOT over train, so generalizes much better than MSE

The Correntropy Loss (C-loss) Function Pima Indians Data Set C-loss does NOT over train, so generalizes much better than MSE

The Correntropy Loss (C-loss) Function But the point of switching affects performance

Conclusions The C-loss has many advantages for classification: Leads to better generalization, as samples near the boundary have less impact on training (the major cause for overtraining with the MSE). Easy to implement - can be simply switched after training with MSE. Computation complexity is the same as MSE and backpropagation. The open question is the search of the performance surface. The switching between MSE and C-loss afffects the final classification accuracy.