STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION

Size: px
Start display at page:

Download "STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION"

Transcription

1 STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION Tong Zhang The Annals of Statistics, 2004

2 Outline Motivation Approximation error under convex risk minimization Examples of approximation error analysis Universal approximation and consistency

3 Motivation Machine learning: predict y for given x. Relationship: functional y f(x) where f is called the classification function. Criterion: to minimize a problem dependent loss l(f(x),y). Usual assumption: The data (X,Y) are drawn i.i.d. from a common but unknown distribution D(X, Y). Explicit formulae of the expected loss: L(f( )) = E X,Y l(f(x),y).

4 Motivation This paper deals with binary problem, namely, y {±1}. Prediction rule: y = 1 if f(x) 0 and y = 1 if f(x) < 0. The classification error of f is given by: 1 if f(x)y < 0, I(f(x),y) = 1 if f(x) = 0 and y = 1, 0 otherwise. The empirical error is given by: 1 n n I(f(X i ),Y i ). (1) i=1

5 Motivation However the minimization of the previous formulae can be a NP-hard problem, because it is not convex. Need to use a surrogate loss φ(, ) which makes the computation easier. When φ(f,y) = φ(yf) it is called large margin classifier. For example, AdaBoost employs the exponential loss φ = exp( yf) and SVM employs the hinge loss φ = [1 yf] +. Now instead of (1) we minimize the empirical risk: 1 n n φ(f(x i )Y i ). (2) i=1

6 Motivation The minimization of (1) can be regarded as an approximation to the true classification error: L(f( )) = E X,Y I(f(X),Y). (3) And the minimization of (2) can be regarded as an approximation to the true risk: Q(f( )) = E X,Y φ(f(x)y). (4) In this paper the authors study the impact of using φ.

7 Motivation In this paper the authors are interested in the following five loss functions: Least Squares: φ(v) = (1 v) 2. Modified Least Squares: φ(v) = max(1 v,0) 2. SVM: φ(v) = [1 v]+. Exponential: φ(v) = exp( v). Logistic Regression: φ(v) = ln(1+exp( v)).

8 Approximation error under convex risk minimization In this section the relationship between L(f( )) and Q(f( )) is studied. Rewrite (4) as the expansion of conditional expectation: Q(f( )) = E X [η(x)φ(f(x))+(1 η(x))φ( f(x))], (5) where η(x) is the conditional probability P(Y = 1 X = x).

9 Approximation error under convex risk minimization L(f( )) can be written as: L(f( )) = E f(x) 0 (1 η(x))+e f(x)<0 η(x). (6) The following notation is very useful in this section: Q(η,f) = ηφ(f)+(1 η)φ( f). (7) Define f φ (η) : [0,1] R (where R is the extended real line) as: and fφ (η) = argminq(η,f), f R Q (η) = inf f R Q(η,f) = Q(η,f φ (η)).

10 Approximation error under convex risk minimization Define the excess risk as: Q(η,f) = Q(η,f) Q(η,f φ (η)) = Q(η,f) Q (η), and Q(f( )) = Q(f( )) Q(f φ (η( ))) = E X Q(η(X),f(X)).

11 Approximation error under convex risk minimization The above formulations are easy to calculate: Least Squares: f φ (η) = 2η 1; Q (η) = 4η(1 η). Modified Least Squares: f φ (η) = 2η 1; Q (η) = 4η(1 η). SVM: f φ (η) = sign(2η 1); Q (η) = 1 2η 1. Exponential: f φ (η) = 1 2 ln η 1 η ; Q (η) = 2 η(1 η). Logistic Regression: f φ (η) = ln η 1 η ; Q (η) = ηlnη (1 η)ln(1 η).

12 Approximation error under convex risk minimization Theorem Assume fφ (η) > 0 when η > 0.5. Assume there exists c > 0 and s 1 such that for all η [0,1], 0.5 η s c s Q(η,0), then for any measurable function f(x), L(f( )) L 2c Q(f( )) 1/s, where L is the optimal Bayes error L = L(2η( ) 1).

13 Approximation error under convex risk minimization Proof omitted. A corollary omitted in this presentation. The above theorem shows that if there is a functional relationship between 0.5 η and Q(η,0), we can bound the classification error by certain transformations of the risk. More specifically, if Q(f( )) 0, then L 0. The quantity Q(η,f) is the key in the proof of the theorem. A question arises: how to compute it?

14 Approximation error under convex risk minimization Introduction to Bregman divergence. For a convex function φ, its Bregman divergence is defined as: d φ (f 1,f 2 ) = φ(f 2 ) φ(f 1 ) φ (f 1 )(f 2 f 1 ). Here prime means subgradient. For a concave function g, the Bregman divergence is defined as d g (η 1,η 2 ) = d g (η 1,η 2 ). The Bregman divergence is always non-negative. A plot in the next slide to illustrate the idea.

15 Approximation error under convex risk minimization Figure: Bregman divergence (the red line segment).

16 Approximation error under convex risk minimization Lemma Q (η) is a concave function of η. Theorem If φ is differentiable, then the Bregman divergence is uniquely defined. Furthermore, Q(η,p) = ηd φ (f φ (η),p)+(1 η)d η( f φ (η), p). If f φ is differentiable then Q is also differentiable. Assume p = f φ ( η), then Q(η,p) = d Q ( η,η).

17 Approximation error under convex risk minimization If f φ is invertible, then the inverse function f φ 1 (f(x)) can serve as a conditional probability estimate.

18 Examples of approximation error analysis Least Squares. Bregman divergence is d φ (p 1,p 2 ) = (p 2,p 1 ) 2. Q(η,p) = (2η 1 p) 2. η = Q(η,0). Thus we can choose c = 0.5 and s = 2.

19 Examples of approximation error analysis Modified Least Squares. Bregman divergence is d φ (p 1,p 2 ) = (p 2,p 1 ) 2 max(0,p 2 1) 2. Q(η,p) = (2η 1 p) 2 ηmax(0,p 1) 2 (1 η)min(0,p +1) 2. η Q(η,0). Thus we can choose c = 0.5 and s = 2.

20 Examples of approximation error analysis SVM. Bregman divergence is harder to compute than calculating Q(η, p) directly. Q(η, p) = η max(0, 1 p)+(1 η) max(0, 1+p) 1+ 2η 1. η Q(η,0). Thus we can choose c = 0.5 and s = 1.

21 Examples of approximation error analysis Exponential loss. Bregman divergence is 2 η(1 η). Q(η,p) = (η η)(e p e p )+2 η(1 η) 2 η(1 η) where η = 1/(1+e 2p ). η (2 0.5 ) 2 Q(η,0). Thus we can choose c = and s = 2.

22 Examples of approximation error analysis Logistic regression. Bregman divergence is ηlnη (1 η)ln(1 η). Q(η,p) = KL(η 1 1+e p ) where KL is the KL-divergence. η (2 0.5 ) 2 Q(η,0). Thus we can choose c = and s = 2.

23 Examples of approximation error analysis There are many discussion in this section, in terms of probability estimation comparison between different loss functions, and in terms of finding best c and s. Please refer to the paper, this is the relatively easy part for reading.

24 Universal approximation and consistency Now we consider a function class C to which the function f belongs. If inf f C Q(f) is small, then any f(x) C that (approximately) minimizes (4) achieves a classification error close to the optimal Bayes error. We call a function class C universal with respect to a convex loss function φ, if any measurable conditional density function η(x) has a distance of zero to C.

25 Universal approximation and consistency Next we introduce a universal approximation theorem. First the following definitions are needed. Let U R d. Denote by C(U) the Banach space of continuous functions: U R under the uniform-norm topology. Call a probability measure µ in R d regular if it is defined on the Borel sets of R d. Say that a convex function φ has property A if φ is continuous and Q is continuous. φ(p) < φ( p), for all p > 0. f φ (η) (,+ ) and is piece-wise continuous in (0,1).

26 Universal approximation and consistency Lemma Assume 0 δ < 0.5. Let η [0,1] and η δ = min(max(η,δ),1 δ). If φ has property A, then Q(η,f φ (η δ)) Q (η δ ). The plot of η δ is shown in the next slide. The lemma is to provide a way for avoiding the difficulty part where fφ tends to infinity, in the proof of the next theorem.

27 Universal approximation and consistency y δ δ eta Figure: The function η δ.

28 Universal approximation and consistency Theorem Let φ be a convex function with property A. Consider a function class C C(U) defined on a Borel set U R d. If C is dense in C(U), then for any regular probability measure µ of x R d such that µ(u) = 1, and any conditional probability P(Y = 1 X = x) = η(x), inf Q(f( )) = 0. p C

29 Universal approximation and consistency Consider the function classes R d R consisted of linear combinations of functions of the form h(ω T x+b), where ω R d, b R and h is a fixed function: C h = { k α i h(ωi T x+b i ) : α i R,ω i R d,b i R,k N }. i=1 In the literature of neural networks, the following function class is well studied and proved to be universal, as long as the function h defined above is sigmoidal. The next theorem is more general.

30 Universal approximation and consistency Theorem If h is a non-polynomial continuous function, then C h is dense in C(U) for all compact subsets U of R d. The introduction to RKHS. Consider the kernel functions in the following form: K h ([x 1,b 1 ],[x 2,b 2 ]) = h(x T 1 x 2 +b 1 b 2 ). where h can be expressed as a Taylor expansion with non-negative coefficients. K h is a positive definite kernel. Denote by H h the corresponding RKHS introduced by K h. Also denote f(x) = f([x,1]).

31 Universal approximation and consistency Consider the following estimation problem: f n = arg inf f Hh [1 n n i=1 φ(y i f(xi ))+ λ n 2 f 2]. We have the following theorem, which says we can show consistency by estimating leave-one-out bounds. Theorem Let f [k] be the solution of the above formulation with the k th datum removed from the training set, then ˆf n ( ) ˆf [k] n ( ) 2 λ n n φ (ˆf n (X k )Y k ) h(x T k X k +1) 1/2, where φ denotes a subgradient of φ.

32 Universal approximation and consistency For simplicity, from now we assume that P(h(X T X+1) M 2 ) = 1 for a constant M. Theorem Under the assumption of the previous theorem, and assume that φ is non-negative and P(h(X T X+1) M 2 ) = 1. Then for all k, the expected leave-one-out error can be bounded as EQ(ˆf [k] ) inf f H h [ Q( f( ))+ λ n 2 f 2] + 2M2 2 phi λ n N, where the expectation is with respect to the training samples (X 1,Y 1 ),...,(X n,y n ) and φ = sup { φ (z) : z 2φ(0) λ n M }.

33 Universal approximation and consistency Here is a list of φ for loss functions considered in this paper. Least Squares: 8 φ λ n M +2. Modified Least Squares: 8 φ λ n M +2. SVM: φ 1. Exponential: 2 φ exp( λ n M). Logistic Regression: φ 1. We are in the position of introducing the last theorem in the paper.

34 Universal approximation and consistency Theorem Let h be an entire function with nonnegative Taylor coefficients. Assume we choose λ n such that for least squares, modified least squares, SVM or logistic regression, λ n 0 and λ n n ; or we choose λ n such that for exponential loss, λ n 0 and λ n log 2 n. Then for any distribution D with regular input probability measure which is bounded almost everywhere in R d, we have lim EQ(ˆf n ( )) = inf Q(f( )). n f H h Moreover if h is not a polynomial, then lim EQ(ˆf n ( )) = 0. n

35 End of the presentation Thank you. Questions?

Boosting with Early Stopping: Convergence and Consistency

Boosting with Early Stopping: Convergence and Consistency Boosting with Early Stopping: Convergence and Consistency Tong Zhang Bin Yu Abstract Boosting is one of the most significant advances in machine learning for classification and regression. In its original

More information

Statistical Properties of Large Margin Classifiers

Statistical Properties of Large Margin Classifiers Statistical Properties of Large Margin Classifiers Peter Bartlett Division of Computer Science and Department of Statistics UC Berkeley Joint work with Mike Jordan, Jon McAuliffe, Ambuj Tewari. slides

More information

AdaBoost and other Large Margin Classifiers: Convexity in Classification

AdaBoost and other Large Margin Classifiers: Convexity in Classification AdaBoost and other Large Margin Classifiers: Convexity in Classification Peter Bartlett Division of Computer Science and Department of Statistics UC Berkeley Joint work with Mikhail Traskin. slides at

More information

Classification objectives COMS 4771

Classification objectives COMS 4771 Classification objectives COMS 4771 1. Recap: binary classification Scoring functions Consider binary classification problems with Y = { 1, +1}. 1 / 22 Scoring functions Consider binary classification

More information

On the Consistency of AUC Pairwise Optimization

On the Consistency of AUC Pairwise Optimization On the Consistency of AUC Pairwise Optimization Wei Gao and Zhi-Hua Zhou National Key Laboratory for Novel Software Technology, Nanjing University Collaborative Innovation Center of Novel Software Technology

More information

Lecture 10 February 23

Lecture 10 February 23 EECS 281B / STAT 241B: Advanced Topics in Statistical LearningSpring 2009 Lecture 10 February 23 Lecturer: Martin Wainwright Scribe: Dave Golland Note: These lecture notes are still rough, and have only

More information

On the Design of Loss Functions for Classification: theory, robustness to outliers, and SavageBoost

On the Design of Loss Functions for Classification: theory, robustness to outliers, and SavageBoost On the Design of Loss Functions for Classification: theory, robustness to outliers, and SavageBoost Hamed Masnadi-Shirazi Statistical Visual Computing Laboratory, University of California, San Diego La

More information

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina Indirect Rule Learning: Support Vector Machines Indirect learning: loss optimization It doesn t estimate the prediction rule f (x) directly, since most loss functions do not have explicit optimizers. Indirection

More information

BINARY CLASSIFICATION

BINARY CLASSIFICATION BINARY CLASSIFICATION MAXIM RAGINSY The problem of binary classification can be stated as follows. We have a random couple Z = X, Y ), where X R d is called the feature vector and Y {, } is called the

More information

Classification with Reject Option

Classification with Reject Option Classification with Reject Option Bartlett and Wegkamp (2008) Wegkamp and Yuan (2010) February 17, 2012 Outline. Introduction.. Classification with reject option. Spirit of the papers BW2008.. Infinite

More information

Logistic Regression Trained with Different Loss Functions. Discussion

Logistic Regression Trained with Different Loss Functions. Discussion Logistic Regression Trained with Different Loss Functions Discussion CS640 Notations We restrict our discussions to the binary case. g(z) = g (z) = g(z) z h w (x) = g(wx) = + e z = g(z)( g(z)) + e wx =

More information

Divergences, surrogate loss functions and experimental design

Divergences, surrogate loss functions and experimental design Divergences, surrogate loss functions and experimental design XuanLong Nguyen University of California Berkeley, CA 94720 xuanlong@cs.berkeley.edu Martin J. Wainwright University of California Berkeley,

More information

Sparseness Versus Estimating Conditional Probabilities: Some Asymptotic Results

Sparseness Versus Estimating Conditional Probabilities: Some Asymptotic Results Sparseness Versus Estimating Conditional Probabilities: Some Asymptotic Results Peter L. Bartlett 1 and Ambuj Tewari 2 1 Division of Computer Science and Department of Statistics University of California,

More information

Calibrated Surrogate Losses

Calibrated Surrogate Losses EECS 598: Statistical Learning Theory, Winter 2014 Topic 14 Calibrated Surrogate Losses Lecturer: Clayton Scott Scribe: Efrén Cruz Cortés Disclaimer: These notes have not been subjected to the usual scrutiny

More information

Convexity, Classification, and Risk Bounds

Convexity, Classification, and Risk Bounds Convexity, Classification, and Risk Bounds Peter L. Bartlett Computer Science Division and Department of Statistics University of California, Berkeley bartlett@stat.berkeley.edu Michael I. Jordan Computer

More information

Fast learning rates for plug-in classifiers under the margin condition

Fast learning rates for plug-in classifiers under the margin condition Fast learning rates for plug-in classifiers under the margin condition Jean-Yves Audibert 1 Alexandre B. Tsybakov 2 1 Certis ParisTech - Ecole des Ponts, France 2 LPMA Université Pierre et Marie Curie,

More information

Learning with Rejection

Learning with Rejection Learning with Rejection Corinna Cortes 1, Giulia DeSalvo 2, and Mehryar Mohri 2,1 1 Google Research, 111 8th Avenue, New York, NY 2 Courant Institute of Mathematical Sciences, 251 Mercer Street, New York,

More information

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels Karl Stratos June 21, 2018 1 / 33 Tangent: Some Loose Ends in Logistic Regression Polynomial feature expansion in logistic

More information

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 11, 2009 About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert

More information

Surrogate loss functions, divergences and decentralized detection

Surrogate loss functions, divergences and decentralized detection Surrogate loss functions, divergences and decentralized detection XuanLong Nguyen Department of Electrical Engineering and Computer Science U.C. Berkeley Advisors: Michael Jordan & Martin Wainwright 1

More information

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel Logistic Regression Pattern Recognition 2016 Sandro Schönborn University of Basel Two Worlds: Probabilistic & Algorithmic We have seen two conceptual approaches to classification: data class density estimation

More information

Lecture 3: Statistical Decision Theory (Part II)

Lecture 3: Statistical Decision Theory (Part II) Lecture 3: Statistical Decision Theory (Part II) Hao Helen Zhang Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 1 / 27 Outline of This Note Part I: Statistics Decision Theory (Classical

More information

Consistency of Nearest Neighbor Methods

Consistency of Nearest Neighbor Methods E0 370 Statistical Learning Theory Lecture 16 Oct 25, 2011 Consistency of Nearest Neighbor Methods Lecturer: Shivani Agarwal Scribe: Arun Rajkumar 1 Introduction In this lecture we return to the study

More information

CIS 520: Machine Learning Oct 09, Kernel Methods

CIS 520: Machine Learning Oct 09, Kernel Methods CIS 520: Machine Learning Oct 09, 207 Kernel Methods Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture They may or may not cover all the material discussed

More information

Empirical Risk Minimization

Empirical Risk Minimization Empirical Risk Minimization Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Introduction PAC learning ERM in practice 2 General setting Data X the input space and Y the output space

More information

Machine Learning for NLP

Machine Learning for NLP Machine Learning for NLP Linear Models Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research Machine Learning for NLP 1(26) Outline

More information

Generalization theory

Generalization theory Generalization theory Daniel Hsu Columbia TRIPODS Bootcamp 1 Motivation 2 Support vector machines X = R d, Y = { 1, +1}. Return solution ŵ R d to following optimization problem: λ min w R d 2 w 2 2 + 1

More information

Approximation Theoretical Questions for SVMs

Approximation Theoretical Questions for SVMs Ingo Steinwart LA-UR 07-7056 October 20, 2007 Statistical Learning Theory: an Overview Support Vector Machines Informal Description of the Learning Goal X space of input samples Y space of labels, usually

More information

Generalization Bounds in Machine Learning. Presented by: Afshin Rostamizadeh

Generalization Bounds in Machine Learning. Presented by: Afshin Rostamizadeh Generalization Bounds in Machine Learning Presented by: Afshin Rostamizadeh Outline Introduction to generalization bounds. Examples: VC-bounds Covering Number bounds Rademacher bounds Stability bounds

More information

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 12, 2007 About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert

More information

Robust Support Vector Machines for Probability Distributions

Robust Support Vector Machines for Probability Distributions Robust Support Vector Machines for Probability Distributions Andreas Christmann joint work with Ingo Steinwart (Los Alamos National Lab) ICORS 2008, Antalya, Turkey, September 8-12, 2008 Andreas Christmann,

More information

Ad Placement Strategies

Ad Placement Strategies Case Study : Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD AdaGrad Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox January 7 th, 04 Ad

More information

Large Margin Classifiers: Convexity and Classification

Large Margin Classifiers: Convexity and Classification Large Margin Classifiers: Convexity and Classification Peter Bartlett Division of Computer Science and Department of Statistics UC Berkeley Joint work with Mike Collins, Mike Jordan, David McAllester,

More information

Geometry of U-Boost Algorithms

Geometry of U-Boost Algorithms Geometry of U-Boost Algorithms Noboru Murata 1, Takashi Takenouchi 2, Takafumi Kanamori 3, Shinto Eguchi 2,4 1 School of Science and Engineering, Waseda University 2 Department of Statistical Science,

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 12: Weak Learnability and the l 1 margin Converse to Scale-Sensitive Learning Stability Convex-Lipschitz-Bounded Problems

More information

Techinical Proofs for Nonlinear Learning using Local Coordinate Coding

Techinical Proofs for Nonlinear Learning using Local Coordinate Coding Techinical Proofs for Nonlinear Learning using Local Coordinate Coding 1 Notations and Main Results Denition 1.1 (Lipschitz Smoothness) A function f(x) on R d is (α, β, p)-lipschitz smooth with respect

More information

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces 9.520: Statistical Learning Theory and Applications February 10th, 2010 Reproducing Kernel Hilbert Spaces Lecturer: Lorenzo Rosasco Scribe: Greg Durrett 1 Introduction In the previous two lectures, we

More information

ASSESSING ROBUSTNESS OF CLASSIFICATION USING ANGULAR BREAKDOWN POINT

ASSESSING ROBUSTNESS OF CLASSIFICATION USING ANGULAR BREAKDOWN POINT Submitted to the Annals of Statistics ASSESSING ROBUSTNESS OF CLASSIFICATION USING ANGULAR BREAKDOWN POINT By Junlong Zhao, Guan Yu and Yufeng Liu, Beijing Normal University, China State University of

More information

Basis Expansion and Nonlinear SVM. Kai Yu

Basis Expansion and Nonlinear SVM. Kai Yu Basis Expansion and Nonlinear SVM Kai Yu Linear Classifiers f(x) =w > x + b z(x) = sign(f(x)) Help to learn more general cases, e.g., nonlinear models 8/7/12 2 Nonlinear Classifiers via Basis Expansion

More information

On Information Divergence Measures, Surrogate Loss Functions and Decentralized Hypothesis Testing

On Information Divergence Measures, Surrogate Loss Functions and Decentralized Hypothesis Testing On Information Divergence Measures, Surrogate Loss Functions and Decentralized Hypothesis Testing XuanLong Nguyen Martin J. Wainwright Michael I. Jordan Electrical Engineering & Computer Science Department

More information

Accelerated Training of Max-Margin Markov Networks with Kernels

Accelerated Training of Max-Margin Markov Networks with Kernels Accelerated Training of Max-Margin Markov Networks with Kernels Xinhua Zhang University of Alberta Alberta Innovates Centre for Machine Learning (AICML) Joint work with Ankan Saha (Univ. of Chicago) and

More information

The Margin Vector, Admissible Loss and Multi-class Margin-based Classifiers

The Margin Vector, Admissible Loss and Multi-class Margin-based Classifiers The Margin Vector, Admissible Loss and Multi-class Margin-based Classifiers Hui Zou University of Minnesota Ji Zhu University of Michigan Trevor Hastie Stanford University Abstract We propose a new framework

More information

CSC 411 Lecture 17: Support Vector Machine

CSC 411 Lecture 17: Support Vector Machine CSC 411 Lecture 17: Support Vector Machine Ethan Fetaya, James Lucas and Emad Andrews University of Toronto CSC411 Lec17 1 / 1 Today Max-margin classification SVM Hard SVM Duality Soft SVM CSC411 Lec17

More information

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers Erin Allwein, Robert Schapire and Yoram Singer Journal of Machine Learning Research, 1:113-141, 000 CSE 54: Seminar on Learning

More information

Statistical Methods for SVM

Statistical Methods for SVM Statistical Methods for SVM Support Vector Machines Here we approach the two-class classification problem in a direct way: We try and find a plane that separates the classes in feature space. If we cannot,

More information

Bayes rule and Bayes error. Donglin Zeng, Department of Biostatistics, University of North Carolina

Bayes rule and Bayes error. Donglin Zeng, Department of Biostatistics, University of North Carolina Bayes rule and Bayes error Definition If f minimizes E[L(Y, f (X))], then f is called a Bayes rule (associated with the loss function L(y, f )) and the resulting prediction error rate, E[L(Y, f (X))],

More information

Surrogate Risk Consistency: the Classification Case

Surrogate Risk Consistency: the Classification Case Chapter 11 Surrogate Risk Consistency: the Classification Case I. The setting: supervised prediction problem (a) Have data coming in pairs (X,Y) and a loss L : R Y R (can have more general losses) (b)

More information

Kernel Logistic Regression and the Import Vector Machine

Kernel Logistic Regression and the Import Vector Machine Kernel Logistic Regression and the Import Vector Machine Ji Zhu and Trevor Hastie Journal of Computational and Graphical Statistics, 2005 Presented by Mingtao Ding Duke University December 8, 2011 Mingtao

More information

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang Machine Learning Basics Lecture 7: Multiclass Classification Princeton University COS 495 Instructor: Yingyu Liang Example: image classification indoor Indoor outdoor Example: image classification (multiclass)

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized

More information

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression Group Prof. Daniel Cremers 9. Gaussian Processes - Regression Repetition: Regularized Regression Before, we solved for w using the pseudoinverse. But: we can kernelize this problem as well! First step:

More information

Problem Set 5: Solutions Math 201A: Fall 2016

Problem Set 5: Solutions Math 201A: Fall 2016 Problem Set 5: s Math 21A: Fall 216 Problem 1. Define f : [1, ) [1, ) by f(x) = x + 1/x. Show that f(x) f(y) < x y for all x, y [1, ) with x y, but f has no fixed point. Why doesn t this example contradict

More information

Logistic Regression. Machine Learning Fall 2018

Logistic Regression. Machine Learning Fall 2018 Logistic Regression Machine Learning Fall 2018 1 Where are e? We have seen the folloing ideas Linear models Learning as loss minimization Bayesian learning criteria (MAP and MLE estimation) The Naïve Bayes

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information

Lecture 2 Machine Learning Review

Lecture 2 Machine Learning Review Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017 Things we will look at today Formal Setup for Supervised Learning Things

More information

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2 Nearest Neighbor Machine Learning CSE546 Kevin Jamieson University of Washington October 26, 2017 2017 Kevin Jamieson 2 Some data, Bayes Classifier Training data: True label: +1 True label: -1 Optimal

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information

Lecture Support Vector Machine (SVM) Classifiers

Lecture Support Vector Machine (SVM) Classifiers Introduction to Machine Learning Lecturer: Amir Globerson Lecture 6 Fall Semester Scribe: Yishay Mansour 6.1 Support Vector Machine (SVM) Classifiers Classification is one of the most important tasks in

More information

INTRODUCTION TO DATA SCIENCE

INTRODUCTION TO DATA SCIENCE INTRODUCTION TO DATA SCIENCE JOHN P DICKERSON Lecture #13 3/9/2017 CMSC320 Tuesdays & Thursdays 3:30pm 4:45pm ANNOUNCEMENTS Mini-Project #1 is due Saturday night (3/11): Seems like people are able to do

More information

Does Modeling Lead to More Accurate Classification?

Does Modeling Lead to More Accurate Classification? Does Modeling Lead to More Accurate Classification? A Comparison of the Efficiency of Classification Methods Yoonkyung Lee* Department of Statistics The Ohio State University *joint work with Rui Wang

More information

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013 Learning Theory Ingo Steinwart University of Stuttgart September 4, 2013 Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 1 / 62 Basics Informal Introduction Informal Description

More information

A Study of Relative Efficiency and Robustness of Classification Methods

A Study of Relative Efficiency and Robustness of Classification Methods A Study of Relative Efficiency and Robustness of Classification Methods Yoonkyung Lee* Department of Statistics The Ohio State University *joint work with Rui Wang April 28, 2011 Department of Statistics

More information

Statistical learning theory, Support vector machines, and Bioinformatics

Statistical learning theory, Support vector machines, and Bioinformatics 1 Statistical learning theory, Support vector machines, and Bioinformatics Jean-Philippe.Vert@mines.org Ecole des Mines de Paris Computational Biology group ENS Paris, november 25, 2003. 2 Overview 1.

More information

CSCI-567: Machine Learning (Spring 2019)

CSCI-567: Machine Learning (Spring 2019) CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March

More information

Effective Dimension and Generalization of Kernel Learning

Effective Dimension and Generalization of Kernel Learning Effective Dimension and Generalization of Kernel Learning Tong Zhang IBM T.J. Watson Research Center Yorktown Heights, Y 10598 tzhang@watson.ibm.com Abstract We investigate the generalization performance

More information

Kernel Methods. Jean-Philippe Vert Last update: Jan Jean-Philippe Vert (Mines ParisTech) 1 / 444

Kernel Methods. Jean-Philippe Vert Last update: Jan Jean-Philippe Vert (Mines ParisTech) 1 / 444 Kernel Methods Jean-Philippe Vert Jean-Philippe.Vert@mines.org Last update: Jan 2015 Jean-Philippe Vert (Mines ParisTech) 1 / 444 What we know how to solve Jean-Philippe Vert (Mines ParisTech) 2 / 444

More information

Logistic Regression Logistic

Logistic Regression Logistic Case Study 1: Estimating Click Probabilities L2 Regularization for Logistic Regression Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin January 10 th,

More information

Warm up: risk prediction with logistic regression

Warm up: risk prediction with logistic regression Warm up: risk prediction with logistic regression Boss gives you a bunch of data on loans defaulting or not: {(x i,y i )} n i= x i 2 R d, y i 2 {, } You model the data as: P (Y = y x, w) = + exp( yw T

More information

Classification Logistic Regression

Classification Logistic Regression Classification Logistic Regression Machine Learning CSE546 Kevin Jamieson University of Washington October 16, 2016 1 THUS FAR, REGRESSION: PREDICT A CONTINUOUS VALUE GIVEN SOME INPUTS 2 Weather prediction

More information

Machine Learning 4771

Machine Learning 4771 Machine Learning 4771 Instructor: Tony Jebara Topic 3 Additive Models and Linear Regression Sinusoids and Radial Basis Functions Classification Logistic Regression Gradient Descent Polynomial Basis Functions

More information

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017 Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training

More information

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 9, 2011 About this class Goal In this class we continue our journey in the world of RKHS. We discuss the Mercer theorem which gives

More information

Max Margin-Classifier

Max Margin-Classifier Max Margin-Classifier Oliver Schulte - CMPT 726 Bishop PRML Ch. 7 Outline Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data Kernels and Non-linear Mappings Where does the maximization

More information

Calibrated asymmetric surrogate losses

Calibrated asymmetric surrogate losses Electronic Journal of Statistics Vol. 6 (22) 958 992 ISSN: 935-7524 DOI:.24/2-EJS699 Calibrated asymmetric surrogate losses Clayton Scott Department of Electrical Engineering and Computer Science Department

More information

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang Machine Learning Basics Lecture 2: Linear Classification Princeton University COS 495 Instructor: Yingyu Liang Review: machine learning basics Math formulation Given training data x i, y i : 1 i n i.i.d.

More information

A Bahadur Representation of the Linear Support Vector Machine

A Bahadur Representation of the Linear Support Vector Machine A Bahadur Representation of the Linear Support Vector Machine Yoonkyung Lee Department of Statistics The Ohio State University October 7, 2008 Data Mining and Statistical Learning Study Group Outline Support

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression

More information

Statistical Data Mining and Machine Learning Hilary Term 2016

Statistical Data Mining and Machine Learning Hilary Term 2016 Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes

More information

Reproducing Kernel Hilbert Spaces Class 03, 15 February 2006 Andrea Caponnetto

Reproducing Kernel Hilbert Spaces Class 03, 15 February 2006 Andrea Caponnetto Reproducing Kernel Hilbert Spaces 9.520 Class 03, 15 February 2006 Andrea Caponnetto About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert

More information

Foundations of Machine Learning

Foundations of Machine Learning Introduction to ML Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu page 1 Logistics Prerequisites: basics in linear algebra, probability, and analysis of algorithms. Workload: about

More information

A Simple Algorithm for Multilabel Ranking

A Simple Algorithm for Multilabel Ranking A Simple Algorithm for Multilabel Ranking Krzysztof Dembczyński 1 Wojciech Kot lowski 1 Eyke Hüllermeier 2 1 Intelligent Decision Support Systems Laboratory (IDSS), Poznań University of Technology, Poland

More information

Nonparametric Bayesian Methods (Gaussian Processes)

Nonparametric Bayesian Methods (Gaussian Processes) [70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent

More information

Support Vector Machines: Maximum Margin Classifiers

Support Vector Machines: Maximum Margin Classifiers Support Vector Machines: Maximum Margin Classifiers Machine Learning and Pattern Recognition: September 16, 2008 Piotr Mirowski Based on slides by Sumit Chopra and Fu-Jie Huang 1 Outline What is behind

More information

Fast Rates for Estimation Error and Oracle Inequalities for Model Selection

Fast Rates for Estimation Error and Oracle Inequalities for Model Selection Fast Rates for Estimation Error and Oracle Inequalities for Model Selection Peter L. Bartlett Computer Science Division and Department of Statistics University of California, Berkeley bartlett@cs.berkeley.edu

More information

Are Loss Functions All the Same?

Are Loss Functions All the Same? Are Loss Functions All the Same? L. Rosasco E. De Vito A. Caponnetto M. Piana A. Verri November 11, 2003 Abstract In this paper we investigate the impact of choosing different loss functions from the viewpoint

More information

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012 Classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Topics Discriminant functions Logistic regression Perceptron Generative models Generative vs. discriminative

More information

Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 April 5, 2013 Due: April 19, 2013

Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 April 5, 2013 Due: April 19, 2013 Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 April 5, 2013 Due: April 19, 2013 A. Kernels 1. Let X be a finite set. Show that the kernel

More information

ECE521 Lectures 9 Fully Connected Neural Networks

ECE521 Lectures 9 Fully Connected Neural Networks ECE521 Lectures 9 Fully Connected Neural Networks Outline Multi-class classification Learning multi-layer neural networks 2 Measuring distance in probability space We learnt that the squared L2 distance

More information

TaylorBoost: First and Second-order Boosting Algorithms with Explicit Margin Control

TaylorBoost: First and Second-order Boosting Algorithms with Explicit Margin Control TaylorBoost: First and Second-order Boosting Algorithms with Explicit Margin Control Mohammad J. Saberian Hamed Masnadi-Shirazi Nuno Vasconcelos Department of Electrical and Computer Engineering University

More information

Machine Learning And Applications: Supervised Learning-SVM

Machine Learning And Applications: Supervised Learning-SVM Machine Learning And Applications: Supervised Learning-SVM Raphaël Bournhonesque École Normale Supérieure de Lyon, Lyon, France raphael.bournhonesque@ens-lyon.fr 1 Supervised vs unsupervised learning Machine

More information

Classification Logistic Regression

Classification Logistic Regression Announcements: Classification Logistic Regression Machine Learning CSE546 Sham Kakade University of Washington HW due on Friday. Today: Review: sub-gradients,lasso Logistic Regression October 3, 26 Sham

More information

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and

More information

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University Lecture 18: Kernels Risk and Loss Support Vector Regression Aykut Erdem December 2016 Hacettepe University Administrative We will have a make-up lecture on next Saturday December 24, 2016 Presentations

More information

Statistical Methods for Data Mining

Statistical Methods for Data Mining Statistical Methods for Data Mining Kuangnan Fang Xiamen University Email: xmufkn@xmu.edu.cn Support Vector Machines Here we approach the two-class classification problem in a direct way: We try and find

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Here we approach the two-class classification problem in a direct way: We try and find a plane that separates the classes in feature space. If we cannot, we get creative in two

More information

FORMULATION OF THE LEARNING PROBLEM

FORMULATION OF THE LEARNING PROBLEM FORMULTION OF THE LERNING PROBLEM MIM RGINSKY Now that we have seen an informal statement of the learning problem, as well as acquired some technical tools in the form of concentration inequalities, we

More information

Nonparametric Bayesian Methods

Nonparametric Bayesian Methods Nonparametric Bayesian Methods Debdeep Pati Florida State University October 2, 2014 Large spatial datasets (Problem of big n) Large observational and computer-generated datasets: Often have spatial and

More information

Learning Binary Classifiers for Multi-Class Problem

Learning Binary Classifiers for Multi-Class Problem Research Memorandum No. 1010 September 28, 2006 Learning Binary Classifiers for Multi-Class Problem Shiro Ikeda The Institute of Statistical Mathematics 4-6-7 Minami-Azabu, Minato-ku, Tokyo, 106-8569,

More information

Stochastic Gradient Descent

Stochastic Gradient Descent Stochastic Gradient Descent Machine Learning CSE546 Carlos Guestrin University of Washington October 9, 2013 1 Logistic Regression Logistic function (or Sigmoid): Learn P(Y X) directly Assume a particular

More information

1 Machine Learning Concepts (16 points)

1 Machine Learning Concepts (16 points) CSCI 567 Fall 2018 Midterm Exam DO NOT OPEN EXAM UNTIL INSTRUCTED TO DO SO PLEASE TURN OFF ALL CELL PHONES Problem 1 2 3 4 5 6 Total Max 16 10 16 42 24 12 120 Points Please read the following instructions

More information