STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION
|
|
- Jared Ellis
- 5 years ago
- Views:
Transcription
1 STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION Tong Zhang The Annals of Statistics, 2004
2 Outline Motivation Approximation error under convex risk minimization Examples of approximation error analysis Universal approximation and consistency
3 Motivation Machine learning: predict y for given x. Relationship: functional y f(x) where f is called the classification function. Criterion: to minimize a problem dependent loss l(f(x),y). Usual assumption: The data (X,Y) are drawn i.i.d. from a common but unknown distribution D(X, Y). Explicit formulae of the expected loss: L(f( )) = E X,Y l(f(x),y).
4 Motivation This paper deals with binary problem, namely, y {±1}. Prediction rule: y = 1 if f(x) 0 and y = 1 if f(x) < 0. The classification error of f is given by: 1 if f(x)y < 0, I(f(x),y) = 1 if f(x) = 0 and y = 1, 0 otherwise. The empirical error is given by: 1 n n I(f(X i ),Y i ). (1) i=1
5 Motivation However the minimization of the previous formulae can be a NP-hard problem, because it is not convex. Need to use a surrogate loss φ(, ) which makes the computation easier. When φ(f,y) = φ(yf) it is called large margin classifier. For example, AdaBoost employs the exponential loss φ = exp( yf) and SVM employs the hinge loss φ = [1 yf] +. Now instead of (1) we minimize the empirical risk: 1 n n φ(f(x i )Y i ). (2) i=1
6 Motivation The minimization of (1) can be regarded as an approximation to the true classification error: L(f( )) = E X,Y I(f(X),Y). (3) And the minimization of (2) can be regarded as an approximation to the true risk: Q(f( )) = E X,Y φ(f(x)y). (4) In this paper the authors study the impact of using φ.
7 Motivation In this paper the authors are interested in the following five loss functions: Least Squares: φ(v) = (1 v) 2. Modified Least Squares: φ(v) = max(1 v,0) 2. SVM: φ(v) = [1 v]+. Exponential: φ(v) = exp( v). Logistic Regression: φ(v) = ln(1+exp( v)).
8 Approximation error under convex risk minimization In this section the relationship between L(f( )) and Q(f( )) is studied. Rewrite (4) as the expansion of conditional expectation: Q(f( )) = E X [η(x)φ(f(x))+(1 η(x))φ( f(x))], (5) where η(x) is the conditional probability P(Y = 1 X = x).
9 Approximation error under convex risk minimization L(f( )) can be written as: L(f( )) = E f(x) 0 (1 η(x))+e f(x)<0 η(x). (6) The following notation is very useful in this section: Q(η,f) = ηφ(f)+(1 η)φ( f). (7) Define f φ (η) : [0,1] R (where R is the extended real line) as: and fφ (η) = argminq(η,f), f R Q (η) = inf f R Q(η,f) = Q(η,f φ (η)).
10 Approximation error under convex risk minimization Define the excess risk as: Q(η,f) = Q(η,f) Q(η,f φ (η)) = Q(η,f) Q (η), and Q(f( )) = Q(f( )) Q(f φ (η( ))) = E X Q(η(X),f(X)).
11 Approximation error under convex risk minimization The above formulations are easy to calculate: Least Squares: f φ (η) = 2η 1; Q (η) = 4η(1 η). Modified Least Squares: f φ (η) = 2η 1; Q (η) = 4η(1 η). SVM: f φ (η) = sign(2η 1); Q (η) = 1 2η 1. Exponential: f φ (η) = 1 2 ln η 1 η ; Q (η) = 2 η(1 η). Logistic Regression: f φ (η) = ln η 1 η ; Q (η) = ηlnη (1 η)ln(1 η).
12 Approximation error under convex risk minimization Theorem Assume fφ (η) > 0 when η > 0.5. Assume there exists c > 0 and s 1 such that for all η [0,1], 0.5 η s c s Q(η,0), then for any measurable function f(x), L(f( )) L 2c Q(f( )) 1/s, where L is the optimal Bayes error L = L(2η( ) 1).
13 Approximation error under convex risk minimization Proof omitted. A corollary omitted in this presentation. The above theorem shows that if there is a functional relationship between 0.5 η and Q(η,0), we can bound the classification error by certain transformations of the risk. More specifically, if Q(f( )) 0, then L 0. The quantity Q(η,f) is the key in the proof of the theorem. A question arises: how to compute it?
14 Approximation error under convex risk minimization Introduction to Bregman divergence. For a convex function φ, its Bregman divergence is defined as: d φ (f 1,f 2 ) = φ(f 2 ) φ(f 1 ) φ (f 1 )(f 2 f 1 ). Here prime means subgradient. For a concave function g, the Bregman divergence is defined as d g (η 1,η 2 ) = d g (η 1,η 2 ). The Bregman divergence is always non-negative. A plot in the next slide to illustrate the idea.
15 Approximation error under convex risk minimization Figure: Bregman divergence (the red line segment).
16 Approximation error under convex risk minimization Lemma Q (η) is a concave function of η. Theorem If φ is differentiable, then the Bregman divergence is uniquely defined. Furthermore, Q(η,p) = ηd φ (f φ (η),p)+(1 η)d η( f φ (η), p). If f φ is differentiable then Q is also differentiable. Assume p = f φ ( η), then Q(η,p) = d Q ( η,η).
17 Approximation error under convex risk minimization If f φ is invertible, then the inverse function f φ 1 (f(x)) can serve as a conditional probability estimate.
18 Examples of approximation error analysis Least Squares. Bregman divergence is d φ (p 1,p 2 ) = (p 2,p 1 ) 2. Q(η,p) = (2η 1 p) 2. η = Q(η,0). Thus we can choose c = 0.5 and s = 2.
19 Examples of approximation error analysis Modified Least Squares. Bregman divergence is d φ (p 1,p 2 ) = (p 2,p 1 ) 2 max(0,p 2 1) 2. Q(η,p) = (2η 1 p) 2 ηmax(0,p 1) 2 (1 η)min(0,p +1) 2. η Q(η,0). Thus we can choose c = 0.5 and s = 2.
20 Examples of approximation error analysis SVM. Bregman divergence is harder to compute than calculating Q(η, p) directly. Q(η, p) = η max(0, 1 p)+(1 η) max(0, 1+p) 1+ 2η 1. η Q(η,0). Thus we can choose c = 0.5 and s = 1.
21 Examples of approximation error analysis Exponential loss. Bregman divergence is 2 η(1 η). Q(η,p) = (η η)(e p e p )+2 η(1 η) 2 η(1 η) where η = 1/(1+e 2p ). η (2 0.5 ) 2 Q(η,0). Thus we can choose c = and s = 2.
22 Examples of approximation error analysis Logistic regression. Bregman divergence is ηlnη (1 η)ln(1 η). Q(η,p) = KL(η 1 1+e p ) where KL is the KL-divergence. η (2 0.5 ) 2 Q(η,0). Thus we can choose c = and s = 2.
23 Examples of approximation error analysis There are many discussion in this section, in terms of probability estimation comparison between different loss functions, and in terms of finding best c and s. Please refer to the paper, this is the relatively easy part for reading.
24 Universal approximation and consistency Now we consider a function class C to which the function f belongs. If inf f C Q(f) is small, then any f(x) C that (approximately) minimizes (4) achieves a classification error close to the optimal Bayes error. We call a function class C universal with respect to a convex loss function φ, if any measurable conditional density function η(x) has a distance of zero to C.
25 Universal approximation and consistency Next we introduce a universal approximation theorem. First the following definitions are needed. Let U R d. Denote by C(U) the Banach space of continuous functions: U R under the uniform-norm topology. Call a probability measure µ in R d regular if it is defined on the Borel sets of R d. Say that a convex function φ has property A if φ is continuous and Q is continuous. φ(p) < φ( p), for all p > 0. f φ (η) (,+ ) and is piece-wise continuous in (0,1).
26 Universal approximation and consistency Lemma Assume 0 δ < 0.5. Let η [0,1] and η δ = min(max(η,δ),1 δ). If φ has property A, then Q(η,f φ (η δ)) Q (η δ ). The plot of η δ is shown in the next slide. The lemma is to provide a way for avoiding the difficulty part where fφ tends to infinity, in the proof of the next theorem.
27 Universal approximation and consistency y δ δ eta Figure: The function η δ.
28 Universal approximation and consistency Theorem Let φ be a convex function with property A. Consider a function class C C(U) defined on a Borel set U R d. If C is dense in C(U), then for any regular probability measure µ of x R d such that µ(u) = 1, and any conditional probability P(Y = 1 X = x) = η(x), inf Q(f( )) = 0. p C
29 Universal approximation and consistency Consider the function classes R d R consisted of linear combinations of functions of the form h(ω T x+b), where ω R d, b R and h is a fixed function: C h = { k α i h(ωi T x+b i ) : α i R,ω i R d,b i R,k N }. i=1 In the literature of neural networks, the following function class is well studied and proved to be universal, as long as the function h defined above is sigmoidal. The next theorem is more general.
30 Universal approximation and consistency Theorem If h is a non-polynomial continuous function, then C h is dense in C(U) for all compact subsets U of R d. The introduction to RKHS. Consider the kernel functions in the following form: K h ([x 1,b 1 ],[x 2,b 2 ]) = h(x T 1 x 2 +b 1 b 2 ). where h can be expressed as a Taylor expansion with non-negative coefficients. K h is a positive definite kernel. Denote by H h the corresponding RKHS introduced by K h. Also denote f(x) = f([x,1]).
31 Universal approximation and consistency Consider the following estimation problem: f n = arg inf f Hh [1 n n i=1 φ(y i f(xi ))+ λ n 2 f 2]. We have the following theorem, which says we can show consistency by estimating leave-one-out bounds. Theorem Let f [k] be the solution of the above formulation with the k th datum removed from the training set, then ˆf n ( ) ˆf [k] n ( ) 2 λ n n φ (ˆf n (X k )Y k ) h(x T k X k +1) 1/2, where φ denotes a subgradient of φ.
32 Universal approximation and consistency For simplicity, from now we assume that P(h(X T X+1) M 2 ) = 1 for a constant M. Theorem Under the assumption of the previous theorem, and assume that φ is non-negative and P(h(X T X+1) M 2 ) = 1. Then for all k, the expected leave-one-out error can be bounded as EQ(ˆf [k] ) inf f H h [ Q( f( ))+ λ n 2 f 2] + 2M2 2 phi λ n N, where the expectation is with respect to the training samples (X 1,Y 1 ),...,(X n,y n ) and φ = sup { φ (z) : z 2φ(0) λ n M }.
33 Universal approximation and consistency Here is a list of φ for loss functions considered in this paper. Least Squares: 8 φ λ n M +2. Modified Least Squares: 8 φ λ n M +2. SVM: φ 1. Exponential: 2 φ exp( λ n M). Logistic Regression: φ 1. We are in the position of introducing the last theorem in the paper.
34 Universal approximation and consistency Theorem Let h be an entire function with nonnegative Taylor coefficients. Assume we choose λ n such that for least squares, modified least squares, SVM or logistic regression, λ n 0 and λ n n ; or we choose λ n such that for exponential loss, λ n 0 and λ n log 2 n. Then for any distribution D with regular input probability measure which is bounded almost everywhere in R d, we have lim EQ(ˆf n ( )) = inf Q(f( )). n f H h Moreover if h is not a polynomial, then lim EQ(ˆf n ( )) = 0. n
35 End of the presentation Thank you. Questions?
Boosting with Early Stopping: Convergence and Consistency
Boosting with Early Stopping: Convergence and Consistency Tong Zhang Bin Yu Abstract Boosting is one of the most significant advances in machine learning for classification and regression. In its original
More informationStatistical Properties of Large Margin Classifiers
Statistical Properties of Large Margin Classifiers Peter Bartlett Division of Computer Science and Department of Statistics UC Berkeley Joint work with Mike Jordan, Jon McAuliffe, Ambuj Tewari. slides
More informationAdaBoost and other Large Margin Classifiers: Convexity in Classification
AdaBoost and other Large Margin Classifiers: Convexity in Classification Peter Bartlett Division of Computer Science and Department of Statistics UC Berkeley Joint work with Mikhail Traskin. slides at
More informationClassification objectives COMS 4771
Classification objectives COMS 4771 1. Recap: binary classification Scoring functions Consider binary classification problems with Y = { 1, +1}. 1 / 22 Scoring functions Consider binary classification
More informationOn the Consistency of AUC Pairwise Optimization
On the Consistency of AUC Pairwise Optimization Wei Gao and Zhi-Hua Zhou National Key Laboratory for Novel Software Technology, Nanjing University Collaborative Innovation Center of Novel Software Technology
More informationLecture 10 February 23
EECS 281B / STAT 241B: Advanced Topics in Statistical LearningSpring 2009 Lecture 10 February 23 Lecturer: Martin Wainwright Scribe: Dave Golland Note: These lecture notes are still rough, and have only
More informationOn the Design of Loss Functions for Classification: theory, robustness to outliers, and SavageBoost
On the Design of Loss Functions for Classification: theory, robustness to outliers, and SavageBoost Hamed Masnadi-Shirazi Statistical Visual Computing Laboratory, University of California, San Diego La
More informationIndirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina
Indirect Rule Learning: Support Vector Machines Indirect learning: loss optimization It doesn t estimate the prediction rule f (x) directly, since most loss functions do not have explicit optimizers. Indirection
More informationBINARY CLASSIFICATION
BINARY CLASSIFICATION MAXIM RAGINSY The problem of binary classification can be stated as follows. We have a random couple Z = X, Y ), where X R d is called the feature vector and Y {, } is called the
More informationClassification with Reject Option
Classification with Reject Option Bartlett and Wegkamp (2008) Wegkamp and Yuan (2010) February 17, 2012 Outline. Introduction.. Classification with reject option. Spirit of the papers BW2008.. Infinite
More informationLogistic Regression Trained with Different Loss Functions. Discussion
Logistic Regression Trained with Different Loss Functions Discussion CS640 Notations We restrict our discussions to the binary case. g(z) = g (z) = g(z) z h w (x) = g(wx) = + e z = g(z)( g(z)) + e wx =
More informationDivergences, surrogate loss functions and experimental design
Divergences, surrogate loss functions and experimental design XuanLong Nguyen University of California Berkeley, CA 94720 xuanlong@cs.berkeley.edu Martin J. Wainwright University of California Berkeley,
More informationSparseness Versus Estimating Conditional Probabilities: Some Asymptotic Results
Sparseness Versus Estimating Conditional Probabilities: Some Asymptotic Results Peter L. Bartlett 1 and Ambuj Tewari 2 1 Division of Computer Science and Department of Statistics University of California,
More informationCalibrated Surrogate Losses
EECS 598: Statistical Learning Theory, Winter 2014 Topic 14 Calibrated Surrogate Losses Lecturer: Clayton Scott Scribe: Efrén Cruz Cortés Disclaimer: These notes have not been subjected to the usual scrutiny
More informationConvexity, Classification, and Risk Bounds
Convexity, Classification, and Risk Bounds Peter L. Bartlett Computer Science Division and Department of Statistics University of California, Berkeley bartlett@stat.berkeley.edu Michael I. Jordan Computer
More informationFast learning rates for plug-in classifiers under the margin condition
Fast learning rates for plug-in classifiers under the margin condition Jean-Yves Audibert 1 Alexandre B. Tsybakov 2 1 Certis ParisTech - Ecole des Ponts, France 2 LPMA Université Pierre et Marie Curie,
More informationLearning with Rejection
Learning with Rejection Corinna Cortes 1, Giulia DeSalvo 2, and Mehryar Mohri 2,1 1 Google Research, 111 8th Avenue, New York, NY 2 Courant Institute of Mathematical Sciences, 251 Mercer Street, New York,
More informationSVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels
SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels Karl Stratos June 21, 2018 1 / 33 Tangent: Some Loose Ends in Logistic Regression Polynomial feature expansion in logistic
More informationReproducing Kernel Hilbert Spaces
Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 11, 2009 About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert
More informationSurrogate loss functions, divergences and decentralized detection
Surrogate loss functions, divergences and decentralized detection XuanLong Nguyen Department of Electrical Engineering and Computer Science U.C. Berkeley Advisors: Michael Jordan & Martin Wainwright 1
More information> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel
Logistic Regression Pattern Recognition 2016 Sandro Schönborn University of Basel Two Worlds: Probabilistic & Algorithmic We have seen two conceptual approaches to classification: data class density estimation
More informationLecture 3: Statistical Decision Theory (Part II)
Lecture 3: Statistical Decision Theory (Part II) Hao Helen Zhang Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 1 / 27 Outline of This Note Part I: Statistics Decision Theory (Classical
More informationConsistency of Nearest Neighbor Methods
E0 370 Statistical Learning Theory Lecture 16 Oct 25, 2011 Consistency of Nearest Neighbor Methods Lecturer: Shivani Agarwal Scribe: Arun Rajkumar 1 Introduction In this lecture we return to the study
More informationCIS 520: Machine Learning Oct 09, Kernel Methods
CIS 520: Machine Learning Oct 09, 207 Kernel Methods Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture They may or may not cover all the material discussed
More informationEmpirical Risk Minimization
Empirical Risk Minimization Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Introduction PAC learning ERM in practice 2 General setting Data X the input space and Y the output space
More informationMachine Learning for NLP
Machine Learning for NLP Linear Models Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research Machine Learning for NLP 1(26) Outline
More informationGeneralization theory
Generalization theory Daniel Hsu Columbia TRIPODS Bootcamp 1 Motivation 2 Support vector machines X = R d, Y = { 1, +1}. Return solution ŵ R d to following optimization problem: λ min w R d 2 w 2 2 + 1
More informationApproximation Theoretical Questions for SVMs
Ingo Steinwart LA-UR 07-7056 October 20, 2007 Statistical Learning Theory: an Overview Support Vector Machines Informal Description of the Learning Goal X space of input samples Y space of labels, usually
More informationGeneralization Bounds in Machine Learning. Presented by: Afshin Rostamizadeh
Generalization Bounds in Machine Learning Presented by: Afshin Rostamizadeh Outline Introduction to generalization bounds. Examples: VC-bounds Covering Number bounds Rademacher bounds Stability bounds
More informationReproducing Kernel Hilbert Spaces
Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 12, 2007 About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert
More informationRobust Support Vector Machines for Probability Distributions
Robust Support Vector Machines for Probability Distributions Andreas Christmann joint work with Ingo Steinwart (Los Alamos National Lab) ICORS 2008, Antalya, Turkey, September 8-12, 2008 Andreas Christmann,
More informationAd Placement Strategies
Case Study : Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD AdaGrad Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox January 7 th, 04 Ad
More informationLarge Margin Classifiers: Convexity and Classification
Large Margin Classifiers: Convexity and Classification Peter Bartlett Division of Computer Science and Department of Statistics UC Berkeley Joint work with Mike Collins, Mike Jordan, David McAllester,
More informationGeometry of U-Boost Algorithms
Geometry of U-Boost Algorithms Noboru Murata 1, Takashi Takenouchi 2, Takafumi Kanamori 3, Shinto Eguchi 2,4 1 School of Science and Engineering, Waseda University 2 Department of Statistical Science,
More informationComputational and Statistical Learning Theory
Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 12: Weak Learnability and the l 1 margin Converse to Scale-Sensitive Learning Stability Convex-Lipschitz-Bounded Problems
More informationTechinical Proofs for Nonlinear Learning using Local Coordinate Coding
Techinical Proofs for Nonlinear Learning using Local Coordinate Coding 1 Notations and Main Results Denition 1.1 (Lipschitz Smoothness) A function f(x) on R d is (α, β, p)-lipschitz smooth with respect
More informationReproducing Kernel Hilbert Spaces
9.520: Statistical Learning Theory and Applications February 10th, 2010 Reproducing Kernel Hilbert Spaces Lecturer: Lorenzo Rosasco Scribe: Greg Durrett 1 Introduction In the previous two lectures, we
More informationASSESSING ROBUSTNESS OF CLASSIFICATION USING ANGULAR BREAKDOWN POINT
Submitted to the Annals of Statistics ASSESSING ROBUSTNESS OF CLASSIFICATION USING ANGULAR BREAKDOWN POINT By Junlong Zhao, Guan Yu and Yufeng Liu, Beijing Normal University, China State University of
More informationBasis Expansion and Nonlinear SVM. Kai Yu
Basis Expansion and Nonlinear SVM Kai Yu Linear Classifiers f(x) =w > x + b z(x) = sign(f(x)) Help to learn more general cases, e.g., nonlinear models 8/7/12 2 Nonlinear Classifiers via Basis Expansion
More informationOn Information Divergence Measures, Surrogate Loss Functions and Decentralized Hypothesis Testing
On Information Divergence Measures, Surrogate Loss Functions and Decentralized Hypothesis Testing XuanLong Nguyen Martin J. Wainwright Michael I. Jordan Electrical Engineering & Computer Science Department
More informationAccelerated Training of Max-Margin Markov Networks with Kernels
Accelerated Training of Max-Margin Markov Networks with Kernels Xinhua Zhang University of Alberta Alberta Innovates Centre for Machine Learning (AICML) Joint work with Ankan Saha (Univ. of Chicago) and
More informationThe Margin Vector, Admissible Loss and Multi-class Margin-based Classifiers
The Margin Vector, Admissible Loss and Multi-class Margin-based Classifiers Hui Zou University of Minnesota Ji Zhu University of Michigan Trevor Hastie Stanford University Abstract We propose a new framework
More informationCSC 411 Lecture 17: Support Vector Machine
CSC 411 Lecture 17: Support Vector Machine Ethan Fetaya, James Lucas and Emad Andrews University of Toronto CSC411 Lec17 1 / 1 Today Max-margin classification SVM Hard SVM Duality Soft SVM CSC411 Lec17
More informationReducing Multiclass to Binary: A Unifying Approach for Margin Classifiers
Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers Erin Allwein, Robert Schapire and Yoram Singer Journal of Machine Learning Research, 1:113-141, 000 CSE 54: Seminar on Learning
More informationStatistical Methods for SVM
Statistical Methods for SVM Support Vector Machines Here we approach the two-class classification problem in a direct way: We try and find a plane that separates the classes in feature space. If we cannot,
More informationBayes rule and Bayes error. Donglin Zeng, Department of Biostatistics, University of North Carolina
Bayes rule and Bayes error Definition If f minimizes E[L(Y, f (X))], then f is called a Bayes rule (associated with the loss function L(y, f )) and the resulting prediction error rate, E[L(Y, f (X))],
More informationSurrogate Risk Consistency: the Classification Case
Chapter 11 Surrogate Risk Consistency: the Classification Case I. The setting: supervised prediction problem (a) Have data coming in pairs (X,Y) and a loss L : R Y R (can have more general losses) (b)
More informationKernel Logistic Regression and the Import Vector Machine
Kernel Logistic Regression and the Import Vector Machine Ji Zhu and Trevor Hastie Journal of Computational and Graphical Statistics, 2005 Presented by Mingtao Ding Duke University December 8, 2011 Mingtao
More informationMachine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang
Machine Learning Basics Lecture 7: Multiclass Classification Princeton University COS 495 Instructor: Yingyu Liang Example: image classification indoor Indoor outdoor Example: image classification (multiclass)
More informationSupport Vector Machines
Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized
More informationComputer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression
Group Prof. Daniel Cremers 9. Gaussian Processes - Regression Repetition: Regularized Regression Before, we solved for w using the pseudoinverse. But: we can kernelize this problem as well! First step:
More informationProblem Set 5: Solutions Math 201A: Fall 2016
Problem Set 5: s Math 21A: Fall 216 Problem 1. Define f : [1, ) [1, ) by f(x) = x + 1/x. Show that f(x) f(y) < x y for all x, y [1, ) with x y, but f has no fixed point. Why doesn t this example contradict
More informationLogistic Regression. Machine Learning Fall 2018
Logistic Regression Machine Learning Fall 2018 1 Where are e? We have seen the folloing ideas Linear models Learning as loss minimization Bayesian learning criteria (MAP and MLE estimation) The Naïve Bayes
More informationDiscriminative Models
No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models
More informationLecture 2 Machine Learning Review
Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017 Things we will look at today Formal Setup for Supervised Learning Things
More informationNearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2
Nearest Neighbor Machine Learning CSE546 Kevin Jamieson University of Washington October 26, 2017 2017 Kevin Jamieson 2 Some data, Bayes Classifier Training data: True label: +1 True label: -1 Optimal
More informationDiscriminative Models
No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models
More informationLecture Support Vector Machine (SVM) Classifiers
Introduction to Machine Learning Lecturer: Amir Globerson Lecture 6 Fall Semester Scribe: Yishay Mansour 6.1 Support Vector Machine (SVM) Classifiers Classification is one of the most important tasks in
More informationINTRODUCTION TO DATA SCIENCE
INTRODUCTION TO DATA SCIENCE JOHN P DICKERSON Lecture #13 3/9/2017 CMSC320 Tuesdays & Thursdays 3:30pm 4:45pm ANNOUNCEMENTS Mini-Project #1 is due Saturday night (3/11): Seems like people are able to do
More informationDoes Modeling Lead to More Accurate Classification?
Does Modeling Lead to More Accurate Classification? A Comparison of the Efficiency of Classification Methods Yoonkyung Lee* Department of Statistics The Ohio State University *joint work with Rui Wang
More informationLearning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013
Learning Theory Ingo Steinwart University of Stuttgart September 4, 2013 Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 1 / 62 Basics Informal Introduction Informal Description
More informationA Study of Relative Efficiency and Robustness of Classification Methods
A Study of Relative Efficiency and Robustness of Classification Methods Yoonkyung Lee* Department of Statistics The Ohio State University *joint work with Rui Wang April 28, 2011 Department of Statistics
More informationStatistical learning theory, Support vector machines, and Bioinformatics
1 Statistical learning theory, Support vector machines, and Bioinformatics Jean-Philippe.Vert@mines.org Ecole des Mines de Paris Computational Biology group ENS Paris, november 25, 2003. 2 Overview 1.
More informationCSCI-567: Machine Learning (Spring 2019)
CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March
More informationEffective Dimension and Generalization of Kernel Learning
Effective Dimension and Generalization of Kernel Learning Tong Zhang IBM T.J. Watson Research Center Yorktown Heights, Y 10598 tzhang@watson.ibm.com Abstract We investigate the generalization performance
More informationKernel Methods. Jean-Philippe Vert Last update: Jan Jean-Philippe Vert (Mines ParisTech) 1 / 444
Kernel Methods Jean-Philippe Vert Jean-Philippe.Vert@mines.org Last update: Jan 2015 Jean-Philippe Vert (Mines ParisTech) 1 / 444 What we know how to solve Jean-Philippe Vert (Mines ParisTech) 2 / 444
More informationLogistic Regression Logistic
Case Study 1: Estimating Click Probabilities L2 Regularization for Logistic Regression Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin January 10 th,
More informationWarm up: risk prediction with logistic regression
Warm up: risk prediction with logistic regression Boss gives you a bunch of data on loans defaulting or not: {(x i,y i )} n i= x i 2 R d, y i 2 {, } You model the data as: P (Y = y x, w) = + exp( yw T
More informationClassification Logistic Regression
Classification Logistic Regression Machine Learning CSE546 Kevin Jamieson University of Washington October 16, 2016 1 THUS FAR, REGRESSION: PREDICT A CONTINUOUS VALUE GIVEN SOME INPUTS 2 Weather prediction
More informationMachine Learning 4771
Machine Learning 4771 Instructor: Tony Jebara Topic 3 Additive Models and Linear Regression Sinusoids and Radial Basis Functions Classification Logistic Regression Gradient Descent Polynomial Basis Functions
More informationMachine Learning. Support Vector Machines. Fabio Vandin November 20, 2017
Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training
More informationReproducing Kernel Hilbert Spaces
Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 9, 2011 About this class Goal In this class we continue our journey in the world of RKHS. We discuss the Mercer theorem which gives
More informationMax Margin-Classifier
Max Margin-Classifier Oliver Schulte - CMPT 726 Bishop PRML Ch. 7 Outline Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data Kernels and Non-linear Mappings Where does the maximization
More informationCalibrated asymmetric surrogate losses
Electronic Journal of Statistics Vol. 6 (22) 958 992 ISSN: 935-7524 DOI:.24/2-EJS699 Calibrated asymmetric surrogate losses Clayton Scott Department of Electrical Engineering and Computer Science Department
More informationMachine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang
Machine Learning Basics Lecture 2: Linear Classification Princeton University COS 495 Instructor: Yingyu Liang Review: machine learning basics Math formulation Given training data x i, y i : 1 i n i.i.d.
More informationA Bahadur Representation of the Linear Support Vector Machine
A Bahadur Representation of the Linear Support Vector Machine Yoonkyung Lee Department of Statistics The Ohio State University October 7, 2008 Data Mining and Statistical Learning Study Group Outline Support
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression
More informationStatistical Data Mining and Machine Learning Hilary Term 2016
Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes
More informationReproducing Kernel Hilbert Spaces Class 03, 15 February 2006 Andrea Caponnetto
Reproducing Kernel Hilbert Spaces 9.520 Class 03, 15 February 2006 Andrea Caponnetto About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert
More informationFoundations of Machine Learning
Introduction to ML Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu page 1 Logistics Prerequisites: basics in linear algebra, probability, and analysis of algorithms. Workload: about
More informationA Simple Algorithm for Multilabel Ranking
A Simple Algorithm for Multilabel Ranking Krzysztof Dembczyński 1 Wojciech Kot lowski 1 Eyke Hüllermeier 2 1 Intelligent Decision Support Systems Laboratory (IDSS), Poznań University of Technology, Poland
More informationNonparametric Bayesian Methods (Gaussian Processes)
[70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent
More informationSupport Vector Machines: Maximum Margin Classifiers
Support Vector Machines: Maximum Margin Classifiers Machine Learning and Pattern Recognition: September 16, 2008 Piotr Mirowski Based on slides by Sumit Chopra and Fu-Jie Huang 1 Outline What is behind
More informationFast Rates for Estimation Error and Oracle Inequalities for Model Selection
Fast Rates for Estimation Error and Oracle Inequalities for Model Selection Peter L. Bartlett Computer Science Division and Department of Statistics University of California, Berkeley bartlett@cs.berkeley.edu
More informationAre Loss Functions All the Same?
Are Loss Functions All the Same? L. Rosasco E. De Vito A. Caponnetto M. Piana A. Verri November 11, 2003 Abstract In this paper we investigate the impact of choosing different loss functions from the viewpoint
More informationClassification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012
Classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Topics Discriminant functions Logistic regression Perceptron Generative models Generative vs. discriminative
More informationMehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 April 5, 2013 Due: April 19, 2013
Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 April 5, 2013 Due: April 19, 2013 A. Kernels 1. Let X be a finite set. Show that the kernel
More informationECE521 Lectures 9 Fully Connected Neural Networks
ECE521 Lectures 9 Fully Connected Neural Networks Outline Multi-class classification Learning multi-layer neural networks 2 Measuring distance in probability space We learnt that the squared L2 distance
More informationTaylorBoost: First and Second-order Boosting Algorithms with Explicit Margin Control
TaylorBoost: First and Second-order Boosting Algorithms with Explicit Margin Control Mohammad J. Saberian Hamed Masnadi-Shirazi Nuno Vasconcelos Department of Electrical and Computer Engineering University
More informationMachine Learning And Applications: Supervised Learning-SVM
Machine Learning And Applications: Supervised Learning-SVM Raphaël Bournhonesque École Normale Supérieure de Lyon, Lyon, France raphael.bournhonesque@ens-lyon.fr 1 Supervised vs unsupervised learning Machine
More informationClassification Logistic Regression
Announcements: Classification Logistic Regression Machine Learning CSE546 Sham Kakade University of Washington HW due on Friday. Today: Review: sub-gradients,lasso Logistic Regression October 3, 26 Sham
More informationUNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013
UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and
More informationLecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University
Lecture 18: Kernels Risk and Loss Support Vector Regression Aykut Erdem December 2016 Hacettepe University Administrative We will have a make-up lecture on next Saturday December 24, 2016 Presentations
More informationStatistical Methods for Data Mining
Statistical Methods for Data Mining Kuangnan Fang Xiamen University Email: xmufkn@xmu.edu.cn Support Vector Machines Here we approach the two-class classification problem in a direct way: We try and find
More informationSupport Vector Machines
Support Vector Machines Here we approach the two-class classification problem in a direct way: We try and find a plane that separates the classes in feature space. If we cannot, we get creative in two
More informationFORMULATION OF THE LEARNING PROBLEM
FORMULTION OF THE LERNING PROBLEM MIM RGINSKY Now that we have seen an informal statement of the learning problem, as well as acquired some technical tools in the form of concentration inequalities, we
More informationNonparametric Bayesian Methods
Nonparametric Bayesian Methods Debdeep Pati Florida State University October 2, 2014 Large spatial datasets (Problem of big n) Large observational and computer-generated datasets: Often have spatial and
More informationLearning Binary Classifiers for Multi-Class Problem
Research Memorandum No. 1010 September 28, 2006 Learning Binary Classifiers for Multi-Class Problem Shiro Ikeda The Institute of Statistical Mathematics 4-6-7 Minami-Azabu, Minato-ku, Tokyo, 106-8569,
More informationStochastic Gradient Descent
Stochastic Gradient Descent Machine Learning CSE546 Carlos Guestrin University of Washington October 9, 2013 1 Logistic Regression Logistic function (or Sigmoid): Learn P(Y X) directly Assume a particular
More information1 Machine Learning Concepts (16 points)
CSCI 567 Fall 2018 Midterm Exam DO NOT OPEN EXAM UNTIL INSTRUCTED TO DO SO PLEASE TURN OFF ALL CELL PHONES Problem 1 2 3 4 5 6 Total Max 16 10 16 42 24 12 120 Points Please read the following instructions
More information