AdaBoost and other Large Margin Classifiers: Convexity in Classification
|
|
- Richard Lee
- 5 years ago
- Views:
Transcription
1 AdaBoost and other Large Margin Classifiers: Convexity in Classification Peter Bartlett Division of Computer Science and Department of Statistics UC Berkeley Joint work with Mikhail Traskin. slides at bartlett 1
2 The Pattern Classification Problem i.i.d. (X, Y ), (X 1, Y 1 ),...,(X n, Y n ) from X {±1}. Use data (X 1, Y 1 ),...,(X n, Y n ) to choose f n : X R with small risk, R(f n ) = Pr (sign(f n (X)) Y ) = El(Y, f(x)). Natural approach: minimize empirical risk, ˆR(f) = Êl(Y, f(x)) = 1 n n l(y i, f(x i )). i=1 Often intractable... Replace 0-1 loss, l, with a convex surrogate, φ. 2
3 Large Margin Algorithms Consider the margins, Y f(x). Define a margin cost function φ : R R +. Define the φ-risk of f : X R as R φ (f) = Eφ(Y f(x)). Choose f F to minimize φ-risk. (e.g., use data, (X 1, Y 1 ),...,(X n, Y n ), to minimize empirical φ-risk, ˆR φ (f) = Êφ(Y f(x)) = 1 n n φ(y i f(x i )), i=1 or a regularized version.) 3
4 Large Margin Algorithms Adaboost: F = span(g) for a VC-class G, φ(α) = exp( α), Minimizes ˆR φ (f) using greedy basis selection, line search: f t+1 = f t + α t+1 g t+1, ˆR φ (f t + α t+1 g t+1 ) = min α R,g G ˆR φ (f t + αg). Effective in applications: real-time face detection, spoken dialogue systems,... 4
5 Large Margin Algorithms Many other variants Support vector machines with 1-norm soft margin. F = ball in reproducing kernel Hilbert space, H. φ(α) = max (0, 1 α). Algorithm minimizes ˆR φ (f) + λ f 2 H. Neural net classifiers φ(α) = max(0, (0.8 α) 2 ). L2Boost, LS-SVMs φ(α) = (1 α) 2. Logistic regression φ(α) = log(1 + exp( 2α)). 5
6 Large Margin Algorithms exponential hinge logistic truncated quadratic α 6
7 Statistical Consequences of Using a Convex Cost Is AdaBoost universally consistent? Other φ? (Lugosi and Vayatis, 2004), (Mannor, Meir and Zhang, 2002): regularized boosting. (Jiang, 2004): process consistency of AdaBoost, for certain probability distributions. (Zhang, 2004), (Steinwart, 2003): SVM. 7
8 Statistical Consequences of Using a Convex Cost How is risk related to φ-risk? (Lugosi and Vayatis, 2004), (Steinwart, 2003): asymptotic. (Zhang, 2004): comparison theorem. 8
9 Overview Relating excess risk to excess φ-risk. ψ-transform: best possible bound. conditions on φ. (with Mike Jordan and Jon McAuliffe) Universal consistency of AdaBoost. 9
10 Definitions and Facts R(f) = Pr(sign(f(X)) Y ) R φ (f) = Eφ(Y f(x)) η(x) = Pr(Y = 1 X = x) R = inf f R(f) R φ = inf f R φ(f) conditional probability. risk φ-risk η defines an optimal classifier: R = R(sign(η(x) 1/2)). 10
11 Definitions and Facts R(f) = Pr(sign(f(X)) Y ) R φ (f) = Eφ(Y f(x)) η(x) = Pr(Y = 1 X = x) R = inf f R(f) R φ = inf f R φ(f) conditional probability. risk φ-risk η defines an optimal classifier: R = R(sign(η(x) 1/2)). Notice: R φ (f) = E (E [φ(y f(x)) X]), and conditional φ-risk is: E [φ(y f(x)) X = x] = η(x)φ(f(x)) + (1 η(x))φ( f(x)). 11
12 Definitions Conditional φ-risk: E [φ(y f(x)) X = x] = η(x)φ(f(x)) + (1 η(x))φ( f(x)). Optimal conditional φ-risk for η [0, 1]: H(η) = inf (ηφ(α) + (1 η)φ( α)). α R R φ = EH(η(X)). 12
13 Optimal Conditional φ-risk: Example φ(α) φ( α) C 0.3 (α) C 0.7 (α) α * (η) H(η) ψ(θ)
14 Definitions Optimal conditional φ-risk for η [0, 1]: H(η) = inf (ηφ(α) + (1 η)φ( α)). α R Optimal conditional φ-risk with incorrect sign: H (η) = inf (ηφ(α) + (1 η)φ( α)). α:α(2η 1) 0 Note: H (η) H(η) H (1/2) = H(1/2). 14
15 Definitions H(η) = inf (ηφ(α) + (1 η)φ( α)) α R H (η) = inf (ηφ(α) + (1 η)φ( α)). α:α(2η 1) 0 Definition: for η 1/2, φ is classification-calibrated if, H (η) > H(η). i.e., pointwise optimization of conditional φ-risk leads to the correct sign. (c.f. Lin (2001)) 15
16 The ψ transform Definition: Given convex φ, define ψ : [0, 1] [0, ) by ( ) 1 + θ ψ(θ) = φ(0) H. 2 (The definition is a little more involved for non-convex φ.) α * (η) H(η) ψ(θ)
17 The Relationship between Excess Risk and Excess φ-risk Theorem: 1. For any P and f, ψ(r(f) R ) R φ (f) Rφ. 2. For X 2, ǫ > 0 and θ [0, 1], there is a P and an f with R(f) R = θ ψ(θ) R φ (f) Rφ ψ(θ) + ǫ. 3. The following conditions are equivalent: (a) φ is classification calibrated. (b) ψ(θ i ) 0 iff θ i 0. (c) R φ (f i ) Rφ implies R(f i) R. 17
18 Classification-calibrated φ If φ is classification-calibrated, then ψ(θ i ) 0 iff θ i 0. Since the function ψ is always convex, in that case it is strictly increasing and so has an inverse. Thus, we can write R(f) R ψ 1 ( R φ (f) R φ). 18
19 Excess Risk Bounds: Proof Idea Facts: H(η), H (η) are symmetric about η = 1/2. H(1/2) = H (1/2), hence ψ(0) = 0. ψ(θ) is convex. ψ(θ) = H ( 1 + θ 2 ) H ( 1 + θ 2 ). 19
20 Excess Risk Bounds: Proof Idea Excess risk of f : X R is Thus, R(f) R = E (1 [sign(f(x)) sign(η(x) 1/2)] 2η(X) 1 ). ψ(r(f) R ) (ψ convex, ψ(0) = 0) E (1 [sign(f(x)) sign(η(x) 1/2)]ψ ( 2η(X) 1 )) = E ( 1 [sign(f(x)) sign(η(x) 1/2)] ( H (η(x)) H(η(X)) )) E (φ(y f(x)) H(η(X))) = R φ (f) R φ. 20
21 Excess Risk Bounds: Proof Idea Excess risk of f : X R is Thus, R(f) R = E (1 [sign(f(x)) sign(η(x) 1/2)] 2η(X) 1 ). ψ(r(f) R ) (definition of ψ) E (1 [sign(f(x)) sign(η(x) 1/2)] ψ ( 2η(X) 1 )) = E ( 1 [sign(f(x)) sign(η(x) 1/2)] ( H (η(x)) H(η(X)) )) E (φ(y f(x)) H(η(X))) = R φ (f) R φ. 20
22 Excess Risk Bounds: Proof Idea Excess risk of f : X R is Thus, R(f) R = E (1 [sign(f(x)) sign(η(x) 1/2)] 2η(X) 1 ). ψ(r(f) R ) (H minimizes conditional φ-risk) E (1 [sign(f(x)) sign(η(x) 1/2)] ψ ( 2η(X) 1 )) = E ( 1 [sign(f(x)) sign(η(x) 1/2)] ( H (η(x)) H(η(X)) )) E (φ(y f(x)) H(η(X))) = R φ (f) R φ. 20
23 Excess Risk Bounds: Proof Idea Excess risk of f : X R is Thus, R(f) R = E (1 [sign(f(x)) sign(η(x) 1/2)] 2η(X) 1 ). ψ(r(f) R ) (definition of R φ ) E (1 [sign(f(x)) sign(η(x) 1/2)] ψ ( 2η(X) 1 )) = E ( 1 [sign(f(x)) sign(η(x) 1/2)] ( H (η(x)) H(η(X)) )) E (φ(y f(x)) H(η(X))) = R φ (f) R φ. 20
24 Classification-calibrated φ Theorem: If φ is convex, φ is classification calibrated φ is differentiable at 0 φ (0) < 0. Theorem: If φ is classification calibrated, γ > 0, α R, γφ(α) 1 [α 0]. 21
25 Overview Relating excess risk to excess φ-risk. Universal consistency of AdaBoost. The approximation/estimation decomposition. AdaBoost: Previous results. Universal consistency. (with Mikhail Traskin) 22
26 Universal Consistency Assume: i.i.d. data, (X, Y ), (X 1, Y 1 ),...,(X n, Y n ) from from X Y (with Y = {±1}). Consider a method f n = A((X 1, Y 1 ),...,(X n, Y n )), e.g., f n = AdaBoost((X 1, Y 1 ),...,(X n, Y n ), t n ). Definition: We say that the method is universally consistent if, for all distributions P, R(f n ) a.s R, where R is the risk and R is the Bayes risk: R(f) = Pr(Y sign(f(x)), R = inf f R(f). 23
27 The Approximation/Estimation Decomposition Consider an algorithm that chooses f n = arg min f F ˆR φ (f) + λ n Ω(f). We can decompose the excess risk estimate as ψ (R(f n ) R ) R φ (f n ) R φ = R φ (f n ) inf f F n R φ (f) }{{} estimation error + inf R φ (f) Rφ. f F n }{{} approximation error 24
28 The Approximation/Estimation Decomposition Consider an algorithm that chooses f n = arg min f F n ˆRφ (f). We can decompose the excess risk estimate as ψ (R(f n ) R ) R φ (f n ) R φ = R φ (f n ) inf f F n R φ (f) }{{} estimation error + inf R φ (f) Rφ. f F n }{{} approximation error 24
29 The Approximation/Estimation Decomposition ψ (R(f n ) R ) R φ (f n ) R φ = R φ (f n ) inf f F n R φ (f) }{{} estimation error + inf R φ (f) Rφ. f F n }{{} approximation error Approximation and estimation errors are in terms of R φ, not R. Like a regression problem. With a rich class and appropriate regularization, R φ (f n ) R φ. (e.g., F n gets large slowly, or λ n 0 slowly.) Universal consistency (R(f n ) R ) iff φ is classification calibrated. 25
30 Overview Relating excess risk to excess φ-risk. Universal consistency of AdaBoost. The approximation/estimation decomposition. AdaBoost: Previous results. Universal consistency. 26
31 AdaBoost Sample, S n = ((x 1, y 1 ),...,(x n,y n )) (X {±1}) n Number of iterations, T function AdaBoost(S n, T ) f 0 := 0 for t from 1,...,T (α t,h t ) := arg min α R,h F f t := f t 1 + α t h t return f T 1 n nx exp ( y i (f t 1 (x i ) + αh(x i ))) i=1 27
32 Previous results: Regularized versions AdaBoost greedily minimizes ˆR φ (f) = 1 n n exp( Y i f(x i )) i=1 over f span(f). (Notice that, for many interesting basis classes F, the infimum is zero.) Instead of AdaBoost, consider a regularized version of its criterion. 28
33 Previous results: Regularized versions 1. Minimize ˆR φ (f) over f γ n co(f), the scaled convex hull of F. 2. Minimize ˆR φ (f) + λ n f 1, over f span(f), where f 1 = inf{γ : f γco(f)}. For suitable choices of the parameters (γ n and λ n ), these algorithms are universally consistent. (Lugosi and Vayatis, 2004), (Zhang, 2004) 29
34 Previous results: Bounded step size function AdaBoostwithBoundedStepSize(S n, T ) f 0 := 0 for t from 1,...,T (α t,h t ) := arg min α R,h F 1 n f t := f t 1 + min{α t, ǫ}h t return f T nx exp ( y i (f t 1 (x i ) + αh(x i ))) i=1 For suitable choices of the parameters (T = T n and ǫ = ǫ n ), this algorithm is universally consistent. (Zhang and Yu, 2005), (Bickel, Ritov, Zakai, 2006) 30
35 Previous results about AdaBoost AdaBoost greedily minimizes ˆR φ (f) = 1 n n exp( Y i f(x i )) i=1 over f span(f). Consider AdaBoost with early stopping: f n is the function returned by AdaBoost after t n steps. How should we choose t n? Note: The infimum is often zero. Don t want t n too large. 31
36 Previous result about AdaBoost: Process consistency Theorem: [Jiang, 2004] For a (suitable) basis class defined on R d, and for all probability distributions P satisfying certain smoothness assumptions, there is a sequence t n such that f n =AdaBoost(S n, t n ) satisfies R(f n ) a.s. R. Conditions on the distribution P are unnatural and cannot be checked. How should the stopping time t n grow with sample size n? Does it need to depend on the distribution P? Rates? 32
37 Overview Relating excess risk to excess φ-risk. Universal consistency of AdaBoost. The approximation/estimation decomposition. AdaBoost: Previous results. Universal consistency. 33
38 The key theorem Assume d V C (F) < Otherwise AdaBoost must stop and fail after one step. Assume where lim λ inf {R φ(f) : f λco(f)} = R φ, R φ (f) = E exp( Y f(x)), R φ = inf f R φ(f). That is, the approximation error is zero. For example, F is linear threshold functions, or binary trees with axis orthogonal decisions in R d and at least d + 1 leaves. 34
39 The key theorem Theorem: If d V C (F) <, R φ = lim λ inf {R φ(f) : f λco(f)}, t n t n = O(n 1 α ) for some α > 0, then AdaBoost is universally consistent. 35
40 The key theorem: Idea of proof We show R φ (f tn ) R φ, which implies R(f t n ) R, since the loss function α exp( α) is classification calibrated. Step 1. Notice that we can clip f tn : If we define π λ (f) as x max{ λ, min{λ, f(x)}}, then R φ (π λ (f tn )) R φ = R(π λ (f tn )) R = R(f tn ) R. We will need to relax the clipping (λ n ) π 1 (x) 36
41 The key theorem: Idea of proof Step 2. Use VC-theory (for clipped combinations of t functions from F ) to show that, with high probability, R φ (π λ (f t )) ˆR dv C (F)t log t φ (π λ (f t )) + c(λ), n where ˆR φ is the empirical version of R φ, ˆR φ (f) = E n exp( Y f(x)). 37
42 The key theorem: Idea of proof Step 3. The clipping only hurts for small values of the exponential criterion: ˆR φ (π λ (f t )) ˆR φ (f t ) + e λ
43 The key theorem: Idea of proof Step 4. Apply numerical convergence result of (Bickel et al, 2006): For any comparison function f F λ = {R φ (f) : f λco(f)}, ˆR φ (f t ) ˆR φ ( f) + ǫ(λ, t). Here, we exploit an attractive property of the exponential loss function and the fact that classifiers are binary-valued: The second derivative of ˆR φ in a basis direction is large whenever ˆR φ is large. This keeps the steps taken by AdaBoost from being too large. 39
44 The key theorem: Idea of proof Step 5. Relate ˆR φ ( f) to R φ ( f). Choosing λ n suitably slowly, we can choose f n so that R φ ( f n ) R φ (by assumption), and then for t = O(n1 α ), we have the result. 40
45 The key theorem: Idea of proof R R n π λn (f tn ) 41
46 The key theorem: Idea of proof R R n π λn (f tn ) π λn (f tn ) 41
47 The key theorem: Idea of proof R R n f tn π λn (f tn ) π λn (f tn ) 41
48 The key theorem: Idea of proof R R n f tn f tn π λn (f tn ) π λn (f tn ) 41
49 The key theorem: Idea of proof f n R R n f tn f tn π λn (f tn ) π λn (f tn ) 41
50 The key theorem: Idea of proof R * f n R R n f tn f tn π λn (f tn ) π λn (f tn ) 41
51 Open Problems Other loss functions? e.g., LogitBoost uses α log(1 + exp( 2α)) in place of exp( α). (The difficulty is the behaviour of the second derivative of ˆR φ in the direction of a basis function. For the numerical convergence results, we want it large whenever ˆR φ is large.) Real-valued basis functions? (The same issue arises.) Rates? The bottleneck is the rate of decrease of ˆR φ (f t ). The numerical convergence result ensures it decreases to f as log 1/2 t. This seems pessimistic. 42
52 Overview Relating excess risk to excess φ-risk. Universal consistency of AdaBoost. slides at bartlett 43
Statistical Properties of Large Margin Classifiers
Statistical Properties of Large Margin Classifiers Peter Bartlett Division of Computer Science and Department of Statistics UC Berkeley Joint work with Mike Jordan, Jon McAuliffe, Ambuj Tewari. slides
More informationLarge Margin Classifiers: Convexity and Classification
Large Margin Classifiers: Convexity and Classification Peter Bartlett Division of Computer Science and Department of Statistics UC Berkeley Joint work with Mike Collins, Mike Jordan, David McAllester,
More informationConvexity, Classification, and Risk Bounds
Convexity, Classification, and Risk Bounds Peter L. Bartlett Computer Science Division and Department of Statistics University of California, Berkeley bartlett@stat.berkeley.edu Michael I. Jordan Computer
More informationLecture 10 February 23
EECS 281B / STAT 241B: Advanced Topics in Statistical LearningSpring 2009 Lecture 10 February 23 Lecturer: Martin Wainwright Scribe: Dave Golland Note: These lecture notes are still rough, and have only
More informationFast Rates for Estimation Error and Oracle Inequalities for Model Selection
Fast Rates for Estimation Error and Oracle Inequalities for Model Selection Peter L. Bartlett Computer Science Division and Department of Statistics University of California, Berkeley bartlett@cs.berkeley.edu
More informationSTATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION
STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION Tong Zhang The Annals of Statistics, 2004 Outline Motivation Approximation error under convex risk minimization
More informationCalibrated Surrogate Losses
EECS 598: Statistical Learning Theory, Winter 2014 Topic 14 Calibrated Surrogate Losses Lecturer: Clayton Scott Scribe: Efrén Cruz Cortés Disclaimer: These notes have not been subjected to the usual scrutiny
More informationA Study of Relative Efficiency and Robustness of Classification Methods
A Study of Relative Efficiency and Robustness of Classification Methods Yoonkyung Lee* Department of Statistics The Ohio State University *joint work with Rui Wang April 28, 2011 Department of Statistics
More informationDoes Modeling Lead to More Accurate Classification?
Does Modeling Lead to More Accurate Classification? A Comparison of the Efficiency of Classification Methods Yoonkyung Lee* Department of Statistics The Ohio State University *joint work with Rui Wang
More informationOn the Consistency of AUC Pairwise Optimization
On the Consistency of AUC Pairwise Optimization Wei Gao and Zhi-Hua Zhou National Key Laboratory for Novel Software Technology, Nanjing University Collaborative Innovation Center of Novel Software Technology
More informationSurrogate loss functions, divergences and decentralized detection
Surrogate loss functions, divergences and decentralized detection XuanLong Nguyen Department of Electrical Engineering and Computer Science U.C. Berkeley Advisors: Michael Jordan & Martin Wainwright 1
More informationDivergences, surrogate loss functions and experimental design
Divergences, surrogate loss functions and experimental design XuanLong Nguyen University of California Berkeley, CA 94720 xuanlong@cs.berkeley.edu Martin J. Wainwright University of California Berkeley,
More informationSparseness Versus Estimating Conditional Probabilities: Some Asymptotic Results
Sparseness Versus Estimating Conditional Probabilities: Some Asymptotic Results Peter L. Bartlett 1 and Ambuj Tewari 2 1 Division of Computer Science and Department of Statistics University of California,
More informationSurrogate Risk Consistency: the Classification Case
Chapter 11 Surrogate Risk Consistency: the Classification Case I. The setting: supervised prediction problem (a) Have data coming in pairs (X,Y) and a loss L : R Y R (can have more general losses) (b)
More informationClassification objectives COMS 4771
Classification objectives COMS 4771 1. Recap: binary classification Scoring functions Consider binary classification problems with Y = { 1, +1}. 1 / 22 Scoring functions Consider binary classification
More informationOn the Design of Loss Functions for Classification: theory, robustness to outliers, and SavageBoost
On the Design of Loss Functions for Classification: theory, robustness to outliers, and SavageBoost Hamed Masnadi-Shirazi Statistical Visual Computing Laboratory, University of California, San Diego La
More informationDecentralized decision making with spatially distributed data
Decentralized decision making with spatially distributed data XuanLong Nguyen Department of Statistics University of Michigan Acknowledgement: Michael Jordan, Martin Wainwright, Ram Rajagopal, Pravin Varaiya
More informationWhy does boosting work from a statistical view
Why does boosting work from a statistical view Jialin Yi Applied Mathematics and Computational Science University of Pennsylvania Philadelphia, PA 939 jialinyi@sas.upenn.edu Abstract We review boosting
More informationAda Boost, Risk Bounds, Concentration Inequalities. 1 AdaBoost and Estimates of Conditional Probabilities
CS8B/Stat4B Sprig 008) Statistical Learig Theory Lecture: Ada Boost, Risk Bouds, Cocetratio Iequalities Lecturer: Peter Bartlett Scribe: Subhrasu Maji AdaBoost ad Estimates of Coditioal Probabilities We
More informationOn Information Divergence Measures, Surrogate Loss Functions and Decentralized Hypothesis Testing
On Information Divergence Measures, Surrogate Loss Functions and Decentralized Hypothesis Testing XuanLong Nguyen Martin J. Wainwright Michael I. Jordan Electrical Engineering & Computer Science Department
More informationBoosting Methods: Why They Can Be Useful for High-Dimensional Data
New URL: http://www.r-project.org/conferences/dsc-2003/ Proceedings of the 3rd International Workshop on Distributed Statistical Computing (DSC 2003) March 20 22, Vienna, Austria ISSN 1609-395X Kurt Hornik,
More informationBINARY CLASSIFICATION
BINARY CLASSIFICATION MAXIM RAGINSY The problem of binary classification can be stated as follows. We have a random couple Z = X, Y ), where X R d is called the feature vector and Y {, } is called the
More informationMinimax risk bounds for linear threshold functions
CS281B/Stat241B (Spring 2008) Statistical Learning Theory Lecture: 3 Minimax risk bounds for linear threshold functions Lecturer: Peter Bartlett Scribe: Hao Zhang 1 Review We assume that there is a probability
More informationSupport Vector Machines for Classification: A Statistical Portrait
Support Vector Machines for Classification: A Statistical Portrait Yoonkyung Lee Department of Statistics The Ohio State University May 27, 2011 The Spring Conference of Korean Statistical Society KAIST,
More informationLearning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013
Learning Theory Ingo Steinwart University of Stuttgart September 4, 2013 Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 1 / 62 Basics Informal Introduction Informal Description
More informationApproximation Theoretical Questions for SVMs
Ingo Steinwart LA-UR 07-7056 October 20, 2007 Statistical Learning Theory: an Overview Support Vector Machines Informal Description of the Learning Goal X space of input samples Y space of labels, usually
More informationBoosting. March 30, 2009
Boosting Peter Bühlmann buhlmann@stat.math.ethz.ch Seminar für Statistik ETH Zürich Zürich, CH-8092, Switzerland Bin Yu binyu@stat.berkeley.edu Department of Statistics University of California Berkeley,
More informationConsistency and Convergence Rates of One-Class SVM and Related Algorithms
Consistency and Convergence Rates of One-Class SVM and Related Algorithms Régis Vert Laboratoire de Recherche en Informatique Bât. 490, Université Paris-Sud 91405, Orsay Cedex, France Regis.Vert@lri.fr
More informationECE 5424: Introduction to Machine Learning
ECE 5424: Introduction to Machine Learning Topics: Ensemble Methods: Bagging, Boosting PAC Learning Readings: Murphy 16.4;; Hastie 16 Stefan Lee Virginia Tech Fighting the bias-variance tradeoff Simple
More informationOn the rate of convergence of regularized boosting classifiers
On the rate of convergence of regularized boosting classifiers Gilles Blanchard CNRS Laboratoire de Mathématiques Université Paris-Sud Bâtiment 425 91405 Orsay Cedex, France blanchar@first.fhg.de Gábor
More informationCSCI-567: Machine Learning (Spring 2019)
CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March
More informationBoosting with Early Stopping: Convergence and Consistency
Boosting with Early Stopping: Convergence and Consistency Tong Zhang Bin Yu Abstract Boosting is one of the most significant advances in machine learning for classification and regression. In its original
More informationIntroduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research
Introduction to Machine Learning Lecture 11 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Boosting Mehryar Mohri - Introduction to Machine Learning page 2 Boosting Ideas Main idea:
More informationAre You a Bayesian or a Frequentist?
Are You a Bayesian or a Frequentist? Michael I. Jordan Department of EECS Department of Statistics University of California, Berkeley http://www.cs.berkeley.edu/ jordan 1 Statistical Inference Bayesian
More informationGeneralization theory
Generalization theory Daniel Hsu Columbia TRIPODS Bootcamp 1 Motivation 2 Support vector machines X = R d, Y = { 1, +1}. Return solution ŵ R d to following optimization problem: λ min w R d 2 w 2 2 + 1
More informationSupport Vector Machine
Support Vector Machine Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Linear Support Vector Machine Kernelized SVM Kernels 2 From ERM to RLM Empirical Risk Minimization in the binary
More informationSurrogate losses and regret bounds for cost-sensitive classification with example-dependent costs
Surrogate losses and regret bounds for cost-sensitive classification with example-dependent costs Clayton Scott Dept. of Electrical Engineering and Computer Science, and of Statistics University of Michigan,
More informationSVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels
SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels Karl Stratos June 21, 2018 1 / 33 Tangent: Some Loose Ends in Logistic Regression Polynomial feature expansion in logistic
More informationASSESSING ROBUSTNESS OF CLASSIFICATION USING ANGULAR BREAKDOWN POINT
Submitted to the Annals of Statistics ASSESSING ROBUSTNESS OF CLASSIFICATION USING ANGULAR BREAKDOWN POINT By Junlong Zhao, Guan Yu and Yufeng Liu, Beijing Normal University, China State University of
More informationVBM683 Machine Learning
VBM683 Machine Learning Pinar Duygulu Slides are adapted from Dhruv Batra Bias is the algorithm's tendency to consistently learn the wrong thing by not taking into account all the information in the data
More informationA Bahadur Representation of the Linear Support Vector Machine
A Bahadur Representation of the Linear Support Vector Machine Yoonkyung Lee Department of Statistics The Ohio State University October 7, 2008 Data Mining and Statistical Learning Study Group Outline Support
More informationThe Margin Vector, Admissible Loss and Multi-class Margin-based Classifiers
The Margin Vector, Admissible Loss and Multi-class Margin-based Classifiers Hui Zou University of Minnesota Ji Zhu University of Michigan Trevor Hastie Stanford University Abstract We propose a new framework
More informationLearning with Rejection
Learning with Rejection Corinna Cortes 1, Giulia DeSalvo 2, and Mehryar Mohri 2,1 1 Google Research, 111 8th Avenue, New York, NY 2 Courant Institute of Mathematical Sciences, 251 Mercer Street, New York,
More informationCIS 520: Machine Learning Oct 09, Kernel Methods
CIS 520: Machine Learning Oct 09, 207 Kernel Methods Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture They may or may not cover all the material discussed
More informationOn the Rate of Convergence of Regularized Boosting Classifiers
Journal of Machine Learning Research 4 (2003) 861-894 Submitted 4/03; Published 10/03 On the Rate of Convergence of Regularized Boosting Classifiers Gilles Blanchard CNRS Laboratoire de Mathématiques Université
More informationClassification with Reject Option
Classification with Reject Option Bartlett and Wegkamp (2008) Wegkamp and Yuan (2010) February 17, 2012 Outline. Introduction.. Classification with reject option. Spirit of the papers BW2008.. Infinite
More informationConsistency of Nearest Neighbor Methods
E0 370 Statistical Learning Theory Lecture 16 Oct 25, 2011 Consistency of Nearest Neighbor Methods Lecturer: Shivani Agarwal Scribe: Arun Rajkumar 1 Introduction In this lecture we return to the study
More informationRobust Support Vector Machines for Probability Distributions
Robust Support Vector Machines for Probability Distributions Andreas Christmann joint work with Ingo Steinwart (Los Alamos National Lab) ICORS 2008, Antalya, Turkey, September 8-12, 2008 Andreas Christmann,
More informationMachine Learning: Chenhao Tan University of Colorado Boulder LECTURE 9
Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 9 Slides adapted from Jordan Boyd-Graber Machine Learning: Chenhao Tan Boulder 1 of 39 Recap Supervised learning Previously: KNN, naïve
More informationStatistical Data Mining and Machine Learning Hilary Term 2016
Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes
More informationQualifying Exam in Machine Learning
Qualifying Exam in Machine Learning October 20, 2009 Instructions: Answer two out of the three questions in Part 1. In addition, answer two out of three questions in two additional parts (choose two parts
More information12. Structural Risk Minimization. ECE 830 & CS 761, Spring 2016
12. Structural Risk Minimization ECE 830 & CS 761, Spring 2016 1 / 23 General setup for statistical learning theory We observe training examples {x i, y i } n i=1 x i = features X y i = labels / responses
More informationLecture 18: Multiclass Support Vector Machines
Fall, 2017 Outlines Overview of Multiclass Learning Traditional Methods for Multiclass Problems One-vs-rest approaches Pairwise approaches Recent development for Multiclass Problems Simultaneous Classification
More informationStatistical Machine Learning from Data
Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Ensembles Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique Fédérale de Lausanne
More informationSupport Vector Machines
Support Vector Machines Jordan Boyd-Graber University of Colorado Boulder LECTURE 7 Slides adapted from Tom Mitchell, Eric Xing, and Lauren Hannah Jordan Boyd-Graber Boulder Support Vector Machines 1 of
More informationCS534 Machine Learning - Spring Final Exam
CS534 Machine Learning - Spring 2013 Final Exam Name: You have 110 minutes. There are 6 questions (8 pages including cover page). If you get stuck on one question, move on to others and come back to the
More informationStatistical Methods for SVM
Statistical Methods for SVM Support Vector Machines Here we approach the two-class classification problem in a direct way: We try and find a plane that separates the classes in feature space. If we cannot,
More informationDS-GA 1003: Machine Learning and Computational Statistics Homework 6: Generalized Hinge Loss and Multiclass SVM
DS-GA 1003: Machine Learning and Computational Statistics Homework 6: Generalized Hinge Loss and Multiclass SVM Due: Monday, April 11, 2016, at 6pm (Submit via NYU Classes) Instructions: Your answers to
More informationECE 5984: Introduction to Machine Learning
ECE 5984: Introduction to Machine Learning Topics: Ensemble Methods: Bagging, Boosting Readings: Murphy 16.4; Hastie 16 Dhruv Batra Virginia Tech Administrativia HW3 Due: April 14, 11:55pm You will implement
More informationSupport Vector Machines, Kernel SVM
Support Vector Machines, Kernel SVM Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, 2017 1 / 40 Outline 1 Administration 2 Review of last lecture 3 SVM
More informationRegularization Paths
December 2005 Trevor Hastie, Stanford Statistics 1 Regularization Paths Trevor Hastie Stanford University drawing on collaborations with Brad Efron, Saharon Rosset, Ji Zhu, Hui Zhou, Rob Tibshirani and
More informationOnline Learning and Sequential Decision Making
Online Learning and Sequential Decision Making Emilie Kaufmann CNRS & CRIStAL, Inria SequeL, emilie.kaufmann@univ-lille.fr Research School, ENS Lyon, Novembre 12-13th 2018 Emilie Kaufmann Online Learning
More informationEmpirical Risk Minimization
Empirical Risk Minimization Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Introduction PAC learning ERM in practice 2 General setting Data X the input space and Y the output space
More informationStatistical Approaches to Learning and Discovery. Week 4: Decision Theory and Risk Minimization. February 3, 2003
Statistical Approaches to Learning and Discovery Week 4: Decision Theory and Risk Minimization February 3, 2003 Recall From Last Time Bayesian expected loss is ρ(π, a) = E π [L(θ, a)] = L(θ, a) df π (θ)
More informationMargin Maximizing Loss Functions
Margin Maximizing Loss Functions Saharon Rosset, Ji Zhu and Trevor Hastie Department of Statistics Stanford University Stanford, CA, 94305 saharon, jzhu, hastie@stat.stanford.edu Abstract Margin maximizing
More informationSupport Vector Machine, Random Forests, Boosting Based in part on slides from textbook, slides of Susan Holmes. December 2, 2012
Support Vector Machine, Random Forests, Boosting Based in part on slides from textbook, slides of Susan Holmes December 2, 2012 1 / 1 Neural networks Neural network Another classifier (or regression technique)
More informationCalibrated asymmetric surrogate losses
Electronic Journal of Statistics Vol. 6 (22) 958 992 ISSN: 935-7524 DOI:.24/2-EJS699 Calibrated asymmetric surrogate losses Clayton Scott Department of Electrical Engineering and Computer Science Department
More informationNonparametric Decentralized Detection using Kernel Methods
Nonparametric Decentralized Detection using Kernel Methods XuanLong Nguyen xuanlong@cs.berkeley.edu Martin J. Wainwright, wainwrig@eecs.berkeley.edu Michael I. Jordan, jordan@cs.berkeley.edu Electrical
More informationDiscriminative Models
No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models
More information> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel
Logistic Regression Pattern Recognition 2016 Sandro Schönborn University of Basel Two Worlds: Probabilistic & Algorithmic We have seen two conceptual approaches to classification: data class density estimation
More informationMachine Learning for NLP
Machine Learning for NLP Linear Models Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research Machine Learning for NLP 1(26) Outline
More informationPart I Week 7 Based in part on slides from textbook, slides of Susan Holmes
Part I Week 7 Based in part on slides from textbook, slides of Susan Holmes Support Vector Machine, Random Forests, Boosting December 2, 2012 1 / 1 2 / 1 Neural networks Artificial Neural networks: Networks
More informationFinal Overview. Introduction to ML. Marek Petrik 4/25/2017
Final Overview Introduction to ML Marek Petrik 4/25/2017 This Course: Introduction to Machine Learning Build a foundation for practice and research in ML Basic machine learning concepts: max likelihood,
More informationIntroduction to Machine Learning
Introduction to Machine Learning Machine Learning: Jordan Boyd-Graber University of Maryland SUPPORT VECTOR MACHINES Slides adapted from Tom Mitchell, Eric Xing, and Lauren Hannah Machine Learning: Jordan
More informationA talk on Oracle inequalities and regularization. by Sara van de Geer
A talk on Oracle inequalities and regularization by Sara van de Geer Workshop Regularization in Statistics Banff International Regularization Station September 6-11, 2003 Aim: to compare l 1 and other
More informationDecentralized Detection and Classification using Kernel Methods
Decentralized Detection and Classification using Kernel Methods XuanLong Nguyen Computer Science Division University of California, Berkeley xuanlong@cs.berkeley.edu Martin J. Wainwright Electrical Engineering
More informationRandom Classification Noise Defeats All Convex Potential Boosters
Machine Learning manuscript No. (will be inserted by the editor) Random Classification Noise Defeats All Convex Potential Boosters Philip M. Long Rocco A. Servedio Received: date / Accepted: date Abstract
More informationA Note on Extending Generalization Bounds for Binary Large-Margin Classifiers to Multiple Classes
A Note on Extending Generalization Bounds for Binary Large-Margin Classifiers to Multiple Classes Ürün Dogan 1 Tobias Glasmachers 2 and Christian Igel 3 1 Institut für Mathematik Universität Potsdam Germany
More informationOrdinal Classification with Decision Rules
Ordinal Classification with Decision Rules Krzysztof Dembczyński 1, Wojciech Kotłowski 1, and Roman Słowiński 1,2 1 Institute of Computing Science, Poznań University of Technology, 60-965 Poznań, Poland
More informationReproducing Kernel Hilbert Spaces
Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 9, 2011 About this class Goal In this class we continue our journey in the world of RKHS. We discuss the Mercer theorem which gives
More informationOn divergences, surrogate loss functions, and decentralized detection
On divergences, surrogate loss functions, and decentralized detection XuanLong Nguyen Computer Science Division University of California, Berkeley xuanlong@eecs.berkeley.edu Martin J. Wainwright Statistics
More informationLecture Support Vector Machine (SVM) Classifiers
Introduction to Machine Learning Lecturer: Amir Globerson Lecture 6 Fall Semester Scribe: Yishay Mansour 6.1 Support Vector Machine (SVM) Classifiers Classification is one of the most important tasks in
More informationReproducing Kernel Hilbert Spaces
Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 11, 2009 About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert
More informationOn surrogate loss functions and f-divergences
On surrogate loss functions and f-divergences XuanLong Nguyen, Martin J. Wainwright, xuanlong.nguyen@stat.duke.edu wainwrig@stat.berkeley.edu Michael I. Jordan, jordan@stat.berkeley.edu Department of Statistical
More informationl 1 -Regularized Linear Regression: Persistence and Oracle Inequalities
l -Regularized Linear Regression: Persistence and Oracle Inequalities Peter Bartlett EECS and Statistics UC Berkeley slides at http://www.stat.berkeley.edu/ bartlett Joint work with Shahar Mendelson and
More informationWarm up: risk prediction with logistic regression
Warm up: risk prediction with logistic regression Boss gives you a bunch of data on loans defaulting or not: {(x i,y i )} n i= x i 2 R d, y i 2 {, } You model the data as: P (Y = y x, w) = + exp( yw T
More informationSurrogate regret bounds for generalized classification performance metrics
Surrogate regret bounds for generalized classification performance metrics Wojciech Kotłowski Krzysztof Dembczyński Poznań University of Technology PL-SIGML, Częstochowa, 14.04.2016 1 / 36 Motivation 2
More informationRandom forests and averaging classifiers
Random forests and averaging classifiers Gábor Lugosi ICREA and Pompeu Fabra University Barcelona joint work with Gérard Biau (Paris 6) Luc Devroye (McGill, Montreal) Leo Breiman Binary classification
More informationStatistical Methods for Data Mining
Statistical Methods for Data Mining Kuangnan Fang Xiamen University Email: xmufkn@xmu.edu.cn Support Vector Machines Here we approach the two-class classification problem in a direct way: We try and find
More informationReproducing Kernel Hilbert Spaces
Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 12, 2007 About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert
More informationMehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 April 5, 2013 Due: April 19, 2013
Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 April 5, 2013 Due: April 19, 2013 A. Kernels 1. Let X be a finite set. Show that the kernel
More informationLecture 3: Statistical Decision Theory (Part II)
Lecture 3: Statistical Decision Theory (Part II) Hao Helen Zhang Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 1 / 27 Outline of This Note Part I: Statistics Decision Theory (Classical
More informationON EARLY STOPPING IN GRADIENT DESCENT LEARNING. 1. Introduction
ON EARLY STOPPING IN GRADIENT DESCENT LEARNING YUAN YAO, LORENZO ROSASCO, AND ANDREA CAPONNETTO Abstract. In this paper, we study a family of gradient descent algorithms to approximate the regression function
More informationAdvanced Machine Learning
Advanced Machine Learning Deep Boosting MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Outline Model selection. Deep boosting. theory. algorithm. experiments. page 2 Model Selection Problem:
More informationECS171: Machine Learning
ECS171: Machine Learning Lecture 3: Linear Models I (LFD 3.2, 3.3) Cho-Jui Hsieh UC Davis Jan 17, 2018 Linear Regression (LFD 3.2) Regression Classification: Customer record Yes/No Regression: predicting
More informationComputing regularization paths for learning multiple kernels
Computing regularization paths for learning multiple kernels Francis Bach Romain Thibaux Michael Jordan Computer Science, UC Berkeley December, 24 Code available at www.cs.berkeley.edu/~fbach Computing
More informationKernel Methods. Lecture 4: Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton
Kernel Methods Lecture 4: Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton Alexander J. Smola Statistical Machine Learning Program Canberra,
More informationRandom Classification Noise Defeats All Convex Potential Boosters
Philip M. Long Google Rocco A. Servedio Computer Science Department, Columbia University Abstract A broad class of boosting algorithms can be interpreted as performing coordinate-wise gradient descent
More informationAdaptive Sampling Under Low Noise Conditions 1
Manuscrit auteur, publié dans "41èmes Journées de Statistique, SFdS, Bordeaux (2009)" Adaptive Sampling Under Low Noise Conditions 1 Nicolò Cesa-Bianchi Dipartimento di Scienze dell Informazione Università
More informationIndirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina
Indirect Rule Learning: Support Vector Machines Indirect learning: loss optimization It doesn t estimate the prediction rule f (x) directly, since most loss functions do not have explicit optimizers. Indirection
More information