AdaBoost and other Large Margin Classifiers: Convexity in Classification

Size: px
Start display at page:

Download "AdaBoost and other Large Margin Classifiers: Convexity in Classification"

Transcription

1 AdaBoost and other Large Margin Classifiers: Convexity in Classification Peter Bartlett Division of Computer Science and Department of Statistics UC Berkeley Joint work with Mikhail Traskin. slides at bartlett 1

2 The Pattern Classification Problem i.i.d. (X, Y ), (X 1, Y 1 ),...,(X n, Y n ) from X {±1}. Use data (X 1, Y 1 ),...,(X n, Y n ) to choose f n : X R with small risk, R(f n ) = Pr (sign(f n (X)) Y ) = El(Y, f(x)). Natural approach: minimize empirical risk, ˆR(f) = Êl(Y, f(x)) = 1 n n l(y i, f(x i )). i=1 Often intractable... Replace 0-1 loss, l, with a convex surrogate, φ. 2

3 Large Margin Algorithms Consider the margins, Y f(x). Define a margin cost function φ : R R +. Define the φ-risk of f : X R as R φ (f) = Eφ(Y f(x)). Choose f F to minimize φ-risk. (e.g., use data, (X 1, Y 1 ),...,(X n, Y n ), to minimize empirical φ-risk, ˆR φ (f) = Êφ(Y f(x)) = 1 n n φ(y i f(x i )), i=1 or a regularized version.) 3

4 Large Margin Algorithms Adaboost: F = span(g) for a VC-class G, φ(α) = exp( α), Minimizes ˆR φ (f) using greedy basis selection, line search: f t+1 = f t + α t+1 g t+1, ˆR φ (f t + α t+1 g t+1 ) = min α R,g G ˆR φ (f t + αg). Effective in applications: real-time face detection, spoken dialogue systems,... 4

5 Large Margin Algorithms Many other variants Support vector machines with 1-norm soft margin. F = ball in reproducing kernel Hilbert space, H. φ(α) = max (0, 1 α). Algorithm minimizes ˆR φ (f) + λ f 2 H. Neural net classifiers φ(α) = max(0, (0.8 α) 2 ). L2Boost, LS-SVMs φ(α) = (1 α) 2. Logistic regression φ(α) = log(1 + exp( 2α)). 5

6 Large Margin Algorithms exponential hinge logistic truncated quadratic α 6

7 Statistical Consequences of Using a Convex Cost Is AdaBoost universally consistent? Other φ? (Lugosi and Vayatis, 2004), (Mannor, Meir and Zhang, 2002): regularized boosting. (Jiang, 2004): process consistency of AdaBoost, for certain probability distributions. (Zhang, 2004), (Steinwart, 2003): SVM. 7

8 Statistical Consequences of Using a Convex Cost How is risk related to φ-risk? (Lugosi and Vayatis, 2004), (Steinwart, 2003): asymptotic. (Zhang, 2004): comparison theorem. 8

9 Overview Relating excess risk to excess φ-risk. ψ-transform: best possible bound. conditions on φ. (with Mike Jordan and Jon McAuliffe) Universal consistency of AdaBoost. 9

10 Definitions and Facts R(f) = Pr(sign(f(X)) Y ) R φ (f) = Eφ(Y f(x)) η(x) = Pr(Y = 1 X = x) R = inf f R(f) R φ = inf f R φ(f) conditional probability. risk φ-risk η defines an optimal classifier: R = R(sign(η(x) 1/2)). 10

11 Definitions and Facts R(f) = Pr(sign(f(X)) Y ) R φ (f) = Eφ(Y f(x)) η(x) = Pr(Y = 1 X = x) R = inf f R(f) R φ = inf f R φ(f) conditional probability. risk φ-risk η defines an optimal classifier: R = R(sign(η(x) 1/2)). Notice: R φ (f) = E (E [φ(y f(x)) X]), and conditional φ-risk is: E [φ(y f(x)) X = x] = η(x)φ(f(x)) + (1 η(x))φ( f(x)). 11

12 Definitions Conditional φ-risk: E [φ(y f(x)) X = x] = η(x)φ(f(x)) + (1 η(x))φ( f(x)). Optimal conditional φ-risk for η [0, 1]: H(η) = inf (ηφ(α) + (1 η)φ( α)). α R R φ = EH(η(X)). 12

13 Optimal Conditional φ-risk: Example φ(α) φ( α) C 0.3 (α) C 0.7 (α) α * (η) H(η) ψ(θ)

14 Definitions Optimal conditional φ-risk for η [0, 1]: H(η) = inf (ηφ(α) + (1 η)φ( α)). α R Optimal conditional φ-risk with incorrect sign: H (η) = inf (ηφ(α) + (1 η)φ( α)). α:α(2η 1) 0 Note: H (η) H(η) H (1/2) = H(1/2). 14

15 Definitions H(η) = inf (ηφ(α) + (1 η)φ( α)) α R H (η) = inf (ηφ(α) + (1 η)φ( α)). α:α(2η 1) 0 Definition: for η 1/2, φ is classification-calibrated if, H (η) > H(η). i.e., pointwise optimization of conditional φ-risk leads to the correct sign. (c.f. Lin (2001)) 15

16 The ψ transform Definition: Given convex φ, define ψ : [0, 1] [0, ) by ( ) 1 + θ ψ(θ) = φ(0) H. 2 (The definition is a little more involved for non-convex φ.) α * (η) H(η) ψ(θ)

17 The Relationship between Excess Risk and Excess φ-risk Theorem: 1. For any P and f, ψ(r(f) R ) R φ (f) Rφ. 2. For X 2, ǫ > 0 and θ [0, 1], there is a P and an f with R(f) R = θ ψ(θ) R φ (f) Rφ ψ(θ) + ǫ. 3. The following conditions are equivalent: (a) φ is classification calibrated. (b) ψ(θ i ) 0 iff θ i 0. (c) R φ (f i ) Rφ implies R(f i) R. 17

18 Classification-calibrated φ If φ is classification-calibrated, then ψ(θ i ) 0 iff θ i 0. Since the function ψ is always convex, in that case it is strictly increasing and so has an inverse. Thus, we can write R(f) R ψ 1 ( R φ (f) R φ). 18

19 Excess Risk Bounds: Proof Idea Facts: H(η), H (η) are symmetric about η = 1/2. H(1/2) = H (1/2), hence ψ(0) = 0. ψ(θ) is convex. ψ(θ) = H ( 1 + θ 2 ) H ( 1 + θ 2 ). 19

20 Excess Risk Bounds: Proof Idea Excess risk of f : X R is Thus, R(f) R = E (1 [sign(f(x)) sign(η(x) 1/2)] 2η(X) 1 ). ψ(r(f) R ) (ψ convex, ψ(0) = 0) E (1 [sign(f(x)) sign(η(x) 1/2)]ψ ( 2η(X) 1 )) = E ( 1 [sign(f(x)) sign(η(x) 1/2)] ( H (η(x)) H(η(X)) )) E (φ(y f(x)) H(η(X))) = R φ (f) R φ. 20

21 Excess Risk Bounds: Proof Idea Excess risk of f : X R is Thus, R(f) R = E (1 [sign(f(x)) sign(η(x) 1/2)] 2η(X) 1 ). ψ(r(f) R ) (definition of ψ) E (1 [sign(f(x)) sign(η(x) 1/2)] ψ ( 2η(X) 1 )) = E ( 1 [sign(f(x)) sign(η(x) 1/2)] ( H (η(x)) H(η(X)) )) E (φ(y f(x)) H(η(X))) = R φ (f) R φ. 20

22 Excess Risk Bounds: Proof Idea Excess risk of f : X R is Thus, R(f) R = E (1 [sign(f(x)) sign(η(x) 1/2)] 2η(X) 1 ). ψ(r(f) R ) (H minimizes conditional φ-risk) E (1 [sign(f(x)) sign(η(x) 1/2)] ψ ( 2η(X) 1 )) = E ( 1 [sign(f(x)) sign(η(x) 1/2)] ( H (η(x)) H(η(X)) )) E (φ(y f(x)) H(η(X))) = R φ (f) R φ. 20

23 Excess Risk Bounds: Proof Idea Excess risk of f : X R is Thus, R(f) R = E (1 [sign(f(x)) sign(η(x) 1/2)] 2η(X) 1 ). ψ(r(f) R ) (definition of R φ ) E (1 [sign(f(x)) sign(η(x) 1/2)] ψ ( 2η(X) 1 )) = E ( 1 [sign(f(x)) sign(η(x) 1/2)] ( H (η(x)) H(η(X)) )) E (φ(y f(x)) H(η(X))) = R φ (f) R φ. 20

24 Classification-calibrated φ Theorem: If φ is convex, φ is classification calibrated φ is differentiable at 0 φ (0) < 0. Theorem: If φ is classification calibrated, γ > 0, α R, γφ(α) 1 [α 0]. 21

25 Overview Relating excess risk to excess φ-risk. Universal consistency of AdaBoost. The approximation/estimation decomposition. AdaBoost: Previous results. Universal consistency. (with Mikhail Traskin) 22

26 Universal Consistency Assume: i.i.d. data, (X, Y ), (X 1, Y 1 ),...,(X n, Y n ) from from X Y (with Y = {±1}). Consider a method f n = A((X 1, Y 1 ),...,(X n, Y n )), e.g., f n = AdaBoost((X 1, Y 1 ),...,(X n, Y n ), t n ). Definition: We say that the method is universally consistent if, for all distributions P, R(f n ) a.s R, where R is the risk and R is the Bayes risk: R(f) = Pr(Y sign(f(x)), R = inf f R(f). 23

27 The Approximation/Estimation Decomposition Consider an algorithm that chooses f n = arg min f F ˆR φ (f) + λ n Ω(f). We can decompose the excess risk estimate as ψ (R(f n ) R ) R φ (f n ) R φ = R φ (f n ) inf f F n R φ (f) }{{} estimation error + inf R φ (f) Rφ. f F n }{{} approximation error 24

28 The Approximation/Estimation Decomposition Consider an algorithm that chooses f n = arg min f F n ˆRφ (f). We can decompose the excess risk estimate as ψ (R(f n ) R ) R φ (f n ) R φ = R φ (f n ) inf f F n R φ (f) }{{} estimation error + inf R φ (f) Rφ. f F n }{{} approximation error 24

29 The Approximation/Estimation Decomposition ψ (R(f n ) R ) R φ (f n ) R φ = R φ (f n ) inf f F n R φ (f) }{{} estimation error + inf R φ (f) Rφ. f F n }{{} approximation error Approximation and estimation errors are in terms of R φ, not R. Like a regression problem. With a rich class and appropriate regularization, R φ (f n ) R φ. (e.g., F n gets large slowly, or λ n 0 slowly.) Universal consistency (R(f n ) R ) iff φ is classification calibrated. 25

30 Overview Relating excess risk to excess φ-risk. Universal consistency of AdaBoost. The approximation/estimation decomposition. AdaBoost: Previous results. Universal consistency. 26

31 AdaBoost Sample, S n = ((x 1, y 1 ),...,(x n,y n )) (X {±1}) n Number of iterations, T function AdaBoost(S n, T ) f 0 := 0 for t from 1,...,T (α t,h t ) := arg min α R,h F f t := f t 1 + α t h t return f T 1 n nx exp ( y i (f t 1 (x i ) + αh(x i ))) i=1 27

32 Previous results: Regularized versions AdaBoost greedily minimizes ˆR φ (f) = 1 n n exp( Y i f(x i )) i=1 over f span(f). (Notice that, for many interesting basis classes F, the infimum is zero.) Instead of AdaBoost, consider a regularized version of its criterion. 28

33 Previous results: Regularized versions 1. Minimize ˆR φ (f) over f γ n co(f), the scaled convex hull of F. 2. Minimize ˆR φ (f) + λ n f 1, over f span(f), where f 1 = inf{γ : f γco(f)}. For suitable choices of the parameters (γ n and λ n ), these algorithms are universally consistent. (Lugosi and Vayatis, 2004), (Zhang, 2004) 29

34 Previous results: Bounded step size function AdaBoostwithBoundedStepSize(S n, T ) f 0 := 0 for t from 1,...,T (α t,h t ) := arg min α R,h F 1 n f t := f t 1 + min{α t, ǫ}h t return f T nx exp ( y i (f t 1 (x i ) + αh(x i ))) i=1 For suitable choices of the parameters (T = T n and ǫ = ǫ n ), this algorithm is universally consistent. (Zhang and Yu, 2005), (Bickel, Ritov, Zakai, 2006) 30

35 Previous results about AdaBoost AdaBoost greedily minimizes ˆR φ (f) = 1 n n exp( Y i f(x i )) i=1 over f span(f). Consider AdaBoost with early stopping: f n is the function returned by AdaBoost after t n steps. How should we choose t n? Note: The infimum is often zero. Don t want t n too large. 31

36 Previous result about AdaBoost: Process consistency Theorem: [Jiang, 2004] For a (suitable) basis class defined on R d, and for all probability distributions P satisfying certain smoothness assumptions, there is a sequence t n such that f n =AdaBoost(S n, t n ) satisfies R(f n ) a.s. R. Conditions on the distribution P are unnatural and cannot be checked. How should the stopping time t n grow with sample size n? Does it need to depend on the distribution P? Rates? 32

37 Overview Relating excess risk to excess φ-risk. Universal consistency of AdaBoost. The approximation/estimation decomposition. AdaBoost: Previous results. Universal consistency. 33

38 The key theorem Assume d V C (F) < Otherwise AdaBoost must stop and fail after one step. Assume where lim λ inf {R φ(f) : f λco(f)} = R φ, R φ (f) = E exp( Y f(x)), R φ = inf f R φ(f). That is, the approximation error is zero. For example, F is linear threshold functions, or binary trees with axis orthogonal decisions in R d and at least d + 1 leaves. 34

39 The key theorem Theorem: If d V C (F) <, R φ = lim λ inf {R φ(f) : f λco(f)}, t n t n = O(n 1 α ) for some α > 0, then AdaBoost is universally consistent. 35

40 The key theorem: Idea of proof We show R φ (f tn ) R φ, which implies R(f t n ) R, since the loss function α exp( α) is classification calibrated. Step 1. Notice that we can clip f tn : If we define π λ (f) as x max{ λ, min{λ, f(x)}}, then R φ (π λ (f tn )) R φ = R(π λ (f tn )) R = R(f tn ) R. We will need to relax the clipping (λ n ) π 1 (x) 36

41 The key theorem: Idea of proof Step 2. Use VC-theory (for clipped combinations of t functions from F ) to show that, with high probability, R φ (π λ (f t )) ˆR dv C (F)t log t φ (π λ (f t )) + c(λ), n where ˆR φ is the empirical version of R φ, ˆR φ (f) = E n exp( Y f(x)). 37

42 The key theorem: Idea of proof Step 3. The clipping only hurts for small values of the exponential criterion: ˆR φ (π λ (f t )) ˆR φ (f t ) + e λ

43 The key theorem: Idea of proof Step 4. Apply numerical convergence result of (Bickel et al, 2006): For any comparison function f F λ = {R φ (f) : f λco(f)}, ˆR φ (f t ) ˆR φ ( f) + ǫ(λ, t). Here, we exploit an attractive property of the exponential loss function and the fact that classifiers are binary-valued: The second derivative of ˆR φ in a basis direction is large whenever ˆR φ is large. This keeps the steps taken by AdaBoost from being too large. 39

44 The key theorem: Idea of proof Step 5. Relate ˆR φ ( f) to R φ ( f). Choosing λ n suitably slowly, we can choose f n so that R φ ( f n ) R φ (by assumption), and then for t = O(n1 α ), we have the result. 40

45 The key theorem: Idea of proof R R n π λn (f tn ) 41

46 The key theorem: Idea of proof R R n π λn (f tn ) π λn (f tn ) 41

47 The key theorem: Idea of proof R R n f tn π λn (f tn ) π λn (f tn ) 41

48 The key theorem: Idea of proof R R n f tn f tn π λn (f tn ) π λn (f tn ) 41

49 The key theorem: Idea of proof f n R R n f tn f tn π λn (f tn ) π λn (f tn ) 41

50 The key theorem: Idea of proof R * f n R R n f tn f tn π λn (f tn ) π λn (f tn ) 41

51 Open Problems Other loss functions? e.g., LogitBoost uses α log(1 + exp( 2α)) in place of exp( α). (The difficulty is the behaviour of the second derivative of ˆR φ in the direction of a basis function. For the numerical convergence results, we want it large whenever ˆR φ is large.) Real-valued basis functions? (The same issue arises.) Rates? The bottleneck is the rate of decrease of ˆR φ (f t ). The numerical convergence result ensures it decreases to f as log 1/2 t. This seems pessimistic. 42

52 Overview Relating excess risk to excess φ-risk. Universal consistency of AdaBoost. slides at bartlett 43

Statistical Properties of Large Margin Classifiers

Statistical Properties of Large Margin Classifiers Statistical Properties of Large Margin Classifiers Peter Bartlett Division of Computer Science and Department of Statistics UC Berkeley Joint work with Mike Jordan, Jon McAuliffe, Ambuj Tewari. slides

More information

Large Margin Classifiers: Convexity and Classification

Large Margin Classifiers: Convexity and Classification Large Margin Classifiers: Convexity and Classification Peter Bartlett Division of Computer Science and Department of Statistics UC Berkeley Joint work with Mike Collins, Mike Jordan, David McAllester,

More information

Convexity, Classification, and Risk Bounds

Convexity, Classification, and Risk Bounds Convexity, Classification, and Risk Bounds Peter L. Bartlett Computer Science Division and Department of Statistics University of California, Berkeley bartlett@stat.berkeley.edu Michael I. Jordan Computer

More information

Lecture 10 February 23

Lecture 10 February 23 EECS 281B / STAT 241B: Advanced Topics in Statistical LearningSpring 2009 Lecture 10 February 23 Lecturer: Martin Wainwright Scribe: Dave Golland Note: These lecture notes are still rough, and have only

More information

Fast Rates for Estimation Error and Oracle Inequalities for Model Selection

Fast Rates for Estimation Error and Oracle Inequalities for Model Selection Fast Rates for Estimation Error and Oracle Inequalities for Model Selection Peter L. Bartlett Computer Science Division and Department of Statistics University of California, Berkeley bartlett@cs.berkeley.edu

More information

STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION

STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION Tong Zhang The Annals of Statistics, 2004 Outline Motivation Approximation error under convex risk minimization

More information

Calibrated Surrogate Losses

Calibrated Surrogate Losses EECS 598: Statistical Learning Theory, Winter 2014 Topic 14 Calibrated Surrogate Losses Lecturer: Clayton Scott Scribe: Efrén Cruz Cortés Disclaimer: These notes have not been subjected to the usual scrutiny

More information

A Study of Relative Efficiency and Robustness of Classification Methods

A Study of Relative Efficiency and Robustness of Classification Methods A Study of Relative Efficiency and Robustness of Classification Methods Yoonkyung Lee* Department of Statistics The Ohio State University *joint work with Rui Wang April 28, 2011 Department of Statistics

More information

Does Modeling Lead to More Accurate Classification?

Does Modeling Lead to More Accurate Classification? Does Modeling Lead to More Accurate Classification? A Comparison of the Efficiency of Classification Methods Yoonkyung Lee* Department of Statistics The Ohio State University *joint work with Rui Wang

More information

On the Consistency of AUC Pairwise Optimization

On the Consistency of AUC Pairwise Optimization On the Consistency of AUC Pairwise Optimization Wei Gao and Zhi-Hua Zhou National Key Laboratory for Novel Software Technology, Nanjing University Collaborative Innovation Center of Novel Software Technology

More information

Surrogate loss functions, divergences and decentralized detection

Surrogate loss functions, divergences and decentralized detection Surrogate loss functions, divergences and decentralized detection XuanLong Nguyen Department of Electrical Engineering and Computer Science U.C. Berkeley Advisors: Michael Jordan & Martin Wainwright 1

More information

Divergences, surrogate loss functions and experimental design

Divergences, surrogate loss functions and experimental design Divergences, surrogate loss functions and experimental design XuanLong Nguyen University of California Berkeley, CA 94720 xuanlong@cs.berkeley.edu Martin J. Wainwright University of California Berkeley,

More information

Sparseness Versus Estimating Conditional Probabilities: Some Asymptotic Results

Sparseness Versus Estimating Conditional Probabilities: Some Asymptotic Results Sparseness Versus Estimating Conditional Probabilities: Some Asymptotic Results Peter L. Bartlett 1 and Ambuj Tewari 2 1 Division of Computer Science and Department of Statistics University of California,

More information

Surrogate Risk Consistency: the Classification Case

Surrogate Risk Consistency: the Classification Case Chapter 11 Surrogate Risk Consistency: the Classification Case I. The setting: supervised prediction problem (a) Have data coming in pairs (X,Y) and a loss L : R Y R (can have more general losses) (b)

More information

Classification objectives COMS 4771

Classification objectives COMS 4771 Classification objectives COMS 4771 1. Recap: binary classification Scoring functions Consider binary classification problems with Y = { 1, +1}. 1 / 22 Scoring functions Consider binary classification

More information

On the Design of Loss Functions for Classification: theory, robustness to outliers, and SavageBoost

On the Design of Loss Functions for Classification: theory, robustness to outliers, and SavageBoost On the Design of Loss Functions for Classification: theory, robustness to outliers, and SavageBoost Hamed Masnadi-Shirazi Statistical Visual Computing Laboratory, University of California, San Diego La

More information

Decentralized decision making with spatially distributed data

Decentralized decision making with spatially distributed data Decentralized decision making with spatially distributed data XuanLong Nguyen Department of Statistics University of Michigan Acknowledgement: Michael Jordan, Martin Wainwright, Ram Rajagopal, Pravin Varaiya

More information

Why does boosting work from a statistical view

Why does boosting work from a statistical view Why does boosting work from a statistical view Jialin Yi Applied Mathematics and Computational Science University of Pennsylvania Philadelphia, PA 939 jialinyi@sas.upenn.edu Abstract We review boosting

More information

Ada Boost, Risk Bounds, Concentration Inequalities. 1 AdaBoost and Estimates of Conditional Probabilities

Ada Boost, Risk Bounds, Concentration Inequalities. 1 AdaBoost and Estimates of Conditional Probabilities CS8B/Stat4B Sprig 008) Statistical Learig Theory Lecture: Ada Boost, Risk Bouds, Cocetratio Iequalities Lecturer: Peter Bartlett Scribe: Subhrasu Maji AdaBoost ad Estimates of Coditioal Probabilities We

More information

On Information Divergence Measures, Surrogate Loss Functions and Decentralized Hypothesis Testing

On Information Divergence Measures, Surrogate Loss Functions and Decentralized Hypothesis Testing On Information Divergence Measures, Surrogate Loss Functions and Decentralized Hypothesis Testing XuanLong Nguyen Martin J. Wainwright Michael I. Jordan Electrical Engineering & Computer Science Department

More information

Boosting Methods: Why They Can Be Useful for High-Dimensional Data

Boosting Methods: Why They Can Be Useful for High-Dimensional Data New URL: http://www.r-project.org/conferences/dsc-2003/ Proceedings of the 3rd International Workshop on Distributed Statistical Computing (DSC 2003) March 20 22, Vienna, Austria ISSN 1609-395X Kurt Hornik,

More information

BINARY CLASSIFICATION

BINARY CLASSIFICATION BINARY CLASSIFICATION MAXIM RAGINSY The problem of binary classification can be stated as follows. We have a random couple Z = X, Y ), where X R d is called the feature vector and Y {, } is called the

More information

Minimax risk bounds for linear threshold functions

Minimax risk bounds for linear threshold functions CS281B/Stat241B (Spring 2008) Statistical Learning Theory Lecture: 3 Minimax risk bounds for linear threshold functions Lecturer: Peter Bartlett Scribe: Hao Zhang 1 Review We assume that there is a probability

More information

Support Vector Machines for Classification: A Statistical Portrait

Support Vector Machines for Classification: A Statistical Portrait Support Vector Machines for Classification: A Statistical Portrait Yoonkyung Lee Department of Statistics The Ohio State University May 27, 2011 The Spring Conference of Korean Statistical Society KAIST,

More information

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013 Learning Theory Ingo Steinwart University of Stuttgart September 4, 2013 Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 1 / 62 Basics Informal Introduction Informal Description

More information

Approximation Theoretical Questions for SVMs

Approximation Theoretical Questions for SVMs Ingo Steinwart LA-UR 07-7056 October 20, 2007 Statistical Learning Theory: an Overview Support Vector Machines Informal Description of the Learning Goal X space of input samples Y space of labels, usually

More information

Boosting. March 30, 2009

Boosting. March 30, 2009 Boosting Peter Bühlmann buhlmann@stat.math.ethz.ch Seminar für Statistik ETH Zürich Zürich, CH-8092, Switzerland Bin Yu binyu@stat.berkeley.edu Department of Statistics University of California Berkeley,

More information

Consistency and Convergence Rates of One-Class SVM and Related Algorithms

Consistency and Convergence Rates of One-Class SVM and Related Algorithms Consistency and Convergence Rates of One-Class SVM and Related Algorithms Régis Vert Laboratoire de Recherche en Informatique Bât. 490, Université Paris-Sud 91405, Orsay Cedex, France Regis.Vert@lri.fr

More information

ECE 5424: Introduction to Machine Learning

ECE 5424: Introduction to Machine Learning ECE 5424: Introduction to Machine Learning Topics: Ensemble Methods: Bagging, Boosting PAC Learning Readings: Murphy 16.4;; Hastie 16 Stefan Lee Virginia Tech Fighting the bias-variance tradeoff Simple

More information

On the rate of convergence of regularized boosting classifiers

On the rate of convergence of regularized boosting classifiers On the rate of convergence of regularized boosting classifiers Gilles Blanchard CNRS Laboratoire de Mathématiques Université Paris-Sud Bâtiment 425 91405 Orsay Cedex, France blanchar@first.fhg.de Gábor

More information

CSCI-567: Machine Learning (Spring 2019)

CSCI-567: Machine Learning (Spring 2019) CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March

More information

Boosting with Early Stopping: Convergence and Consistency

Boosting with Early Stopping: Convergence and Consistency Boosting with Early Stopping: Convergence and Consistency Tong Zhang Bin Yu Abstract Boosting is one of the most significant advances in machine learning for classification and regression. In its original

More information

Introduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research

Introduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research Introduction to Machine Learning Lecture 11 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Boosting Mehryar Mohri - Introduction to Machine Learning page 2 Boosting Ideas Main idea:

More information

Are You a Bayesian or a Frequentist?

Are You a Bayesian or a Frequentist? Are You a Bayesian or a Frequentist? Michael I. Jordan Department of EECS Department of Statistics University of California, Berkeley http://www.cs.berkeley.edu/ jordan 1 Statistical Inference Bayesian

More information

Generalization theory

Generalization theory Generalization theory Daniel Hsu Columbia TRIPODS Bootcamp 1 Motivation 2 Support vector machines X = R d, Y = { 1, +1}. Return solution ŵ R d to following optimization problem: λ min w R d 2 w 2 2 + 1

More information

Support Vector Machine

Support Vector Machine Support Vector Machine Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Linear Support Vector Machine Kernelized SVM Kernels 2 From ERM to RLM Empirical Risk Minimization in the binary

More information

Surrogate losses and regret bounds for cost-sensitive classification with example-dependent costs

Surrogate losses and regret bounds for cost-sensitive classification with example-dependent costs Surrogate losses and regret bounds for cost-sensitive classification with example-dependent costs Clayton Scott Dept. of Electrical Engineering and Computer Science, and of Statistics University of Michigan,

More information

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels Karl Stratos June 21, 2018 1 / 33 Tangent: Some Loose Ends in Logistic Regression Polynomial feature expansion in logistic

More information

ASSESSING ROBUSTNESS OF CLASSIFICATION USING ANGULAR BREAKDOWN POINT

ASSESSING ROBUSTNESS OF CLASSIFICATION USING ANGULAR BREAKDOWN POINT Submitted to the Annals of Statistics ASSESSING ROBUSTNESS OF CLASSIFICATION USING ANGULAR BREAKDOWN POINT By Junlong Zhao, Guan Yu and Yufeng Liu, Beijing Normal University, China State University of

More information

VBM683 Machine Learning

VBM683 Machine Learning VBM683 Machine Learning Pinar Duygulu Slides are adapted from Dhruv Batra Bias is the algorithm's tendency to consistently learn the wrong thing by not taking into account all the information in the data

More information

A Bahadur Representation of the Linear Support Vector Machine

A Bahadur Representation of the Linear Support Vector Machine A Bahadur Representation of the Linear Support Vector Machine Yoonkyung Lee Department of Statistics The Ohio State University October 7, 2008 Data Mining and Statistical Learning Study Group Outline Support

More information

The Margin Vector, Admissible Loss and Multi-class Margin-based Classifiers

The Margin Vector, Admissible Loss and Multi-class Margin-based Classifiers The Margin Vector, Admissible Loss and Multi-class Margin-based Classifiers Hui Zou University of Minnesota Ji Zhu University of Michigan Trevor Hastie Stanford University Abstract We propose a new framework

More information

Learning with Rejection

Learning with Rejection Learning with Rejection Corinna Cortes 1, Giulia DeSalvo 2, and Mehryar Mohri 2,1 1 Google Research, 111 8th Avenue, New York, NY 2 Courant Institute of Mathematical Sciences, 251 Mercer Street, New York,

More information

CIS 520: Machine Learning Oct 09, Kernel Methods

CIS 520: Machine Learning Oct 09, Kernel Methods CIS 520: Machine Learning Oct 09, 207 Kernel Methods Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture They may or may not cover all the material discussed

More information

On the Rate of Convergence of Regularized Boosting Classifiers

On the Rate of Convergence of Regularized Boosting Classifiers Journal of Machine Learning Research 4 (2003) 861-894 Submitted 4/03; Published 10/03 On the Rate of Convergence of Regularized Boosting Classifiers Gilles Blanchard CNRS Laboratoire de Mathématiques Université

More information

Classification with Reject Option

Classification with Reject Option Classification with Reject Option Bartlett and Wegkamp (2008) Wegkamp and Yuan (2010) February 17, 2012 Outline. Introduction.. Classification with reject option. Spirit of the papers BW2008.. Infinite

More information

Consistency of Nearest Neighbor Methods

Consistency of Nearest Neighbor Methods E0 370 Statistical Learning Theory Lecture 16 Oct 25, 2011 Consistency of Nearest Neighbor Methods Lecturer: Shivani Agarwal Scribe: Arun Rajkumar 1 Introduction In this lecture we return to the study

More information

Robust Support Vector Machines for Probability Distributions

Robust Support Vector Machines for Probability Distributions Robust Support Vector Machines for Probability Distributions Andreas Christmann joint work with Ingo Steinwart (Los Alamos National Lab) ICORS 2008, Antalya, Turkey, September 8-12, 2008 Andreas Christmann,

More information

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 9

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 9 Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 9 Slides adapted from Jordan Boyd-Graber Machine Learning: Chenhao Tan Boulder 1 of 39 Recap Supervised learning Previously: KNN, naïve

More information

Statistical Data Mining and Machine Learning Hilary Term 2016

Statistical Data Mining and Machine Learning Hilary Term 2016 Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes

More information

Qualifying Exam in Machine Learning

Qualifying Exam in Machine Learning Qualifying Exam in Machine Learning October 20, 2009 Instructions: Answer two out of the three questions in Part 1. In addition, answer two out of three questions in two additional parts (choose two parts

More information

12. Structural Risk Minimization. ECE 830 & CS 761, Spring 2016

12. Structural Risk Minimization. ECE 830 & CS 761, Spring 2016 12. Structural Risk Minimization ECE 830 & CS 761, Spring 2016 1 / 23 General setup for statistical learning theory We observe training examples {x i, y i } n i=1 x i = features X y i = labels / responses

More information

Lecture 18: Multiclass Support Vector Machines

Lecture 18: Multiclass Support Vector Machines Fall, 2017 Outlines Overview of Multiclass Learning Traditional Methods for Multiclass Problems One-vs-rest approaches Pairwise approaches Recent development for Multiclass Problems Simultaneous Classification

More information

Statistical Machine Learning from Data

Statistical Machine Learning from Data Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Ensembles Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique Fédérale de Lausanne

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Jordan Boyd-Graber University of Colorado Boulder LECTURE 7 Slides adapted from Tom Mitchell, Eric Xing, and Lauren Hannah Jordan Boyd-Graber Boulder Support Vector Machines 1 of

More information

CS534 Machine Learning - Spring Final Exam

CS534 Machine Learning - Spring Final Exam CS534 Machine Learning - Spring 2013 Final Exam Name: You have 110 minutes. There are 6 questions (8 pages including cover page). If you get stuck on one question, move on to others and come back to the

More information

Statistical Methods for SVM

Statistical Methods for SVM Statistical Methods for SVM Support Vector Machines Here we approach the two-class classification problem in a direct way: We try and find a plane that separates the classes in feature space. If we cannot,

More information

DS-GA 1003: Machine Learning and Computational Statistics Homework 6: Generalized Hinge Loss and Multiclass SVM

DS-GA 1003: Machine Learning and Computational Statistics Homework 6: Generalized Hinge Loss and Multiclass SVM DS-GA 1003: Machine Learning and Computational Statistics Homework 6: Generalized Hinge Loss and Multiclass SVM Due: Monday, April 11, 2016, at 6pm (Submit via NYU Classes) Instructions: Your answers to

More information

ECE 5984: Introduction to Machine Learning

ECE 5984: Introduction to Machine Learning ECE 5984: Introduction to Machine Learning Topics: Ensemble Methods: Bagging, Boosting Readings: Murphy 16.4; Hastie 16 Dhruv Batra Virginia Tech Administrativia HW3 Due: April 14, 11:55pm You will implement

More information

Support Vector Machines, Kernel SVM

Support Vector Machines, Kernel SVM Support Vector Machines, Kernel SVM Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, 2017 1 / 40 Outline 1 Administration 2 Review of last lecture 3 SVM

More information

Regularization Paths

Regularization Paths December 2005 Trevor Hastie, Stanford Statistics 1 Regularization Paths Trevor Hastie Stanford University drawing on collaborations with Brad Efron, Saharon Rosset, Ji Zhu, Hui Zhou, Rob Tibshirani and

More information

Online Learning and Sequential Decision Making

Online Learning and Sequential Decision Making Online Learning and Sequential Decision Making Emilie Kaufmann CNRS & CRIStAL, Inria SequeL, emilie.kaufmann@univ-lille.fr Research School, ENS Lyon, Novembre 12-13th 2018 Emilie Kaufmann Online Learning

More information

Empirical Risk Minimization

Empirical Risk Minimization Empirical Risk Minimization Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Introduction PAC learning ERM in practice 2 General setting Data X the input space and Y the output space

More information

Statistical Approaches to Learning and Discovery. Week 4: Decision Theory and Risk Minimization. February 3, 2003

Statistical Approaches to Learning and Discovery. Week 4: Decision Theory and Risk Minimization. February 3, 2003 Statistical Approaches to Learning and Discovery Week 4: Decision Theory and Risk Minimization February 3, 2003 Recall From Last Time Bayesian expected loss is ρ(π, a) = E π [L(θ, a)] = L(θ, a) df π (θ)

More information

Margin Maximizing Loss Functions

Margin Maximizing Loss Functions Margin Maximizing Loss Functions Saharon Rosset, Ji Zhu and Trevor Hastie Department of Statistics Stanford University Stanford, CA, 94305 saharon, jzhu, hastie@stat.stanford.edu Abstract Margin maximizing

More information

Support Vector Machine, Random Forests, Boosting Based in part on slides from textbook, slides of Susan Holmes. December 2, 2012

Support Vector Machine, Random Forests, Boosting Based in part on slides from textbook, slides of Susan Holmes. December 2, 2012 Support Vector Machine, Random Forests, Boosting Based in part on slides from textbook, slides of Susan Holmes December 2, 2012 1 / 1 Neural networks Neural network Another classifier (or regression technique)

More information

Calibrated asymmetric surrogate losses

Calibrated asymmetric surrogate losses Electronic Journal of Statistics Vol. 6 (22) 958 992 ISSN: 935-7524 DOI:.24/2-EJS699 Calibrated asymmetric surrogate losses Clayton Scott Department of Electrical Engineering and Computer Science Department

More information

Nonparametric Decentralized Detection using Kernel Methods

Nonparametric Decentralized Detection using Kernel Methods Nonparametric Decentralized Detection using Kernel Methods XuanLong Nguyen xuanlong@cs.berkeley.edu Martin J. Wainwright, wainwrig@eecs.berkeley.edu Michael I. Jordan, jordan@cs.berkeley.edu Electrical

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel Logistic Regression Pattern Recognition 2016 Sandro Schönborn University of Basel Two Worlds: Probabilistic & Algorithmic We have seen two conceptual approaches to classification: data class density estimation

More information

Machine Learning for NLP

Machine Learning for NLP Machine Learning for NLP Linear Models Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research Machine Learning for NLP 1(26) Outline

More information

Part I Week 7 Based in part on slides from textbook, slides of Susan Holmes

Part I Week 7 Based in part on slides from textbook, slides of Susan Holmes Part I Week 7 Based in part on slides from textbook, slides of Susan Holmes Support Vector Machine, Random Forests, Boosting December 2, 2012 1 / 1 2 / 1 Neural networks Artificial Neural networks: Networks

More information

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Final Overview. Introduction to ML. Marek Petrik 4/25/2017 Final Overview Introduction to ML Marek Petrik 4/25/2017 This Course: Introduction to Machine Learning Build a foundation for practice and research in ML Basic machine learning concepts: max likelihood,

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Machine Learning: Jordan Boyd-Graber University of Maryland SUPPORT VECTOR MACHINES Slides adapted from Tom Mitchell, Eric Xing, and Lauren Hannah Machine Learning: Jordan

More information

A talk on Oracle inequalities and regularization. by Sara van de Geer

A talk on Oracle inequalities and regularization. by Sara van de Geer A talk on Oracle inequalities and regularization by Sara van de Geer Workshop Regularization in Statistics Banff International Regularization Station September 6-11, 2003 Aim: to compare l 1 and other

More information

Decentralized Detection and Classification using Kernel Methods

Decentralized Detection and Classification using Kernel Methods Decentralized Detection and Classification using Kernel Methods XuanLong Nguyen Computer Science Division University of California, Berkeley xuanlong@cs.berkeley.edu Martin J. Wainwright Electrical Engineering

More information

Random Classification Noise Defeats All Convex Potential Boosters

Random Classification Noise Defeats All Convex Potential Boosters Machine Learning manuscript No. (will be inserted by the editor) Random Classification Noise Defeats All Convex Potential Boosters Philip M. Long Rocco A. Servedio Received: date / Accepted: date Abstract

More information

A Note on Extending Generalization Bounds for Binary Large-Margin Classifiers to Multiple Classes

A Note on Extending Generalization Bounds for Binary Large-Margin Classifiers to Multiple Classes A Note on Extending Generalization Bounds for Binary Large-Margin Classifiers to Multiple Classes Ürün Dogan 1 Tobias Glasmachers 2 and Christian Igel 3 1 Institut für Mathematik Universität Potsdam Germany

More information

Ordinal Classification with Decision Rules

Ordinal Classification with Decision Rules Ordinal Classification with Decision Rules Krzysztof Dembczyński 1, Wojciech Kotłowski 1, and Roman Słowiński 1,2 1 Institute of Computing Science, Poznań University of Technology, 60-965 Poznań, Poland

More information

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 9, 2011 About this class Goal In this class we continue our journey in the world of RKHS. We discuss the Mercer theorem which gives

More information

On divergences, surrogate loss functions, and decentralized detection

On divergences, surrogate loss functions, and decentralized detection On divergences, surrogate loss functions, and decentralized detection XuanLong Nguyen Computer Science Division University of California, Berkeley xuanlong@eecs.berkeley.edu Martin J. Wainwright Statistics

More information

Lecture Support Vector Machine (SVM) Classifiers

Lecture Support Vector Machine (SVM) Classifiers Introduction to Machine Learning Lecturer: Amir Globerson Lecture 6 Fall Semester Scribe: Yishay Mansour 6.1 Support Vector Machine (SVM) Classifiers Classification is one of the most important tasks in

More information

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 11, 2009 About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert

More information

On surrogate loss functions and f-divergences

On surrogate loss functions and f-divergences On surrogate loss functions and f-divergences XuanLong Nguyen, Martin J. Wainwright, xuanlong.nguyen@stat.duke.edu wainwrig@stat.berkeley.edu Michael I. Jordan, jordan@stat.berkeley.edu Department of Statistical

More information

l 1 -Regularized Linear Regression: Persistence and Oracle Inequalities

l 1 -Regularized Linear Regression: Persistence and Oracle Inequalities l -Regularized Linear Regression: Persistence and Oracle Inequalities Peter Bartlett EECS and Statistics UC Berkeley slides at http://www.stat.berkeley.edu/ bartlett Joint work with Shahar Mendelson and

More information

Warm up: risk prediction with logistic regression

Warm up: risk prediction with logistic regression Warm up: risk prediction with logistic regression Boss gives you a bunch of data on loans defaulting or not: {(x i,y i )} n i= x i 2 R d, y i 2 {, } You model the data as: P (Y = y x, w) = + exp( yw T

More information

Surrogate regret bounds for generalized classification performance metrics

Surrogate regret bounds for generalized classification performance metrics Surrogate regret bounds for generalized classification performance metrics Wojciech Kotłowski Krzysztof Dembczyński Poznań University of Technology PL-SIGML, Częstochowa, 14.04.2016 1 / 36 Motivation 2

More information

Random forests and averaging classifiers

Random forests and averaging classifiers Random forests and averaging classifiers Gábor Lugosi ICREA and Pompeu Fabra University Barcelona joint work with Gérard Biau (Paris 6) Luc Devroye (McGill, Montreal) Leo Breiman Binary classification

More information

Statistical Methods for Data Mining

Statistical Methods for Data Mining Statistical Methods for Data Mining Kuangnan Fang Xiamen University Email: xmufkn@xmu.edu.cn Support Vector Machines Here we approach the two-class classification problem in a direct way: We try and find

More information

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 12, 2007 About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert

More information

Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 April 5, 2013 Due: April 19, 2013

Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 April 5, 2013 Due: April 19, 2013 Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 April 5, 2013 Due: April 19, 2013 A. Kernels 1. Let X be a finite set. Show that the kernel

More information

Lecture 3: Statistical Decision Theory (Part II)

Lecture 3: Statistical Decision Theory (Part II) Lecture 3: Statistical Decision Theory (Part II) Hao Helen Zhang Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 1 / 27 Outline of This Note Part I: Statistics Decision Theory (Classical

More information

ON EARLY STOPPING IN GRADIENT DESCENT LEARNING. 1. Introduction

ON EARLY STOPPING IN GRADIENT DESCENT LEARNING. 1. Introduction ON EARLY STOPPING IN GRADIENT DESCENT LEARNING YUAN YAO, LORENZO ROSASCO, AND ANDREA CAPONNETTO Abstract. In this paper, we study a family of gradient descent algorithms to approximate the regression function

More information

Advanced Machine Learning

Advanced Machine Learning Advanced Machine Learning Deep Boosting MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Outline Model selection. Deep boosting. theory. algorithm. experiments. page 2 Model Selection Problem:

More information

ECS171: Machine Learning

ECS171: Machine Learning ECS171: Machine Learning Lecture 3: Linear Models I (LFD 3.2, 3.3) Cho-Jui Hsieh UC Davis Jan 17, 2018 Linear Regression (LFD 3.2) Regression Classification: Customer record Yes/No Regression: predicting

More information

Computing regularization paths for learning multiple kernels

Computing regularization paths for learning multiple kernels Computing regularization paths for learning multiple kernels Francis Bach Romain Thibaux Michael Jordan Computer Science, UC Berkeley December, 24 Code available at www.cs.berkeley.edu/~fbach Computing

More information

Kernel Methods. Lecture 4: Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton

Kernel Methods. Lecture 4: Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton Kernel Methods Lecture 4: Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton Alexander J. Smola Statistical Machine Learning Program Canberra,

More information

Random Classification Noise Defeats All Convex Potential Boosters

Random Classification Noise Defeats All Convex Potential Boosters Philip M. Long Google Rocco A. Servedio Computer Science Department, Columbia University Abstract A broad class of boosting algorithms can be interpreted as performing coordinate-wise gradient descent

More information

Adaptive Sampling Under Low Noise Conditions 1

Adaptive Sampling Under Low Noise Conditions 1 Manuscrit auteur, publié dans "41èmes Journées de Statistique, SFdS, Bordeaux (2009)" Adaptive Sampling Under Low Noise Conditions 1 Nicolò Cesa-Bianchi Dipartimento di Scienze dell Informazione Università

More information

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina Indirect Rule Learning: Support Vector Machines Indirect learning: loss optimization It doesn t estimate the prediction rule f (x) directly, since most loss functions do not have explicit optimizers. Indirection

More information