STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION

Size: px

Start display at page:

Download "STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION"

Jared Ellis
5 years ago
Views:

1 STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION Tong Zhang The Annals of Statistics, 2004

2 Outline Motivation Approximation error under convex risk minimization Examples of approximation error analysis Universal approximation and consistency

3 Motivation Machine learning: predict y for given x. Relationship: functional y f(x) where f is called the classification function. Criterion: to minimize a problem dependent loss l(f(x),y). Usual assumption: The data (X,Y) are drawn i.i.d. from a common but unknown distribution D(X, Y). Explicit formulae of the expected loss: L(f( )) = E X,Y l(f(x),y).

4 Motivation This paper deals with binary problem, namely, y {±1}. Prediction rule: y = 1 if f(x) 0 and y = 1 if f(x) < 0. The classification error of f is given by: 1 if f(x)y < 0, I(f(x),y) = 1 if f(x) = 0 and y = 1, 0 otherwise. The empirical error is given by: 1 n n I(f(X i ),Y i ). (1) i=1

5 Motivation However the minimization of the previous formulae can be a NP-hard problem, because it is not convex. Need to use a surrogate loss φ(, ) which makes the computation easier. When φ(f,y) = φ(yf) it is called large margin classifier. For example, AdaBoost employs the exponential loss φ = exp( yf) and SVM employs the hinge loss φ = [1 yf] +. Now instead of (1) we minimize the empirical risk: 1 n n φ(f(x i )Y i ). (2) i=1

6 Motivation The minimization of (1) can be regarded as an approximation to the true classification error: L(f( )) = E X,Y I(f(X),Y). (3) And the minimization of (2) can be regarded as an approximation to the true risk: Q(f( )) = E X,Y φ(f(x)y). (4) In this paper the authors study the impact of using φ.

7 Motivation In this paper the authors are interested in the following five loss functions: Least Squares: φ(v) = (1 v) 2. Modified Least Squares: φ(v) = max(1 v,0) 2. SVM: φ(v) = [1 v]+. Exponential: φ(v) = exp( v). Logistic Regression: φ(v) = ln(1+exp( v)).

8 Approximation error under convex risk minimization In this section the relationship between L(f( )) and Q(f( )) is studied. Rewrite (4) as the expansion of conditional expectation: Q(f( )) = E X [η(x)φ(f(x))+(1 η(x))φ( f(x))], (5) where η(x) is the conditional probability P(Y = 1 X = x).

9 Approximation error under convex risk minimization L(f( )) can be written as: L(f( )) = E f(x) 0 (1 η(x))+e f(x)<0 η(x). (6) The following notation is very useful in this section: Q(η,f) = ηφ(f)+(1 η)φ( f). (7) Define f φ (η) : [0,1] R (where R is the extended real line) as: and fφ (η) = argminq(η,f), f R Q (η) = inf f R Q(η,f) = Q(η,f φ (η)).

10 Approximation error under convex risk minimization Define the excess risk as: Q(η,f) = Q(η,f) Q(η,f φ (η)) = Q(η,f) Q (η), and Q(f( )) = Q(f( )) Q(f φ (η( ))) = E X Q(η(X),f(X)).

11 Approximation error under convex risk minimization The above formulations are easy to calculate: Least Squares: f φ (η) = 2η 1; Q (η) = 4η(1 η). Modified Least Squares: f φ (η) = 2η 1; Q (η) = 4η(1 η). SVM: f φ (η) = sign(2η 1); Q (η) = 1 2η 1. Exponential: f φ (η) = 1 2 ln η 1 η ; Q (η) = 2 η(1 η). Logistic Regression: f φ (η) = ln η 1 η ; Q (η) = ηlnη (1 η)ln(1 η).

12 Approximation error under convex risk minimization Theorem Assume fφ (η) > 0 when η > 0.5. Assume there exists c > 0 and s 1 such that for all η [0,1], 0.5 η s c s Q(η,0), then for any measurable function f(x), L(f( )) L 2c Q(f( )) 1/s, where L is the optimal Bayes error L = L(2η( ) 1).

13 Approximation error under convex risk minimization Proof omitted. A corollary omitted in this presentation. The above theorem shows that if there is a functional relationship between 0.5 η and Q(η,0), we can bound the classification error by certain transformations of the risk. More specifically, if Q(f( )) 0, then L 0. The quantity Q(η,f) is the key in the proof of the theorem. A question arises: how to compute it?

14 Approximation error under convex risk minimization Introduction to Bregman divergence. For a convex function φ, its Bregman divergence is defined as: d φ (f 1,f 2 ) = φ(f 2 ) φ(f 1 ) φ (f 1 )(f 2 f 1 ). Here prime means subgradient. For a concave function g, the Bregman divergence is defined as d g (η 1,η 2 ) = d g (η 1,η 2 ). The Bregman divergence is always non-negative. A plot in the next slide to illustrate the idea.

15 Approximation error under convex risk minimization Figure: Bregman divergence (the red line segment).

16 Approximation error under convex risk minimization Lemma Q (η) is a concave function of η. Theorem If φ is differentiable, then the Bregman divergence is uniquely defined. Furthermore, Q(η,p) = ηd φ (f φ (η),p)+(1 η)d η( f φ (η), p). If f φ is differentiable then Q is also differentiable. Assume p = f φ ( η), then Q(η,p) = d Q ( η,η).

17 Approximation error under convex risk minimization If f φ is invertible, then the inverse function f φ 1 (f(x)) can serve as a conditional probability estimate.

18 Examples of approximation error analysis Least Squares. Bregman divergence is d φ (p 1,p 2 ) = (p 2,p 1 ) 2. Q(η,p) = (2η 1 p) 2. η = Q(η,0). Thus we can choose c = 0.5 and s = 2.

19 Examples of approximation error analysis Modified Least Squares. Bregman divergence is d φ (p 1,p 2 ) = (p 2,p 1 ) 2 max(0,p 2 1) 2. Q(η,p) = (2η 1 p) 2 ηmax(0,p 1) 2 (1 η)min(0,p +1) 2. η Q(η,0). Thus we can choose c = 0.5 and s = 2.

20 Examples of approximation error analysis SVM. Bregman divergence is harder to compute than calculating Q(η, p) directly. Q(η, p) = η max(0, 1 p)+(1 η) max(0, 1+p) 1+ 2η 1. η Q(η,0). Thus we can choose c = 0.5 and s = 1.

21 Examples of approximation error analysis Exponential loss. Bregman divergence is 2 η(1 η). Q(η,p) = (η η)(e p e p )+2 η(1 η) 2 η(1 η) where η = 1/(1+e 2p ). η (2 0.5 ) 2 Q(η,0). Thus we can choose c = and s = 2.

22 Examples of approximation error analysis Logistic regression. Bregman divergence is ηlnη (1 η)ln(1 η). Q(η,p) = KL(η 1 1+e p ) where KL is the KL-divergence. η (2 0.5 ) 2 Q(η,0). Thus we can choose c = and s = 2.

23 Examples of approximation error analysis There are many discussion in this section, in terms of probability estimation comparison between different loss functions, and in terms of finding best c and s. Please refer to the paper, this is the relatively easy part for reading.

24 Universal approximation and consistency Now we consider a function class C to which the function f belongs. If inf f C Q(f) is small, then any f(x) C that (approximately) minimizes (4) achieves a classification error close to the optimal Bayes error. We call a function class C universal with respect to a convex loss function φ, if any measurable conditional density function η(x) has a distance of zero to C.

25 Universal approximation and consistency Next we introduce a universal approximation theorem. First the following definitions are needed. Let U R d. Denote by C(U) the Banach space of continuous functions: U R under the uniform-norm topology. Call a probability measure µ in R d regular if it is defined on the Borel sets of R d. Say that a convex function φ has property A if φ is continuous and Q is continuous. φ(p) < φ( p), for all p > 0. f φ (η) (,+ ) and is piece-wise continuous in (0,1).

26 Universal approximation and consistency Lemma Assume 0 δ < 0.5. Let η [0,1] and η δ = min(max(η,δ),1 δ). If φ has property A, then Q(η,f φ (η δ)) Q (η δ ). The plot of η δ is shown in the next slide. The lemma is to provide a way for avoiding the difficulty part where fφ tends to infinity, in the proof of the next theorem.

27 Universal approximation and consistency y δ δ eta Figure: The function η δ.

28 Universal approximation and consistency Theorem Let φ be a convex function with property A. Consider a function class C C(U) defined on a Borel set U R d. If C is dense in C(U), then for any regular probability measure µ of x R d such that µ(u) = 1, and any conditional probability P(Y = 1 X = x) = η(x), inf Q(f( )) = 0. p C

29 Universal approximation and consistency Consider the function classes R d R consisted of linear combinations of functions of the form h(ω T x+b), where ω R d, b R and h is a fixed function: C h = { k α i h(ωi T x+b i ) : α i R,ω i R d,b i R,k N }. i=1 In the literature of neural networks, the following function class is well studied and proved to be universal, as long as the function h defined above is sigmoidal. The next theorem is more general.

30 Universal approximation and consistency Theorem If h is a non-polynomial continuous function, then C h is dense in C(U) for all compact subsets U of R d. The introduction to RKHS. Consider the kernel functions in the following form: K h ([x 1,b 1 ],[x 2,b 2 ]) = h(x T 1 x 2 +b 1 b 2 ). where h can be expressed as a Taylor expansion with non-negative coefficients. K h is a positive definite kernel. Denote by H h the corresponding RKHS introduced by K h. Also denote f(x) = f([x,1]).

31 Universal approximation and consistency Consider the following estimation problem: f n = arg inf f Hh [1 n n i=1 φ(y i f(xi ))+ λ n 2 f 2]. We have the following theorem, which says we can show consistency by estimating leave-one-out bounds. Theorem Let f [k] be the solution of the above formulation with the k th datum removed from the training set, then ˆf n ( ) ˆf [k] n ( ) 2 λ n n φ (ˆf n (X k )Y k ) h(x T k X k +1) 1/2, where φ denotes a subgradient of φ.

32 Universal approximation and consistency For simplicity, from now we assume that P(h(X T X+1) M 2 ) = 1 for a constant M. Theorem Under the assumption of the previous theorem, and assume that φ is non-negative and P(h(X T X+1) M 2 ) = 1. Then for all k, the expected leave-one-out error can be bounded as EQ(ˆf [k] ) inf f H h [ Q( f( ))+ λ n 2 f 2] + 2M2 2 phi λ n N, where the expectation is with respect to the training samples (X 1,Y 1 ),...,(X n,y n ) and φ = sup { φ (z) : z 2φ(0) λ n M }.

33 Universal approximation and consistency Here is a list of φ for loss functions considered in this paper. Least Squares: 8 φ λ n M +2. Modified Least Squares: 8 φ λ n M +2. SVM: φ 1. Exponential: 2 φ exp( λ n M). Logistic Regression: φ 1. We are in the position of introducing the last theorem in the paper.

34 Universal approximation and consistency Theorem Let h be an entire function with nonnegative Taylor coefficients. Assume we choose λ n such that for least squares, modified least squares, SVM or logistic regression, λ n 0 and λ n n ; or we choose λ n such that for exponential loss, λ n 0 and λ n log 2 n. Then for any distribution D with regular input probability measure which is bounded almost everywhere in R d, we have lim EQ(ˆf n ( )) = inf Q(f( )). n f H h Moreover if h is not a polynomial, then lim EQ(ˆf n ( )) = 0. n

35 End of the presentation Thank you. Questions?

Boosting with Early Stopping: Convergence and Consistency

Boosting with Early Stopping: Convergence and Consistency Tong Zhang Bin Yu Abstract Boosting is one of the most significant advances in machine learning for classification and regression. In its original