Weighted Principal Support Vector Machines for Sufficient Dimension Reduction in Binary Classification

Size: px

Start display at page:

Download "Weighted Principal Support Vector Machines for Sufficient Dimension Reduction in Binary Classification"

Moris Leonard
6 years ago
Views:

1 s for Sufficient Dimension Reduction in Binary Classification A joint work with Seung Jun Shin, Hao Helen Zhang and Yufeng Liu

2 Outline Introduction 1 Introduction

3 Outline Introduction 1 Introduction

4 Sufficient Dimension Reduction For a given pair of (Y,X) R R p, Sufficient Dimension Reduction (SDR) seeks a matrix B = (b 1,,b d ) R p d which satisfies Y X B X. (1)

5 Central Subspace Dimension Reduction Subspace (DRS) is defined by span(b) R p. Central Subspace Central Subspace, S Y X is the intersection of all DRSes. S Y X has a minimum dimension among all DRS and uniquely exists under very mild conditions. (Cook, 1998, Prop. 6.4) We assume S Y X = span(b). The dimension of S Y X, d, is called the structure dimension.

6 Estimation of S Y X Seminal paper in Early K-C Li (1991) Sliced Inverse Regression for Dimension Reduction (with discussion). JASA, 86, Many other methods: Sliced Average Variance Estimation (SAVE, 1991) Principal Hessian Directions (phd, 1992) Contour Regression (2005) Fourier-Transformation-Based Estimation (2005) Directional Regression (2007) Cumulative Sliced Regression (CUME; 2008) and many others...

7 Sliced Inverse Regression Foundation of SIR Under the linearity condition, E(Z Y) S Y Z = Σ 1/2 S Y X. where Z = Σ 1/2 {X E(X)}. Slice response into H non-overlapping intervals, I 1,,I H, ˆm h := E n (Z Y I h ) = 1 n h Y I h z i, h = 1,,H. B is estimated by premultiplying first d leading eigenvectors of H h=1 ˆm h ˆm h by ˆΣ 1/2.

8 SIR with Binary Response If Y { 1,1} is binary: Only one possible choice to slice. I 1 = {i : y i = 1} and I 2 = {i : y i = 1} Associated z 1 and z 2 are linearly dependent since z n = 0. SIR can estimate at most ONE direction.

9 Illustration to Wisconsin Diagnostic Breast Cancer Data y SIR SAVE x %*% b x %*% b x %*% b1 (a) SIR (Y vs. ˆb 1 X) (b) SAVE (ˆb 1 X vs. ˆb 2 X)

10 Outline Introduction 1 Introduction

11 Principal Support Vector Machine For (Y,X) R R p, PSVMs (Li et al., 2011; AOS) solve the following SVM-like problem: (a 0,c,b 0,c ) = argmin b } {{ Σb } λe [ 1 Ỹc(ab (X EX)) ] a,b }{{}. Var(b X) f(x) - Ỹ c = 1{Y c} 1{Y < c} for a given constant c. - Σ = cov(x) - [u] = max(0,u). Foundation of the PSVM Under linearity condition, b 0,c S Y X for any given c.

12 PSVM: Sample Estimation Given a set of data (X i,y i ),i = 1,,n: 1. For a given grid miny i < c 1 < < c H < maxy i, solve a sequence of PSVMs for different values of c h : (â n,h,ˆb n,h ) = argminb ˆΣn b λ a,b n 2. First k leading eigenvectors of estimate the basis set of S Y X. n [ 1 Ỹi,c h (ab (X i X ] n )) i=1 ˆM H L n = ˆb n,hˆb n,h. h=1.

13 PSVM: Remarks Pros: Outperforms SIR. Can be extended to kernel PSVM to handle nonlinear SDR. Cons: Estimates only one direction if Y is binary.

14 s Toward SDR with binary Y, WPSVM minimizes { [ ] } Λ π (θ) = β Σβ λe π(y) 1 Y{αβ (X EX)}. - θ = (α,β ) R R p. - π(y) = 1 π if Y = 1 and π otherwise for a given π (0,1). - Y itself is binary (no need Ỹc). θ 0,π = (α 0,π,β 0,π ) = argmin θ Λ π (θ). Foundation of the Weighted PSVM Under linearity condition, β 0,π S Y X for any given π (0,1).

15 Sample Estimation Given (X i,y i ) R p {1, 1},i= 1,,n: 1. For a given grid of π, 0 < π 1 < < π H < 1, solve a sequence of WPSVMs ˆΛ n,πh (θ) = β ˆΣn β λ n n π h (Y i )[1 Y i (β (X i X n )], i=1 and let ˆθ n,h = (ˆα n,h, ˆβ n,h ) = argmin θ ˆΛn,πh (θ). 2. First k leading eigenvectors of the WPSVM candidate matrix ˆM WL n = estimate the basis set of S Y X. H h=1 ˆβ n,hˆβ n,h

16 Computation Let η = U i = ˆΣ 1/2 n β. ˆΣ 1/2 n (X i X n). The WPSVM objective function ˆΛ n,πh (θ) becomes η η λ n n π h (Y i ) [ 1 Y i (αη U i ) ]. i=1 Equivalent to solve the linear WSVM w.r.t (U i,y i ). Solve WSVM H times for different weights of π h,h = 1,,H.

17 π-path Introduction Wang et al. (2008, Biometrika) show that the WSVM solutions move piecewise-linearly as a function of π. Shin et al. (2012, JCGS) implemented the π-path algorithm in R while developing a two-dimensional solution surface for weighted SVMs. Solution paths of the linear WPSVM β^n π

18 Asymptotic Results (1) Standard approach based on M-estimation scheme. Similar to the results for the linear SVM: - Koo et al., 2008; JMLR - Jiang et al., 2008; JMLR Consistency of ˆθ n Suppose Σ is positive definite, ˆθ n θ 0 in probability.

19 Asymptotic Results (2) Asymptotic Normality of ˆθ n (A Bahadur Representation) Under some regularity conditions to ensure the existence of both Gradient vector D θ and Hessian matrix H θ of Λ π (θ), where n(ˆθn θ 0 ) = n 1/2 H 1 θ 0 n D θ0 (Z i )o p (1), i=1 D θ (Z) = (0,2Σβ) λ[π(y) XY 1{θ XY < 1}] and H θ = 2diag(0,Σ) λ P(Y = y)π(y)f β X Y(y α y)e( X X θ X = y), y= 1,1 with X = (1,X ).

20 Asymptotic Results (3) For a given grid of π 1 < < π H, we define the population WPSVM kernel matrix M WL 0 = H β 0,h β 0,h. h=1 Asymptotic Normality of ˆMn Suppose rank(m WL 0 ) = k. Under the regularity conditions, } n WL {vec( ˆM n ) vec(mwl 0 ) N(0,Σ M ), where Σ M is explicitly provided. Asymptotic normality of eigenvectors of ˆMWL n normality of ˆMn. (Bura & Pfeiffer, 2008) is followed by the

21 Structure Dimensionality k Selection We estimate k as: ˆk = argmax k k {1,,p} j=1 υ j ρ klogn n υ 1, where υ 1 υ p are eigenvalues of ˆMn. Then lim P(ˆk = k) = 1. n

22 Outline Introduction 1 Introduction

23 Nonlinear SDR Nonlinear SDR assumes Y X φ(x). φ : R p R k is an arbitrary function of X which lives on H, a Hilbert space of functions of X. SDR is achieved by estimating φ.

24 Kernel WPSVM: Objective Function Kernel WPSVM objective function is Λ π (α,ψ) = var(ψ(x))λe{π(y)[1 Y(aψ(X) Eψ(X))] } Kernel WPSVM solves (α 0,π,ψ 0,π ) = argmin Λ π (α,ψ). α R,ψ H

25 Kernel WPSVM: Foundation Foundation of the Kernel WPSVM For a given π, ψ 0,π has a version that is σ{φ(x)}-measurable. Roughly speaking, ψ 0,π is a function of φ. It is a nonlinear-generalization of linear SDR: β 0,π S Y X = span(b) β 0,πX is a linear function of B X.

26 Kernel WPSVM: Sample Estimation Use Reproducing Kernel Hilbert Space. Using a linear operator Σ : ψ 1,Σψ 2 H = cov{ψ 1 (X),ψ 2 (X)}, Λ π (α,ψ) = ψ,σψ H λe{π(y)[1 Y(aψ(X) Eψ(X))] }. Li et al. (2011) proposed to use the first d leading eigenfunctions of the operator Σ n : H H such that as a basis set. ψ 1,Σ n ψ 2 H = cov n (ψ 1 (X),ψ 2 (X)), By proposition 2 in Li et al. (2011), ω j (X),j = 1,,d can be readily obtained by eigen-decomposition of (I n J n )K n (I n J n ). We chose d n/4.

27 Kernel WPSVM: Sample Estimation Sample version of Λ π (α,ψ) is ˆΛ n,π (α,γ) = γ Ω Ωγ λ n π(y i ) [ 1 Y i {αγ Ω i } ]. i=1 ω 1,,ω d be the first d leading eigenfunctions of the operator Σ n. Then, ω1(x 1 ) ωd (X 1) Ω = ω1 (X n) ωd (X n) where ω j (X) = ω j(x) n 1 n i=1 ω j(x i ).

28 Kernel WPSVM: Dual Problem Dual Formulation subject to ˆν = argmax ν 1,,ν n n ν i 1 4 i=1 n n i=1 j=1 ν i ν j Y i Y j P (i,j) Ω i) 0 ν i λπ(y i ),i = 1,,n ii) ii) n i=1 ν iy i = 0 where P (i,j) Ω is the (i,j)th element of P Ω = Ω(Ω Ω) 1 Ω. The kernel WPSVM solution is given by ˆγ n = λ n ˆν i Y i {(Ω Ω) 1 Ω i }. 2 i=1

29 Kernel WPSVM: 1. For a given gird π 1 < < π H, we compute a sequence of kernel WPSVM solutions: (ˆα n,h,ˆγ n,h ) = argmin α,γ 2. Corresponding kernel matrix is ˆΛ n,πh (α,γ). H ˆγ n,hˆγ n,h. (2) h=1 3. Let ˆV n = (ˆv 1,,v k ) denote the first k leading eigenvectors of (2), ˆφ(x) = ˆV n (ω 1(x),,ω d (x)).

30 Outline Introduction 1 Introduction

31 Simulation -Set Up X i = (X i1,,x ip ) N p (0,I), i = 1,,n where (n,p) = (500,10). We consider 5 Models: Model I: Y = sign{x 1/[0.5(X 2 1) 2 ]0.2ǫ}. Model II: Y = sign{(x 1 0.5)(X 2 0.5) 2 0.2ǫ}. Model III: Y = sign{sin(x 1)/e X 2 0.2ǫ}. Model IV: Y = sign{x 1(X 1 X 2 1)0.2ǫ}. Model V: Y = sign{(x 2 1 X 2 2) 1/2 log(x 2 1 X 2 2) 1/2 0.2ǫ}. B = (e 1,e 2 ) s.t. e i X = X i,i = 1,2 (k = 2). Performance is measured by PˆB P B F, where P A = A(A A) 1 A and F denotes Frobenius norm.

32 True Classification Function (c) Model IV (d) Model V Figure: Surface plots of the Model IV and V.

33 Results - Linear WPSVM Table: Averaged F-distance measures over 100 independent repetitions with associated standard deviations in parentheses. SAVE phd Fourier IHT LWPSVM I (.161) (.193) (.156) (.254) (.171) II (.187) (.186) (.214) (.199) (.198) III (.186) (.198) (.163) (.232) (.180) IV (.272) (.194) (.103) (.105) (.101) V (.052) (.053) (.241) (.011) (.171)

34 Results - Structure Dimensionality Model k n p = 10 p = 20 SAVE WPSVM SAVE WPSVM f f f f f f Table: Empirical probabilities (in percentage) of correctly estimating true k based on 100 independent repetitions. SAVE: the permutation test (Cook and Yin, 2001).

35 Results - Kernel WPSVM original predictors SAVE Linear WPSVM x b2 * x b2 * x x1 (a) Original b1 * x (b) SAVE b1 * x (c) Linear WPSVM Figure: Nonlinear SDR results for a random data set from Model V.

36 Results - Kernel WPSVM Kernel WPSVM φ^2(x) estimated φ^1(x) true (a) Kernel WPSVM(ˆφ 1 (X) vs. ˆφ2 (X)) (b) ˆφ 1 (X) vs. (X 2 1 X2 2 )1/2 Figure: Kernel WPSVM results for a random data set from Model V.

37 Results - Kernel WPSVM Two-sample Hotelling s T 2 test statistics: { ( Tn 2 = ( X X 1 ) ˆΣ n 1 )} 1 ( X n n X ). Table: Averaged T 2 n computed from the first two estimated sufficient predictors over 100 independent repetitions. Model SAVE phd FCN IHT LWPSVM KWPSVM IV (30.7) (20.6) (25.7) (24.5) (25.7) (71.9) V (1.2) (1.1) (4.2) (4.5) (4.6) (78.1)

38 WDBC data - SDR results Gn(M) ˆb 3 X ˆb 2 X φ^2(x) Structure Dimension ˆb 1 X φ^1(x) (a) k-selection (b) Linear WPSVM (c) Kernel WPSVM

39 WDBC data - Classification Accuracy 3-NN test error rate for the raw data: 7.7% (1.2) k SAVE phd FCN IHT LWPSVM KWPSVM (2.2) (4.5) (2.3) (3.1) (1.1) (2.0) (1.8) (4.8) (2.0) (2.8) (1.1) (2.0) (1.7) (5.0) (1.8) (1.5) (1.2) (2.0) (1.6) (4.6) (1.9) (1.4) (1.3) (1.9) (1.8) (4.2) (1.8) (1.3) (1.5) (1.9) Table: Averaged test error rates (in percentage) of the knn classifier (κ = 3) over 100 random partitions for the WDBC data with respect to the first k sufficient predictors (k = 1,2,3,4,5), which are estimated by different SDR methods. Corresponding standard deviations are given in parentheses.

40 Outline Introduction 1 Introduction

41 Introduction Most existing SDR methods suffer if Y is binary. The proposed WPSVM preserves all the merits of the PSVM and performs very well in binary classification. Computational efficiency can be improved by employing the π-path algorithm.

42 Thank you!!!

43 Selected References Li, K.-C. (1991), Sliced inverse regression for dimension reduction (with discussion), Journal of the American Statistical Association, 86, Li, B., Artemiou, A., and Li, L. (2011), Principal Support Vector Machines for Linear and Nonlinear Sufficient Dimension Reduction, Annals of Statistics, 39, Wang, J., Shen, X., and Liu, Y. (2008), Probability estimation for large-margin classifier, Biometrika, 95, Shin, S.J., Wu, Y., and Zhang, H.H. (2013), Two dimensional solution surface of the weighted support vector machines, Journal of Computational and Graphical Statistics, to appear. Koo, J.-Y., Lee, Y., Kim, Y., and Park, C. (2008) A Bahadur Representation of the Linear Support Vector Machine, Journal of Machine Learning Research, 9(Jul): Jiang, B., Zhang, X., and Cai, T. (2008) Estimating the Confidence Interval for Prediction Errors of Support Vector Machine Classifiers, Journal of Machine Learning Research, 9(Mar): Bura, E. and Pfeiffer, C. (2008), On the distribution of the left singular vectors of a random matrix and its applications, Statistics and Probability Letters, 78,

44 Linearity Condition Linearity Condition For any a R p, E(a X B X) is a linear function of B X. E(X B X) = P Σ (B)X = B(B ΣB) 1 B ΣX Common and essential assumption in SDR. Hard to check since B is unknown. Holds if X is elliptically symmetric. (eg. X is multivariate normal) Approximately holds if p gets large for fixed d. (Hall and Li, 1993) Assumption is only for the marginal distribution of X.

45 ρ Selection 1. Randomly split the data into the training and testing sets. 2. Apply the WPSVM to the training set and compute its candidate tr matrix, M n. 3. For a given ρ, 3.a Compute ˆk tr = argmax k {1,,p} = G n(k;ρ, M tr n). tr tr 3.b Transform training predictors X j = ( V n) X tr j where V n tr = ( v 1 tr,, v tr ˆk tr ) are the first ˆk tr tr leading eigenvectors of M n. 3.c For each π h,h = 1,,H, apply the WSVM to {( X tr tr j,yj ) : j = 1,,n tr} to predict Y ts j. 3.d Denoting the predicted label Ŷ ts j, compute the total cost on the test data set. H n ts TC(ρ) = π h (Y ts ts j ) 1(Ŷj Y ts j ). h=1 j =1 4. Repeat 3.a d to select ρ which minimizes TC(ρ).

Sufficient Dimension Reduction using Support Vector Machine and it s variants

Sufficient Dimension Reduction using Support Vector Machine and it s variants Andreas Artemiou School of Mathematics, Cardiff University @AG DANK/BCS Meeting 2013 SDR PSVM Real Data Current Research and