Weighted Principal Support Vector Machines for Sufficient Dimension Reduction in Binary Classification

Similar documents
Sufficient Dimension Reduction using Support Vector Machine and it s variants

Kernel-Based Contrast Functions for Sufficient Dimension Reduction

Learning gradients: prescriptive models

A Selective Review of Sufficient Dimension Reduction

Does Modeling Lead to More Accurate Classification?

A Study of Relative Efficiency and Robustness of Classification Methods

A Least Squares Formulation for Canonical Correlation Analysis

A Bahadur Representation of the Linear Support Vector Machine

Simulation study on using moment functions for sufficient dimension reduction

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

Sufficient dimension reduction via distance covariance

Structured Statistical Learning with Support Vector Machine for Feature Selection and Prediction

A Magiv CV Theory for Large-Margin Classifiers

Lecture 18: Multiclass Support Vector Machines

Combining eigenvalues and variation of eigenvectors for order determination

Statistical Data Mining and Machine Learning Hilary Term 2016

Canonical Correlation Analysis of Longitudinal Data

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

DISCUSSION OF INFLUENTIAL FEATURE PCA FOR HIGH DIMENSIONAL CLUSTERING. By T. Tony Cai and Linjun Zhang University of Pennsylvania

Support Vector Machines for Classification: A Statistical Portrait

Approximate Kernel Methods

Lecture 16: Modern Classification (I) - Separating Hyperplanes

Approximate Kernel PCA with Random Features

Supervised Dimension Reduction:

6-1. Canonical Correlation Analysis

Evaluation requires to define performance measures to be optimized

Support Vector Machines

A Sparse Solution Approach to Gene Selection for Cancer Diagnosis Using Microarray Data

Convex Optimization in Classification Problems

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

9. Least squares data fitting

Evaluation. Andrea Passerini Machine Learning. Evaluation

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Discriminative Models

Machine learning - HT Maximum Likelihood

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

A GENERAL THEORY FOR NONLINEAR SUFFICIENT DIMENSION REDUCTION: FORMULATION AND ESTIMATION

Kernel Methods. Machine Learning A W VO

Chapter 17: Undirected Graphical Models

Lecture 10: Support Vector Machine and Large Margin Classifier

Discriminative Models

Regression Graphics. R. D. Cook Department of Applied Statistics University of Minnesota St. Paul, MN 55108

Partial factor modeling: predictor-dependent shrinkage for linear regression

Basis Expansion and Nonlinear SVM. Kai Yu

Support Vector Machines

1 Machine Learning Concepts (16 points)

ECS289: Scalable Machine Learning

Second-Order Inference for Gaussian Random Curves

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Approximate Dynamic Programming

Multisurface Proximal Support Vector Machine Classification via Generalized Eigenvalues

Regression Graphics. 1 Introduction. 2 The Central Subspace. R. D. Cook Department of Applied Statistics University of Minnesota St.

Learning Task Grouping and Overlap in Multi-Task Learning

Simple Linear Regression

Convex Optimization Algorithms for Machine Learning in 10 Slides

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Midterm: CS 6375 Spring 2015 Solutions

Sliced Inverse Moment Regression Using Weighted Chi-Squared Tests for Dimension Reduction

Convex Optimization M2

Notes on the Multivariate Normal and Related Topics

Localized Sliced Inverse Regression

5 Operations on Multiple Random Variables

Linear & nonlinear classifiers

Deep Learning: Approximation of Functions by Composition

Scaling up Vector Autoregressive Models with Operator Random Fourier Features

International Journal of Pure and Applied Mathematics Volume 19 No , A NOTE ON BETWEEN-GROUP PCA

Infinitely Imbalanced Logistic Regression

Principal components analysis COMS 4771

Density estimation Nonparametric conditional mean estimation Semiparametric conditional mean estimation. Nonparametrics. Gabriel Montes-Rojas

Regularized Least Squares

Math for Machine Learning Open Doors to Data Science and Artificial Intelligence. Richard Han

Learning with Rejection

Introduction to Machine Learning

Machine Learning. Classification, Discriminative learning. Marc Toussaint University of Stuttgart Summer 2015

Introduction to Machine Learning

c 4, < y 2, 1 0, otherwise,

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Notes on the framework of Ando and Zhang (2005) 1 Beyond learning good functions: learning good spaces

Convex Optimization and Modeling

Kernels A Machine Learning Overview

Chapter 3: Maximum Likelihood Theory

Dimension reduction techniques for classification Milan, May 2003

This paper is not to be removed from the Examination Halls

CIS 520: Machine Learning Oct 09, Kernel Methods

10-725/36-725: Convex Optimization Prerequisite Topics

A Bias Correction for the Minimum Error Rate in Cross-validation

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Machine Learning Practice Page 2 of 2 10/28/13

Random Forests. These notes rely heavily on Biau and Scornet (2016) as well as the other references at the end of the notes.

Classification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1).

LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Kernel-based Feature Extraction under Maximum Margin Criterion

Cross-Validation with Confidence

Sufficient Dimension Reduction for Longitudinally Measured Predictors

Statistical Machine Learning

ML (cont.): SUPPORT VECTOR MACHINES

Support Vector Machine

Transcription:

s for Sufficient Dimension Reduction in Binary Classification A joint work with Seung Jun Shin, Hao Helen Zhang and Yufeng Liu

Outline Introduction 1 Introduction 2 3 4 5

Outline Introduction 1 Introduction 2 3 4 5

Sufficient Dimension Reduction For a given pair of (Y,X) R R p, Sufficient Dimension Reduction (SDR) seeks a matrix B = (b 1,,b d ) R p d which satisfies Y X B X. (1)

Central Subspace Dimension Reduction Subspace (DRS) is defined by span(b) R p. Central Subspace Central Subspace, S Y X is the intersection of all DRSes. S Y X has a minimum dimension among all DRS and uniquely exists under very mild conditions. (Cook, 1998, Prop. 6.4) We assume S Y X = span(b). The dimension of S Y X, d, is called the structure dimension.

Estimation of S Y X Seminal paper in Early 1990. K-C Li (1991) Sliced Inverse Regression for Dimension Reduction (with discussion). JASA, 86, 316 327. Many other methods: Sliced Average Variance Estimation (SAVE, 1991) Principal Hessian Directions (phd, 1992) Contour Regression (2005) Fourier-Transformation-Based Estimation (2005) Directional Regression (2007) Cumulative Sliced Regression (CUME; 2008) and many others...

Sliced Inverse Regression Foundation of SIR Under the linearity condition, E(Z Y) S Y Z = Σ 1/2 S Y X. where Z = Σ 1/2 {X E(X)}. Slice response into H non-overlapping intervals, I 1,,I H, ˆm h := E n (Z Y I h ) = 1 n h Y I h z i, h = 1,,H. B is estimated by premultiplying first d leading eigenvectors of H h=1 ˆm h ˆm h by ˆΣ 1/2.

SIR with Binary Response If Y { 1,1} is binary: Only one possible choice to slice. I 1 = {i : y i = 1} and I 2 = {i : y i = 1} Associated z 1 and z 2 are linearly dependent since z n = 0. SIR can estimate at most ONE direction.

Illustration to Wisconsin Diagnostic Breast Cancer Data y SIR SAVE 1.0 0.5 0.0 0.5 1.0 x %*% b2 0.0 0.1 0.2 0.3 0.16 0.14 0.12 0.10 0.08 x %*% b1 0.8 0.6 0.4 0.2 0.0 0.2 x %*% b1 (a) SIR (Y vs. ˆb 1 X) (b) SAVE (ˆb 1 X vs. ˆb 2 X)

Outline Introduction 1 Introduction 2 3 4 5

Principal Support Vector Machine For (Y,X) R R p, PSVMs (Li et al., 2011; AOS) solve the following SVM-like problem: (a 0,c,b 0,c ) = argmin b } {{ Σb } λe [ 1 Ỹc(ab (X EX)) ] a,b }{{}. Var(b X) f(x) - Ỹ c = 1{Y c} 1{Y < c} for a given constant c. - Σ = cov(x) - [u] = max(0,u). Foundation of the PSVM Under linearity condition, b 0,c S Y X for any given c.

PSVM: Sample Estimation Given a set of data (X i,y i ),i = 1,,n: 1. For a given grid miny i < c 1 < < c H < maxy i, solve a sequence of PSVMs for different values of c h : (â n,h,ˆb n,h ) = argminb ˆΣn b λ a,b n 2. First k leading eigenvectors of estimate the basis set of S Y X. n [ 1 Ỹi,c h (ab (X i X ] n )) i=1 ˆM H L n = ˆb n,hˆb n,h. h=1.

PSVM: Remarks Pros: Outperforms SIR. Can be extended to kernel PSVM to handle nonlinear SDR. Cons: Estimates only one direction if Y is binary.

s Toward SDR with binary Y, WPSVM minimizes { [ ] } Λ π (θ) = β Σβ λe π(y) 1 Y{αβ (X EX)}. - θ = (α,β ) R R p. - π(y) = 1 π if Y = 1 and π otherwise for a given π (0,1). - Y itself is binary (no need Ỹc). θ 0,π = (α 0,π,β 0,π ) = argmin θ Λ π (θ). Foundation of the Weighted PSVM Under linearity condition, β 0,π S Y X for any given π (0,1).

Sample Estimation Given (X i,y i ) R p {1, 1},i= 1,,n: 1. For a given grid of π, 0 < π 1 < < π H < 1, solve a sequence of WPSVMs ˆΛ n,πh (θ) = β ˆΣn β λ n n π h (Y i )[1 Y i (β (X i X n )], i=1 and let ˆθ n,h = (ˆα n,h, ˆβ n,h ) = argmin θ ˆΛn,πh (θ). 2. First k leading eigenvectors of the WPSVM candidate matrix ˆM WL n = estimate the basis set of S Y X. H h=1 ˆβ n,hˆβ n,h

Computation Let η = U i = ˆΣ 1/2 n β. ˆΣ 1/2 n (X i X n). The WPSVM objective function ˆΛ n,πh (θ) becomes η η λ n n π h (Y i ) [ 1 Y i (αη U i ) ]. i=1 Equivalent to solve the linear WSVM w.r.t (U i,y i ). Solve WSVM H times for different weights of π h,h = 1,,H.

π-path Introduction Wang et al. (2008, Biometrika) show that the WSVM solutions move piecewise-linearly as a function of π. Shin et al. (2012, JCGS) implemented the π-path algorithm in R while developing a two-dimensional solution surface for weighted SVMs. Solution paths of the linear WPSVM β^n 0.5 0.0 0.5 1.0 1.5 2.0 0.0 0.2 0.4 0.6 0.8 1.0 π

Asymptotic Results (1) Standard approach based on M-estimation scheme. Similar to the results for the linear SVM: - Koo et al., 2008; JMLR - Jiang et al., 2008; JMLR Consistency of ˆθ n Suppose Σ is positive definite, ˆθ n θ 0 in probability.

Asymptotic Results (2) Asymptotic Normality of ˆθ n (A Bahadur Representation) Under some regularity conditions to ensure the existence of both Gradient vector D θ and Hessian matrix H θ of Λ π (θ), where n(ˆθn θ 0 ) = n 1/2 H 1 θ 0 n D θ0 (Z i )o p (1), i=1 D θ (Z) = (0,2Σβ) λ[π(y) XY 1{θ XY < 1}] and H θ = 2diag(0,Σ) λ P(Y = y)π(y)f β X Y(y α y)e( X X θ X = y), y= 1,1 with X = (1,X ).

Asymptotic Results (3) For a given grid of π 1 < < π H, we define the population WPSVM kernel matrix M WL 0 = H β 0,h β 0,h. h=1 Asymptotic Normality of ˆMn Suppose rank(m WL 0 ) = k. Under the regularity conditions, } n WL {vec( ˆM n ) vec(mwl 0 ) N(0,Σ M ), where Σ M is explicitly provided. Asymptotic normality of eigenvectors of ˆMWL n normality of ˆMn. (Bura & Pfeiffer, 2008) is followed by the

Structure Dimensionality k Selection We estimate k as: ˆk = argmax k k {1,,p} j=1 υ j ρ klogn n υ 1, where υ 1 υ p are eigenvalues of ˆMn. Then lim P(ˆk = k) = 1. n

Outline Introduction 1 Introduction 2 3 4 5

Nonlinear SDR Nonlinear SDR assumes Y X φ(x). φ : R p R k is an arbitrary function of X which lives on H, a Hilbert space of functions of X. SDR is achieved by estimating φ.

Kernel WPSVM: Objective Function Kernel WPSVM objective function is Λ π (α,ψ) = var(ψ(x))λe{π(y)[1 Y(aψ(X) Eψ(X))] } Kernel WPSVM solves (α 0,π,ψ 0,π ) = argmin Λ π (α,ψ). α R,ψ H

Kernel WPSVM: Foundation Foundation of the Kernel WPSVM For a given π, ψ 0,π has a version that is σ{φ(x)}-measurable. Roughly speaking, ψ 0,π is a function of φ. It is a nonlinear-generalization of linear SDR: β 0,π S Y X = span(b) β 0,πX is a linear function of B X.

Kernel WPSVM: Sample Estimation Use Reproducing Kernel Hilbert Space. Using a linear operator Σ : ψ 1,Σψ 2 H = cov{ψ 1 (X),ψ 2 (X)}, Λ π (α,ψ) = ψ,σψ H λe{π(y)[1 Y(aψ(X) Eψ(X))] }. Li et al. (2011) proposed to use the first d leading eigenfunctions of the operator Σ n : H H such that as a basis set. ψ 1,Σ n ψ 2 H = cov n (ψ 1 (X),ψ 2 (X)), By proposition 2 in Li et al. (2011), ω j (X),j = 1,,d can be readily obtained by eigen-decomposition of (I n J n )K n (I n J n ). We chose d n/4.

Kernel WPSVM: Sample Estimation Sample version of Λ π (α,ψ) is ˆΛ n,π (α,γ) = γ Ω Ωγ λ n π(y i ) [ 1 Y i {αγ Ω i } ]. i=1 ω 1,,ω d be the first d leading eigenfunctions of the operator Σ n. Then, ω1(x 1 ) ωd (X 1) Ω =....... ω1 (X n) ωd (X n) where ω j (X) = ω j(x) n 1 n i=1 ω j(x i ).

Kernel WPSVM: Dual Problem Dual Formulation subject to ˆν = argmax ν 1,,ν n n ν i 1 4 i=1 n n i=1 j=1 ν i ν j Y i Y j P (i,j) Ω i) 0 ν i λπ(y i ),i = 1,,n ii) ii) n i=1 ν iy i = 0 where P (i,j) Ω is the (i,j)th element of P Ω = Ω(Ω Ω) 1 Ω. The kernel WPSVM solution is given by ˆγ n = λ n ˆν i Y i {(Ω Ω) 1 Ω i }. 2 i=1

Kernel WPSVM: 1. For a given gird π 1 < < π H, we compute a sequence of kernel WPSVM solutions: (ˆα n,h,ˆγ n,h ) = argmin α,γ 2. Corresponding kernel matrix is ˆΛ n,πh (α,γ). H ˆγ n,hˆγ n,h. (2) h=1 3. Let ˆV n = (ˆv 1,,v k ) denote the first k leading eigenvectors of (2), ˆφ(x) = ˆV n (ω 1(x),,ω d (x)).

Outline Introduction 1 Introduction 2 3 4 5

Simulation -Set Up X i = (X i1,,x ip ) N p (0,I), i = 1,,n where (n,p) = (500,10). We consider 5 Models: Model I: Y = sign{x 1/[0.5(X 2 1) 2 ]0.2ǫ}. Model II: Y = sign{(x 1 0.5)(X 2 0.5) 2 0.2ǫ}. Model III: Y = sign{sin(x 1)/e X 2 0.2ǫ}. Model IV: Y = sign{x 1(X 1 X 2 1)0.2ǫ}. Model V: Y = sign{(x 2 1 X 2 2) 1/2 log(x 2 1 X 2 2) 1/2 0.2ǫ}. B = (e 1,e 2 ) s.t. e i X = X i,i = 1,2 (k = 2). Performance is measured by PˆB P B F, where P A = A(A A) 1 A and F denotes Frobenius norm.

True Classification Function (c) Model IV (d) Model V Figure: Surface plots of the Model IV and V.

Results - Linear WPSVM Table: Averaged F-distance measures over 100 independent repetitions with associated standard deviations in parentheses. SAVE phd Fourier IHT LWPSVM I 1.285 1.542 1.289 1.316 0.695 (.161) (.193) (.156) (.254) (.171) II 1.265 1.383 1.205 1.140 0.896 (.187) (.186) (.214) (.199) (.198) III 1.255 1.491 1.282 1.295 0.688 (.186) (.198) (.163) (.232) (.180) IV 0.771 0.680 0.469 0.474 0.482 (.272) (.194) (.103) (.105) (.101) V 0.273 0.283 0.492 1.424 1.530 (.052) (.053) (.241) (.011) (.171)

Results - Structure Dimensionality Model k n p = 10 p = 20 SAVE WPSVM SAVE WPSVM f 1 1 500 91 84 82 86 1000 92 95 93 92 f 1 2 500 7 66 6 40 1000 15 98 16 74 f 2 1 500 80 95 41 71 1000 93 93 88 85 f 2 2 500 17 42 15 19 1000 13 72 17 54 f 3 1 500 87 89 86 91 1000 91 98 90 81 f 3 2 500 15 56 7 44 1000 16 86 11 74 Table: Empirical probabilities (in percentage) of correctly estimating true k based on 100 independent repetitions. SAVE: the permutation test (Cook and Yin, 2001).

Results - Kernel WPSVM original predictors SAVE Linear WPSVM x2 3 2 1 0 1 2 3 b2 * x 2 1 0 1 2 3 b2 * x 2 1 0 1 2 3 3 2 1 0 1 2 3 4 x1 (a) Original 3 2 1 0 1 2 3 b1 * x (b) SAVE 3 2 1 0 1 2 3 b1 * x (c) Linear WPSVM Figure: Nonlinear SDR results for a random data set from Model V.

Results - Kernel WPSVM Kernel WPSVM φ^2(x) 0.2 0.1 0.0 0.1 0.2 0.3 estimated 0.10 0.05 0.00 0.05 0.10 0.15 0.10 0.05 0.00 0.05 0.10 0.15 φ^1(x) 0 1 2 3 true (a) Kernel WPSVM(ˆφ 1 (X) vs. ˆφ2 (X)) (b) ˆφ 1 (X) vs. (X 2 1 X2 2 )1/2 Figure: Kernel WPSVM results for a random data set from Model V.

Results - Kernel WPSVM Two-sample Hotelling s T 2 test statistics: { ( Tn 2 = ( X X 1 ) ˆΣ n 1 )} 1 ( X n n X ). Table: Averaged T 2 n computed from the first two estimated sufficient predictors over 100 independent repetitions. Model SAVE phd FCN IHT LWPSVM KWPSVM IV 76.0 74.0 104.0 97.2 103.8 581.7 (30.7) (20.6) (25.7) (24.5) (25.7) (71.9) V 1.2 1.1 4.0 8.7 8.8 626.0 (1.2) (1.1) (4.2) (4.5) (4.6) (78.1)

WDBC data - SDR results 0.25 0.2 0.15 0.1 0.05 0.05 0.1 0.15 0.2 0.25 0.02 0 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2 0.06 0.05 0.04 0.03 0.02 0.01 0 0.01 Gn(M) 4050000 4100000 4150000 4200000 4250000 4300000 0 5 10 15 20 25 30 ˆb 3 X ˆb 2 X φ^2(x) 0.70 0.75 0.80 0.85 0.90 0.95 0.35 0.30 0.25 0.20 0.15 Structure Dimension ˆb 1 X φ^1(x) (a) k-selection (b) Linear WPSVM (c) Kernel WPSVM

WDBC data - Classification Accuracy 3-NN test error rate for the raw data: 7.7% (1.2) k SAVE phd FCN IHT LWPSVM KWPSVM 1 19.3 39.7 12.7 23.0 5.3 8.7 (2.2) (4.5) (2.3) (3.1) (1.1) (2.0) 2 13.5 34.5 9.2 13.0 5.2 8.5 (1.8) (4.8) (2.0) (2.8) (1.1) (2.0) 3 12.0 30.3 8.0 6.9 5.4 8.0 (1.7) (5.0) (1.8) (1.5) (1.2) (2.0) 4 11.9 27.0 7.5 5.7 5.4 7.8 (1.6) (4.6) (1.9) (1.4) (1.3) (1.9) 5 12.1 25.2 7.2 5.8 5.5 8.0 (1.8) (4.2) (1.8) (1.3) (1.5) (1.9) Table: Averaged test error rates (in percentage) of the knn classifier (κ = 3) over 100 random partitions for the WDBC data with respect to the first k sufficient predictors (k = 1,2,3,4,5), which are estimated by different SDR methods. Corresponding standard deviations are given in parentheses.

Outline Introduction 1 Introduction 2 3 4 5

Introduction Most existing SDR methods suffer if Y is binary. The proposed WPSVM preserves all the merits of the PSVM and performs very well in binary classification. Computational efficiency can be improved by employing the π-path algorithm.

Thank you!!!

Selected References Li, K.-C. (1991), Sliced inverse regression for dimension reduction (with discussion), Journal of the American Statistical Association, 86, 316 342. Li, B., Artemiou, A., and Li, L. (2011), Principal Support Vector Machines for Linear and Nonlinear Sufficient Dimension Reduction, Annals of Statistics, 39, 3182 3210. Wang, J., Shen, X., and Liu, Y. (2008), Probability estimation for large-margin classifier, Biometrika, 95, 149 167. Shin, S.J., Wu, Y., and Zhang, H.H. (2013), Two dimensional solution surface of the weighted support vector machines, Journal of Computational and Graphical Statistics, to appear. Koo, J.-Y., Lee, Y., Kim, Y., and Park, C. (2008) A Bahadur Representation of the Linear Support Vector Machine, Journal of Machine Learning Research, 9(Jul):1343 1368. Jiang, B., Zhang, X., and Cai, T. (2008) Estimating the Confidence Interval for Prediction Errors of Support Vector Machine Classifiers, Journal of Machine Learning Research, 9(Mar):521 540. Bura, E. and Pfeiffer, C. (2008), On the distribution of the left singular vectors of a random matrix and its applications, Statistics and Probability Letters, 78, 2275 2280.

Linearity Condition Linearity Condition For any a R p, E(a X B X) is a linear function of B X. E(X B X) = P Σ (B)X = B(B ΣB) 1 B ΣX Common and essential assumption in SDR. Hard to check since B is unknown. Holds if X is elliptically symmetric. (eg. X is multivariate normal) Approximately holds if p gets large for fixed d. (Hall and Li, 1993) Assumption is only for the marginal distribution of X.

ρ Selection 1. Randomly split the data into the training and testing sets. 2. Apply the WPSVM to the training set and compute its candidate tr matrix, M n. 3. For a given ρ, 3.a Compute ˆk tr = argmax k {1,,p} = G n(k;ρ, M tr n). tr tr 3.b Transform training predictors X j = ( V n) X tr j where V n tr = ( v 1 tr,, v tr ˆk tr ) are the first ˆk tr tr leading eigenvectors of M n. 3.c For each π h,h = 1,,H, apply the WSVM to {( X tr tr j,yj ) : j = 1,,n tr} to predict Y ts j. 3.d Denoting the predicted label Ŷ ts j, compute the total cost on the test data set. H n ts TC(ρ) = π h (Y ts ts j ) 1(Ŷj Y ts j ). h=1 j =1 4. Repeat 3.a d to select ρ which minimizes TC(ρ).