Directly and Efficiently Optimizing Prediction Error and AUC of Linear Classifiers

Size: px

Start display at page:

Download "Directly and Efficiently Optimizing Prediction Error and AUC of Linear Classifiers"

Reynard Park
6 years ago
Views:

1 Directly and Efficiently Optimizing Prediction Error and AUC of Linear Classifiers Hiva Ghanbari Joint work with Prof. Katya Scheinberg Industrial and Systems Engineering Department US & Mexico Workshop on Optimization and its Applications Huatulco, Mexico January 2018 Hiva Ghanbari (Lehigh University) 1 / 34

2 Outline Introduction Directly Optimizing Prediction Error Directly Optimizing AUC Numerical Analysis Summary Hiva Ghanbari (Lehigh University) 2 / 34

3 Outline Introduction Directly Optimizing Prediction Error Directly Optimizing AUC Numerical Analysis Summary Hiva Ghanbari (Lehigh University) 3 / 34

4 Supervised Learning Problem Given a finite sample data set S of n (input, label) pairs, e.g., S := {(x i, y i ) : i = 1,, n}, where x i R d and y i {+1, 1}. We are interested in Binary Classification Problem in supervised learning Binary Classification Problem Discrete valued output +1 or 1 We are interested in linear classifier (predictor) f(x; w) = w T x so that f : X Y, where X denote the space of input values and Y the space of output values. Hiva Ghanbari (Lehigh University) 4 / 34

5 Supervised Learning Problem How good is this classifier? Prediction Error Area Under ROC Curve (AUC) Given a finite sample data set S of n (input, label) pairs, e.g., S := {(x i, y i ) : i = 1,, n}, where x i R d and y i {+1, 1}. We are interested in Binary Classification Problem in supervised learning Binary Classification Problem Discrete valued output +1 or 1 We are interested in linear classifier (predictor) f(x; w) = w T x so that f : X Y, where X denote the space of input values and Y the space of output values. Hiva Ghanbari (Lehigh University) 4 / 34

6 Outline Introduction Directly Optimizing Prediction Error Directly Optimizing AUC Numerical Analysis Summary Hiva Ghanbari (Lehigh University) 5 / 34

7 Expected Risk (Prediction Error) In S, each (x i, y i ) is an i.i.d. observation of the random variables (X, Y ). (X, Y ) has an unknown joint probability distribution P X,Y (x, y) over X and Y. The expected risk associated with a linear classifier f(x; w) = w T x for zero-one loss function is defined as where R 0 1 (f) = E X,Y [l 0 1 (f(x; w), Y )] = X Y l 0 1 (f(x; w), y) = P X,Y (x, y)l 0 1 (f(x; w), y) dydx { +1 if y f(x; w) < 0, 0 if y f(x; w) 0. Hiva Ghanbari (Lehigh University) 6 / 34

8 Empirical Risk Minimization The joint probability distribution P X,Y (x, y) is unknown The empirical risk of the linear classifier f(x; w) for zero-one loss function over the finite training set S is of the interest, e.g., R 0 1 (f; S) = 1 n n l 0 1 (f(x i ; w), y i ). i=1 Hiva Ghanbari (Lehigh University) 7 / 34

9 Empirical Risk Minimization The joint probability distribution P X,Y (x, y) is unknown The empirical risk of the linear classifier f(x; w) for zero-one loss function over the finite training set S is of the interest, e.g., n R 0 1 (f; S) = 1 l 0 1 (f(x i ; w), y i ). n i=1 Utilizing the logistic regression loss function instead of 0-1 loss function, results n R log (f; S) = 1 log (1 + exp( y i f(x i ; w))), n i=1 Practically { } n min F log (w) = 1 log (1 + exp( y i f(x i ; w))) + λ w 2. w R d n i=1 Hiva Ghanbari (Lehigh University) 7 / 34

10 Alternative Interpretation of the Prediction Error We can interpret prediction error as a probability value: F error(w) = R 0 1 (f) = E X,Y [l 0 1 (f(x; w), Y )] = P (Y w T X < 0). Hiva Ghanbari (Lehigh University) 8 / 34

11 Alternative Interpretation of the Prediction Error We can interpret prediction error as a probability value: F error(w) = R 0 1 (f) = E X,Y [l 0 1 (f(x; w), Y )] = P (Y w T X < 0). If the true values of the prior probabilities P (Y = +1) and P (Y = 1) are known or obtainable from a trivial calculation, then Lemma 1 Expected risk can be interpreted in terms of the probability value, so that where F error(w) = P (Y w T X < 0) = P ( Z + 0 ) P (Y = +1) + ( 1 P ( Z 0 )) P (Y = 1), Z + = w T X +, and Z = w T X, for X + and X as random variables from positive and negative classes, respectively. Hiva Ghanbari (Lehigh University) 8 / 34

12 Data with Any Arbitrary Distribution Suppose (X 1,, X n) is a multivariate random variable. For a given mapping function g( ) we are interested in the c.d.f of Z = g (X 1,, X n). If we define a region in space {X 1 X n} such that g(x 1,, x n) z, then we have F Z (z) = P (Z z) = P (g(x) z) = P ({x 1 X 1,, x n X n : g(x 1,, x n) z}) = {x 1 X 1,,x n X n:g(x 1,,x n) z} f X1,,X n (x 1,, x n)dx 1 dx n. Hiva Ghanbari (Lehigh University) 9 / 34

13 Data with Normal Distribution Assume X + N ( µ +, Σ +) and X N ( µ, Σ ). Hiva Ghanbari (Lehigh University) 10 / 34

14 Data with Normal Distribution Assume Why Normal? X + N ( µ +, Σ +) and X N ( µ, Σ ). The family of multivariate Normal distributions is closed under linear transformations. Theorem 2 (Tong (1990)) If X N (µ, Σ) and Z = CX + b, where C is any given m n real matrix and b is any m 1 real vector, then Z N ( Cµ + b, CΣC T ). Normal Distribution has a smooth c.d.f. Hiva Ghanbari (Lehigh University) 10 / 34

15 Prediction Error as a Smooth Function Theorem 3 Suppose that Then, X + N ( µ +, Σ +) and X N ( µ, Σ ). F error(w) = P (Y = +1) (1 φ (µ Z +/σ Z +)) + P (Y = 1)φ (µ Z /σ Z ), where µ Z + = w T µ +, σ Z + = w T Σ + w, µ Z = w T µ, σ Z = w T Σ w, and in which φ is the c.d.f of the standard normal distribution, e.g., x 1 φ(x) = exp( 1 2π 2 t2 )dt, for x R. Hiva Ghanbari (Lehigh University) 11 / 34

16 Prediction Error as a Smooth Function Theorem 3 Suppose that Then, X + N ( µ +, Σ +) and X N ( µ, Σ ). F error(w) = P (Y = +1) (1 φ (µ Z +/σ Z +)) + P (Y = 1)φ (µ Z /σ Z ), where µ Z + = w T µ +, σ Z + = w T Σ + w, µ Z = w T µ, σ Z = w T Σ w, and in which φ is the c.d.f of the standard normal distribution, e.g., x 1 φ(x) = exp( 1 2π 2 t2 )dt, for x R. Prediction error is a smooth function of w we can compute the gradient and... Hiva Ghanbari (Lehigh University) 11 / 34

17 Outline Introduction Directly Optimizing Prediction Error Directly Optimizing AUC Numerical Analysis Summary Hiva Ghanbari (Lehigh University) 12 / 34

18 Learning From Imbalanced Data Sets Many real-world machine learning problems are dealing with imbalanced learning data (a) Balanced data set (b) Imbalanced data set Hiva Ghanbari (Lehigh University) 13 / 34

19 Receiver Operating Characteristic (ROC) Curve Sorted outputs based on descending value of f(x; w) = w T x f(x; w) Predicted Positive Predicted Negative Actual Positive True Positive (TP) False Negative (FN) Actual Negative False Positive (FP) True Negative (TN) Various thresholds result in different True Positive Rate = False Positive Rate = F P F P +T N. T P T P +F N and ROC curve presents the tradeoff between the TPR and the FPR, for all possible thresholds. Hiva Ghanbari (Lehigh University) 14 / 34

20 Area Under ROC Curve (AUC) How we can compare ROC curves? Hiva Ghanbari (Lehigh University) 15 / 34

21 Area Under ROC Curve (AUC) How we can compare ROC curves? Higher AUC = Better classifier Hiva Ghanbari (Lehigh University) 15 / 34

22 An Unbiased Estimation of AUC Value An unbiased estimation of the AUC value of a linear classifier can be obtained via Wilcoxon-Mann-Whitney (WMW) statistic result (Mann and R.Whitney (1947)), e.g., where AUC ( f; S +, S ) = n + n i=1 j=1 1 [ f(x + i ; w) > f(x j ; w)] n + n. 1 [ f(x + i ; w) > f(x j ; w)] = in which S = S + S. { +1 if f(x + i ; w) > f(x j ; w), 0 otherwise. Hiva Ghanbari (Lehigh University) 16 / 34

23 AUC Approximation via Surrogate Losses The indicator function 1[ ] can be approximate with: Sigmoid surrogate function, Yan et al. (2003), Pairwise exponential loss or pairwise logistic loss, Rudin and Schapire (2009), Pairwise hinge loss, Steck (2007), F hinge (w) = n + n i=1 j=1 max { 0, 1 ( f(x j ; w) f(x+ i ; w))} n + n. Hiva Ghanbari (Lehigh University) 17 / 34

24 Measuring AUC Statistically Let X + and X denote the space of the positive and negative input values, Then x + i is an i.i.d. observation of the random variable X + and x j i.i.d. observation of the random variable X, is an If the joint probability distribution P X +,X (x+, x ) is known, the actual associated AUC value of a linear classifier f(x; w) = w T x is defined as [ [ ]] AUC(f) = E X +,X 1 f(x + ; w) > f(x ; w) = X + X P X +,X (x+, x )1 [ f(x + ; w) > f(x ; w) ] dx dx +. Hiva Ghanbari (Lehigh University) 18 / 34

25 Alternative Interpretation of AUC Lemma 4 We can interpret AUC value as a probability value: F AUC (w) = 1 AUC(f) = 1 E X +,X [ 1 [ f ( X + ; w ) > f ( X ; w )]] = 1 P (Z < 0), where Z = w T ( X X +), for X + and X as random variables from positive and negative classes, respectively. Hiva Ghanbari (Lehigh University) 19 / 34

26 AUC as a Smooth Function Theorem 5 If two random variables X + and X have a joint multivariate normal distribution, such that ( X + ) X N (µ, Σ), ( µ + ) where µ = µ ( ) Σ and Σ = ++ Σ + Σ + Σ, then the AUC function can be defined as ( ) µz F AUC (w) = 1 φ, σ Z where ( µ Z = w T µ µ +) and σ Z = w T (Σ + Σ ++ Σ + Σ + ) w, and is the c.d.f of the standard normal distribution. Hiva Ghanbari (Lehigh University) 20 / 34

27 AUC as a Smooth Function Theorem 5 If two random variables X + and X have a joint multivariate normal distribution, such that ( X + ) X N (µ, Σ), ( µ + ) where µ = µ ( ) Σ and Σ = ++ Σ + Σ + Σ, then the AUC function can be defined as ( ) µz F AUC (w) = 1 φ, σ Z where ( µ Z = w T µ µ +) and σ Z = w T (Σ + Σ ++ Σ + Σ + ) w, and is the c.d.f of the standard normal distribution. AUC is a smooth function of w we can compute the gradient. Hiva Ghanbari (Lehigh University) 20 / 34

28 Outline Introduction Directly Optimizing Prediction Error Directly Optimizing AUC Numerical Analysis Summary Hiva Ghanbari (Lehigh University) 21 / 34

29 Computational Settings We have used gradient descent with backtracking line search as the optimization method. The algorithm is implemented in Python and computations are performed on the computational cluster. We have used both artificial data sets and real data sets. We used five-fold cross-validation with the train-test framework. Hiva Ghanbari (Lehigh University) 22 / 34

30 Directly Optimizing Prediction Error and AUC Artificial Data Sets Information able 1: Artificial data sets statistics. d : number of features, n : number of data points, P +,P Artificial : prior data probabilities, points with normal out : percentage distributionofare outlier generated datarandomly. points. Name d n P + P out% data data data data data data data data data ese data sets, we have swapped an specific percentage of the positive training data sets ith the negative ones and vice versa. The numerical results summarized in Table 2 are the results of 20 randomly generated ata sets. At each run we used 80 percent of the data points as the training data and e rest of the data points as the test data. The reported average accuracy and standard eviation are based on the test data not seen during the training process. The results resented Hiva Ghanbari in Table (Lehigh 2 are University) based on minimizing F error (w) versusf log (w). In the process 23 / of 34

31 Optimizing F error (w) vs. F log (w) on Artificial Data Table 2: F error (w) vs.f log (w) minimization via Algorithm 1 on artificial data sets. F error (w) Minimization F error (w) Minimization F log (w) Minimization Data Exact moments Approximate moments Accuracy± std Time (s) Accuracy ± std Time (s) Accuracy ± std Time (s) data ± ± ± data ± ± ± data ± ± ± data ± ± ± data ± ± ± data ± ± ± data ± ± ± data ± ± ± data ± ± ± Table 3: Real data sets statistics. d : number of features, n : number of data points, P +,P : prior probabilities, AC : attribute characteristics. Name AC d n P + P fourclass [ 1, 1], real Hiva Ghanbari (Lehigh University) 24 / 34

32 8 data ± ± ± Real Data Sets Information Table 3: Real data sets statistics. d : number of features, n : number These data sets can be downloaded from LIBSVM website 1 of data points, and UCI machine P +,P : prior probabilities, AC : attribute characteristics. learning repository 2. Name AC d n P + P fourclass [ 1, 1], real svmguide1 [ 1, 1], real diabetes [ 1, 1], real shuttle [ 1, 1], real vowel [ 6, 6], int magic04 [ 1, 1], real poker [1, 13], int letter [0, 15], int segment [ 1, 1], real svmguide3 [ 1, 1], real ijcnn1 [ 1, 1], real german [ 1, 1], real landsat satellite [27, 157], int sonar [ 1, 1], real a9a binary w8a binary mnist [0, 1], real colon-cancer [ 1, 1], real gisette [ 1, 1], real rcv1 [ 1, 1], real cjlin/libsvmtools/datasets/binary.html 2 Table 4 summarizes the computational results to compare the performance of the obtained linear classifiers through minimizing F error (w) versusf log (w) via Algorithm 1. We Hiva Ghanbari (Lehigh University) 25 / 34

33 Optimizing F error (w) vs. F log (w) on Real Data Table 4: F error (w) vs.f log (w) minimization via Algorithm 1 on real data sets. F error (w) Minimization F log (w) Minimization Data Accuracy± std Time (s) Accuracy ± std Time (s) fourclass ± ± svmguide ± ± diabetes ± ± shuttle ± ± vowel ± ± magic ± ± poker ± ± letter ± ± segment ± ± svmguide ± ± ijcnn ± ± german ± ± landsat satellite ± ± sonar ± ± a9a ± ± w8a ± ± mnist ± ± colon-cancer ± ± gisette ± ± rcv1?±?± Hiva Ghanbari (Lehigh University) 26 / 34

34 Optimizing F error (w) vs. F log (w) on Real Data Minimizing F log Minimizing F error Hiva Ghanbari (Lehigh University) 27 / 34

35 Table 6 summarizes the results on real data sets, in a similar manner to Table 4, withthis di erent that the AUC values of the linear classifiers obtained through minimizing F AUC (w) versus F hinge (w) are of the interest. As we can see in Table 6, the average AUC values of the linear classifiers obtained through minimizing F AUC (w) and F hinge (w) are comparable in 14 data sets among the total number of 20. Although in terms of the solution time, minimizing F AUC (w) constantly surpasses minimizing F hinge (w). Moreover, In the case of 4 data sets, minimizing F AUC (w) performs better than minimizing F hinge (w) in terms of Hiva Ghanbari (Lehigh University) 28 / 34 Optimizing F AUC (w) vs. F hinge (w) on Artificial Data Table 5: F AUC (w) vs.f hinge (w) minimization via Algorithm 1 on artificial data sets. F AUC (w) Minimization F AUC (w) Minimization F hinge (w) Minimization Data Exact moments Approximate moments AUC± std Time (s) AUC ± std Time (s) AUC ± std Time (s) data ± ± ± data ± ± ± data ± ± ± data ± ± ± data ± ± ± data ± ± ± data ± ± ± data ± ± ± data ± ± ±

36 5. Conclusion Hiva Ghanbari (Lehigh University) 29 / 34 Optimizing F AUC (w) vs. F hinge (w) on Real Data Table 6: F AUC (w) vs.f hinge (w) Minimization via Algorithm 1 on Real Data Sets F AUC (w) Minimization F hinge (w) Minimization Data AUC± std Time (s) AUC ± std Time (s) fourclass ± ± svmguide ± ± diabetes ± ± shuttle ± ± vowel ± ± magic ± ± poker ± ± letter ± ± segment ± ± svmguide ± ± ijcnn ± ± german ± ± landsat satellite ± ± sonar ± ± a9a ± ± w8a ± ± mnist ± ± colon-cancer ± ± gisette ± ± rcv1 ± ±

37 Optimizing F AUC (w) vs. F hinge (w) on Real Data Minimizing F hinge Minimizing F AUC Hiva Ghanbari (Lehigh University) 30 / 34

38 Outline Introduction Directly Optimizing Prediction Error Directly Optimizing AUC Numerical Analysis Summary Hiva Ghanbari (Lehigh University) 31 / 34

39 Summary We proposed some conditions under which the expected error and AUC are smooth functions. Any gradient-based optimization method can be applied to directly optimize these functions. These new proposed approaches work efficiently without perturbing the unknown distribution of the real data sets. Studying data distributions may lead to new efficient approaches. Hiva Ghanbari (Lehigh University) 32 / 34

40 References H. B. Mann and D. R.Whitney. On a test whether one of two random variables is stochastically larger than the other. Ann. Math. Statist, 18:50-60, Y.L. Tong. The multivariate normal distribution. Springer Series in Statistics, G. Casella and R.L. Berger. Statistical Inference. Pacific Grove, CA: Duxbury, 2, Hiva Ghanbari (Lehigh University) 33 / 34

41 Thanks for your attention!

Pointwise Exact Bootstrap Distributions of Cost Curves

Pointwise Exact Bootstrap Distributions of Cost Curves Charles Dugas and David Gadoury University of Montréal 25th ICML Helsinki July 2008 Dugas, Gadoury (U Montréal) Cost curves July 8, 2008 1 / 24 Outline