ABC-LogitBoost for Multi-Class Classification

Size: px

Start display at page:

Download "ABC-LogitBoost for Multi-Class Classification"

Shanon Holt
5 years ago
Views:

1 Ping Li, Cornell University ABC-Boost BTRY 6520 Fall ABC-LogitBoost for Multi-Class Classification Ping Li Department of Statistical Science Cornell University

2 Ping Li, Cornell University ABC-Boost BTRY 6520 Fall What is Classification? An Example: USPS Handwritten Zipcode Recognition Person 1: Person 2: Person 3: The task: Teach the machine to automatically recognize the 10 digits.

3 Ping Li, Cornell University ABC-Boost BTRY 6520 Fall Multi-Class Classification Given a training data set {y i, X i } N i=1, X i R p, y i {0, 1, 2,..., K 1} the task is to learn a function to predict the class label y i from X i. K = 2 : binary classification K > 2 : multi-class classification Many important practical problems can be cast as (multi-class) classification. For example, Li, Burges, and Wu, NIPS 2007 Learning to Ranking Using Multiple Classification and Gradient Boosting.

4 Ping Li, Cornell University ABC-Boost BTRY 6520 Fall Logistic Regression for Classification First learn the class probabilities ˆp k = Pr {y = k X}, k = 0, 1,..., K 1, K 1 k=0 ˆp k = 1, (only K 1 degrees of freedom). Then assign the class label according to ŷ X = argmax k ˆp k

5 Ping Li, Cornell University ABC-Boost BTRY 6520 Fall Multinomial Logit Probability Model p k = e F k K 1 s=0 ef s where F k = F k (x) is the function to be learned from the data. Classical logistic regression: The task is to learn the coefficients β. F(x) = β T x

6 Ping Li, Cornell University ABC-Boost BTRY 6520 Fall Flexible additive modeling: F(x) = F (M) (x) = M m=1 ρ m h(x;a m ), h(x; a) is a pre-specified function (e.g., trees). The task is to learn the parameters ρ m and a m. - Both LogitBoost (Friedman et. al, ) and MART (Multiple Additive Regression Trees, Friedman 2001) adopted this model.

7 Ping Li, Cornell University ABC-Boost BTRY 6520 Fall Learning Logistic Regression by Maximum Likelihood Seek F i,k to maximize the mutlinomial likelihood: Suppose y i = k, Lik p 0 i,0... p i,k 1... p 0 i,k 1 = p i,k or equivalently, maximizing the log likelihood: log Lik log p i,k Or equivalently, minimizing the negative log likelihood loss L i = log p i,k, (y i = k)

8 Ping Li, Cornell University ABC-Boost BTRY 6520 Fall The Negative Log-Likelihood Loss L = N L i = i=1 { N i=1 K 1 k=0 r i,k log p i,k } r i,k = 1 if y i = k 0 otherwise

9 Ping Li, Cornell University ABC-Boost BTRY 6520 Fall Two Basic Optimization Methods for Maximum Likelihood 1. Newton s Method Uses the first and second derivatives of the loss function. The method in LogitBoost. 2. Gradient Descent Only uses the first order derivative of the loss function. - MART used a creative combination of gradient descent and Newton s method.

10 Ping Li, Cornell University ABC-Boost BTRY 6520 Fall Derivatives Used in LogitBoost and MART The loss function: L = The first derivative: N L i = i=1 { N i=1 K 1 k=0 r i,k log p i,k } L i F i,k = (r i,k p i,k ) The second derivative: 2 L i F 2 i,k = p i,k (1 p i,k ).

11 Ping Li, Cornell University ABC-Boost BTRY 6520 Fall The Original LogitBoost Algorithm 1: F i,k = 0, p i,k = 1, k = 0 to K 1, i = 1 to N K 2: For m = 1 to M Do 3: For k = 0 to K 1, Do 4: w i,k = p i,k (1 p i,k ), z i,k = r i,k p i,k p i,k (1 p i,k ). 5: Fit the function f i,k by a weighted least-square of z i,k to x i with weights w i,k. 6: F i,k = F i,k +ν K 1 K ( f i,k 1 K ) K 1 k=0 f i,k 7: End 8: p i,k = exp(f i,k )/ K 1 s=0 exp(f i,s), k = 0 to K 1, i = 1 to N 9: End

12 Ping Li, Cornell University ABC-Boost BTRY 6520 Fall The Original MART Algorithm 1: F i,k = 0, p i,k = 1, k = 0 to K 1, i = 1 to N K 2: For m = 1 to M Do 3: For k = 0 to K 1 Do 4: {R j,k,m } J j=1 = J-terminal node regression tree from {r i,k p i,k, x i } N i=1 5: β j,k,m = K 1 K x i R j,k,m r i,k p i,k x i R j,k,m (1 p i,k )p i,k 6: F i,k = F i,k + ν J j=1 β j,k,m1 xi R j,k,m 7: End 8: p i,k = exp(f i,k )/ K 1 s=0 exp(f i,s), k = 0 to K 1, i = 1 to N 9: End

13 Ping Li, Cornell University ABC-Boost BTRY 6520 Fall Comparing LogitBoost with MART LogitBoost used first and second derivatives to construct the trees. MART only used the first order information to construct the trees. Both used second-order information to update values of the terminal nodes. LogitBoost was believed to have numerical instability problems.

14 Ping Li, Cornell University ABC-Boost BTRY 6520 Fall The Numerical Issue in LoigtBoost 4: w i,k = p i,k (1 p i,k ), z i,k = r i,k p i,k p i,k (1 p i,k ). 5: Fit the function f i,k by a weighted least-square of z i,k to x i with weights w i,k. ( ) 6: F i,k = F i,k + ν K 1 f K i,k 1 K 1 K k=0 f i,k The instability issue : When p i,k is close to 0 or 1, z i,k = z i,k = r i,k p i,k p i,k (1 p i,k ) may approach infinity.

15 Ping Li, Cornell University ABC-Boost BTRY 6520 Fall The Numerical Issue in LoigtBoost (Friedman et al ) used regression trees and suggested some crucial implementation protections : In Line 4, compute z i,k by 1 p i,k (if r i,k = 1) or Bound z i,k by z max [2, 4]. 1 1 p i,k (if r i,k = 0). Robust LogitBoost avoids this pointwise thresholding and is essentially free of numerical problems. It turns out, the numerical issue does not really exist, especially when the trees are not too large.

16 Ping Li, Cornell University ABC-Boost BTRY 6520 Fall Tree-Splitting Using the Second-Order Information Feature values: x i, i = 1 to N. Assume x 1 x 2... x N. Weight values: w i, i = 1 to N. Response values: z i, i = 1 to N. We seek the index s, 1 s < N, to maximize the gain of weighted SE: Gain(s) =SE T (SE L + SE R ) [ N s = (z i z) 2 w i (z i z L ) 2 w i + i=1 i=1 N i=s+1 (z i z R ) 2 w i ] where z = N i=1 z iw i N i=1 w i, z L = s i=1 z iw i s i=1 w i, z R = N i=s+1 z iw i N i=s+1 w i.

17 Ping Li, Cornell University ABC-Boost BTRY 6520 Fall After simplification, we obtain Gain(s) = [ s i=1 z iw i ] 2 s i=1 w i + [ N i=s+1 z iw i ] 2 N i=s+1 w i [ N i=1 z iw i ] 2 N i=1 w i = [ s i=1 r i,k p i,k ] 2 s i=1 p i,k(1 p i,k ) + [ N i=s+1 r i,k p i,k ] 2 N i=s+1 p i,k(1 p i,k ) [ N i=1 r i,k p i,k ] 2 N i=1 p i,k(1 p i,k ). Recall w i = p i,k (1 p i,k ), z i = r i,k p i,k p i,k (1 p i,k ). This procedure is numerically stable.

18 Ping Li, Cornell University ABC-Boost BTRY 6520 Fall MART only used the first order information to construct the trees: MARTGain(s) = 1 s [ s i=1 r i,k p i,k ] N s [ N i=s+1 r i,k p i,k ] 2 1 N [ N i=1 r i,k p i,k ] 2. Which can also be derived by letting weights w i,k = 1 and response z i,k = r i,k p i,k. LogitBoost used more information and could be more accurate in many datasets.

19 Ping Li, Cornell University ABC-Boost BTRY 6520 Fall Update terminal node values In LogitBoost: node z i,kw i,k node w i,k which is the same as in MART. = node r i,k p i,k node p i,k(1 p i,k ),

20 Ping Li, Cornell University ABC-Boost BTRY 6520 Fall Robust LogitBoost 1: F i,k = 0, p i,k = 1, k = 0 to K 1, i = 1 to N K 2: For m = 1 to M Do 3: For k = 0 to K 1 Do 4: {R j,k,m } J j=1 = J-terminal node regression tree from {r i,k p i,k, x i } N i=1, : with weights p i,k (1 p i,k ). 5: β j,k,m = K 1 K x i R j,k,m r i,k p i,k x i R j,k,m (1 p i,k )p i,k 6: F i,k = F i,k + ν J j=1 β j,k,m1 xi R j,k,m 7: End 8: p i,k = exp(f i,k )/ K 1 s=0 exp(f i,s), k = 0 to K 1, i = 1 to N 9: End

21 Ping Li, Cornell University ABC-Boost BTRY 6520 Fall Experiments on Binary Classification (Multi-class classification is even more interesting!) Data IJCNN1: training samples, test samples This dataset was used in a competition. LibSVM was the winner. Forest100k: Forest521k: training samples, 0 test samples training samples, 0 test samples The two largest datasets from Bordes et al. JMLR 2005, Fast Kernel Classifiers with Online and Active Learning

22 Ping Li, Cornell University ABC-Boost BTRY 6520 Fall IJCNN1 Test Errors Test misclassification error Test: J = 20 ν = MART 1400 LibSVM 1200 Robust LogitBoost Iterations

23 Ping Li, Cornell University ABC-Boost BTRY 6520 Fall Forest100k Test Errors Test misclassification error Test: J = 20 ν = 0.1 MART SVM Robust LogitBoost Iterations

24 Ping Li, Cornell University ABC-Boost BTRY 6520 Fall Forest521k Test Errors Test misclassification error Test: J = 20 ν = SVM MART Robust LogitBoost Iterations

25 Ping Li, Cornell University ABC-Boost BTRY 6520 Fall ABC-Boost for Multi-Class Classification ABC = Adaptive Base Class ABC-MART = ABC-Boost + MART ABC-LogitBoost = ABC-Boost + (Robust) LogitBoost The key to the success of ABC-Boost is the use of better derivatives.

26 Ping Li, Cornell University ABC-Boost BTRY 6520 Fall Review Components of Logistic Regression The multinomial probability model: p k = e F k K 1 s=0 ef s, K 1 k=0 p k = 1 where F k = F k (x) is the function to be learned from the data. The sum-to-zero constraint: K 1 k=0 F k (x) = 0 is commonly used to obtain a unique solution (only K 1 degrees of freedom).

27 Ping Li, Cornell University ABC-Boost BTRY 6520 Fall Why the sum-to-zero constraint? e F i,k+c K 1 s=0 ef i,s+c = ec efi,k e C K 1 s=0 ef i,s = e F i,k K 1 s=0 ef i,s = p i,k. For identifiability, one should impose a constraint. One popular choice is to assume K 1 k=0 F i,k = const, equivalent to K 1 k=0 F i,k = 0. This is the assumption used in many papers including LogitBoost and MART.

28 Ping Li, Cornell University ABC-Boost BTRY 6520 Fall The negative log-likelihood loss L = N L i = i=1 { N i=1 K 1 k=0 r i,k log p i,k } r i,k = 1 if y i = k 0 otherwise K 1 k=0 r i,k = 1

29 Ping Li, Cornell University ABC-Boost BTRY 6520 Fall Derivatives used in LogitBoost and MART: L i F i,k = (r i,k p i,k ) 2 L i F 2 i,k = p i,k (1 p i,k ), which could be derived without imposing any constraints on F k.

30 Ping Li, Cornell University ABC-Boost BTRY 6520 Fall Singularity of Hessian without Sum-to-zero Constraint Without sum-to-zero constraint K 1 k=0 F i,k = 0: L i F i,k = (r i,k p i,k ), 2 L i F 2 i,k = p i,k (1 p i,k ). For example, when K = 3. 2 L i 2 L i F 2 F i,0 i,0 F i,1 2 L i 2 L i F i,1 F i,0 Fi,1 2 2 L i F i,2 F i,0 2 L i F i,2 F i,1 2 L i F i,0 F i,2 2 L i F i,1 F i,2 2 L i F 2 i,2 = p 0 (1 p 0 ) p 0 p 1 p 0 p 2 p 1 p 0 p 1 (1 p 1 ) p 1 p 2 p 2 p 0 p 2 p 1 p 2 (1 p 2 ) =0

31 Ping Li, Cornell University ABC-Boost BTRY 6520 Fall Diagonal Approximation (Friedman et al and Friedman 2001) Hessian: K 1 K = K 1 K 2 L i F 2 i,0 2 L i F 2 i,1 p 0 (1 p 0 ) used diagonal approximation of 2 L i F 2 i,2 p 1 (1 p 1 ) p 2 (1 p 2 )

32 Ping Li, Cornell University ABC-Boost BTRY 6520 Fall Derivatives Under Sum-to-zero Constraint The loss function: L i = K 1 k=0 r i,k log p i,k The probability model and sum-to-zero constraint: p i,k = e F i,k K 1 s=0 ef i,s, K 1 k=0 F i,k = 0 Without loss of generality, we assume k = 0 is the base class F i,0 = K 1 i=1 F i,k

33 Ping Li, Cornell University ABC-Boost BTRY 6520 Fall New derivatives: L i F i,k = (r i,0 p i,0 ) (r i,k p i,k ), 2 L i F 2 i,k = p i,0 (1 p i,0 ) + p i,k (1 p i,k ) + 2p i,0 p i,k. MART and LogitBoost used: L i F i,k = (r i,k p i,k ), 2 L i F 2 i,k = p i,k (1 p i,k ).

34 Ping Li, Cornell University ABC-Boost BTRY 6520 Fall Assume class 0 is the base class. The determinant of the Hessian is 2 L i F 2 i,1 2 L i F i,2 F i,1 2 L i F i,1 F i,2 2 L i F 2 i,2 = p 0 (1 p 0 ) + p 1 (1 p 1 ) + 2p 0 p 1 p 0 p p 0 p 1 + p 0 p 2 p 1 p 2 p 0 p p 0 p 1 + p 0 p 2 p 1 p 2 p 0 (1 p 0 ) + p 2 (1 p 2 ) + 2p 0 p 2 =p 0 p 1 + p 0 p 2 + p 1 p 2 p 0 p 2 1 p 0 p 2 2 p 1 p 2 2 p 2 p 2 1 p 1 p 2 0 p 2 p p 0 p 1 p 2, independent of the choice of the base class. However, diagonal approximation appears to be a must when using trees. Thus the choice of base class matters.

35 Ping Li, Cornell University ABC-Boost BTRY 6520 Fall Adaptive Base Class Boost (ABC-Boost) Two Key Ideas of ABC-Boost: 1. Formulate the multi-class boosting algorithm by considering a base class. Do not have to train for the base class, which is inferred from the sum-zero-constraint K 1 k=0 F i,k = At each boosting step, adaptively choose the base class.

36 Ping Li, Cornell University ABC-Boost BTRY 6520 Fall How to choose the base class? We should choose the base class based on performance (training loss). How? (One Idea) Exhaustively search for all K base classes and choose the base class that leads to the best performance (smallest training loss). Computationally expensive (but not too bad, unless K is really large) Good performance can be achieved. Many other ideas.

37 Ping Li, Cornell University ABC-Boost BTRY 6520 Fall ABC-MART = ABC-Boost + MART. ABC-LogitBoost = ABC-Boost + Robust LogitBoost.

38 Ping Li, Cornell University ABC-Boost BTRY 6520 Fall The Original MART Algorithm 1: F i,k = 0, p i,k = 1, k = 0 to K 1, i = 1 to N K 2: For m = 1 to M Do 3: For k = 0 to K 1 Do 4: {R j,k,m } J j=1 = J-terminal node regression tree from {r i,k p i,k, x i } N i=1 5: β j,k,m = K 1 K x i R j,k,m r i,k p i,k x i R j,k,m (1 p i,k )p i,k 6: F i,k = F i,k + ν J j=1 β j,k,m1 xi R j,k,m 7: End 8: p i,k = exp(f i,k )/ K 1 s=0 exp(f i,s), k = 0 to K 1, i = 1 to N 9: End

39 Ping Li, Cornell University ABC-Boost BTRY 6520 Fall ABC-MART 1: F i,k = 0, k = 0 to K 1, i = 1 to N 2: For m = 1 to M Do : For b = 0 to K 1 DO 3: For k = 0 to K 1 (and k b) Do 4: {R j,k,m } J j=1 = J-terminal node tree from {r i,k p i,k (r i,b p i,b ), x i } N i=1 5: β j,k,m = x i R (r j,k,m i,k p i,k ) (r i,b p i,b ) x i R j,k,m (1 p i,k )p i,k +(1 p i,b )p i,b +2p i,k p i,b 6: F i,k = F i,k + ν J j=1 β j,k,m1 xi R j,k,m 7: End 8: p i,k = exp(f i,k )/ K 1 s=0 exp(f i,s) 9: End

40 Ping Li, Cornell University ABC-Boost BTRY 6520 Fall Datasets UCI-Covertype Total samples. Two datasets were generated: Covertype290k, Covertype145k UCI-Poker Original training samples and 1 million test samples. Poker25kT1, Poker25kT2, Poker525k, Poker275k, Poker150k, Poker100k. MNIST Originally training samples and test samples. MNIST10k swapped the training with test samples. Many variations of MNIST Original MNIST is a well-known easy problem. ( lisa/twiki/bin/view.cgi/public/ DeepVsShallowComparisonICML2007) created a variety of much more difficult datasets by adding various background (correlated) noise, background images, rotations, etc. UCI-Letter Total 0 samples.

41 Ping Li, Cornell University ABC-Boost BTRY 6520 Fall dataset K # training # test # features Covertype290k Covertype145k Poker525k Poker275k Poker150k Poker100k Poker25kT Poker25kT Mnist10k M-Basic M-Rotate M-Image M-Rand M-RotImg M-Noise M-Noise M-Noise M-Noise M-Noise M-Noise Letter15k Letter4k Letter2k

42 Ping Li, Cornell University ABC-Boost BTRY 6520 Fall Summary of test mis-classification errors Dataset abc- boost abc-boost logistic regression # test Covertype290k Covertype145k Poker525k Poker275k Poker150k Poker100k Poker25kT Poker25kT Mnist10k M-Basic M-Rotate M-Image M-Rand M-RotImg M-Noise M-Noise M-Noise M-Noise M-Noise M-Noise Letter15k Letter4k Letter2k

43 Ping Li, Cornell University ABC-Boost BTRY 6520 Fall P -Values P 1: for testing if abc- has significantly lower error rates than. P 2: for testing if (robust) boost has significantly lower error rates than. P 3: for testing if abc-boost has significantly lower error rates than abc-. P 4: for testing if abc-boost has significantly lower error rates than (robust) boost. P -values are computed using binomial distributions and normal approximations.

44 Ping Li, Cornell University ABC-Boost BTRY 6520 Fall Summary of test P -Values Dataset P1 P2 P3 P4 Covertype290k Covertype145k Poker525k Poker275k Poker150k Poker100k Poker25kT Poker25kT Mnist10k M-Basic M-Rotate M-Image M-Rand M-RotImg M-Noise M-Noise M-Noise M-Noise M-Noise M-Noise Letter15k Letter4k Letter2k

45 Ping Li, Cornell University ABC-Boost BTRY 6520 Fall Comparisons with SVM and Deep Learning Datasets: M-Noise1 to M-Noise6 Results on SVM, Neural Nets, and Deep Learning are from lisa/twiki/bin/view.cgi/public/ DeepVsShallowComparisonICML SAA 3 40 Error rate (%) DBN 3 SVM rbf Error rate (%) Degree of correlation Degree of correlation

46 Ping Li, Cornell University ABC-Boost BTRY 6520 Fall Comparisons with SVM and Deep Learning Datasets: M-Noise1 to M-Noise6 Error rate (%) SAA 3 DBN 3 SVM rbf Degree of correlation Error rate (%) Degree of correlation 6

47 Ping Li, Cornell University ABC-Boost BTRY 6520 Fall More Comparisons with SVM and Deep Learning M-Basic M-Rotate M-Image M-Rand M-RotImg SVM-RBF 3.05% 11.11% 22.61% 14.58% 55.18% SVM-POLY 3.69% 15.42% 24.01% 16.62% 56.41% NNET 4.69% 18.11% 27.41% 20.04% 62.16% DBN % 10.30% 16.31% 6.73% 47.39% SAA % 10.30% 23.00% 11.28% 51.93% DBN % 14.69% 16.15% 9.80% 52.21% 4.12% 15.35% 11.64% 13.15% 49.82% abc- 3.69% 13.27% 9.45% 10.60% 46.14% boost 3.45% 13.63% 9.41% 10.04% 45.92% abc-boost 3.20% 11.92% 8.54% 9.45% 44.69%

48 Ping Li, Cornell University ABC-Boost BTRY 6520 Fall Training Loss Vs Boosting Iterations Training loss Train 10 3 Covertype290k: J=20, ν= Training loss Train Poker525k: J=20, ν=

49 Ping Li, Cornell University ABC-Boost BTRY 6520 Fall Test Errors Vs Boosting Iterations Mnist10k: J = 20, ν = Letter15k: J = 20, ν =

50 Ping Li, Cornell University ABC-Boost BTRY 6520 Fall Letter4k: J = 20, ν = Letter2k: J = 20, ν =

51 Ping Li, Cornell University ABC-Boost BTRY 6520 Fall x Covertype290k: J=20, ν= x Covertype145k: J=20, ν= x 104 Poker525k: J = 20, ν = x 104 Poker275k: J = 20, ν =

52 Ping Li, Cornell University ABC-Boost BTRY 6520 Fall x 104 Poker150k: J = 20, ν = x Poker100k: J = 20, ν = x Poker25kT1: J = 6, ν = x Poker25kT2: J = 6, ν =

53 Ping Li, Cornell University ABC-Boost BTRY 6520 Fall Detailed Experiment Results on Mnist10k J {4, 6, 8, 10, 12, 14, 16, 18, 20, 24, 30, 40, 50} ν {0.04, 0.06, 0.08, 0.1}. The goal is to show the improvements at all reasonable combinations of J and ν.

54 Ping Li, Cornell University ABC-Boost BTRY 6520 Fall abc- ν = 0.04 ν = 0.06 ν = 0.08 ν = 0.1 J = J = J = J = J = J = J = J = J = J = J = J = J =

55 Ping Li, Cornell University ABC-Boost BTRY 6520 Fall boost abc- ν = 0.04 ν = 0.06 ν = 0.08 ν = 0.1 J = J = J = J = J = J = J = J = J = J = J = J = J =

56 Ping Li, Cornell University ABC-Boost BTRY 6520 Fall ν = 0.04 ν = 0.06 ν = 0.08 ν = 0.1 J = J = J = J = J = J = J = J = J = J = J = J = J = P1

57 Ping Li, Cornell University ABC-Boost BTRY 6520 Fall ν = 0.04 ν = 0.06 ν = 0.08 ν = 0.1 J = J = J = J = J = J = J = J = J = J = J = J = J = P2

58 Ping Li, Cornell University ABC-Boost BTRY 6520 Fall ν = 0.04 ν = 0.06 ν = 0.08 ν = 0.1 J = J = J = J = J = J = J = J = J = J = J = J = J = P3

59 Ping Li, Cornell University ABC-Boost BTRY 6520 Fall ν = 0.04 ν = 0.06 ν = 0.08 ν = 0.1 J = J = J = J = J = J = J = J = J = J = J = J = J = P4

60 Ping Li, Cornell University ABC-Boost BTRY 6520 Fall Mnist10k: J = 4, ν = Mnist10k: J = 6, ν = Mnist10k: J = 8, ν = Mnist10k: J = 4, ν = Mnist10k: J = 6, ν = Mnist10k: J = 8, ν = Mnist10k: J = 4, ν = Mnist10k: J = 6, ν = Mnist10k: J = 8, ν =

61 Ping Li, Cornell University ABC-Boost BTRY 6520 Fall Mnist10k: J = 10, ν = Mnist10k: J = 10, ν = Mnist10k: J = 10, ν = Mnist10k: J = 12, ν = Mnist10k: J = 12, ν = Mnist10k: J = 12, ν = Mnist10k: J = 14, ν = Mnist10k: J = 14, ν = Mnist10k: J = 14, ν =

62 Ping Li, Cornell University ABC-Boost BTRY 6520 Fall Mnist10k: J = 16, ν = Mnist10k: J = 16, ν = Mnist10k: J = 16, ν = Mnist10k: J = 18, ν = Mnist10k: J = 18, ν = Mnist10k: J = 18, ν = Mnist10k: J = 20, ν = Mnist10k: J = 20, ν = Mnist10k: J = 20, ν =

63 Ping Li, Cornell University ABC-Boost BTRY 6520 Fall Mnist10k: J = 24, ν = Mnist10k: J = 24, ν = Mnist10k: J = 24, ν = Mnist10k: J = 30, ν = Mnist10k: J = 30, ν = m Mnist10k: J = 40, ν = Mnist10k: J = 40, ν = Mnist10k: J = 30, ν = Mnist10k: J = 40, ν =

64 Ping Li, Cornell University ABC-Boost BTRY 6520 Fall Mnist10k: J = 50, ν = Mnist10k: J = 50, ν = Mnist10k: J = 50, ν = Want to See More?

ABC-Boost: Adaptive Base Class Boost for Multi-class Classification

ABC-Boost: Adaptive Base Class Boost for Multi-class Classification Ping Li Department of Statistical Science, Cornell University, Ithaca, NY 14853 USA pingli@cornell.edu Abstract We propose -boost (adaptive