ABC-LogitBoost for Multi-Class Classification

Size: px
Start display at page:

Download "ABC-LogitBoost for Multi-Class Classification"

Transcription

1 Ping Li, Cornell University ABC-Boost BTRY 6520 Fall ABC-LogitBoost for Multi-Class Classification Ping Li Department of Statistical Science Cornell University

2 Ping Li, Cornell University ABC-Boost BTRY 6520 Fall What is Classification? An Example: USPS Handwritten Zipcode Recognition Person 1: Person 2: Person 3: The task: Teach the machine to automatically recognize the 10 digits.

3 Ping Li, Cornell University ABC-Boost BTRY 6520 Fall Multi-Class Classification Given a training data set {y i, X i } N i=1, X i R p, y i {0, 1, 2,..., K 1} the task is to learn a function to predict the class label y i from X i. K = 2 : binary classification K > 2 : multi-class classification Many important practical problems can be cast as (multi-class) classification. For example, Li, Burges, and Wu, NIPS 2007 Learning to Ranking Using Multiple Classification and Gradient Boosting.

4 Ping Li, Cornell University ABC-Boost BTRY 6520 Fall Logistic Regression for Classification First learn the class probabilities ˆp k = Pr {y = k X}, k = 0, 1,..., K 1, K 1 k=0 ˆp k = 1, (only K 1 degrees of freedom). Then assign the class label according to ŷ X = argmax k ˆp k

5 Ping Li, Cornell University ABC-Boost BTRY 6520 Fall Multinomial Logit Probability Model p k = e F k K 1 s=0 ef s where F k = F k (x) is the function to be learned from the data. Classical logistic regression: The task is to learn the coefficients β. F(x) = β T x

6 Ping Li, Cornell University ABC-Boost BTRY 6520 Fall Flexible additive modeling: F(x) = F (M) (x) = M m=1 ρ m h(x;a m ), h(x; a) is a pre-specified function (e.g., trees). The task is to learn the parameters ρ m and a m. - Both LogitBoost (Friedman et. al, ) and MART (Multiple Additive Regression Trees, Friedman 2001) adopted this model.

7 Ping Li, Cornell University ABC-Boost BTRY 6520 Fall Learning Logistic Regression by Maximum Likelihood Seek F i,k to maximize the mutlinomial likelihood: Suppose y i = k, Lik p 0 i,0... p i,k 1... p 0 i,k 1 = p i,k or equivalently, maximizing the log likelihood: log Lik log p i,k Or equivalently, minimizing the negative log likelihood loss L i = log p i,k, (y i = k)

8 Ping Li, Cornell University ABC-Boost BTRY 6520 Fall The Negative Log-Likelihood Loss L = N L i = i=1 { N i=1 K 1 k=0 r i,k log p i,k } r i,k = 1 if y i = k 0 otherwise

9 Ping Li, Cornell University ABC-Boost BTRY 6520 Fall Two Basic Optimization Methods for Maximum Likelihood 1. Newton s Method Uses the first and second derivatives of the loss function. The method in LogitBoost. 2. Gradient Descent Only uses the first order derivative of the loss function. - MART used a creative combination of gradient descent and Newton s method.

10 Ping Li, Cornell University ABC-Boost BTRY 6520 Fall Derivatives Used in LogitBoost and MART The loss function: L = The first derivative: N L i = i=1 { N i=1 K 1 k=0 r i,k log p i,k } L i F i,k = (r i,k p i,k ) The second derivative: 2 L i F 2 i,k = p i,k (1 p i,k ).

11 Ping Li, Cornell University ABC-Boost BTRY 6520 Fall The Original LogitBoost Algorithm 1: F i,k = 0, p i,k = 1, k = 0 to K 1, i = 1 to N K 2: For m = 1 to M Do 3: For k = 0 to K 1, Do 4: w i,k = p i,k (1 p i,k ), z i,k = r i,k p i,k p i,k (1 p i,k ). 5: Fit the function f i,k by a weighted least-square of z i,k to x i with weights w i,k. 6: F i,k = F i,k +ν K 1 K ( f i,k 1 K ) K 1 k=0 f i,k 7: End 8: p i,k = exp(f i,k )/ K 1 s=0 exp(f i,s), k = 0 to K 1, i = 1 to N 9: End

12 Ping Li, Cornell University ABC-Boost BTRY 6520 Fall The Original MART Algorithm 1: F i,k = 0, p i,k = 1, k = 0 to K 1, i = 1 to N K 2: For m = 1 to M Do 3: For k = 0 to K 1 Do 4: {R j,k,m } J j=1 = J-terminal node regression tree from {r i,k p i,k, x i } N i=1 5: β j,k,m = K 1 K x i R j,k,m r i,k p i,k x i R j,k,m (1 p i,k )p i,k 6: F i,k = F i,k + ν J j=1 β j,k,m1 xi R j,k,m 7: End 8: p i,k = exp(f i,k )/ K 1 s=0 exp(f i,s), k = 0 to K 1, i = 1 to N 9: End

13 Ping Li, Cornell University ABC-Boost BTRY 6520 Fall Comparing LogitBoost with MART LogitBoost used first and second derivatives to construct the trees. MART only used the first order information to construct the trees. Both used second-order information to update values of the terminal nodes. LogitBoost was believed to have numerical instability problems.

14 Ping Li, Cornell University ABC-Boost BTRY 6520 Fall The Numerical Issue in LoigtBoost 4: w i,k = p i,k (1 p i,k ), z i,k = r i,k p i,k p i,k (1 p i,k ). 5: Fit the function f i,k by a weighted least-square of z i,k to x i with weights w i,k. ( ) 6: F i,k = F i,k + ν K 1 f K i,k 1 K 1 K k=0 f i,k The instability issue : When p i,k is close to 0 or 1, z i,k = z i,k = r i,k p i,k p i,k (1 p i,k ) may approach infinity.

15 Ping Li, Cornell University ABC-Boost BTRY 6520 Fall The Numerical Issue in LoigtBoost (Friedman et al ) used regression trees and suggested some crucial implementation protections : In Line 4, compute z i,k by 1 p i,k (if r i,k = 1) or Bound z i,k by z max [2, 4]. 1 1 p i,k (if r i,k = 0). Robust LogitBoost avoids this pointwise thresholding and is essentially free of numerical problems. It turns out, the numerical issue does not really exist, especially when the trees are not too large.

16 Ping Li, Cornell University ABC-Boost BTRY 6520 Fall Tree-Splitting Using the Second-Order Information Feature values: x i, i = 1 to N. Assume x 1 x 2... x N. Weight values: w i, i = 1 to N. Response values: z i, i = 1 to N. We seek the index s, 1 s < N, to maximize the gain of weighted SE: Gain(s) =SE T (SE L + SE R ) [ N s = (z i z) 2 w i (z i z L ) 2 w i + i=1 i=1 N i=s+1 (z i z R ) 2 w i ] where z = N i=1 z iw i N i=1 w i, z L = s i=1 z iw i s i=1 w i, z R = N i=s+1 z iw i N i=s+1 w i.

17 Ping Li, Cornell University ABC-Boost BTRY 6520 Fall After simplification, we obtain Gain(s) = [ s i=1 z iw i ] 2 s i=1 w i + [ N i=s+1 z iw i ] 2 N i=s+1 w i [ N i=1 z iw i ] 2 N i=1 w i = [ s i=1 r i,k p i,k ] 2 s i=1 p i,k(1 p i,k ) + [ N i=s+1 r i,k p i,k ] 2 N i=s+1 p i,k(1 p i,k ) [ N i=1 r i,k p i,k ] 2 N i=1 p i,k(1 p i,k ). Recall w i = p i,k (1 p i,k ), z i = r i,k p i,k p i,k (1 p i,k ). This procedure is numerically stable.

18 Ping Li, Cornell University ABC-Boost BTRY 6520 Fall MART only used the first order information to construct the trees: MARTGain(s) = 1 s [ s i=1 r i,k p i,k ] N s [ N i=s+1 r i,k p i,k ] 2 1 N [ N i=1 r i,k p i,k ] 2. Which can also be derived by letting weights w i,k = 1 and response z i,k = r i,k p i,k. LogitBoost used more information and could be more accurate in many datasets.

19 Ping Li, Cornell University ABC-Boost BTRY 6520 Fall Update terminal node values In LogitBoost: node z i,kw i,k node w i,k which is the same as in MART. = node r i,k p i,k node p i,k(1 p i,k ),

20 Ping Li, Cornell University ABC-Boost BTRY 6520 Fall Robust LogitBoost 1: F i,k = 0, p i,k = 1, k = 0 to K 1, i = 1 to N K 2: For m = 1 to M Do 3: For k = 0 to K 1 Do 4: {R j,k,m } J j=1 = J-terminal node regression tree from {r i,k p i,k, x i } N i=1, : with weights p i,k (1 p i,k ). 5: β j,k,m = K 1 K x i R j,k,m r i,k p i,k x i R j,k,m (1 p i,k )p i,k 6: F i,k = F i,k + ν J j=1 β j,k,m1 xi R j,k,m 7: End 8: p i,k = exp(f i,k )/ K 1 s=0 exp(f i,s), k = 0 to K 1, i = 1 to N 9: End

21 Ping Li, Cornell University ABC-Boost BTRY 6520 Fall Experiments on Binary Classification (Multi-class classification is even more interesting!) Data IJCNN1: training samples, test samples This dataset was used in a competition. LibSVM was the winner. Forest100k: Forest521k: training samples, 0 test samples training samples, 0 test samples The two largest datasets from Bordes et al. JMLR 2005, Fast Kernel Classifiers with Online and Active Learning

22 Ping Li, Cornell University ABC-Boost BTRY 6520 Fall IJCNN1 Test Errors Test misclassification error Test: J = 20 ν = MART 1400 LibSVM 1200 Robust LogitBoost Iterations

23 Ping Li, Cornell University ABC-Boost BTRY 6520 Fall Forest100k Test Errors Test misclassification error Test: J = 20 ν = 0.1 MART SVM Robust LogitBoost Iterations

24 Ping Li, Cornell University ABC-Boost BTRY 6520 Fall Forest521k Test Errors Test misclassification error Test: J = 20 ν = SVM MART Robust LogitBoost Iterations

25 Ping Li, Cornell University ABC-Boost BTRY 6520 Fall ABC-Boost for Multi-Class Classification ABC = Adaptive Base Class ABC-MART = ABC-Boost + MART ABC-LogitBoost = ABC-Boost + (Robust) LogitBoost The key to the success of ABC-Boost is the use of better derivatives.

26 Ping Li, Cornell University ABC-Boost BTRY 6520 Fall Review Components of Logistic Regression The multinomial probability model: p k = e F k K 1 s=0 ef s, K 1 k=0 p k = 1 where F k = F k (x) is the function to be learned from the data. The sum-to-zero constraint: K 1 k=0 F k (x) = 0 is commonly used to obtain a unique solution (only K 1 degrees of freedom).

27 Ping Li, Cornell University ABC-Boost BTRY 6520 Fall Why the sum-to-zero constraint? e F i,k+c K 1 s=0 ef i,s+c = ec efi,k e C K 1 s=0 ef i,s = e F i,k K 1 s=0 ef i,s = p i,k. For identifiability, one should impose a constraint. One popular choice is to assume K 1 k=0 F i,k = const, equivalent to K 1 k=0 F i,k = 0. This is the assumption used in many papers including LogitBoost and MART.

28 Ping Li, Cornell University ABC-Boost BTRY 6520 Fall The negative log-likelihood loss L = N L i = i=1 { N i=1 K 1 k=0 r i,k log p i,k } r i,k = 1 if y i = k 0 otherwise K 1 k=0 r i,k = 1

29 Ping Li, Cornell University ABC-Boost BTRY 6520 Fall Derivatives used in LogitBoost and MART: L i F i,k = (r i,k p i,k ) 2 L i F 2 i,k = p i,k (1 p i,k ), which could be derived without imposing any constraints on F k.

30 Ping Li, Cornell University ABC-Boost BTRY 6520 Fall Singularity of Hessian without Sum-to-zero Constraint Without sum-to-zero constraint K 1 k=0 F i,k = 0: L i F i,k = (r i,k p i,k ), 2 L i F 2 i,k = p i,k (1 p i,k ). For example, when K = 3. 2 L i 2 L i F 2 F i,0 i,0 F i,1 2 L i 2 L i F i,1 F i,0 Fi,1 2 2 L i F i,2 F i,0 2 L i F i,2 F i,1 2 L i F i,0 F i,2 2 L i F i,1 F i,2 2 L i F 2 i,2 = p 0 (1 p 0 ) p 0 p 1 p 0 p 2 p 1 p 0 p 1 (1 p 1 ) p 1 p 2 p 2 p 0 p 2 p 1 p 2 (1 p 2 ) =0

31 Ping Li, Cornell University ABC-Boost BTRY 6520 Fall Diagonal Approximation (Friedman et al and Friedman 2001) Hessian: K 1 K = K 1 K 2 L i F 2 i,0 2 L i F 2 i,1 p 0 (1 p 0 ) used diagonal approximation of 2 L i F 2 i,2 p 1 (1 p 1 ) p 2 (1 p 2 )

32 Ping Li, Cornell University ABC-Boost BTRY 6520 Fall Derivatives Under Sum-to-zero Constraint The loss function: L i = K 1 k=0 r i,k log p i,k The probability model and sum-to-zero constraint: p i,k = e F i,k K 1 s=0 ef i,s, K 1 k=0 F i,k = 0 Without loss of generality, we assume k = 0 is the base class F i,0 = K 1 i=1 F i,k

33 Ping Li, Cornell University ABC-Boost BTRY 6520 Fall New derivatives: L i F i,k = (r i,0 p i,0 ) (r i,k p i,k ), 2 L i F 2 i,k = p i,0 (1 p i,0 ) + p i,k (1 p i,k ) + 2p i,0 p i,k. MART and LogitBoost used: L i F i,k = (r i,k p i,k ), 2 L i F 2 i,k = p i,k (1 p i,k ).

34 Ping Li, Cornell University ABC-Boost BTRY 6520 Fall Assume class 0 is the base class. The determinant of the Hessian is 2 L i F 2 i,1 2 L i F i,2 F i,1 2 L i F i,1 F i,2 2 L i F 2 i,2 = p 0 (1 p 0 ) + p 1 (1 p 1 ) + 2p 0 p 1 p 0 p p 0 p 1 + p 0 p 2 p 1 p 2 p 0 p p 0 p 1 + p 0 p 2 p 1 p 2 p 0 (1 p 0 ) + p 2 (1 p 2 ) + 2p 0 p 2 =p 0 p 1 + p 0 p 2 + p 1 p 2 p 0 p 2 1 p 0 p 2 2 p 1 p 2 2 p 2 p 2 1 p 1 p 2 0 p 2 p p 0 p 1 p 2, independent of the choice of the base class. However, diagonal approximation appears to be a must when using trees. Thus the choice of base class matters.

35 Ping Li, Cornell University ABC-Boost BTRY 6520 Fall Adaptive Base Class Boost (ABC-Boost) Two Key Ideas of ABC-Boost: 1. Formulate the multi-class boosting algorithm by considering a base class. Do not have to train for the base class, which is inferred from the sum-zero-constraint K 1 k=0 F i,k = At each boosting step, adaptively choose the base class.

36 Ping Li, Cornell University ABC-Boost BTRY 6520 Fall How to choose the base class? We should choose the base class based on performance (training loss). How? (One Idea) Exhaustively search for all K base classes and choose the base class that leads to the best performance (smallest training loss). Computationally expensive (but not too bad, unless K is really large) Good performance can be achieved. Many other ideas.

37 Ping Li, Cornell University ABC-Boost BTRY 6520 Fall ABC-MART = ABC-Boost + MART. ABC-LogitBoost = ABC-Boost + Robust LogitBoost.

38 Ping Li, Cornell University ABC-Boost BTRY 6520 Fall The Original MART Algorithm 1: F i,k = 0, p i,k = 1, k = 0 to K 1, i = 1 to N K 2: For m = 1 to M Do 3: For k = 0 to K 1 Do 4: {R j,k,m } J j=1 = J-terminal node regression tree from {r i,k p i,k, x i } N i=1 5: β j,k,m = K 1 K x i R j,k,m r i,k p i,k x i R j,k,m (1 p i,k )p i,k 6: F i,k = F i,k + ν J j=1 β j,k,m1 xi R j,k,m 7: End 8: p i,k = exp(f i,k )/ K 1 s=0 exp(f i,s), k = 0 to K 1, i = 1 to N 9: End

39 Ping Li, Cornell University ABC-Boost BTRY 6520 Fall ABC-MART 1: F i,k = 0, k = 0 to K 1, i = 1 to N 2: For m = 1 to M Do : For b = 0 to K 1 DO 3: For k = 0 to K 1 (and k b) Do 4: {R j,k,m } J j=1 = J-terminal node tree from {r i,k p i,k (r i,b p i,b ), x i } N i=1 5: β j,k,m = x i R (r j,k,m i,k p i,k ) (r i,b p i,b ) x i R j,k,m (1 p i,k )p i,k +(1 p i,b )p i,b +2p i,k p i,b 6: F i,k = F i,k + ν J j=1 β j,k,m1 xi R j,k,m 7: End 8: p i,k = exp(f i,k )/ K 1 s=0 exp(f i,s) 9: End

40 Ping Li, Cornell University ABC-Boost BTRY 6520 Fall Datasets UCI-Covertype Total samples. Two datasets were generated: Covertype290k, Covertype145k UCI-Poker Original training samples and 1 million test samples. Poker25kT1, Poker25kT2, Poker525k, Poker275k, Poker150k, Poker100k. MNIST Originally training samples and test samples. MNIST10k swapped the training with test samples. Many variations of MNIST Original MNIST is a well-known easy problem. ( lisa/twiki/bin/view.cgi/public/ DeepVsShallowComparisonICML2007) created a variety of much more difficult datasets by adding various background (correlated) noise, background images, rotations, etc. UCI-Letter Total 0 samples.

41 Ping Li, Cornell University ABC-Boost BTRY 6520 Fall dataset K # training # test # features Covertype290k Covertype145k Poker525k Poker275k Poker150k Poker100k Poker25kT Poker25kT Mnist10k M-Basic M-Rotate M-Image M-Rand M-RotImg M-Noise M-Noise M-Noise M-Noise M-Noise M-Noise Letter15k Letter4k Letter2k

42 Ping Li, Cornell University ABC-Boost BTRY 6520 Fall Summary of test mis-classification errors Dataset abc- boost abc-boost logistic regression # test Covertype290k Covertype145k Poker525k Poker275k Poker150k Poker100k Poker25kT Poker25kT Mnist10k M-Basic M-Rotate M-Image M-Rand M-RotImg M-Noise M-Noise M-Noise M-Noise M-Noise M-Noise Letter15k Letter4k Letter2k

43 Ping Li, Cornell University ABC-Boost BTRY 6520 Fall P -Values P 1: for testing if abc- has significantly lower error rates than. P 2: for testing if (robust) boost has significantly lower error rates than. P 3: for testing if abc-boost has significantly lower error rates than abc-. P 4: for testing if abc-boost has significantly lower error rates than (robust) boost. P -values are computed using binomial distributions and normal approximations.

44 Ping Li, Cornell University ABC-Boost BTRY 6520 Fall Summary of test P -Values Dataset P1 P2 P3 P4 Covertype290k Covertype145k Poker525k Poker275k Poker150k Poker100k Poker25kT Poker25kT Mnist10k M-Basic M-Rotate M-Image M-Rand M-RotImg M-Noise M-Noise M-Noise M-Noise M-Noise M-Noise Letter15k Letter4k Letter2k

45 Ping Li, Cornell University ABC-Boost BTRY 6520 Fall Comparisons with SVM and Deep Learning Datasets: M-Noise1 to M-Noise6 Results on SVM, Neural Nets, and Deep Learning are from lisa/twiki/bin/view.cgi/public/ DeepVsShallowComparisonICML SAA 3 40 Error rate (%) DBN 3 SVM rbf Error rate (%) Degree of correlation Degree of correlation

46 Ping Li, Cornell University ABC-Boost BTRY 6520 Fall Comparisons with SVM and Deep Learning Datasets: M-Noise1 to M-Noise6 Error rate (%) SAA 3 DBN 3 SVM rbf Degree of correlation Error rate (%) Degree of correlation 6

47 Ping Li, Cornell University ABC-Boost BTRY 6520 Fall More Comparisons with SVM and Deep Learning M-Basic M-Rotate M-Image M-Rand M-RotImg SVM-RBF 3.05% 11.11% 22.61% 14.58% 55.18% SVM-POLY 3.69% 15.42% 24.01% 16.62% 56.41% NNET 4.69% 18.11% 27.41% 20.04% 62.16% DBN % 10.30% 16.31% 6.73% 47.39% SAA % 10.30% 23.00% 11.28% 51.93% DBN % 14.69% 16.15% 9.80% 52.21% 4.12% 15.35% 11.64% 13.15% 49.82% abc- 3.69% 13.27% 9.45% 10.60% 46.14% boost 3.45% 13.63% 9.41% 10.04% 45.92% abc-boost 3.20% 11.92% 8.54% 9.45% 44.69%

48 Ping Li, Cornell University ABC-Boost BTRY 6520 Fall Training Loss Vs Boosting Iterations Training loss Train 10 3 Covertype290k: J=20, ν= Training loss Train Poker525k: J=20, ν=

49 Ping Li, Cornell University ABC-Boost BTRY 6520 Fall Test Errors Vs Boosting Iterations Mnist10k: J = 20, ν = Letter15k: J = 20, ν =

50 Ping Li, Cornell University ABC-Boost BTRY 6520 Fall Letter4k: J = 20, ν = Letter2k: J = 20, ν =

51 Ping Li, Cornell University ABC-Boost BTRY 6520 Fall x Covertype290k: J=20, ν= x Covertype145k: J=20, ν= x 104 Poker525k: J = 20, ν = x 104 Poker275k: J = 20, ν =

52 Ping Li, Cornell University ABC-Boost BTRY 6520 Fall x 104 Poker150k: J = 20, ν = x Poker100k: J = 20, ν = x Poker25kT1: J = 6, ν = x Poker25kT2: J = 6, ν =

53 Ping Li, Cornell University ABC-Boost BTRY 6520 Fall Detailed Experiment Results on Mnist10k J {4, 6, 8, 10, 12, 14, 16, 18, 20, 24, 30, 40, 50} ν {0.04, 0.06, 0.08, 0.1}. The goal is to show the improvements at all reasonable combinations of J and ν.

54 Ping Li, Cornell University ABC-Boost BTRY 6520 Fall abc- ν = 0.04 ν = 0.06 ν = 0.08 ν = 0.1 J = J = J = J = J = J = J = J = J = J = J = J = J =

55 Ping Li, Cornell University ABC-Boost BTRY 6520 Fall boost abc- ν = 0.04 ν = 0.06 ν = 0.08 ν = 0.1 J = J = J = J = J = J = J = J = J = J = J = J = J =

56 Ping Li, Cornell University ABC-Boost BTRY 6520 Fall ν = 0.04 ν = 0.06 ν = 0.08 ν = 0.1 J = J = J = J = J = J = J = J = J = J = J = J = J = P1

57 Ping Li, Cornell University ABC-Boost BTRY 6520 Fall ν = 0.04 ν = 0.06 ν = 0.08 ν = 0.1 J = J = J = J = J = J = J = J = J = J = J = J = J = P2

58 Ping Li, Cornell University ABC-Boost BTRY 6520 Fall ν = 0.04 ν = 0.06 ν = 0.08 ν = 0.1 J = J = J = J = J = J = J = J = J = J = J = J = J = P3

59 Ping Li, Cornell University ABC-Boost BTRY 6520 Fall ν = 0.04 ν = 0.06 ν = 0.08 ν = 0.1 J = J = J = J = J = J = J = J = J = J = J = J = J = P4

60 Ping Li, Cornell University ABC-Boost BTRY 6520 Fall Mnist10k: J = 4, ν = Mnist10k: J = 6, ν = Mnist10k: J = 8, ν = Mnist10k: J = 4, ν = Mnist10k: J = 6, ν = Mnist10k: J = 8, ν = Mnist10k: J = 4, ν = Mnist10k: J = 6, ν = Mnist10k: J = 8, ν =

61 Ping Li, Cornell University ABC-Boost BTRY 6520 Fall Mnist10k: J = 10, ν = Mnist10k: J = 10, ν = Mnist10k: J = 10, ν = Mnist10k: J = 12, ν = Mnist10k: J = 12, ν = Mnist10k: J = 12, ν = Mnist10k: J = 14, ν = Mnist10k: J = 14, ν = Mnist10k: J = 14, ν =

62 Ping Li, Cornell University ABC-Boost BTRY 6520 Fall Mnist10k: J = 16, ν = Mnist10k: J = 16, ν = Mnist10k: J = 16, ν = Mnist10k: J = 18, ν = Mnist10k: J = 18, ν = Mnist10k: J = 18, ν = Mnist10k: J = 20, ν = Mnist10k: J = 20, ν = Mnist10k: J = 20, ν =

63 Ping Li, Cornell University ABC-Boost BTRY 6520 Fall Mnist10k: J = 24, ν = Mnist10k: J = 24, ν = Mnist10k: J = 24, ν = Mnist10k: J = 30, ν = Mnist10k: J = 30, ν = m Mnist10k: J = 40, ν = Mnist10k: J = 40, ν = Mnist10k: J = 30, ν = Mnist10k: J = 40, ν =

64 Ping Li, Cornell University ABC-Boost BTRY 6520 Fall Mnist10k: J = 50, ν = Mnist10k: J = 50, ν = Mnist10k: J = 50, ν = Want to See More?

ABC-Boost: Adaptive Base Class Boost for Multi-class Classification

ABC-Boost: Adaptive Base Class Boost for Multi-class Classification ABC-Boost: Adaptive Base Class Boost for Multi-class Classification Ping Li Department of Statistical Science, Cornell University, Ithaca, NY 14853 USA pingli@cornell.edu Abstract We propose -boost (adaptive

More information

AOSO-LogitBoost: Adaptive One-Vs-One LogitBoost for Multi-Class Problem

AOSO-LogitBoost: Adaptive One-Vs-One LogitBoost for Multi-Class Problem : Adaptive One-Vs-One LogitBoost for Multi-Class Problem Peng Sun sunp08@mails.tsinghua.edu.cn Tsinghua National Laboratory for Information Science and Technology(TNList), Department of Automation, Tsinghua

More information

LogitBoost with Trees Applied to the WCCI 2006 Performance Prediction Challenge Datasets

LogitBoost with Trees Applied to the WCCI 2006 Performance Prediction Challenge Datasets Applied to the WCCI 2006 Performance Prediction Challenge Datasets Roman Lutz Seminar für Statistik, ETH Zürich, Switzerland 18. July 2006 Overview Motivation 1 Motivation 2 The logistic framework The

More information

Kernel Logistic Regression and the Import Vector Machine

Kernel Logistic Regression and the Import Vector Machine Kernel Logistic Regression and the Import Vector Machine Ji Zhu and Trevor Hastie Journal of Computational and Graphical Statistics, 2005 Presented by Mingtao Ding Duke University December 8, 2011 Mingtao

More information

Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13

Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13 Boosting Ryan Tibshirani Data Mining: 36-462/36-662 April 25 2013 Optional reading: ISL 8.2, ESL 10.1 10.4, 10.7, 10.13 1 Reminder: classification trees Suppose that we are given training data (x i, y

More information

ECE 5984: Introduction to Machine Learning

ECE 5984: Introduction to Machine Learning ECE 5984: Introduction to Machine Learning Topics: Ensemble Methods: Bagging, Boosting Readings: Murphy 16.4; Hastie 16 Dhruv Batra Virginia Tech Administrativia HW3 Due: April 14, 11:55pm You will implement

More information

CS 231A Section 1: Linear Algebra & Probability Review

CS 231A Section 1: Linear Algebra & Probability Review CS 231A Section 1: Linear Algebra & Probability Review 1 Topics Support Vector Machines Boosting Viola-Jones face detector Linear Algebra Review Notation Operations & Properties Matrix Calculus Probability

More information

HOMEWORK #4: LOGISTIC REGRESSION

HOMEWORK #4: LOGISTIC REGRESSION HOMEWORK #4: LOGISTIC REGRESSION Probabilistic Learning: Theory and Algorithms CS 274A, Winter 2019 Due: 11am Monday, February 25th, 2019 Submit scan of plots/written responses to Gradebook; submit your

More information

Machine Learning Lecture 7

Machine Learning Lecture 7 Course Outline Machine Learning Lecture 7 Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Statistical Learning Theory 23.05.2016 Discriminative Approaches (5 weeks) Linear Discriminant

More information

Logistic Regression. COMP 527 Danushka Bollegala

Logistic Regression. COMP 527 Danushka Bollegala Logistic Regression COMP 527 Danushka Bollegala Binary Classification Given an instance x we must classify it to either positive (1) or negative (0) class We can use {1,-1} instead of {1,0} but we will

More information

ECE 5424: Introduction to Machine Learning

ECE 5424: Introduction to Machine Learning ECE 5424: Introduction to Machine Learning Topics: Ensemble Methods: Bagging, Boosting PAC Learning Readings: Murphy 16.4;; Hastie 16 Stefan Lee Virginia Tech Fighting the bias-variance tradeoff Simple

More information

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18 CSE 417T: Introduction to Machine Learning Lecture 11: Review Henry Chai 10/02/18 Unknown Target Function!: # % Training data Formal Setup & = ( ), + ),, ( -, + - Learning Algorithm 2 Hypothesis Set H

More information

COMS 4771 Lecture Boosting 1 / 16

COMS 4771 Lecture Boosting 1 / 16 COMS 4771 Lecture 12 1. Boosting 1 / 16 Boosting What is boosting? Boosting: Using a learning algorithm that provides rough rules-of-thumb to construct a very accurate predictor. 3 / 16 What is boosting?

More information

CS 231A Section 1: Linear Algebra & Probability Review. Kevin Tang

CS 231A Section 1: Linear Algebra & Probability Review. Kevin Tang CS 231A Section 1: Linear Algebra & Probability Review Kevin Tang Kevin Tang Section 1-1 9/30/2011 Topics Support Vector Machines Boosting Viola Jones face detector Linear Algebra Review Notation Operations

More information

HOMEWORK #4: LOGISTIC REGRESSION

HOMEWORK #4: LOGISTIC REGRESSION HOMEWORK #4: LOGISTIC REGRESSION Probabilistic Learning: Theory and Algorithms CS 274A, Winter 2018 Due: Friday, February 23rd, 2018, 11:55 PM Submit code and report via EEE Dropbox You should submit a

More information

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012 Classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Topics Discriminant functions Logistic regression Perceptron Generative models Generative vs. discriminative

More information

Support Vector Machines for Classification: A Statistical Portrait

Support Vector Machines for Classification: A Statistical Portrait Support Vector Machines for Classification: A Statistical Portrait Yoonkyung Lee Department of Statistics The Ohio State University May 27, 2011 The Spring Conference of Korean Statistical Society KAIST,

More information

Evaluation requires to define performance measures to be optimized

Evaluation requires to define performance measures to be optimized Evaluation Basic concepts Evaluation requires to define performance measures to be optimized Performance of learning algorithms cannot be evaluated on entire domain (generalization error) approximation

More information

Supervised Learning. George Konidaris

Supervised Learning. George Konidaris Supervised Learning George Konidaris gdk@cs.brown.edu Fall 2017 Machine Learning Subfield of AI concerned with learning from data. Broadly, using: Experience To Improve Performance On Some Task (Tom Mitchell,

More information

VBM683 Machine Learning

VBM683 Machine Learning VBM683 Machine Learning Pinar Duygulu Slides are adapted from Dhruv Batra Bias is the algorithm's tendency to consistently learn the wrong thing by not taking into account all the information in the data

More information

Statistics and learning: Big Data

Statistics and learning: Big Data Statistics and learning: Big Data Learning Decision Trees and an Introduction to Boosting Sébastien Gadat Toulouse School of Economics February 2017 S. Gadat (TSE) SAD 2013 1 / 30 Keywords Decision trees

More information

Boosting Methods: Why They Can Be Useful for High-Dimensional Data

Boosting Methods: Why They Can Be Useful for High-Dimensional Data New URL: http://www.r-project.org/conferences/dsc-2003/ Proceedings of the 3rd International Workshop on Distributed Statistical Computing (DSC 2003) March 20 22, Vienna, Austria ISSN 1609-395X Kurt Hornik,

More information

Online Learning and Sequential Decision Making

Online Learning and Sequential Decision Making Online Learning and Sequential Decision Making Emilie Kaufmann CNRS & CRIStAL, Inria SequeL, emilie.kaufmann@univ-lille.fr Research School, ENS Lyon, Novembre 12-13th 2018 Emilie Kaufmann Online Learning

More information

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability

More information

Evaluation. Andrea Passerini Machine Learning. Evaluation

Evaluation. Andrea Passerini Machine Learning. Evaluation Andrea Passerini passerini@disi.unitn.it Machine Learning Basic concepts requires to define performance measures to be optimized Performance of learning algorithms cannot be evaluated on entire domain

More information

10-701/ Machine Learning - Midterm Exam, Fall 2010

10-701/ Machine Learning - Midterm Exam, Fall 2010 10-701/15-781 Machine Learning - Midterm Exam, Fall 2010 Aarti Singh Carnegie Mellon University 1. Personal info: Name: Andrew account: E-mail address: 2. There should be 15 numbered pages in this exam

More information

Knowledge Extraction from DBNs for Images

Knowledge Extraction from DBNs for Images Knowledge Extraction from DBNs for Images Son N. Tran and Artur d Avila Garcez Department of Computer Science City University London Contents 1 Introduction 2 Knowledge Extraction from DBNs 3 Experimental

More information

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18 CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$

More information

How to do backpropagation in a brain

How to do backpropagation in a brain How to do backpropagation in a brain Geoffrey Hinton Canadian Institute for Advanced Research & University of Toronto & Google Inc. Prelude I will start with three slides explaining a popular type of deep

More information

Kaggle.

Kaggle. Administrivia Mini-project 2 due April 7, in class implement multi-class reductions, naive bayes, kernel perceptron, multi-class logistic regression and two layer neural networks training set: Project

More information

Logistic Regression Trained with Different Loss Functions. Discussion

Logistic Regression Trained with Different Loss Functions. Discussion Logistic Regression Trained with Different Loss Functions Discussion CS640 Notations We restrict our discussions to the binary case. g(z) = g (z) = g(z) z h w (x) = g(wx) = + e z = g(z)( g(z)) + e wx =

More information

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel Logistic Regression Pattern Recognition 2016 Sandro Schönborn University of Basel Two Worlds: Probabilistic & Algorithmic We have seen two conceptual approaches to classification: data class density estimation

More information

CS4495/6495 Introduction to Computer Vision. 8C-L3 Support Vector Machines

CS4495/6495 Introduction to Computer Vision. 8C-L3 Support Vector Machines CS4495/6495 Introduction to Computer Vision 8C-L3 Support Vector Machines Discriminative classifiers Discriminative classifiers find a division (surface) in feature space that separates the classes Several

More information

What is semi-supervised learning?

What is semi-supervised learning? What is semi-supervised learning? In many practical learning domains, there is a large supply of unlabeled data but limited labeled data, which can be expensive to generate text processing, video-indexing,

More information

Bagging. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL 8.7

Bagging. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL 8.7 Bagging Ryan Tibshirani Data Mining: 36-462/36-662 April 23 2013 Optional reading: ISL 8.2, ESL 8.7 1 Reminder: classification trees Our task is to predict the class label y {1,... K} given a feature vector

More information

Learning theory. Ensemble methods. Boosting. Boosting: history

Learning theory. Ensemble methods. Boosting. Boosting: history Learning theory Probability distribution P over X {0, 1}; let (X, Y ) P. We get S := {(x i, y i )} n i=1, an iid sample from P. Ensemble methods Goal: Fix ɛ, δ (0, 1). With probability at least 1 δ (over

More information

Machine Learning 2017

Machine Learning 2017 Machine Learning 2017 Volker Roth Department of Mathematics & Computer Science University of Basel 21st March 2017 Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 1 / 41 Section

More information

Basis Expansion and Nonlinear SVM. Kai Yu

Basis Expansion and Nonlinear SVM. Kai Yu Basis Expansion and Nonlinear SVM Kai Yu Linear Classifiers f(x) =w > x + b z(x) = sign(f(x)) Help to learn more general cases, e.g., nonlinear models 8/7/12 2 Nonlinear Classifiers via Basis Expansion

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

A Magiv CV Theory for Large-Margin Classifiers

A Magiv CV Theory for Large-Margin Classifiers A Magiv CV Theory for Large-Margin Classifiers Hui Zou School of Statistics, University of Minnesota June 30, 2018 Joint work with Boxiang Wang Outline 1 Background 2 Magic CV formula 3 Magic support vector

More information

Midterm, Fall 2003

Midterm, Fall 2003 5-78 Midterm, Fall 2003 YOUR ANDREW USERID IN CAPITAL LETTERS: YOUR NAME: There are 9 questions. The ninth may be more time-consuming and is worth only three points, so do not attempt 9 unless you are

More information

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE Data Provided: None DEPARTMENT OF COMPUTER SCIENCE Autumn Semester 203 204 MACHINE LEARNING AND ADAPTIVE INTELLIGENCE 2 hours Answer THREE of the four questions. All questions carry equal weight. Figures

More information

Machine Learning - Waseda University Logistic Regression

Machine Learning - Waseda University Logistic Regression Machine Learning - Waseda University Logistic Regression AD June AD ) June / 9 Introduction Assume you are given some training data { x i, y i } i= where xi R d and y i can take C different values. Given

More information

ECE521 Lecture7. Logistic Regression

ECE521 Lecture7. Logistic Regression ECE521 Lecture7 Logistic Regression Outline Review of decision theory Logistic regression A single neuron Multi-class classification 2 Outline Decision theory is conceptually easy and computationally hard

More information

Proteomics and Variable Selection

Proteomics and Variable Selection Proteomics and Variable Selection p. 1/55 Proteomics and Variable Selection Alex Lewin With thanks to Paul Kirk for some graphs Department of Epidemiology and Biostatistics, School of Public Health, Imperial

More information

Statistical Machine Learning from Data

Statistical Machine Learning from Data Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Ensembles Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique Fédérale de Lausanne

More information

Boosting Based Conditional Quantile Estimation for Regression and Binary Classification

Boosting Based Conditional Quantile Estimation for Regression and Binary Classification Boosting Based Conditional Quantile Estimation for Regression and Binary Classification Songfeng Zheng Department of Mathematics Missouri State University, Springfield, MO 65897, USA SongfengZheng@MissouriState.edu

More information

CS 6375 Machine Learning

CS 6375 Machine Learning CS 6375 Machine Learning Nicholas Ruozzi University of Texas at Dallas Slides adapted from David Sontag and Vibhav Gogate Course Info. Instructor: Nicholas Ruozzi Office: ECSS 3.409 Office hours: Tues.

More information

Machine Learning Practice Page 2 of 2 10/28/13

Machine Learning Practice Page 2 of 2 10/28/13 Machine Learning 10-701 Practice Page 2 of 2 10/28/13 1. True or False Please give an explanation for your answer, this is worth 1 pt/question. (a) (2 points) No classifier can do better than a naive Bayes

More information

Machine Learning, Fall 2012 Homework 2

Machine Learning, Fall 2012 Homework 2 0-60 Machine Learning, Fall 202 Homework 2 Instructors: Tom Mitchell, Ziv Bar-Joseph TA in charge: Selen Uguroglu email: sugurogl@cs.cmu.edu SOLUTIONS Naive Bayes, 20 points Problem. Basic concepts, 0

More information

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning Kernel Machines Pradeep Ravikumar Co-instructor: Manuela Veloso Machine Learning 10-701 SVM linearly separable case n training points (x 1,, x n ) d features x j is a d-dimensional vector Primal problem:

More information

Neural Networks and Deep Learning

Neural Networks and Deep Learning Neural Networks and Deep Learning Professor Ameet Talwalkar November 12, 2015 Professor Ameet Talwalkar Neural Networks and Deep Learning November 12, 2015 1 / 16 Outline 1 Review of last lecture AdaBoost

More information

Stochastic Gradient Descent

Stochastic Gradient Descent Stochastic Gradient Descent Machine Learning CSE546 Carlos Guestrin University of Washington October 9, 2013 1 Logistic Regression Logistic function (or Sigmoid): Learn P(Y X) directly Assume a particular

More information

Machine Learning. Classification, Discriminative learning. Marc Toussaint University of Stuttgart Summer 2015

Machine Learning. Classification, Discriminative learning. Marc Toussaint University of Stuttgart Summer 2015 Machine Learning Classification, Discriminative learning Structured output, structured input, discriminative function, joint input-output features, Likelihood Maximization, Logistic regression, binary

More information

The Perceptron Algorithm

The Perceptron Algorithm The Perceptron Algorithm Greg Grudic Greg Grudic Machine Learning Questions? Greg Grudic Machine Learning 2 Binary Classification A binary classifier is a mapping from a set of d inputs to a single output

More information

Generalized Boosted Models: A guide to the gbm package

Generalized Boosted Models: A guide to the gbm package Generalized Boosted Models: A guide to the gbm package Greg Ridgeway April 15, 2006 Boosting takes on various forms with different programs using different loss functions, different base models, and different

More information

Does Modeling Lead to More Accurate Classification?

Does Modeling Lead to More Accurate Classification? Does Modeling Lead to More Accurate Classification? A Comparison of the Efficiency of Classification Methods Yoonkyung Lee* Department of Statistics The Ohio State University *joint work with Rui Wang

More information

CSE 250a. Assignment Noisy-OR model. Out: Tue Oct 26 Due: Tue Nov 2

CSE 250a. Assignment Noisy-OR model. Out: Tue Oct 26 Due: Tue Nov 2 CSE 250a. Assignment 4 Out: Tue Oct 26 Due: Tue Nov 2 4.1 Noisy-OR model X 1 X 2 X 3... X d Y For the belief network of binary random variables shown above, consider the noisy-or conditional probability

More information

The Perceptron. Volker Tresp Summer 2016

The Perceptron. Volker Tresp Summer 2016 The Perceptron Volker Tresp Summer 2016 1 Elements in Learning Tasks Collection, cleaning and preprocessing of training data Definition of a class of learning models. Often defined by the free model parameters

More information

Chapter 14 Combining Models

Chapter 14 Combining Models Chapter 14 Combining Models T-61.62 Special Course II: Pattern Recognition and Machine Learning Spring 27 Laboratory of Computer and Information Science TKK April 3th 27 Outline Independent Mixing Coefficients

More information

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University Lecture 18: Kernels Risk and Loss Support Vector Regression Aykut Erdem December 2016 Hacettepe University Administrative We will have a make-up lecture on next Saturday December 24, 2016 Presentations

More information

Machine Learning for NLP

Machine Learning for NLP Machine Learning for NLP Uppsala University Department of Linguistics and Philology Slides borrowed from Ryan McDonald, Google Research Machine Learning for NLP 1(50) Introduction Linear Classifiers Classifiers

More information

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet. CS 189 Spring 013 Introduction to Machine Learning Final You have 3 hours for the exam. The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet. Please

More information

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas Midterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas Machine Learning Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Y Continuous Non-parametric

More information

Links between Perceptrons, MLPs and SVMs

Links between Perceptrons, MLPs and SVMs Links between Perceptrons, MLPs and SVMs Ronan Collobert Samy Bengio IDIAP, Rue du Simplon, 19 Martigny, Switzerland Abstract We propose to study links between three important classification algorithms:

More information

Machine Learning. Lecture 3: Logistic Regression. Feng Li.

Machine Learning. Lecture 3: Logistic Regression. Feng Li. Machine Learning Lecture 3: Logistic Regression Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2016 Logistic Regression Classification

More information

A Study of Relative Efficiency and Robustness of Classification Methods

A Study of Relative Efficiency and Robustness of Classification Methods A Study of Relative Efficiency and Robustness of Classification Methods Yoonkyung Lee* Department of Statistics The Ohio State University *joint work with Rui Wang April 28, 2011 Department of Statistics

More information

MODULE -4 BAYEIAN LEARNING

MODULE -4 BAYEIAN LEARNING MODULE -4 BAYEIAN LEARNING CONTENT Introduction Bayes theorem Bayes theorem and concept learning Maximum likelihood and Least Squared Error Hypothesis Maximum likelihood Hypotheses for predicting probabilities

More information

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday! Case Study 1: Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April 4, 017 1 Announcements:

More information

Reading Group on Deep Learning Session 1

Reading Group on Deep Learning Session 1 Reading Group on Deep Learning Session 1 Stephane Lathuiliere & Pablo Mesejo 2 June 2016 1/31 Contents Introduction to Artificial Neural Networks to understand, and to be able to efficiently use, the popular

More information

Notes on Machine Learning for and

Notes on Machine Learning for and Notes on Machine Learning for 16.410 and 16.413 (Notes adapted from Tom Mitchell and Andrew Moore.) Choosing Hypotheses Generally want the most probable hypothesis given the training data Maximum a posteriori

More information

Logistic Regression and Boosting for Labeled Bags of Instances

Logistic Regression and Boosting for Labeled Bags of Instances Logistic Regression and Boosting for Labeled Bags of Instances Xin Xu and Eibe Frank Department of Computer Science University of Waikato Hamilton, New Zealand {xx5, eibe}@cs.waikato.ac.nz Abstract. In

More information

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016 Probabilistic classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Topics Probabilistic approach Bayes decision theory Generative models Gaussian Bayes classifier

More information

Efficient Data Reduction and Summarization

Efficient Data Reduction and Summarization Ping Li Efficient Data Reduction and Summarization December 8, 2011 FODAVA Review 1 Efficient Data Reduction and Summarization Ping Li Department of Statistical Science Faculty of omputing and Information

More information

Comparison of Modern Stochastic Optimization Algorithms

Comparison of Modern Stochastic Optimization Algorithms Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,

More information

Probabilistic Graphical Models & Applications

Probabilistic Graphical Models & Applications Probabilistic Graphical Models & Applications Learning of Graphical Models Bjoern Andres and Bernt Schiele Max Planck Institute for Informatics The slides of today s lecture are authored by and shown with

More information

Linear Models in Machine Learning

Linear Models in Machine Learning CS540 Intro to AI Linear Models in Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu We briefly go over two linear models frequently used in machine learning: linear regression for, well, regression,

More information

Machine Learning Linear Models

Machine Learning Linear Models Machine Learning Linear Models Outline II - Linear Models 1. Linear Regression (a) Linear regression: History (b) Linear regression with Least Squares (c) Matrix representation and Normal Equation Method

More information

Machine Learning and Data Mining. Multi-layer Perceptrons & Neural Networks: Basics. Prof. Alexander Ihler

Machine Learning and Data Mining. Multi-layer Perceptrons & Neural Networks: Basics. Prof. Alexander Ihler + Machine Learning and Data Mining Multi-layer Perceptrons & Neural Networks: Basics Prof. Alexander Ihler Linear Classifiers (Perceptrons) Linear Classifiers a linear classifier is a mapping which partitions

More information

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers Erin Allwein, Robert Schapire and Yoram Singer Journal of Machine Learning Research, 1:113-141, 000 CSE 54: Seminar on Learning

More information

CS 340 Lec. 16: Logistic Regression

CS 340 Lec. 16: Logistic Regression CS 34 Lec. 6: Logistic Regression AD March AD ) March / 6 Introduction Assume you are given some training data { x i, y i } i= where xi R d and y i can take C different values. Given an input test data

More information

Midterm exam CS 189/289, Fall 2015

Midterm exam CS 189/289, Fall 2015 Midterm exam CS 189/289, Fall 2015 You have 80 minutes for the exam. Total 100 points: 1. True/False: 36 points (18 questions, 2 points each). 2. Multiple-choice questions: 24 points (8 questions, 3 points

More information

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition NONLINEAR CLASSIFICATION AND REGRESSION Nonlinear Classification and Regression: Outline 2 Multi-Layer Perceptrons The Back-Propagation Learning Algorithm Generalized Linear Models Radial Basis Function

More information

Holdout and Cross-Validation Methods Overfitting Avoidance

Holdout and Cross-Validation Methods Overfitting Avoidance Holdout and Cross-Validation Methods Overfitting Avoidance Decision Trees Reduce error pruning Cost-complexity pruning Neural Networks Early stopping Adjusting Regularizers via Cross-Validation Nearest

More information

Classification. Chapter Introduction. 6.2 The Bayes classifier

Classification. Chapter Introduction. 6.2 The Bayes classifier Chapter 6 Classification 6.1 Introduction Often encountered in applications is the situation where the response variable Y takes values in a finite set of labels. For example, the response Y could encode

More information

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu Logistic Regression Review 10-601 Fall 2012 Recitation September 25, 2012 TA: Selen Uguroglu!1 Outline Decision Theory Logistic regression Goal Loss function Inference Gradient Descent!2 Training Data

More information

COMS 4771 Regression. Nakul Verma

COMS 4771 Regression. Nakul Verma COMS 4771 Regression Nakul Verma Last time Support Vector Machines Maximum Margin formulation Constrained Optimization Lagrange Duality Theory Convex Optimization SVM dual and Interpretation How get the

More information

Gradient Boosting, Continued

Gradient Boosting, Continued Gradient Boosting, Continued David Rosenberg New York University December 26, 2016 David Rosenberg (New York University) DS-GA 1003 December 26, 2016 1 / 16 Review: Gradient Boosting Review: Gradient Boosting

More information

Support Vector Machines

Support Vector Machines Support Vector Machines INFO-4604, Applied Machine Learning University of Colorado Boulder September 28, 2017 Prof. Michael Paul Today Two important concepts: Margins Kernels Large Margin Classification

More information

Delta Boosting Machine and its application in Actuarial Modeling Simon CK Lee, Sheldon XS Lin KU Leuven, University of Toronto

Delta Boosting Machine and its application in Actuarial Modeling Simon CK Lee, Sheldon XS Lin KU Leuven, University of Toronto Delta Boosting Machine and its application in Actuarial Modeling Simon CK Lee, Sheldon XS Lin KU Leuven, University of Toronto This presentation has been prepared for the Actuaries Institute 2015 ASTIN

More information

A Deep Interpretation of Classifier Chains

A Deep Interpretation of Classifier Chains A Deep Interpretation of Classifier Chains Jesse Read and Jaakko Holmén http://users.ics.aalto.fi/{jesse,jhollmen}/ Aalto University School of Science, Department of Information and Computer Science and

More information

A summary of Deep Learning without Poor Local Minima

A summary of Deep Learning without Poor Local Minima A summary of Deep Learning without Poor Local Minima by Kenji Kawaguchi MIT oral presentation at NIPS 2016 Learning Supervised (or Predictive) learning Learn a mapping from inputs x to outputs y, given

More information

Neural networks (NN) 1

Neural networks (NN) 1 Neural networks (NN) 1 Hedibert F. Lopes Insper Institute of Education and Research São Paulo, Brazil 1 Slides based on Chapter 11 of Hastie, Tibshirani and Friedman s book The Elements of Statistical

More information

HOMEWORK 4: SVMS AND KERNELS

HOMEWORK 4: SVMS AND KERNELS HOMEWORK 4: SVMS AND KERNELS CMU 060: MACHINE LEARNING (FALL 206) OUT: Sep. 26, 206 DUE: 5:30 pm, Oct. 05, 206 TAs: Simon Shaolei Du, Tianshu Ren, Hsiao-Yu Fish Tung Instructions Homework Submission: Submit

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information

Data Mining und Maschinelles Lernen

Data Mining und Maschinelles Lernen Data Mining und Maschinelles Lernen Ensemble Methods Bias-Variance Trade-off Basic Idea of Ensembles Bagging Basic Algorithm Bagging with Costs Randomization Random Forests Boosting Stacking Error-Correcting

More information

Introduction. Chapter 1

Introduction. Chapter 1 Chapter 1 Introduction In this book we will be concerned with supervised learning, which is the problem of learning input-output mappings from empirical data (the training dataset). Depending on the characteristics

More information

Computational statistics

Computational statistics Computational statistics Lecture 3: Neural networks Thierry Denœux 5 March, 2016 Neural networks A class of learning methods that was developed separately in different fields statistics and artificial

More information

Au-delà de la Machine de Boltzmann Restreinte. Hugo Larochelle University of Toronto

Au-delà de la Machine de Boltzmann Restreinte. Hugo Larochelle University of Toronto Au-delà de la Machine de Boltzmann Restreinte Hugo Larochelle University of Toronto Introduction Restricted Boltzmann Machines (RBMs) are useful feature extractors They are mostly used to initialize deep

More information

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning Lecture 0 Neural networks and optimization Machine Learning and Data Mining November 2009 UBC Gradient Searching for a good solution can be interpreted as looking for a minimum of some error (loss) function

More information