Application of machine learning in manufacturing industry

Size: px
Start display at page:

Download "Application of machine learning in manufacturing industry"

Transcription

1 Application of machine learning in manufacturing industry MSC Degree Thesis Written by: Hsinyi Lin Master of Science in Mathematics Supervisor: Lukács András Institute of Mathematics Eötvös Loránd University Faculty of Science Budapest 2017

2 ABSTRACT In high-tech manufacturing industry, much data are fetched each day from production line. These data could be measurement data, categorical data, or timerelated data. Before the product being delivered to costumer, it has to pass quality examination. The task in this thesis is to use a given training set and combine with different boosting algorithms which already exist in scikit-learn package in python to build a model which can predict quality test result. To build such model can be regarded as solving highly imbalanced binary classification problem, since the number of Fail case compared to the number of Pass case are very scarce for mature product. Boosting is very common used for solving this type of problem in Kaggle community and its general idea is to combine many weak learners(base classifiers) sequentially to get a strong learner. In programming, the model is trained under certain condition, which means some parameters are fixed, therefore, the understanding of each parameter is needed for finding optimal parameters. To optimize the model, we minimize certain objective function which depends on its correspondent algorithm. In adaboost, the upper bound of empirical error is minimized. In logitboost, it minimizes the least square regression of residual which is updated by Newton s method. In gradientboost, it minimizes user-defined differential convex loss function which represents the dissimilarity between real class and prediction result. Except loss function, XGB takes regularization term into account and using second order Taylor expansion to approximate the loss function for preventing over-fitting. Solving the problem with given training set includes two step: (1)Use line-search to find optimal parameters. (2)Find the optimal threshold to decide the result of classification. The purpose of the thesis is to demonstrate the procedure of solving the problem and the computational ability of laptop is also limited, therefore, the very small part of original data are taken as training set. Thus, the model trained by given training set with less information is not as competitive as those model with higher rank in the competition. While doing line search to optimize the value for a specific parameter, it may not easy to find a proper range or value. Thus, it s quite important to reference past empirical experience. In this thesis, some model with different objective function and different training set are built, but there is no specific method which is always the best. In Kaggle competition, most participants won the competition by combining many methods rather than using single method. I

3 Contents Abstract Table of contents List of figures I II IV List of tables V 1 Introduction Motivation Outline Boosting Methods Binary classification General idea Adaptive boosting Algorithm Derivation of each step Logit boosting Algorithm Derivation of each step Gradient boosting Algorithm Derivation XGB-Extreme Gradient Boosting Algorithm Derivation Measurement and Experiment Measurement of goodness of the classifier ROC curve sigmoid function Data preparation implementation Optimization of parameters Optimal threshold II

4 3.3.3 Regularization and over-fitting Model with different training set and algorithm Summary Bibliography VI III

5 List of Figures 1 Receiver operating characteristics curve Sigmoid function maps the score to [0,1] Partial process flow Default value: M = 100, λ = 1, γ = 0, objective=logistic, ν = max depth = 2, and default value λ = 1, γ = 0, objective = logistic, ν = M = 103, max depth = 2, λ = 1, γ = 0, objective = logistic M = 103, max depth = 2, λ = 1, objective = logistic M = 103, max depth = 2, λ = 1, objective = logistic M = 103, max depth = 2, λ = 1, objective = logistic M = 103, max depth = 2, γ = 0, objective = logistic ROC curve under optimal condition Confusion matrix while threshold is equal to XGB with regularization λ = 1, γ = XGB without regularization λ = 0, γ = Overfitting: XGB with regularization improved overfitting problem.. 35 IV

6 List of Tables 1 Summary of different loss function for gradient boosting Summary of different loss function for XGB Confusion matrix which is gotten under certain threshold Size of training set Optimal condiftion of given different training set by using adaboost algorithms Optimal condiftion of given different training set while using gradient boosting Optimal condiftion of different given training set while using XGB.. 36 V

7 1 Introduction 1.1 Motivation The few years working experience in high-tech manufacturing industry made me realize how fast the speed of data being generated in production line. The limitation of capacity of repository urges old data be purged while new data being generated. It s a tough task to analyse all the fetched data and extract useful information from such a big data set by tradition statistical analytical methods before data being purged. In order to effectively use all the fetched data, more and more companies started to introduce the concept of machining learning to develop their analytical tools. As we can see, recently, many companies started to provide their data and held competitions in Kaggle, the largest communicative platform for people studying in relevant fields of machine learning [1]. By holding the competitions, they can get better ideas and concepts among those innovative and efficient methods that were proposed by participants from all over the world. In manufacturing industry, one of the typical questions is how to predict the quality test results based on past data. The quality test result decides whether the company can gain benefit from the product or not, because every product needs to pass all the quality tests before being delivered to customers. If the result can be predicted precisely by built model and certain given features, that would be helpful for company to save cost. Thinking about the situation, if the predicted result is fail, then we can simply scrap the product and stop the remaining process it should be done, because of the high risk of being scrap after finishing all the procedure. We also can know which features are important to the prediction via such model, so we can focus on the yield improvement of certain process. To build the model of solving binary classification problems, we will introduce boosting methods which are the most commonly used in Kaggle platform by the community. Most participants won the competitions by using the idea of boosting methods to construct their model. Until now, there have been many kinds of boosting methods have been proposed such as adaboost, logitboost, gradient boosting and extreme gradient boosting. Some algorithms have been developed and existed in python packages. 1

8 1.2 Outline This thesis is organized in the following way. In Chapter 2, the general idea of boosting methods and four different boosting methods which are adaboost, logitboost, gradient boost, extreme gradient boost are introduced. Each method includes the detail of algorithm, the concept and derivation. The idea of loss function and the regularization which appears in the extreme gradient boost are also be shown. In Chapter 3, firstly, ROC curve and sigmoid function are introduced for evaluating the goodness of the model. Data preparation and the implementation environment are also presented before programming. The demonstration of the procedure of solving the highly imbalanced classification problems by extreme gradient boosting which existed in scikit-learn in python packages includes two steps: (1)Use line search strategy to find optimal condition.(2)find a best threshold which maximizes the Matthew correlation coefficient. After demonstrating the procedure, I show the improvement of overfitting when taking the regularization into account. In the end of this chapter, I demonstrate the result of those models which are trained by different boosting algorithms with given different training set and give a conclusion for chapter 3. 2

9 2 Boosting Methods 2.1 Binary classification Classification is one of the supervised learning problem in machine learning. The task is to find a model F to fit given labeled training set x i, y i i = 1, 2,, N, and use it to predict a set with unknown class label. The labeled training set include features x i and class label y i. The problem of predicting quality test results can be regarded as a highly imbalanced binary classification problem because the results of prediction have only two possibilities, Pass and Fail, and non-uniform distribution for the big disparity between two classes. In practical application, the labels of two classes are usually encoded to -1/1 or 0/1 depend on the method being used. In manufacturing industry, features can be any measured data during process or time values. To formulize the problem, we denote the model F : x F (x) where F (x) is the prediction with given features x. The goal of finding optimal F is equivalent to find the model which minimizes the dissimilarity between y i and F (x i ). 2.2 General idea The general idea of boosting methods [2] [3] [4] is to sequentially ensemble many weak learners(base classifiers) f m which can predict slightly better than flipping uniform coin(random guessing) to a strong classifier F. In most boosting methods use CART(classification and regression trees) as based classifiers. In general, to find a optimal classifier in m th iteration is equivalent to find a f m which minimize certain given objective function. In adaboost, the goal is to minimize the upper bound of the empirical error of the ensemble classifier [5]. Since its empirical error is bounded by exponential loss, it s same as taking exponential loss as objective function. In logitboost, it minimizes the least square of residual which estimated by Newton s update. The goal of gradient boosting is to minimize loss function which can be interpreted as the dissimilarity between real class label y and predicted value F (x). The loss function can be any differentiable convex function. Unlike gradient boosting, the objective function of extreme gradient boosting has one more term, that is regularization, which can prevent overfitting effectively. 2.3 Adaptive boosting Adaboost was once the most popular boosting method until extreme gradient boosting appeared. It combines many weak classifiers which are trained by weighted 3

10 training set sequentially to get a strong classifier. Initially, every sample in training set is given an equal weight. In each stage, we increase the weight to misclassified samples of predicting by ensemble classifiers and get a new weighted training set to train next weak learner. It emphasizes the importance of those misclassified instances so that next weak classifier is able to focus on fitting those misclassified samples. Because the summation of the weight of each sample is equal to one, the weight of those samples which are classified correctly will also be decreased. In general, using stumps which are the simplest decision tree only with one node and two leaves as a weak classifier already get quite good accuracy and efficiency. The commonly used version of adaboost is discrete adaboost which was first time introduced in [Freund and Schapire (1996b)]. It was developed to solve binary classification problem with class label y i { 1, 1}, i. The real adaboost came up two years later Algorithm The following two algorithms are the variant of adaboost. Algorithm 1 is the discrete version which was presented in 1996 by Freund and Schapire and Algorithm 2 is real adaboost. It has a slightly difference while calculating the value which will be added at m iteration. In discrete adaboost, the value is estimated by given weighted training error predicted by weak classifier f m { 1, 1} when adaboost uses weak classifier f m which return a real number and estimate the value by given class probability. The weak classifier f m Both algorithms increase the weight of those misclassified samples by multiplying exponential function. The detail explanation will be described in the derivation part. Algorithm 1 Discrete adaboost algorithm 1: Initialize w i 1, N i = 1, 2..., N Initialize equal weight of each sample 2: while m M do 3: Fit a classifier f m (x) { 1, 1} Fit a classifier f m { 1, 1} with minimum weighted training error [ ] 4: Compute ε m = E w 1(y fm(x)) Calculate weighted training error of f m 5: c m = log (1 ε m) (ε m ) Update the coefficient of the weak learner of current iteration 6: w i w i e cmi[ fm(x)]] i = 1, 2, N Increase the weight of those misclassified samples N 7: re-normalize such that w i = 1 8: end while 9: return sign [F (x)]. F (x) = M c m f m (x) m=1 re-normalize all the weights output the sign of ensemble result w i : weight of sample i f m: optimal classifier of m th iteration 4

11 ε m : weighted training error of f m c m: the coefficient of classifier f m while ensemble to a strong classifier F : final ensemble classifier Algorithm 2 Real adaboost algorithm 1: Initialize w i 1, N i = 1, 2..., N Initialize equal weight to each sample 2: while m M do 3: Find optimal partition and get class probability p m (y = 1 x) Train a classifier with minimum weighted training error and get its class probability 4: Compute f m (x) 1 2 ln p m(y = 1 x) 1 p m (y = 1 x) Update the score of each leaf 5: w i w i e yifm(x) i = 1, 2, N Update weight of each sample N 6: re-normalize such that w i = 1 7: end while 8: return sign [F (x)]. F (x) = M f m (x) m=1 re-normalize all the weights output the sign of ensemble result w i : weight of sample i p m(x y = 1) : class conditional probability of each leaf at m th iteration f m: classifier with optimal score at m th iteration F :final ensemble classifier Derivation of each step Firstly, it sets equal weight to each sample w 0 (i) = 1/N, where N is the number of all samples. Secondly, it comes the iteration step m. There are three steps at m th stage in discrete adaboost. It includes (1) Find a optimal classifier f m with minimal weighted training error. (2) Find the optimal coefficient c m of weak classifier f m. (3) Update weight for next iteration. In practical case, we usually find a classifier f m (x) with minimum weighted training error under certain given condition which can also be optimized by doing linesearch. For example, the most popular condition is to set the base classifier as stump, which is the simplest decision tree for its efficiency and accuracy of final classifier. [ ] f m (x) = arg min E w I(y fm(x)) = arg min f m f m N w m (i) [ ] I (y fm(x)) The following we ll have to find optimal c m which can minimize the empirical training error 1 N I N (yi F (x i )<0) while f m is known. To minimize the empirical training error, we consider to minimize its upper bound. Since I (yi F (x i )) e ky if (x i ) k > 0, 5

12 the objective function becomes 1 N e ky if (x i ). According to the definition in algorithm, n the weights of those misclassified samples will be increased. In fact, the weight of those samples which are classified correctly are also reduced because of the re-normalization. The updated weight for next iteration can be expressed as w (m+1) (i) = w m(i)e cmi[ fm(x)] N e cmi[y i f m(x)] (1) N The denominator w m (i)e cmi[ fm(x)] is the normalization term and it can be simplified to (1 ε m ) + ε m e cm by the following derivation. N w m (i)e cmi[y i f m(x)] = w m (i)e 0 + w m (i)e cm y i =f m(x i ) y i f m(x i ) = w m (i) + e cm w m (i) y i =f m(x i ) y i f m(x i ) = (1 ε m ) + ε m e cm Where the weighted training error is ε m = y i f m(x i ) When y i = f m (x i ),which is same as y i f m (x i ) = 1 w (m+1) (i) = Otherwise; y i f m (x i )(or y i f m (x i ) = 1), w (m+1) (i) = w i. w m (i) (1 ε m ) + ε m e = w m (i)e cm/2 cm (1 ε m )e cm/2. (2) + ε m ecm/2 w m (i)e cm (1 ε m ) + ε m e = w m (i)e cm/2 cm (1 ε m )e cm/2. (3) + ε m ecm/2 We can simplify equation (2) and (3) to single equation. Where Z m = (1 ε m )e cm/2 + ε m e cm/2. By equation (4), we can get w (m+1) (i) = w m(i)e cmy if m(x i )/2 Z m (4) e cmy if m(x i )/2 = w (m+1)(i) w (m) (i) Z m (5) 6

13 By (5), upper bound of empirical error can be written as 1 N N e y if (x i )/2 = while choosing k = 1.The derivation is as following. 2 M Z m (6) m=1 1 N N e y if (x i )/2 = 1 N N y i e M m=1 c m f m (x i )/2 = 1 N = 1 N N M m=1 N M m=1 e y ic mf m(x i )/2 w (m+1) (i) w (m) (i) Z m with given = M m=1 N w (m+1 (i) = 1 and w 1 (x i ) = 1 N. Thus, by equation (6), to minimize upper bound of empirical error is equivalent to minimize Z m at each iteration. The minimum value of Z m occurs when Z m c m = 0. Z m Z m c m = (1 ε m)e cm/ ε me cm/2 = 0 e cm = 1 ε m ε m And c m = log 1 ε m ε m (7) Z m = (1 ε m )e 1 1 ε 2 log m ε m 1 1 ε + ε m e2 log m ε m 7

14 εm 1 εm = (1 ε m ) + ε m 1 ε m ε m while c m = log 1 ε m ε m. Thus, empirical error is bounded. 1 N = 2 ε m (1 ε m ) N I [(y i F (x i ) < 0] < M 2 ε m (1 ε m ) (8) m=1 This result can be interpreted. When c m = 0, it means the classifier f m is same as random guessing, because the weighted training error ε m is 0.5, so it s useless for final prediction. c m > 0 means f m has more chance to predict correctly(ε m < 0.5), so we can multiply a positive coefficient(c m > 0) to emphasize its positive influence of ensemble classifier. If c m < 0, not surprisingly, f m has more possibility to have a wrong prediction(ε m > 0.5), we reverse the prediction, then it still helps for ensemble prediction. In real adaboost, f m is a real number rather than in { 1, 1}. To find optimal value of f m (x). Because the weighted training error E w [1 (yfm(x)<0] E w [e yfm(x) ], to minimize weighted training error is equivalent to minimize its upper bound E w [e yfm(x) ]. E w [ e yfm(x)] = P w (y = 1 x)e fm(x) + P w (y = 1 x)e fm(x) E w [ e yf m(x) ] f m (x) = P w (y = 1 x)e fm(x) + P w (y = 1 x)e fm(x) E w [ e yf m(x) ] f m (x) = 0 f m (x) = 1 2 log P w(y = 1 x) P w (y = 1 x) Hence, the optimal base classifier f m is half of the value of taking logarithm of the ratio of two classes probabilities. Lastly, both algorithms output the sign of ensemble value. 2.4 Logit boosting Logitboost is a method only being used to solve binary classification problem with class label y = {0, 1}, while other methods using class label y = { 1, 1}. It fit an additive logistic symmetric likelihood function by using adaptive newton approach. At every iteration step, a working response z i i = 1, 2,, N of each sample is calculated by its y i and estimated class probability. Working response can be also regarded as an 8

15 approximate residual between real class label and ensemble value. The formula of working response is derived by using Newton s method. The optimal regression tree f m is got by minimizing the weighted least-square regression of z i. In the last step, it will output the sign of the ensemble value Algorithm Algorithm 3 Logit boosting algorithm 1: Initialize w i 1 N, p 0(x i ) = 1 2 i = 1, 2..., N and F 0 (x i ) = 0 Initialize equal weight to each sample and initial score of single node tree 2: while m M do yi 3: z m (i) = p (m 1)(x i ) Compute working response Z p (m 1) (x i )(1 p (m 1) (x i )) 4: w m (i) = p (m 1) (x i )(1 p (m 1) (x i )) Update weight of each instance [ 5: Fit the function f m (x), E w (fm (x) z) 2] by a weighted least-squares regression of z i to F m(x i ) using weight w m(i) 6: F m (x) = F (m 1) (x) f e Fm(x) m(x), and p m (x) = e Fm(x) + e Fm(x) estimate class probability for computing next working response 7: end while 8: return sign[f (x)] output the sign of ensemble result w i :weight of sample i w m(i): weight of sample i which updated by using previous class probability z m(i): the newton s update or estimated residual f m: optimal classifier which minimize the least square regression of z i F : final ensemble classifier Derivation of each step Firstly, give to equal weight 1 N to every training instance, initialize F (x) = 0 and p(x i ) = 1, which means the the probability of y = 1 of every instance is same as random 2 guessing. In the m th iteration, the goal is to fit f m by minimizing the weighted least square regression of z i. F (x) is gradually improved by adding f(x) after each iteration. Here z m (i) is regarded as a residual(newton s update) that compared to its previous estimated probability of occurring y = 1 under x = x i. Next topic we concerned most is that why y p (m 1) (x i ) z m (i) = and how to find an optimal f(x) to maximize the (1 p (m 1) is(x i ))p (m 1) (x i )) next logistic likelihood function? Logistic likelihood function is denoted as l(p(x)) = y log p(x) + (1 y ) log(1 p(x)) 9

16 where p(x) = e F (x) e F (x) + e F (x) = 1 e 2F (x) + 1 (9) = l(p(x)) = l(f (x)) [ ] y e F (x) [ e F (x) = log e F (x + e F (x) e F (x + e F (x ] 1 y = 2y F (x) log(1 + e 2F (x) ) To find an optimal f(x) for current iteration, we need to maximize the expected logistic likelihood function. It occurs when Here, we denote,and E [(F (x) + f(x))]] f(x) g(f (x) + f(x)) = h(f (x) + f(x)) = f(x) = arg max E [(F (x) + f(x))] f = 0 E [l(f (x) + f(x))] f(x) g(f (x) + f(x)) f(x) [ = E 2y = E [ e 2(F (x)+f(x)) 4e 2(F (x)+f(x)) (1 + e 2(F (x)+f(x)) ) 2 Thus, it can be simplified to solve g(f) = 0. In logitboost methods,it uses Newton s method to find approximate solution. Newton s method is a method to find approximate roots r of function q(r) such that q(r) 0. According to Taylor expansion, q(r) = q(r 0 ) + q (r 0 )(r r 0 ) + O((r r 0 ) 2. It may not easy to find the root of q(r) = 0, but we can find a r such that 0 < q(r) < q(r 0 ), which means r is closer to root than r 0, and using this approach to find an approximate solution until q(r) 0. If r 0 is already a point which is near the root, and q(r) q(r 0 ) + q (r 0 )(r r 0 ) 0, ] ] (10) (11) r r 0 q(r 0) q (r 0 ), (12) then r will be a better approach than r 0. To solve g(f +f) = 0, we can consider F (x) = r 0 and r = F (x) + f(x). By (10), (11) and (12), F (x) + f(x) F (x) 10 g(f (x) + f(x)) h(f (x) + f(x)) f(x)=0

17 Since and by (9), we get g(f (x)) F (x) + f(x) F (x) h(f (x)) = [ E [ E 2y g(f (x)) h(f (x)), e 4e 2F (x) 2F (x) x (1 + e 2F (x) ) 2 x [ ] E y 1 x 1 + e 2F (x) = [ ], e 2F (x) 1 2E (1 + e 2F (x) ) (1 + e 2F (x) ) ] ] (13) And (13) becomes [ g(f (x)) h(f (x)) = 1 2 E y ] p(x) (1 p(x))p(x) x. (14) F (x) + f(x) F (x) + 1 [ 2 E y ] p(x) (1 p(x))p(x) x. (15) [ f(x) = arg min E w f(x) 1 f 2 y ] p(x) 2 (16) (1 p(x))p(x) y p (m 1) (x i ) Thus, we denote z m (i) =, and minimize weighted least square (1 p (m 1) (x i ))p (m 1) (x i )) error to z to find optimal f m (x). 2.5 Gradient boosting Gradient boosting [6] [8] is a methods of using the idea of function estimation. The goal is to find a additive function which is fitting for training data. In gradient boosting,he objective function is loss function which can interpreted as the dissimilarity between real class label and the predicted value of the additive function. In every iteration, it uses negative gradient of loss function to approximate the residual of previous iteration and the goal is to find a regression or classifier f and its score(function value) of each leaf which can minimize the summation of all the instances loss function L(y, F (m 1) (x) + f m (x)). Because of the needed of computing gradient, the loss function is required to be differential convex function. In binary classification problem, it output the sign of final ensemble score as its prediction if the class label y { 1, 1}. 11

18 2.5.1 Algorithm Algorithm 4 Gradient boosting algorithm N 1: F 0 (x) = arg min γ ψ(y i, γ) Give an initial value to F 2: while m [ M do ] ψ(yi, F (x i )) 3: ỹ im = F (xi)=f F (x i ) (m 1) (x i) Calculate negative gradient of loss function 4: {R lm } L 1 = L disjoint regions trained by ({ỹ im, x i } N 1 ) use the gradient of loss function we just calculated as new training label to train a new classifier 5: γ lm = arg min γ ψ(y i, F m 1 (x i ) + γ) Calculate the optimal score of each leaf x i R lm 6: F m (x) = F (m 1) (x) + ν γ lm 1(x R lm ) add the base classifier to the additive model(ensemble classifier) 7: end while F 0 : initial value of ensemble classifier ψ : loss function ỹ im : estimated residual of sample i at m th iteration R lm : all the samples which are classified to leaf with index l at m th γ lm : optimal score of leaf l under the L-disjoint partition ν : learning rate F m: the ensemble score of m th iteration Derivation In gradient boosting algorithm, user can give arbitrary convex differential function as a loss function. Firstly, it gives an initial value to ensemble classifier which will be calculated by the loss function being used. The initial classifier can be regards as a tree with only single node which means all the samples are classified to the only single node and the initial value F 0 is the optimal score of the single node. Denote the score of single node tree is γ. The minimum value of F 0 (x) = arg min γ N ψ(y i, γ) occurs at N ψ(y i, γ) (17) N ψ(y i, γ) γ = 0 (18) By equation (17) (18), F 0 of taking different loss function can be derived as following. Least square loss: 1 (y F (x))2 2 12

19 N 1 2 (y i γ) 2 γ = N γ N N y i = 0 F 0 (x) = F 0 is the average of all y i in the training set according to the following derivation. Exponential loss: e yf (x) N N y i N γ e y iγ = 0 N y i e yiγ = 0 N N = e γ 1(y i = 1) + e γ 1(y i = 1) = 0 e γ NP (y = 1) + e γ NP (y = 1) = 0 F 0 (x) = γ 0 = 1 2 log P (y = 1) P (y = 1) Logistic loss: log(e 2yF (x) + 1) N log(1 + e 2yiγ ) γ = 0 N 2y i 1 + e 2y iγ = 0 2N [ P (y = 1) 1 + e 2γ ] P (y = 1) 1 + e 2γ = 0 F 0 (x) = 1 2 log P (y = 1) P (y = 1) 13

20 For exponential loss and logistic loss, F 0 is taking the logarithm of the ratio of the probabilities of two classes. Secondly, it include four steps at m th iteration. (1) Calculate negative gradient of loss function ỹ im to approximate residual of each sample. (2) Train a classifier f m which partition the features {x i i = 1, 2,, N} into L-disjoint region R lm l = 1, 2,, N by given training set {x i, ỹ im } (3) Find the optimal score γ lm of leaf l which minimizes the loss function of the leaf. (4) Update the ensemble score. (1) Calculate ỹ im of different loss function. Least square loss: [ ] ψ(yi, F (x i )) ỹ im = F (x i ) F (xi )=F (m 1) (x i ) (19) 1 ỹ im = 2 (y i F (x i )) 2 F (x i ) F (xi )=F (m 1) (x i ) = y i F (m 1) (x i ) (20) Exponential loss: ỹ im = [ ] e y if (x i ) F (x i ) F (xi )=F (m 1) (x i ) = y i e y if (m 1) (x i ) (21) Logistic loss: ỹ im = [ ] log(1 + e 2y if (x i ) ) F (x i ) F (xi )=F (m 1) (x i ) = 2y i 1 + e 2y if (m 1) (x i ) (22) (2)Train a classifier(tree) f m has L leaves by using training set {x i, ỹ im }. L can be variant in different iteration. (3)Find optimal score f m (x i ) = γ lm x i R lm of leaf l, which minimize the ensemble loss function ψ(y i, F m (x i )). x i R lm ψ(y i, F m (x i )) = ψ(y i, F (m 1) (x i ) + γ lm ) (23) x i R lm x i R lm 14

21 γ lm = arg min The minimum value occurs while Least square loss: By equation (20), can be re-written as By equation (24) (25), we get γ x i R lm ψ(y i, F m 1 (x i ) + γ) (24) x i R lm ψ(y i, F m 1 (x i ) + γ) γ (y i F (m 1) (x i ) γ) 2 1 ( γ 2 2ỹ im γ + ỹ 2 ) im. 2 1 ( γ 2 2ỹ im γ + ỹ 2 im) 2 x i R lm = 0 γ N l γ γ lm = x i R lm ỹ im N l x i R lm ỹ im N l, = 0 = 0 (25) where N l is the number of samples which are classified to leaf l, N l = N(R lm ). Exponential loss: By equation (21) (24) and (25), we get x i R lm e yi(f(m 1)(xi)+γ) = 0, γ y i e yi(f(m 1)(xi) e yiγ = 0, x i R lm ỹ im e yiγ = 0, x i R lm e γ ỹ im 1(y i = 1) e γ ỹ im 1(y i = 1) = 0, x i R lm x i R lm ỹ im 1(y i = 1) γ lm = 1 2 log x i R lm. ỹ im 1(y i = 1) x i R lm 15

22 Logistic loss: By equation(24), we need to solve x i R lm 2y i 1 + e 2(F (m 1)(x i )+γ) γ but there is no closed form of the solution yet. In practical case, use numerical method to solve it. = 0, (4) Add the score to the ensemble classifier depends on which leaf x belongs to. F m (x) = F (m 1) (x) + ν γ lm 1(x R lm ) Here, ν is learning rate which is a coefficient of γ lm while adding to ensemble classifier. In practical application, it s a parameter need to be fine tuned for building model conservatively. Lastly, output the prediction value which depends on loss function. For least square, output the sign of ensemble result. For logistic and exponential loss, use sigmoid function which will be introduced next chapter maps the ensemble value to [0, 1]. If the result is larger than 0.5, the result of prediction will be 1. We summarize the result as Table 1 based on different loss function. Table 1: Summary of different loss function for gradient boosting loss function F 0 (x) Least square exponential loss logistic loss 1 2 (y F (x))2 e yf (x) log(e 2yF (x) + 1) N y i 1 P (y = 1) 1 P (y = 1) log log N 2 P (y = 1) 2 P (y = 1) ỹ im y i F (m 1) (x i ) y i e y if (m 1) (x i ) 2y i 1 + e 2y if (m 1) x γ i R lm ỹ im 1 lm N l 2 log x i R lm ỹ im 1(y i = 1) no closed form x i R lm ỹ im 1(y i = 1) 2.6 XGB-Extreme Gradient Boosting XGB which is also called extreme gradient boosting and has been developed since It appeared first time in the competition hold by Kaggle and was proposed by Tainqi Chen [7] in University of Washington. The idea of XGB is originated from gradient boosting. The biggest difference between gradient boosting and XGB is the objective function. 16

23 In addition to training loss, objective function of XGB has a regularization term which dose not exist in traditional gradient boosting. It can prevent overfitting. Like other boosting method, the final classifier is the result to ensemble many based classifier up. And the ensemble classifier can always have a better prediction than previous iteration. XGB uses second order Taylor expansion to approximate objective function which approached by previous term. We discuss binary classification problem with label y {1, 1}, and the training set {(x i, y i ) i = 1, 2, N}, where x i is a vector with d dimensions(d features) Algorithm Algorithm 5 Extreme boosting algorithm 1: Initial F 0 (x i ) = γ 0 i = 1, 2,, N Give an initial value to F 2: while m M do 3: g im = ψ(y i, F (x i )) F (xi)=f F (x i ) (m 1) (x i), h im = l2 (y i, F (x i )) F (x i ) 2 F (xi)=f (m 1) (x i) Calculate g im and h im for 2nd order Taylor expansion approximation 4: Train a new classifier which partitions {x i } N x i into L disjoint regions {R lm } L l=1 Train a classifier with training set {x i, {g im, h im }} 5: g im x w lm = i R lm h im + λ x i R lm calculate the optimal score of each region 6: F m (x) = F (m 1) (x) + ν w lm 1(x R lm ) add the base classifier to the additive model(ensemble classifier) 7: end while γ 0 : initial score of each sample g im : value of first derivation of loss function of sample i at m th iteration h im : value of first derivation of loss function of sample i at m th iteration R lm : x i which are classified to region l at m th iteration w lm : optimal score of leaf l at m th iteration Derivation The objective function in XGB includes two terms, which are loss function and regularization term. Denote the Objective function L m at m th iteration. L (m) = N ψ(y i, F (m 1) (x i ) + f m (x i )) + Ω(f m ) (26) Ω(f m ) = γl λ w 2 (27) are loss function and regularization term respectively. f m is a regression tree trained by {x i, {g im, h im }}. It partitions all samples {x i i = 1, 2, N} into L-disjoint Regions 17

24 {R lm } N. The goal is to minimize the objective function L(m) with given under fixed partition {R lm } N l=1 and find optimal weight w lm of region l such that f m (x i R lm ) = w lm The regions can be regarded as leaves. Different from traditional gradient boosting, XGB uses 2nd order Taylor expansion to approximate loss function, thus, ψ is required to be a second differentiable function. ψ(y i, F (m 1) (x i ) + f m (x i )) = ψ(f (m 1) (x i ), y i ) + g im f m (x i ) h imfm(x 2 i ) (28) where g im = F ψ(y i, F ) F =F (m 1) (x i ) (29) h im = 2 F ψ(yi, F ) F =F (m 1) (x i ) (30) Eq (26) becomes L (m) N = ψ(f (m 1) (x i ), y i ) + g im f m (x i ) h imfm(x 2 i ) + γl λ w 2. (31) Since ψ(f (m 1) (x i ), y i ) is known, it s equivalent to optimize L (m) N = g im f m (x i ) h imfm(x 2 i ) + γl λ w 2 (32) Let f m (x i R lm ) = w l, equation (32) can be rewritten as Let L (m) L = {g im w l + 1 L 2 h imwl 2 + γ} λw2 l l=1 x i R lm l=1 L = {w l g im w2 l h im + λ + γ} x i R lm x i R lm l=1 L l (m) = wl ( x i R lm g im ) w2 l ( x i R lm h im + λ) + γ (33) Finding optimal value of regression tree f m is equivalent to find optimal w l which minimize L l (m) of leaf l, respectively. w lm = arg min w l (m) Ll Its minimal value occurs when L (m) l = 0 w l Denote a lm = 1 2 ( h im + λ) and b lm = ( g im ). x i R lm x i R lm Equation (33) becomes (34) 18

25 L l (m) = alm w 2 l + b lmw l + γ and we get Since it turns out that L l (m) w l = 2a lm w l + b lm = 0, ( g im ) w lm = b lm x i R = lm 2a lm ( h im + λ) x i R lm Optimal ( Ll (m) ) = γ b2 lm 4a lm = γ x i R lm g im 2 h im + λ, x i R lm ( L(m) ) g im ) 2 Optimal L = γ x i R lm 2( x i R lm h im + λ) l=1 2 (35) = γl 2 g im L x i R lm 2, (36) h im + λ x i R lm l=1 which is important for the latter derivation of the splitting criteria for training new classifier of each iteration. In the algorithm, it gives a initial value to ensemble classifier which can be defined by user in practical application. It also can simply use the same value as gradient boosting. In the m th iteration, it includes four steps: (1) Compute the estimated first derivation g im and second derivation h im by using y i and the ensemble value F (m 1) (x i ) of previous iteration for all the samples. (2) Fit a classifier with given training set {x i, {g im, h im }}. (3) Compute optimal score w lm. (4) Add the score to ensemble classifier. (1) Compute g im and h im of different loss function by equation (29) (30) Least square loss:(y F (x)) 2 g im = 2F (m 1) (x i ) 2y i 19

26 h im = 2 Exponential loss function: e yf (x) g im = y i e y i F (m 1)(xi ) because of y 2 i = 1 for y i { 1, 1}. Logistic loss function: log(1 + e 2yF (x) ) h im = y 2 i e y i F (m 1)(x i ) = e y i F (m 1)(x i ), g im = 2y ie 2y if (m 1)(xi ) 1 + e 2y if (m 1)(xi ) = 2y i 1 + e 2y if (m 1)(xi ) h im = 4y2 i e2y if (m 1)(xi ) 1 + e 2y if (m 1)(xi ) = e 2y if (m 1)(xi ) (2) and (3) Train a classifier to partition {x i i = 1, 2,, N} into {R lm } L l=1 and find the optimal score w lm which minimize the objective function. In step (2), to use g im and h im as a stopping/splitting criteria to train a new classifier with L leaves. It means the classifier stop splitting after evaluating the further splitting of each leaf. To measure the goodness of the classifier, we can use the similar idea as decision tree. Grow a tree from a single node and stop while impurity(entropy) is increasing after splitting and the optimal classifier will be the one before splitting. In XGB, the objective function L (m) can be an index to represent goodness of a classifier. If objective function value is increasing after splitting leaf l, then not to split l, otherwise continue to split until every leaf can t be split further. To formulate the stopping criteria, firstly, we partitions the samples in R lm into R L and R R with leaf index l L, l R. Every possible partition(splitting) of R lm won t increase objective function value, if l is a leaf of optimal classifier. Since other leaves objective function value won t change after splitting R lm, the difference of objectives is equal to the difference between Optimal( L l (m) ) and Optimal( L(m) l L l R ). Thus, we can formulate the stopping the criteria as following ( δl (m) (m) ) ( L(m) ) = Optimal Ll Optimal l L l R < 0 (37) 20

27 where ( xi Rlm g im) 2 Optimal ( Ll (m) ) = γ ( ), 2 x i R h lm im + λ and ( L(m) ) Optimal l L l R = 2γ 2 g im 2 g im x i R R x i R L 2 + h im + λ 2. h im + λ x i R R x i R L Inequality (37) becomes x i R L g im 2 x i R R g im 2 x i R lm g im 2 + h im + λ 2 h im + λ 2 γ < 0 (38) h im + λ x i R L x i R R x i R lm 2 If l is a leaf of the optimal classifier, all the possible partitions satisfies inequality (38). If δl (m) > 0 which means l can be split further, then we have to find the best splitting for leaf l. The best splitting in decision tree is to choose the partition which can reduce most score of impurity(entropy). Exact greedy algorithm for splitting finding in XGB is also with the similar idea, so it choose the splitting which maximizes δl (m) (decrease the objective function value most) as optimal spitting. In exact greedy algorithm for splitting finding, it sorts all the samples in R lm by x ik which is the value of k th feature such that R L = {x s x sk x ik } and R R = {x j x jk x ik }. Always Take the partition with larger δl (m). Repeat it for every feature and finally we can find the best splitting for leaf l. After getting L and {R lm } N l=1, calculate optimal weight w lm of each leaf l by equation (35). In last step, output the ensemble classifier. The prediction output the sign of the ensemble value while using least square loss for binary classification problem with y i { 1, 1}. For exponential and logistic loss, sigmoid function can be used to map the value to [0, 1] and return 1 if the value is larger than 0.5. Table 2 summarize the results of different objective functions. 21

28 Least square Exponential loss Logistic loss loss function (y F (x)) 2 e yf (x) log(e 2yF (x) + 1) 2(yi F(m 1)(xi)) yie F 2yi (m 1)(xi)yi 1 + e 2y if (m 1) (xi) xi Rlm xi Rlm wlm (2Nl + λ) ( xi Rlm e F (m 1) (xi)yi + λ) ( 4e 2y if (m 1) xi Rlm (1 + e + λ) 2y if (m 1) (xi) ) 2 gim 2(yi F(m 1)(xi)) yie F (m 1)(xi)yi him 2 e F (m 1) (xi)yi xi Rlm 2yi 1 + e 2y if (m 1) (xi) 4e 2y if (m 1) (1 + e 2y if (m 1) (xi) ) 2 Table 2: Summary of different loss function for XGB 22

29 3 Measurement and Experiment 3.1 Measurement of goodness of the classifier In general, the goodness of a model can be measured by accuracy, which is the ratio between correct predicted counts and total counts. But in some specific case, it may happen that it can not truly represent the model s goodness, such as the quality test prediction of high yield product, which always includes rare fail samples compared to Pass samples. In the dataset will be used for demonstration in this thesis, the fail samples of quality test is less than 1% of all products, which means two classes have big disparity. The model trained by such imbalanced dataset will tend to predict the result as the major class, but it s still possible to get high accuracy, which is variant by the given test set of two-class ratio. For example, if the ratio between two classes(fail/pass) is 0.01(1:100), the model will be prone to predict the result as Pass, and the accuracy is higher than 99% while the class distribution of given test set is same as training set. But when we change an test set with more failure cases, this model can not predict precisely. To effectively measure the goodness of the model learned from imbalance two-class data set, the accuracy of ROC curve will be used to measure the model s goodness in whole paper ROC curve ROC(receiver operating characteristics) curve is a curve using for measuring the goodness of binary classifier. Its value of x axis and y axis represent false positive rate and true positive rate respectively. The area under the curve can be an index of goodness of the model, which is called AUC. In binary classification, we define two classes being + and -, and the combination of predicted result and true class label have 4 different cases, which are TP (True positive), FP (False positive), TN (True negative) and FN (False negative) and it can be represented by confusion matrix as Table 1. True positive means condition positive case is predicted as positive, and false positive means negative case is predicted as positive case. To draw ROC curve, we adjust the threshold and get different pair of false positive rate and true positive rate (F P R, T P R) which is defined by equation (39) and (40). For example, If the instance s probability of being + is p and p > threshold, then classify it to positive class. Thus, the higher the threshold is, the smaller the false positive rate is. 23

30 Figure 1: Receiver operating characteristics curve Confusion matrix Predicted + Predicted - Condition + True Positive(TP) False negative(fn) Condition - false Positive(FP) True negative(tn) Table 3: Confusion matrix which is gotten under certain threshold. T P R = F P R = T P T P + F N F P F P + T N (39) (40) sigmoid function The curve of sigmoid function which is an antisymmetric function looks like S as Fig 2, The special case of sigmoid function is logistic function, which maps a real number to [0,1]. Denote S(x) = 1. (41) 1 + e x Some of boosting methods returns the sign of ensemble classifier F (x) as the prediction result while F (x) is an additive logistic function. But in this kind of sense, it means it already choose the threshold as 0.5. As we can see in Fig 2, S(x) > 0.5 when x is 24

31 positive. For cutting different threshold to draw ROC curve, in this paper, Sigmoid function will be used to map F to [0,1]. Dscrete adaboost is not suitable for using this case, because the return value will be likely accuracy(1-error) instead of class probability after mapping by sigmoid function. Basically, this will be used only when the objective function is exponential or logistic such as real adaboost, gradient boosting and extreme gradient boosting. Figure 2: Sigmoid function maps the score to [0,1] 3.2 Data preparation The original dataset includes three parts, which are categorical, numerical and timestamps data. Three datasets are totally with 4267 columns rows. Since the limitation of computational ability of laptop, only numerical and time-related data from partial process flow will be used for the later demonstration. Before starting to train the model, I construct three the training sets as Table 5. First training set is constructed by cutting partial consecutive process flow and leaving those rows with missing values out. The dataset didn t reveal the real physical meaning of each feature in the numerical dataset, but there are still some useful information, such as the process sequence of each product in a machine. There are some processes which can be done not in only single machine. For example, in Fig 3, B1 and B2 are equivalent machines, in this station, product can be processed either in B1 or in B2, except some special case needed rework. In first training set, some data with same feature data from equivalent machine are putting into 25

32 different columns, so I merged them into same column. Second training set is constructed by adding some extra features to first training set. I add machine number and machine idling time before the product staring process as extra features. If the product pass B1, then the added feature is B1. It s possible that machine has a long queueing time before product comes. The machine s idling time is the difference between two consecutive products which are processed in this machine. Since there are many missing values in each columns, which means the sampling rate of measurement is not 100%. In third training set, two consecutive partial process flow are cut, but later I use XGB to demonstrate. In all the packages, only XGB support to handle the cell with missing value. Training set # of rows # of columns Note Partial process flow without missing data Partial process flow and extra added features Partial process flow with missing data Table 4: Size of training set Figure 3: Partial process flow 3.3 implementation The demonstration will be implement using ASUS laptop under the environment with window 10, 64-bit system, 4-GB ram, X64 processor. I encode in python language and the programming part will be executed and complied in jupyter notebook, which is a on-line 26

33 interactive platform support multi-programming languages. The code can be executed in a single cell independently. The version of Python is 3.3 and jupyter notebook is 4.1. Compared to original data set, the training set used for demonstration is very small, so the result of the prediction is not as good as the those results of higher rank in the competition. Since the purpose of this thesis is not to require the high accuracy, it s the demonstration of the procedure. Programming will be executed by using scikit-learn package which exists in python. Most methods can be found in this package, such as adaboost, gradient boost, and extreme gradient boosting. To build the training model, the procedure includes two steps. (1) Find the optimal parameters to train the model. (2) Find an optimal threshold so that we can output the result of prediction Optimization of parameters In programming, the model is optimized under given condition with fixed parameters. To find these optimal parameters before training the model, I took line-search strategy which is a methods looking for local extreme value by adjusting one specific parameter while other parameters are fixed. I will demonstrate the procedure by using Extreme gradient boosting with third training set in Table 4, because XGB is the only methods able to handle training set with data missing among four methods. The line search strategy can be shown from Fig 4 to 8. The sequence of picking parameter to maximize AUC does not matter. Firstly, we fix other parameters except maximal depth of base classifier and find local extreme by adjusting maximal depth of base classifier. Fig 4 shows that the local maximum Roc accuracy occurs when max depth = 2. 27

34 Figure 4: Default value: M = 100, λ = 1, γ = 0, objective=logistic, ν = 0.1 Secondly, fix max depth = 2 and pick the number of base classifiers(n estimators) for doing line search to find the optimal value which maximize the ROC accuracy while other parameters fixed. In Fig 5, the result show the maximum of AUC occurs when it s 103. Figure 5: max depth = 2, and default value λ = 1, γ = 0, objective = logistic, ν =

35 Figure 6: M = 103, max depth = 2, λ = 1, γ = 0, objective = logistic Fig 6 shows the learning rate ν with maximum AUC occurs while it s 0.1 and it keep trending down after 0.1. In empirical experience, it s is between 0.01 and 0.2. γ is a parameter which only appears in XGB. Since AUC doesn t change between 0.01 and 0.2 while programming. Thus, I took 0.2 as increment of each iteration. The maximum value happens while γ = In practice, it could happen that γ is either very close to zero or very far from zero, but the result in Fig 7 shows it s almost same as random guessing while γ is very large. 29

36 Figure 7: M = 103, max depth = 2, λ = 1, objective = logistic Figure 8: M = 103, max depth = 2, λ = 1, objective = logistic 30

37 Figure 9: M = 103, max depth = 2, λ = 1, objective = logistic λ is also a parameter only appears in XGB, it represents the coefficient of L2-regularization. Figure 10: M = 103, max depth = 2, γ = 0, objective = logistic 9. Finally, we get the optimal condition with auc The optimal condition is as Table 31

38 3.3.2 Optimal threshold The ROC curve under optimal condition is shown as Fig 10. To find a optimal threshold, we have to maximize MCC(Matthew s Correlation Coefficient) which can strongly represent the correlation between imbalanced two classes and the value is always between -1 1, it can be computed by equation (43) with the confusion matrix which is got by adjusting threshold. MCC = T P T N + F P F N (T P + F P )(T P + F N)(T N + F P )(T N + F N) (42) Figure 11: ROC curve under optimal condition 32

39 Figure 12: Confusion matrix while threshold is equal to Regularization and over-fitting In XGB algorithm, the most significant difference from other boosting methods is to consider the regularization term which can prevent over-fitting problem. Fig 12 and 13 show the model become slightly stable while the number of estimators is larger than 100, and training error continue to decay in the meantime. Fig 14 shows not only the over-fitting problem also the AUC got improvement while taking the regularization into account. 33

Decision Trees: Overfitting

Decision Trees: Overfitting Decision Trees: Overfitting Emily Fox University of Washington January 30, 2017 Decision tree recap Loan status: Root 22 18 poor 4 14 Credit? Income? excellent 9 0 3 years 0 4 Fair 9 4 Term? 5 years 9

More information

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts Data Mining: Concepts and Techniques (3 rd ed.) Chapter 8 1 Chapter 8. Classification: Basic Concepts Classification: Basic Concepts Decision Tree Induction Bayes Classification Methods Rule-Based Classification

More information

Chapter 14 Combining Models

Chapter 14 Combining Models Chapter 14 Combining Models T-61.62 Special Course II: Pattern Recognition and Machine Learning Spring 27 Laboratory of Computer and Information Science TKK April 3th 27 Outline Independent Mixing Coefficients

More information

Stochastic Gradient Descent

Stochastic Gradient Descent Stochastic Gradient Descent Machine Learning CSE546 Carlos Guestrin University of Washington October 9, 2013 1 Logistic Regression Logistic function (or Sigmoid): Learn P(Y X) directly Assume a particular

More information

Learning with multiple models. Boosting.

Learning with multiple models. Boosting. CS 2750 Machine Learning Lecture 21 Learning with multiple models. Boosting. Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square Learning with multiple models: Approach 2 Approach 2: use multiple models

More information

Gradient Boosting (Continued)

Gradient Boosting (Continued) Gradient Boosting (Continued) David Rosenberg New York University April 4, 2016 David Rosenberg (New York University) DS-GA 1003 April 4, 2016 1 / 31 Boosting Fits an Additive Model Boosting Fits an Additive

More information

ECE 5424: Introduction to Machine Learning

ECE 5424: Introduction to Machine Learning ECE 5424: Introduction to Machine Learning Topics: Ensemble Methods: Bagging, Boosting PAC Learning Readings: Murphy 16.4;; Hastie 16 Stefan Lee Virginia Tech Fighting the bias-variance tradeoff Simple

More information

Boosting. CAP5610: Machine Learning Instructor: Guo-Jun Qi

Boosting. CAP5610: Machine Learning Instructor: Guo-Jun Qi Boosting CAP5610: Machine Learning Instructor: Guo-Jun Qi Weak classifiers Weak classifiers Decision stump one layer decision tree Naive Bayes A classifier without feature correlations Linear classifier

More information

Learning theory. Ensemble methods. Boosting. Boosting: history

Learning theory. Ensemble methods. Boosting. Boosting: history Learning theory Probability distribution P over X {0, 1}; let (X, Y ) P. We get S := {(x i, y i )} n i=1, an iid sample from P. Ensemble methods Goal: Fix ɛ, δ (0, 1). With probability at least 1 δ (over

More information

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007 Decision Trees, cont. Boosting Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University October 1 st, 2007 1 A Decision Stump 2 1 The final tree 3 Basic Decision Tree Building Summarized

More information

Machine Learning Lecture 5

Machine Learning Lecture 5 Machine Learning Lecture 5 Linear Discriminant Functions 26.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory

More information

Algorithm-Independent Learning Issues

Algorithm-Independent Learning Issues Algorithm-Independent Learning Issues Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2007 c 2007, Selim Aksoy Introduction We have seen many learning

More information

Voting (Ensemble Methods)

Voting (Ensemble Methods) 1 2 Voting (Ensemble Methods) Instead of learning a single classifier, learn many weak classifiers that are good at different parts of the data Output class: (Weighted) vote of each classifier Classifiers

More information

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri CSE 151 Machine Learning Instructor: Kamalika Chaudhuri Ensemble Learning How to combine multiple classifiers into a single one Works well if the classifiers are complementary This class: two types of

More information

CART Classification and Regression Trees Trees can be viewed as basis expansions of simple functions. f(x) = c m 1(x R m )

CART Classification and Regression Trees Trees can be viewed as basis expansions of simple functions. f(x) = c m 1(x R m ) CART Classification and Regression Trees Trees can be viewed as basis expansions of simple functions with R 1,..., R m R p disjoint. f(x) = M c m 1(x R m ) m=1 The CART algorithm is a heuristic, adaptive

More information

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted

More information

Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c /9/9 page 331 le-tex

Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c /9/9 page 331 le-tex Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c15 2013/9/9 page 331 le-tex 331 15 Ensemble Learning The expression ensemble learning refers to a broad class

More information

Analysis of the Performance of AdaBoost.M2 for the Simulated Digit-Recognition-Example

Analysis of the Performance of AdaBoost.M2 for the Simulated Digit-Recognition-Example Analysis of the Performance of AdaBoost.M2 for the Simulated Digit-Recognition-Example Günther Eibl and Karl Peter Pfeiffer Institute of Biostatistics, Innsbruck, Austria guenther.eibl@uibk.ac.at Abstract.

More information

CART Classification and Regression Trees Trees can be viewed as basis expansions of simple functions

CART Classification and Regression Trees Trees can be viewed as basis expansions of simple functions CART Classification and Regression Trees Trees can be viewed as basis expansions of simple functions f (x) = M c m 1(x R m ) m=1 with R 1,..., R m R p disjoint. The CART algorithm is a heuristic, adaptive

More information

Ordinal Classification with Decision Rules

Ordinal Classification with Decision Rules Ordinal Classification with Decision Rules Krzysztof Dembczyński 1, Wojciech Kotłowski 1, and Roman Słowiński 1,2 1 Institute of Computing Science, Poznań University of Technology, 60-965 Poznań, Poland

More information

Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13

Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13 Boosting Ryan Tibshirani Data Mining: 36-462/36-662 April 25 2013 Optional reading: ISL 8.2, ESL 10.1 10.4, 10.7, 10.13 1 Reminder: classification trees Suppose that we are given training data (x i, y

More information

Neural Networks and Ensemble Methods for Classification

Neural Networks and Ensemble Methods for Classification Neural Networks and Ensemble Methods for Classification NEURAL NETWORKS 2 Neural Networks A neural network is a set of connected input/output units (neurons) where each connection has a weight associated

More information

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING DATE AND TIME: June 9, 2018, 09.00 14.00 RESPONSIBLE TEACHER: Andreas Svensson NUMBER OF PROBLEMS: 5 AIDING MATERIAL: Calculator, mathematical

More information

15-388/688 - Practical Data Science: Decision trees and interpretable models. J. Zico Kolter Carnegie Mellon University Spring 2018

15-388/688 - Practical Data Science: Decision trees and interpretable models. J. Zico Kolter Carnegie Mellon University Spring 2018 15-388/688 - Practical Data Science: Decision trees and interpretable models J. Zico Kolter Carnegie Mellon University Spring 2018 1 Outline Decision trees Training (classification) decision trees Interpreting

More information

Ensemble Methods. NLP ML Web! Fall 2013! Andrew Rosenberg! TA/Grader: David Guy Brizan

Ensemble Methods. NLP ML Web! Fall 2013! Andrew Rosenberg! TA/Grader: David Guy Brizan Ensemble Methods NLP ML Web! Fall 2013! Andrew Rosenberg! TA/Grader: David Guy Brizan How do you make a decision? What do you want for lunch today?! What did you have last night?! What are your favorite

More information

Tufts COMP 135: Introduction to Machine Learning

Tufts COMP 135: Introduction to Machine Learning Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2019s/ Logistic Regression Many slides attributable to: Prof. Mike Hughes Erik Sudderth (UCI) Finale Doshi-Velez (Harvard)

More information

Article from. Predictive Analytics and Futurism. July 2016 Issue 13

Article from. Predictive Analytics and Futurism. July 2016 Issue 13 Article from Predictive Analytics and Futurism July 2016 Issue 13 Regression and Classification: A Deeper Look By Jeff Heaton Classification and regression are the two most common forms of models fitted

More information

Logistic Regression. COMP 527 Danushka Bollegala

Logistic Regression. COMP 527 Danushka Bollegala Logistic Regression COMP 527 Danushka Bollegala Binary Classification Given an instance x we must classify it to either positive (1) or negative (0) class We can use {1,-1} instead of {1,0} but we will

More information

Machine Learning Lecture 7

Machine Learning Lecture 7 Course Outline Machine Learning Lecture 7 Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Statistical Learning Theory 23.05.2016 Discriminative Approaches (5 weeks) Linear Discriminant

More information

Data Mining und Maschinelles Lernen

Data Mining und Maschinelles Lernen Data Mining und Maschinelles Lernen Ensemble Methods Bias-Variance Trade-off Basic Idea of Ensembles Bagging Basic Algorithm Bagging with Costs Randomization Random Forests Boosting Stacking Error-Correcting

More information

Evaluation requires to define performance measures to be optimized

Evaluation requires to define performance measures to be optimized Evaluation Basic concepts Evaluation requires to define performance measures to be optimized Performance of learning algorithms cannot be evaluated on entire domain (generalization error) approximation

More information

AdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague

AdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague AdaBoost Lecturer: Jan Šochman Authors: Jan Šochman, Jiří Matas Center for Machine Perception Czech Technical University, Prague http://cmp.felk.cvut.cz Motivation Presentation 2/17 AdaBoost with trees

More information

ECE 5984: Introduction to Machine Learning

ECE 5984: Introduction to Machine Learning ECE 5984: Introduction to Machine Learning Topics: Ensemble Methods: Bagging, Boosting Readings: Murphy 16.4; Hastie 16 Dhruv Batra Virginia Tech Administrativia HW3 Due: April 14, 11:55pm You will implement

More information

VBM683 Machine Learning

VBM683 Machine Learning VBM683 Machine Learning Pinar Duygulu Slides are adapted from Dhruv Batra Bias is the algorithm's tendency to consistently learn the wrong thing by not taking into account all the information in the data

More information

Methods and Criteria for Model Selection. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Methods and Criteria for Model Selection. CS57300 Data Mining Fall Instructor: Bruno Ribeiro Methods and Criteria for Model Selection CS57300 Data Mining Fall 2016 Instructor: Bruno Ribeiro Goal } Introduce classifier evaluation criteria } Introduce Bias x Variance duality } Model Assessment }

More information

Machine Learning Linear Classification. Prof. Matteo Matteucci

Machine Learning Linear Classification. Prof. Matteo Matteucci Machine Learning Linear Classification Prof. Matteo Matteucci Recall from the first lecture 2 X R p Regression Y R Continuous Output X R p Y {Ω 0, Ω 1,, Ω K } Classification Discrete Output X R p Y (X)

More information

Neural Networks and Deep Learning

Neural Networks and Deep Learning Neural Networks and Deep Learning Professor Ameet Talwalkar November 12, 2015 Professor Ameet Talwalkar Neural Networks and Deep Learning November 12, 2015 1 / 16 Outline 1 Review of last lecture AdaBoost

More information

Evaluation. Andrea Passerini Machine Learning. Evaluation

Evaluation. Andrea Passerini Machine Learning. Evaluation Andrea Passerini passerini@disi.unitn.it Machine Learning Basic concepts requires to define performance measures to be optimized Performance of learning algorithms cannot be evaluated on entire domain

More information

Statistics and learning: Big Data

Statistics and learning: Big Data Statistics and learning: Big Data Learning Decision Trees and an Introduction to Boosting Sébastien Gadat Toulouse School of Economics February 2017 S. Gadat (TSE) SAD 2013 1 / 30 Keywords Decision trees

More information

Boosting with decision stumps and binary features

Boosting with decision stumps and binary features Boosting with decision stumps and binary features Jason Rennie jrennie@ai.mit.edu April 10, 2003 1 Introduction A special case of boosting is when features are binary and the base learner is a decision

More information

Machine Learning for NLP

Machine Learning for NLP Machine Learning for NLP Linear Models Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research Machine Learning for NLP 1(26) Outline

More information

Applied Machine Learning Annalisa Marsico

Applied Machine Learning Annalisa Marsico Applied Machine Learning Annalisa Marsico OWL RNA Bionformatics group Max Planck Institute for Molecular Genetics Free University of Berlin 22 April, SoSe 2015 Goals Feature Selection rather than Feature

More information

Big Data Analytics. Special Topics for Computer Science CSE CSE Feb 24

Big Data Analytics. Special Topics for Computer Science CSE CSE Feb 24 Big Data Analytics Special Topics for Computer Science CSE 4095-001 CSE 5095-005 Feb 24 Fei Wang Associate Professor Department of Computer Science and Engineering fei_wang@uconn.edu Prediction III Goal

More information

COMS 4771 Lecture Boosting 1 / 16

COMS 4771 Lecture Boosting 1 / 16 COMS 4771 Lecture 12 1. Boosting 1 / 16 Boosting What is boosting? Boosting: Using a learning algorithm that provides rough rules-of-thumb to construct a very accurate predictor. 3 / 16 What is boosting?

More information

Announcements Kevin Jamieson

Announcements Kevin Jamieson Announcements My office hours TODAY 3:30 pm - 4:30 pm CSE 666 Poster Session - Pick one First poster session TODAY 4:30 pm - 7:30 pm CSE Atrium Second poster session December 12 4:30 pm - 7:30 pm CSE Atrium

More information

Model Accuracy Measures

Model Accuracy Measures Model Accuracy Measures Master in Bioinformatics UPF 2017-2018 Eduardo Eyras Computational Genomics Pompeu Fabra University - ICREA Barcelona, Spain Variables What we can measure (attributes) Hypotheses

More information

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION 1 Outline Basic terminology Features Training and validation Model selection Error and loss measures Statistical comparison Evaluation measures 2 Terminology

More information

Hierarchical Boosting and Filter Generation

Hierarchical Boosting and Filter Generation January 29, 2007 Plan Combining Classifiers Boosting Neural Network Structure of AdaBoost Image processing Hierarchical Boosting Hierarchical Structure Filters Combining Classifiers Combining Classifiers

More information

CS534 Machine Learning - Spring Final Exam

CS534 Machine Learning - Spring Final Exam CS534 Machine Learning - Spring 2013 Final Exam Name: You have 110 minutes. There are 6 questions (8 pages including cover page). If you get stuck on one question, move on to others and come back to the

More information

Boosting: Foundations and Algorithms. Rob Schapire

Boosting: Foundations and Algorithms. Rob Schapire Boosting: Foundations and Algorithms Rob Schapire Example: Spam Filtering problem: filter out spam (junk email) gather large collection of examples of spam and non-spam: From: yoav@ucsd.edu Rob, can you

More information

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18 CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$

More information

Recitation 9. Gradient Boosting. Brett Bernstein. March 30, CDS at NYU. Brett Bernstein (CDS at NYU) Recitation 9 March 30, / 14

Recitation 9. Gradient Boosting. Brett Bernstein. March 30, CDS at NYU. Brett Bernstein (CDS at NYU) Recitation 9 March 30, / 14 Brett Bernstein CDS at NYU March 30, 2017 Brett Bernstein (CDS at NYU) Recitation 9 March 30, 2017 1 / 14 Initial Question Intro Question Question Suppose 10 different meteorologists have produced functions

More information

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring / Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring 2015 http://ce.sharif.edu/courses/93-94/2/ce717-1 / Agenda Combining Classifiers Empirical view Theoretical

More information

FINAL: CS 6375 (Machine Learning) Fall 2014

FINAL: CS 6375 (Machine Learning) Fall 2014 FINAL: CS 6375 (Machine Learning) Fall 2014 The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run out of room for

More information

Gradient Boosting, Continued

Gradient Boosting, Continued Gradient Boosting, Continued David Rosenberg New York University December 26, 2016 David Rosenberg (New York University) DS-GA 1003 December 26, 2016 1 / 16 Review: Gradient Boosting Review: Gradient Boosting

More information

Statistical Machine Learning from Data

Statistical Machine Learning from Data Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Ensembles Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique Fédérale de Lausanne

More information

Machine Learning, Fall 2012 Homework 2

Machine Learning, Fall 2012 Homework 2 0-60 Machine Learning, Fall 202 Homework 2 Instructors: Tom Mitchell, Ziv Bar-Joseph TA in charge: Selen Uguroglu email: sugurogl@cs.cmu.edu SOLUTIONS Naive Bayes, 20 points Problem. Basic concepts, 0

More information

A Brief Introduction to Adaboost

A Brief Introduction to Adaboost A Brief Introduction to Adaboost Hongbo Deng 6 Feb, 2007 Some of the slides are borrowed from Derek Hoiem & Jan ˇSochman. 1 Outline Background Adaboost Algorithm Theory/Interpretations 2 What s So Good

More information

Lecture 8. Instructor: Haipeng Luo

Lecture 8. Instructor: Haipeng Luo Lecture 8 Instructor: Haipeng Luo Boosting and AdaBoost In this lecture we discuss the connection between boosting and online learning. Boosting is not only one of the most fundamental theories in machine

More information

Learning Ensembles. 293S T. Yang. UCSB, 2017.

Learning Ensembles. 293S T. Yang. UCSB, 2017. Learning Ensembles 293S T. Yang. UCSB, 2017. Outlines Learning Assembles Random Forest Adaboost Training data: Restaurant example Examples described by attribute values (Boolean, discrete, continuous)

More information

Data Mining and Analysis: Fundamental Concepts and Algorithms

Data Mining and Analysis: Fundamental Concepts and Algorithms Data Mining and Analysis: Fundamental Concepts and Algorithms dataminingbook.info Mohammed J. Zaki 1 Wagner Meira Jr. 2 1 Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY, USA

More information

10701/15781 Machine Learning, Spring 2007: Homework 2

10701/15781 Machine Learning, Spring 2007: Homework 2 070/578 Machine Learning, Spring 2007: Homework 2 Due: Wednesday, February 2, beginning of the class Instructions There are 4 questions on this assignment The second question involves coding Do not attach

More information

Feature Engineering, Model Evaluations

Feature Engineering, Model Evaluations Feature Engineering, Model Evaluations Giri Iyengar Cornell University gi43@cornell.edu Feb 5, 2018 Giri Iyengar (Cornell Tech) Feature Engineering Feb 5, 2018 1 / 35 Overview 1 ETL 2 Feature Engineering

More information

Lecture 4 Discriminant Analysis, k-nearest Neighbors

Lecture 4 Discriminant Analysis, k-nearest Neighbors Lecture 4 Discriminant Analysis, k-nearest Neighbors Fredrik Lindsten Division of Systems and Control Department of Information Technology Uppsala University. Email: fredrik.lindsten@it.uu.se fredrik.lindsten@it.uu.se

More information

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington Linear Classification CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 Example of Linear Classification Red points: patterns belonging

More information

Machine Learning, Midterm Exam: Spring 2008 SOLUTIONS. Q Topic Max. Score Score. 1 Short answer questions 20.

Machine Learning, Midterm Exam: Spring 2008 SOLUTIONS. Q Topic Max. Score Score. 1 Short answer questions 20. 10-601 Machine Learning, Midterm Exam: Spring 2008 Please put your name on this cover sheet If you need more room to work out your answer to a question, use the back of the page and clearly mark on the

More information

COMS 4721: Machine Learning for Data Science Lecture 13, 3/2/2017

COMS 4721: Machine Learning for Data Science Lecture 13, 3/2/2017 COMS 4721: Machine Learning for Data Science Lecture 13, 3/2/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University BOOSTING Robert E. Schapire and Yoav

More information

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Final Overview. Introduction to ML. Marek Petrik 4/25/2017 Final Overview Introduction to ML Marek Petrik 4/25/2017 This Course: Introduction to Machine Learning Build a foundation for practice and research in ML Basic machine learning concepts: max likelihood,

More information

CS7267 MACHINE LEARNING

CS7267 MACHINE LEARNING CS7267 MACHINE LEARNING ENSEMBLE LEARNING Ref: Dr. Ricardo Gutierrez-Osuna at TAMU, and Aarti Singh at CMU Mingon Kang, Ph.D. Computer Science, Kennesaw State University Definition of Ensemble Learning

More information

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Decision Trees. Tobias Scheffer

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Decision Trees. Tobias Scheffer Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Decision Trees Tobias Scheffer Decision Trees One of many applications: credit risk Employed longer than 3 months Positive credit

More information

CS229 Supplemental Lecture notes

CS229 Supplemental Lecture notes CS229 Supplemental Lecture notes John Duchi 1 Boosting We have seen so far how to solve classification (and other) problems when we have a data representation already chosen. We now talk about a procedure,

More information

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning Mark Schmidt University of British Columbia, May 2016 www.cs.ubc.ca/~schmidtm/svan16 Some images from this lecture are

More information

Generalized Boosted Models: A guide to the gbm package

Generalized Boosted Models: A guide to the gbm package Generalized Boosted Models: A guide to the gbm package Greg Ridgeway April 15, 2006 Boosting takes on various forms with different programs using different loss functions, different base models, and different

More information

Variations of Logistic Regression with Stochastic Gradient Descent

Variations of Logistic Regression with Stochastic Gradient Descent Variations of Logistic Regression with Stochastic Gradient Descent Panqu Wang(pawang@ucsd.edu) Phuc Xuan Nguyen(pxn002@ucsd.edu) January 26, 2012 Abstract In this paper, we extend the traditional logistic

More information

Robotics 2 AdaBoost for People and Place Detection

Robotics 2 AdaBoost for People and Place Detection Robotics 2 AdaBoost for People and Place Detection Giorgio Grisetti, Cyrill Stachniss, Kai Arras, Wolfram Burgard v.1.0, Kai Arras, Oct 09, including material by Luciano Spinello and Oscar Martinez Mozos

More information

Nonlinear Classification

Nonlinear Classification Nonlinear Classification INFO-4604, Applied Machine Learning University of Colorado Boulder October 5-10, 2017 Prof. Michael Paul Linear Classification Most classifiers we ve seen use linear functions

More information

Regularization. CSCE 970 Lecture 3: Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline

Regularization. CSCE 970 Lecture 3: Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline Other Measures 1 / 52 sscott@cse.unl.edu learning can generally be distilled to an optimization problem Choose a classifier (function, hypothesis) from a set of functions that minimizes an objective function

More information

LogitBoost with Trees Applied to the WCCI 2006 Performance Prediction Challenge Datasets

LogitBoost with Trees Applied to the WCCI 2006 Performance Prediction Challenge Datasets Applied to the WCCI 2006 Performance Prediction Challenge Datasets Roman Lutz Seminar für Statistik, ETH Zürich, Switzerland 18. July 2006 Overview Motivation 1 Motivation 2 The logistic framework The

More information

Logistic Regression and Boosting for Labeled Bags of Instances

Logistic Regression and Boosting for Labeled Bags of Instances Logistic Regression and Boosting for Labeled Bags of Instances Xin Xu and Eibe Frank Department of Computer Science University of Waikato Hamilton, New Zealand {xx5, eibe}@cs.waikato.ac.nz Abstract. In

More information

i=1 = H t 1 (x) + α t h t (x)

i=1 = H t 1 (x) + α t h t (x) AdaBoost AdaBoost, which stands for ``Adaptive Boosting", is an ensemble learning algorithm that uses the boosting paradigm []. We will discuss AdaBoost for binary classification. That is, we assume that

More information

Stephen Scott.

Stephen Scott. 1 / 35 (Adapted from Ethem Alpaydin and Tom Mitchell) sscott@cse.unl.edu In Homework 1, you are (supposedly) 1 Choosing a data set 2 Extracting a test set of size > 30 3 Building a tree on the training

More information

Kernel Logistic Regression and the Import Vector Machine

Kernel Logistic Regression and the Import Vector Machine Kernel Logistic Regression and the Import Vector Machine Ji Zhu and Trevor Hastie Journal of Computational and Graphical Statistics, 2005 Presented by Mingtao Ding Duke University December 8, 2011 Mingtao

More information

Decision Tree Learning Lecture 2

Decision Tree Learning Lecture 2 Machine Learning Coms-4771 Decision Tree Learning Lecture 2 January 28, 2008 Two Types of Supervised Learning Problems (recap) Feature (input) space X, label (output) space Y. Unknown distribution D over

More information

Boosting & Deep Learning

Boosting & Deep Learning Boosting & Deep Learning Ensemble Learning n So far learning methods that learn a single hypothesis, chosen form a hypothesis space that is used to make predictions n Ensemble learning à select a collection

More information

Metric Embedding of Task-Specific Similarity. joint work with Trevor Darrell (MIT)

Metric Embedding of Task-Specific Similarity. joint work with Trevor Darrell (MIT) Metric Embedding of Task-Specific Similarity Greg Shakhnarovich Brown University joint work with Trevor Darrell (MIT) August 9, 2006 Task-specific similarity A toy example: Task-specific similarity A toy

More information

Linear Models in Machine Learning

Linear Models in Machine Learning CS540 Intro to AI Linear Models in Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu We briefly go over two linear models frequently used in machine learning: linear regression for, well, regression,

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Thomas G. Dietterich tgd@eecs.oregonstate.edu 1 Outline What is Machine Learning? Introduction to Supervised Learning: Linear Methods Overfitting, Regularization, and the

More information

CS 6375 Machine Learning

CS 6375 Machine Learning CS 6375 Machine Learning Decision Trees Instructor: Yang Liu 1 Supervised Classifier X 1 X 2. X M Ref class label 2 1 Three variables: Attribute 1: Hair = {blond, dark} Attribute 2: Height = {tall, short}

More information

Logistic Regression Logistic

Logistic Regression Logistic Case Study 1: Estimating Click Probabilities L2 Regularization for Logistic Regression Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin January 10 th,

More information

A Gentle Introduction to Gradient Boosting. Cheng Li College of Computer and Information Science Northeastern University

A Gentle Introduction to Gradient Boosting. Cheng Li College of Computer and Information Science Northeastern University A Gentle Introduction to Gradient Boosting Cheng Li chengli@ccs.neu.edu College of Computer and Information Science Northeastern University Gradient Boosting a powerful machine learning algorithm it can

More information

Open Problem: A (missing) boosting-type convergence result for ADABOOST.MH with factorized multi-class classifiers

Open Problem: A (missing) boosting-type convergence result for ADABOOST.MH with factorized multi-class classifiers JMLR: Workshop and Conference Proceedings vol 35:1 8, 014 Open Problem: A (missing) boosting-type convergence result for ADABOOST.MH with factorized multi-class classifiers Balázs Kégl LAL/LRI, University

More information

Bayesian Decision Theory

Bayesian Decision Theory Introduction to Pattern Recognition [ Part 4 ] Mahdi Vasighi Remarks It is quite common to assume that the data in each class are adequately described by a Gaussian distribution. Bayesian classifier is

More information

Midterm, Fall 2003

Midterm, Fall 2003 5-78 Midterm, Fall 2003 YOUR ANDREW USERID IN CAPITAL LETTERS: YOUR NAME: There are 9 questions. The ninth may be more time-consuming and is worth only three points, so do not attempt 9 unless you are

More information

Lecture 3: Decision Trees

Lecture 3: Decision Trees Lecture 3: Decision Trees Cognitive Systems - Machine Learning Part I: Basic Approaches of Concept Learning ID3, Information Gain, Overfitting, Pruning last change November 26, 2014 Ute Schmid (CogSys,

More information

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

CS6375: Machine Learning Gautam Kunapuli. Decision Trees Gautam Kunapuli Example: Restaurant Recommendation Example: Develop a model to recommend restaurants to users depending on their past dining experiences. Here, the features are cost (x ) and the user s

More information

10-701/ Machine Learning - Midterm Exam, Fall 2010

10-701/ Machine Learning - Midterm Exam, Fall 2010 10-701/15-781 Machine Learning - Midterm Exam, Fall 2010 Aarti Singh Carnegie Mellon University 1. Personal info: Name: Andrew account: E-mail address: 2. There should be 15 numbered pages in this exam

More information

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so. CS 89 Spring 07 Introduction to Machine Learning Midterm Please do not open the exam before you are instructed to do so. The exam is closed book, closed notes except your one-page cheat sheet. Electronic

More information

Performance Evaluation

Performance Evaluation Performance Evaluation David S. Rosenberg Bloomberg ML EDU October 26, 2017 David S. Rosenberg (Bloomberg ML EDU) October 26, 2017 1 / 36 Baseline Models David S. Rosenberg (Bloomberg ML EDU) October 26,

More information

CSCI-567: Machine Learning (Spring 2019)

CSCI-567: Machine Learning (Spring 2019) CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March

More information