Application of machine learning in manufacturing industry

Application of machine learning in manufacturing industry MSC Degree Thesis Written by: Hsinyi Lin Master of Science in Mathematics Supervisor: Lukács András Institute of Mathematics Eötvös Loránd University Faculty of Science Budapest 2017

ABSTRACT In high-tech manufacturing industry, much data are fetched each day from production line. These data could be measurement data, categorical data, or timerelated data. Before the product being delivered to costumer, it has to pass quality examination. The task in this thesis is to use a given training set and combine with different boosting algorithms which already exist in scikit-learn package in python to build a model which can predict quality test result. To build such model can be regarded as solving highly imbalanced binary classification problem, since the number of Fail case compared to the number of Pass case are very scarce for mature product. Boosting is very common used for solving this type of problem in Kaggle community and its general idea is to combine many weak learners(base classifiers) sequentially to get a strong learner. In programming, the model is trained under certain condition, which means some parameters are fixed, therefore, the understanding of each parameter is needed for finding optimal parameters. To optimize the model, we minimize certain objective function which depends on its correspondent algorithm. In adaboost, the upper bound of empirical error is minimized. In logitboost, it minimizes the least square regression of residual which is updated by Newton s method. In gradientboost, it minimizes user-defined differential convex loss function which represents the dissimilarity between real class and prediction result. Except loss function, XGB takes regularization term into account and using second order Taylor expansion to approximate the loss function for preventing over-fitting. Solving the problem with given training set includes two step: (1)Use line-search to find optimal parameters. (2)Find the optimal threshold to decide the result of classification. The purpose of the thesis is to demonstrate the procedure of solving the problem and the computational ability of laptop is also limited, therefore, the very small part of original data are taken as training set. Thus, the model trained by given training set with less information is not as competitive as those model with higher rank in the competition. While doing line search to optimize the value for a specific parameter, it may not easy to find a proper range or value. Thus, it s quite important to reference past empirical experience. In this thesis, some model with different objective function and different training set are built, but there is no specific method which is always the best. In Kaggle competition, most participants won the competition by combining many methods rather than using single method. I

Contents Abstract...................................... Table of contents................................. List of figures................................... I II IV List of tables................................... V 1 Introduction.................................. 1 1.1 Motivation................................. 1 1.2 Outline................................... 2 2 Boosting Methods.............................. 3 2.1 Binary classification............................ 3 2.2 General idea................................ 3 2.3 Adaptive boosting............................ 3 2.3.1 Algorithm............................. 4 2.3.2 Derivation of each step...................... 5 2.4 Logit boosting............................... 8 2.4.1 Algorithm............................. 9 2.4.2 Derivation of each step...................... 9 2.5 Gradient boosting............................. 11 2.5.1 Algorithm............................. 12 2.5.2 Derivation............................. 12 2.6 XGB-Extreme Gradient Boosting.................... 16 2.6.1 Algorithm............................. 17 2.6.2 Derivation............................. 17 3 Measurement and Experiment...................... 23 3.1 Measurement of goodness of the classifier................ 23 3.1.1 ROC curve............................ 23 3.1.2 sigmoid function......................... 24 3.2 Data preparation............................. 25 3.3 implementation.............................. 26 3.3.1 Optimization of parameters................... 27 3.3.2 Optimal threshold........................ 32 II

3.3.3 Regularization and over-fitting.................. 33 3.3.4 Model with different training set and algorithm........ 35 3.4 Summary................................. 36 Bibliography................................... VI III

List of Figures 1 Receiver operating characteristics curve................. 24 2 Sigmoid function maps the score to [0,1]................ 25 3 Partial process flow............................ 26 4 Default value: M = 100, λ = 1, γ = 0, objective=logistic, ν = 0.1.. 28 5 max depth = 2, and default value λ = 1, γ = 0, objective = logistic, ν = 0.1................................... 28 6 M = 103, max depth = 2, λ = 1, γ = 0, objective = logistic..... 29 7 M = 103, max depth = 2, λ = 1, objective = logistic......... 30 8 M = 103, max depth = 2, λ = 1, objective = logistic......... 30 9 M = 103, max depth = 2, λ = 1, objective = logistic......... 31 10 M = 103, max depth = 2, γ = 0, objective = logistic......... 31 11 ROC curve under optimal condition................... 32 12 Confusion matrix while threshold is equal to 0.0644.......... 33 13 XGB with regularization λ = 1, γ = 0.1................. 34 14 XGB without regularization λ = 0, γ = 0................ 34 15 Overfitting: XGB with regularization improved overfitting problem.. 35 IV

List of Tables 1 Summary of different loss function for gradient boosting....... 16 2 Summary of different loss function for XGB.............. 22 3 Confusion matrix which is gotten under certain threshold....... 24 4 Size of training set............................ 26 5 Optimal condiftion of given different training set by using adaboost algorithms................................. 35 6 Optimal condiftion of given different training set while using gradient boosting.................................. 36 7 Optimal condiftion of different given training set while using XGB.. 36 V

1 Introduction 1.1 Motivation The few years working experience in high-tech manufacturing industry made me realize how fast the speed of data being generated in production line. The limitation of capacity of repository urges old data be purged while new data being generated. It s a tough task to analyse all the fetched data and extract useful information from such a big data set by tradition statistical analytical methods before data being purged. In order to effectively use all the fetched data, more and more companies started to introduce the concept of machining learning to develop their analytical tools. As we can see, recently, many companies started to provide their data and held competitions in Kaggle, the largest communicative platform for people studying in relevant fields of machine learning [1]. By holding the competitions, they can get better ideas and concepts among those innovative and efficient methods that were proposed by participants from all over the world. In manufacturing industry, one of the typical questions is how to predict the quality test results based on past data. The quality test result decides whether the company can gain benefit from the product or not, because every product needs to pass all the quality tests before being delivered to customers. If the result can be predicted precisely by built model and certain given features, that would be helpful for company to save cost. Thinking about the situation, if the predicted result is fail, then we can simply scrap the product and stop the remaining process it should be done, because of the high risk of being scrap after finishing all the procedure. We also can know which features are important to the prediction via such model, so we can focus on the yield improvement of certain process. To build the model of solving binary classification problems, we will introduce boosting methods which are the most commonly used in Kaggle platform by the community. Most participants won the competitions by using the idea of boosting methods to construct their model. Until now, there have been many kinds of boosting methods have been proposed such as adaboost, logitboost, gradient boosting and extreme gradient boosting. Some algorithms have been developed and existed in python packages. 1

1.2 Outline This thesis is organized in the following way. In Chapter 2, the general idea of boosting methods and four different boosting methods which are adaboost, logitboost, gradient boost, extreme gradient boost are introduced. Each method includes the detail of algorithm, the concept and derivation. The idea of loss function and the regularization which appears in the extreme gradient boost are also be shown. In Chapter 3, firstly, ROC curve and sigmoid function are introduced for evaluating the goodness of the model. Data preparation and the implementation environment are also presented before programming. The demonstration of the procedure of solving the highly imbalanced classification problems by extreme gradient boosting which existed in scikit-learn in python packages includes two steps: (1)Use line search strategy to find optimal condition.(2)find a best threshold which maximizes the Matthew correlation coefficient. After demonstrating the procedure, I show the improvement of overfitting when taking the regularization into account. In the end of this chapter, I demonstrate the result of those models which are trained by different boosting algorithms with given different training set and give a conclusion for chapter 3. 2

2 Boosting Methods 2.1 Binary classification Classification is one of the supervised learning problem in machine learning. The task is to find a model F to fit given labeled training set x i, y i i = 1, 2,, N, and use it to predict a set with unknown class label. The labeled training set include features x i and class label y i. The problem of predicting quality test results can be regarded as a highly imbalanced binary classification problem because the results of prediction have only two possibilities, Pass and Fail, and non-uniform distribution for the big disparity between two classes. In practical application, the labels of two classes are usually encoded to -1/1 or 0/1 depend on the method being used. In manufacturing industry, features can be any measured data during process or time values. To formulize the problem, we denote the model F : x F (x) where F (x) is the prediction with given features x. The goal of finding optimal F is equivalent to find the model which minimizes the dissimilarity between y i and F (x i ). 2.2 General idea The general idea of boosting methods [2] [3] [4] is to sequentially ensemble many weak learners(base classifiers) f m which can predict slightly better than flipping uniform coin(random guessing) to a strong classifier F. In most boosting methods use CART(classification and regression trees) as based classifiers. In general, to find a optimal classifier in m th iteration is equivalent to find a f m which minimize certain given objective function. In adaboost, the goal is to minimize the upper bound of the empirical error of the ensemble classifier [5]. Since its empirical error is bounded by exponential loss, it s same as taking exponential loss as objective function. In logitboost, it minimizes the least square of residual which estimated by Newton s update. The goal of gradient boosting is to minimize loss function which can be interpreted as the dissimilarity between real class label y and predicted value F (x). The loss function can be any differentiable convex function. Unlike gradient boosting, the objective function of extreme gradient boosting has one more term, that is regularization, which can prevent overfitting effectively. 2.3 Adaptive boosting Adaboost was once the most popular boosting method until extreme gradient boosting appeared. It combines many weak classifiers which are trained by weighted 3

training set sequentially to get a strong classifier. Initially, every sample in training set is given an equal weight. In each stage, we increase the weight to misclassified samples of predicting by ensemble classifiers and get a new weighted training set to train next weak learner. It emphasizes the importance of those misclassified instances so that next weak classifier is able to focus on fitting those misclassified samples. Because the summation of the weight of each sample is equal to one, the weight of those samples which are classified correctly will also be decreased. In general, using stumps which are the simplest decision tree only with one node and two leaves as a weak classifier already get quite good accuracy and efficiency. The commonly used version of adaboost is discrete adaboost which was first time introduced in [Freund and Schapire (1996b)]. It was developed to solve binary classification problem with class label y i { 1, 1}, i. The real adaboost came up two years later. 2.3.1 Algorithm The following two algorithms are the variant of adaboost. Algorithm 1 is the discrete version which was presented in 1996 by Freund and Schapire and Algorithm 2 is real adaboost. It has a slightly difference while calculating the value which will be added at m iteration. In discrete adaboost, the value is estimated by given weighted training error predicted by weak classifier f m { 1, 1} when adaboost uses weak classifier f m which return a real number and estimate the value by given class probability. The weak classifier f m Both algorithms increase the weight of those misclassified samples by multiplying exponential function. The detail explanation will be described in the derivation part. Algorithm 1 Discrete adaboost algorithm 1: Initialize w i 1, N i = 1, 2..., N Initialize equal weight of each sample 2: while m M do 3: Fit a classifier f m (x) { 1, 1} Fit a classifier f m { 1, 1} with minimum weighted training error [ ] 4: Compute ε m = E w 1(y fm(x)) Calculate weighted training error of f m 5: c m = log (1 ε m) (ε m ) Update the coefficient of the weak learner of current iteration 6: w i w i e cmi[ fm(x)]] i = 1, 2, N Increase the weight of those misclassified samples N 7: re-normalize such that w i = 1 8: end while 9: return sign [F (x)]. F (x) = M c m f m (x) m=1 re-normalize all the weights output the sign of ensemble result w i : weight of sample i f m: optimal classifier of m th iteration 4

ε m : weighted training error of f m c m: the coefficient of classifier f m while ensemble to a strong classifier F : final ensemble classifier Algorithm 2 Real adaboost algorithm 1: Initialize w i 1, N i = 1, 2..., N Initialize equal weight to each sample 2: while m M do 3: Find optimal partition and get class probability p m (y = 1 x) Train a classifier with minimum weighted training error and get its class probability 4: Compute f m (x) 1 2 ln p m(y = 1 x) 1 p m (y = 1 x) Update the score of each leaf 5: w i w i e yifm(x) i = 1, 2, N Update weight of each sample N 6: re-normalize such that w i = 1 7: end while 8: return sign [F (x)]. F (x) = M f m (x) m=1 re-normalize all the weights output the sign of ensemble result w i : weight of sample i p m(x y = 1) : class conditional probability of each leaf at m th iteration f m: classifier with optimal score at m th iteration F :final ensemble classifier 2.3.2 Derivation of each step Firstly, it sets equal weight to each sample w 0 (i) = 1/N, where N is the number of all samples. Secondly, it comes the iteration step m. There are three steps at m th stage in discrete adaboost. It includes (1) Find a optimal classifier f m with minimal weighted training error. (2) Find the optimal coefficient c m of weak classifier f m. (3) Update weight for next iteration. In practical case, we usually find a classifier f m (x) with minimum weighted training error under certain given condition which can also be optimized by doing linesearch. For example, the most popular condition is to set the base classifier as stump, which is the simplest decision tree for its efficiency and accuracy of final classifier. [ ] f m (x) = arg min E w I(y fm(x)) = arg min f m f m N w m (i) [ ] I (y fm(x)) The following we ll have to find optimal c m which can minimize the empirical training error 1 N I N (yi F (x i )<0) while f m is known. To minimize the empirical training error, we consider to minimize its upper bound. Since I (yi F (x i )) e ky if (x i ) k > 0, 5

the objective function becomes 1 N e ky if (x i ). According to the definition in algorithm, n the weights of those misclassified samples will be increased. In fact, the weight of those samples which are classified correctly are also reduced because of the re-normalization. The updated weight for next iteration can be expressed as w (m+1) (i) = w m(i)e cmi[ fm(x)] N e cmi[y i f m(x)] (1) N The denominator w m (i)e cmi[ fm(x)] is the normalization term and it can be simplified to (1 ε m ) + ε m e cm by the following derivation. N w m (i)e cmi[y i f m(x)] = w m (i)e 0 + w m (i)e cm y i =f m(x i ) y i f m(x i ) = w m (i) + e cm w m (i) y i =f m(x i ) y i f m(x i ) = (1 ε m ) + ε m e cm Where the weighted training error is ε m = y i f m(x i ) When y i = f m (x i ),which is same as y i f m (x i ) = 1 w (m+1) (i) = Otherwise; y i f m (x i )(or y i f m (x i ) = 1), w (m+1) (i) = w i. w m (i) (1 ε m ) + ε m e = w m (i)e cm/2 cm (1 ε m )e cm/2. (2) + ε m ecm/2 w m (i)e cm (1 ε m ) + ε m e = w m (i)e cm/2 cm (1 ε m )e cm/2. (3) + ε m ecm/2 We can simplify equation (2) and (3) to single equation. Where Z m = (1 ε m )e cm/2 + ε m e cm/2. By equation (4), we can get w (m+1) (i) = w m(i)e cmy if m(x i )/2 Z m (4) e cmy if m(x i )/2 = w (m+1)(i) w (m) (i) Z m (5) 6

By (5), upper bound of empirical error can be written as 1 N N e y if (x i )/2 = while choosing k = 1.The derivation is as following. 2 M Z m (6) m=1 1 N N e y if (x i )/2 = 1 N N y i e M m=1 c m f m (x i )/2 = 1 N = 1 N N M m=1 N M m=1 e y ic mf m(x i )/2 w (m+1) (i) w (m) (i) Z m with given = M m=1 N w (m+1 (i) = 1 and w 1 (x i ) = 1 N. Thus, by equation (6), to minimize upper bound of empirical error is equivalent to minimize Z m at each iteration. The minimum value of Z m occurs when Z m c m = 0. Z m Z m c m = 0 1 2 (1 ε m)e cm/2 + 1 2 ε me cm/2 = 0 e cm = 1 ε m ε m And c m = log 1 ε m ε m (7) Z m = (1 ε m )e 1 1 ε 2 log m ε m 1 1 ε + ε m e2 log m ε m 7

εm 1 εm = (1 ε m ) + ε m 1 ε m ε m while c m = log 1 ε m ε m. Thus, empirical error is bounded. 1 N = 2 ε m (1 ε m ) N I [(y i F (x i ) < 0] < M 2 ε m (1 ε m ) (8) m=1 This result can be interpreted. When c m = 0, it means the classifier f m is same as random guessing, because the weighted training error ε m is 0.5, so it s useless for final prediction. c m > 0 means f m has more chance to predict correctly(ε m < 0.5), so we can multiply a positive coefficient(c m > 0) to emphasize its positive influence of ensemble classifier. If c m < 0, not surprisingly, f m has more possibility to have a wrong prediction(ε m > 0.5), we reverse the prediction, then it still helps for ensemble prediction. In real adaboost, f m is a real number rather than in { 1, 1}. To find optimal value of f m (x). Because the weighted training error E w [1 (yfm(x)<0] E w [e yfm(x) ], to minimize weighted training error is equivalent to minimize its upper bound E w [e yfm(x) ]. E w [ e yfm(x)] = P w (y = 1 x)e fm(x) + P w (y = 1 x)e fm(x) E w [ e yf m(x) ] f m (x) = P w (y = 1 x)e fm(x) + P w (y = 1 x)e fm(x) E w [ e yf m(x) ] f m (x) = 0 f m (x) = 1 2 log P w(y = 1 x) P w (y = 1 x) Hence, the optimal base classifier f m is half of the value of taking logarithm of the ratio of two classes probabilities. Lastly, both algorithms output the sign of ensemble value. 2.4 Logit boosting Logitboost is a method only being used to solve binary classification problem with class label y = {0, 1}, while other methods using class label y = { 1, 1}. It fit an additive logistic symmetric likelihood function by using adaptive newton approach. At every iteration step, a working response z i i = 1, 2,, N of each sample is calculated by its y i and estimated class probability. Working response can be also regarded as an 8

approximate residual between real class label and ensemble value. The formula of working response is derived by using Newton s method. The optimal regression tree f m is got by minimizing the weighted least-square regression of z i. In the last step, it will output the sign of the ensemble value. 2.4.1 Algorithm Algorithm 3 Logit boosting algorithm 1: Initialize w i 1 N, p 0(x i ) = 1 2 i = 1, 2..., N and F 0 (x i ) = 0 Initialize equal weight to each sample and initial score of single node tree 2: while m M do yi 3: z m (i) = p (m 1)(x i ) Compute working response Z p (m 1) (x i )(1 p (m 1) (x i )) 4: w m (i) = p (m 1) (x i )(1 p (m 1) (x i )) Update weight of each instance [ 5: Fit the function f m (x), E w (fm (x) z) 2] by a weighted least-squares regression of z i to F m(x i ) using weight w m(i) 6: F m (x) = F (m 1) (x) + 1 2 f e Fm(x) m(x), and p m (x) = e Fm(x) + e Fm(x) estimate class probability for computing next working response 7: end while 8: return sign[f (x)] output the sign of ensemble result w i :weight of sample i w m(i): weight of sample i which updated by using previous class probability z m(i): the newton s update or estimated residual f m: optimal classifier which minimize the least square regression of z i F : final ensemble classifier 2.4.2 Derivation of each step Firstly, give to equal weight 1 N to every training instance, initialize F (x) = 0 and p(x i ) = 1, which means the the probability of y = 1 of every instance is same as random 2 guessing. In the m th iteration, the goal is to fit f m by minimizing the weighted least square regression of z i. F (x) is gradually improved by adding f(x) after each iteration. Here z m (i) is regarded as a residual(newton s update) that compared to its previous estimated probability of occurring y = 1 under x = x i. Next topic we concerned most is that why y p (m 1) (x i ) z m (i) = and how to find an optimal f(x) to maximize the (1 p (m 1) is(x i ))p (m 1) (x i )) next logistic likelihood function? Logistic likelihood function is denoted as l(p(x)) = y log p(x) + (1 y ) log(1 p(x)) 9

where p(x) = e F (x) e F (x) + e F (x) = 1 e 2F (x) + 1 (9) = l(p(x)) = l(f (x)) [ ] y e F (x) [ e F (x) = log e F (x + e F (x) e F (x + e F (x ] 1 y = 2y F (x) log(1 + e 2F (x) ) To find an optimal f(x) for current iteration, we need to maximize the expected logistic likelihood function. It occurs when Here, we denote,and E [(F (x) + f(x))]] f(x) g(f (x) + f(x)) = h(f (x) + f(x)) = f(x) = arg max E [(F (x) + f(x))] f = 0 E [l(f (x) + f(x))] f(x) g(f (x) + f(x)) f(x) [ = E 2y = E [ 2 1 + e 2(F (x)+f(x)) 4e 2(F (x)+f(x)) (1 + e 2(F (x)+f(x)) ) 2 Thus, it can be simplified to solve g(f) = 0. In logitboost methods,it uses Newton s method to find approximate solution. Newton s method is a method to find approximate roots r of function q(r) such that q(r) 0. According to Taylor expansion, q(r) = q(r 0 ) + q (r 0 )(r r 0 ) + O((r r 0 ) 2. It may not easy to find the root of q(r) = 0, but we can find a r such that 0 < q(r) < q(r 0 ), which means r is closer to root than r 0, and using this approach to find an approximate solution until q(r) 0. If r 0 is already a point which is near the root, and q(r) q(r 0 ) + q (r 0 )(r r 0 ) 0, ] ] (10) (11) r r 0 q(r 0) q (r 0 ), (12) then r will be a better approach than r 0. To solve g(f +f) = 0, we can consider F (x) = r 0 and r = F (x) + f(x). By (10), (11) and (12), F (x) + f(x) F (x) 10 g(f (x) + f(x)) h(f (x) + f(x)) f(x)=0

Since and by (9), we get g(f (x)) F (x) + f(x) F (x) h(f (x)) = [ E [ E 2y g(f (x)) h(f (x)), 2 1 + e 4e 2F (x) 2F (x) x (1 + e 2F (x) ) 2 x [ ] E y 1 x 1 + e 2F (x) = [ ], e 2F (x) 1 2E (1 + e 2F (x) ) (1 + e 2F (x) ) ] ] (13) And (13) becomes [ g(f (x)) h(f (x)) = 1 2 E y ] p(x) (1 p(x))p(x) x. (14) F (x) + f(x) F (x) + 1 [ 2 E y ] p(x) (1 p(x))p(x) x. (15) [ f(x) = arg min E w f(x) 1 f 2 y ] p(x) 2 (16) (1 p(x))p(x) y p (m 1) (x i ) Thus, we denote z m (i) =, and minimize weighted least square (1 p (m 1) (x i ))p (m 1) (x i )) error to z to find optimal f m (x). 2.5 Gradient boosting Gradient boosting [6] [8] is a methods of using the idea of function estimation. The goal is to find a additive function which is fitting for training data. In gradient boosting,he objective function is loss function which can interpreted as the dissimilarity between real class label and the predicted value of the additive function. In every iteration, it uses negative gradient of loss function to approximate the residual of previous iteration and the goal is to find a regression or classifier f and its score(function value) of each leaf which can minimize the summation of all the instances loss function L(y, F (m 1) (x) + f m (x)). Because of the needed of computing gradient, the loss function is required to be differential convex function. In binary classification problem, it output the sign of final ensemble score as its prediction if the class label y { 1, 1}. 11

2.5.1 Algorithm Algorithm 4 Gradient boosting algorithm N 1: F 0 (x) = arg min γ ψ(y i, γ) Give an initial value to F 2: while m [ M do ] ψ(yi, F (x i )) 3: ỹ im = F (xi)=f F (x i ) (m 1) (x i) Calculate negative gradient of loss function 4: {R lm } L 1 = L disjoint regions trained by ({ỹ im, x i } N 1 ) use the gradient of loss function we just calculated as new training label to train a new classifier 5: γ lm = arg min γ ψ(y i, F m 1 (x i ) + γ) Calculate the optimal score of each leaf x i R lm 6: F m (x) = F (m 1) (x) + ν γ lm 1(x R lm ) add the base classifier to the additive model(ensemble classifier) 7: end while F 0 : initial value of ensemble classifier ψ : loss function ỹ im : estimated residual of sample i at m th iteration R lm : all the samples which are classified to leaf with index l at m th γ lm : optimal score of leaf l under the L-disjoint partition ν : learning rate F m: the ensemble score of m th iteration 2.5.2 Derivation In gradient boosting algorithm, user can give arbitrary convex differential function as a loss function. Firstly, it gives an initial value to ensemble classifier which will be calculated by the loss function being used. The initial classifier can be regards as a tree with only single node which means all the samples are classified to the only single node and the initial value F 0 is the optimal score of the single node. Denote the score of single node tree is γ. The minimum value of F 0 (x) = arg min γ N ψ(y i, γ) occurs at N ψ(y i, γ) (17) N ψ(y i, γ) γ = 0 (18) By equation (17) (18), F 0 of taking different loss function can be derived as following. Least square loss: 1 (y F (x))2 2 12

N 1 2 (y i γ) 2 γ = N γ N N y i = 0 F 0 (x) = F 0 is the average of all y i in the training set according to the following derivation. Exponential loss: e yf (x) N N y i N γ e y iγ = 0 N y i e yiγ = 0 N N = e γ 1(y i = 1) + e γ 1(y i = 1) = 0 e γ NP (y = 1) + e γ NP (y = 1) = 0 F 0 (x) = γ 0 = 1 2 log P (y = 1) P (y = 1) Logistic loss: log(e 2yF (x) + 1) N log(1 + e 2yiγ ) γ = 0 N 2y i 1 + e 2y iγ = 0 2N [ P (y = 1) 1 + e 2γ ] P (y = 1) 1 + e 2γ = 0 F 0 (x) = 1 2 log P (y = 1) P (y = 1) 13

For exponential loss and logistic loss, F 0 is taking the logarithm of the ratio of the probabilities of two classes. Secondly, it include four steps at m th iteration. (1) Calculate negative gradient of loss function ỹ im to approximate residual of each sample. (2) Train a classifier f m which partition the features {x i i = 1, 2,, N} into L-disjoint region R lm l = 1, 2,, N by given training set {x i, ỹ im } (3) Find the optimal score γ lm of leaf l which minimizes the loss function of the leaf. (4) Update the ensemble score. (1) Calculate ỹ im of different loss function. Least square loss: [ ] ψ(yi, F (x i )) ỹ im = F (x i ) F (xi )=F (m 1) (x i ) (19) 1 ỹ im = 2 (y i F (x i )) 2 F (x i ) F (xi )=F (m 1) (x i ) = y i F (m 1) (x i ) (20) Exponential loss: ỹ im = [ ] e y if (x i ) F (x i ) F (xi )=F (m 1) (x i ) = y i e y if (m 1) (x i ) (21) Logistic loss: ỹ im = [ ] log(1 + e 2y if (x i ) ) F (x i ) F (xi )=F (m 1) (x i ) = 2y i 1 + e 2y if (m 1) (x i ) (22) (2)Train a classifier(tree) f m has L leaves by using training set {x i, ỹ im }. L can be variant in different iteration. (3)Find optimal score f m (x i ) = γ lm x i R lm of leaf l, which minimize the ensemble loss function ψ(y i, F m (x i )). x i R lm ψ(y i, F m (x i )) = ψ(y i, F (m 1) (x i ) + γ lm ) (23) x i R lm x i R lm 14

γ lm = arg min The minimum value occurs while Least square loss: By equation (20), can be re-written as By equation (24) (25), we get γ x i R lm ψ(y i, F m 1 (x i ) + γ) (24) x i R lm ψ(y i, F m 1 (x i ) + γ) γ (y i F (m 1) (x i ) γ) 2 1 ( γ 2 2ỹ im γ + ỹ 2 ) im. 2 1 ( γ 2 2ỹ im γ + ỹ 2 im) 2 x i R lm = 0 γ N l γ γ lm = x i R lm ỹ im N l x i R lm ỹ im N l, = 0 = 0 (25) where N l is the number of samples which are classified to leaf l, N l = N(R lm ). Exponential loss: By equation (21) (24) and (25), we get x i R lm e yi(f(m 1)(xi)+γ) = 0, γ y i e yi(f(m 1)(xi) e yiγ = 0, x i R lm ỹ im e yiγ = 0, x i R lm e γ ỹ im 1(y i = 1) e γ ỹ im 1(y i = 1) = 0, x i R lm x i R lm ỹ im 1(y i = 1) γ lm = 1 2 log x i R lm. ỹ im 1(y i = 1) x i R lm 15

Logistic loss: By equation(24), we need to solve x i R lm 2y i 1 + e 2(F (m 1)(x i )+γ) γ but there is no closed form of the solution yet. In practical case, use numerical method to solve it. = 0, (4) Add the score to the ensemble classifier depends on which leaf x belongs to. F m (x) = F (m 1) (x) + ν γ lm 1(x R lm ) Here, ν is learning rate which is a coefficient of γ lm while adding to ensemble classifier. In practical application, it s a parameter need to be fine tuned for building model conservatively. Lastly, output the prediction value which depends on loss function. For least square, output the sign of ensemble result. For logistic and exponential loss, use sigmoid function which will be introduced next chapter maps the ensemble value to [0, 1]. If the result is larger than 0.5, the result of prediction will be 1. We summarize the result as Table 1 based on different loss function. Table 1: Summary of different loss function for gradient boosting loss function F 0 (x) Least square exponential loss logistic loss 1 2 (y F (x))2 e yf (x) log(e 2yF (x) + 1) N y i 1 P (y = 1) 1 P (y = 1) log log N 2 P (y = 1) 2 P (y = 1) ỹ im y i F (m 1) (x i ) y i e y if (m 1) (x i ) 2y i 1 + e 2y if (m 1) x γ i R lm ỹ im 1 lm N l 2 log x i R lm ỹ im 1(y i = 1) no closed form x i R lm ỹ im 1(y i = 1) 2.6 XGB-Extreme Gradient Boosting XGB which is also called extreme gradient boosting and has been developed since 2014. It appeared first time in the competition hold by Kaggle and was proposed by Tainqi Chen [7] in University of Washington. The idea of XGB is originated from gradient boosting. The biggest difference between gradient boosting and XGB is the objective function. 16

In addition to training loss, objective function of XGB has a regularization term which dose not exist in traditional gradient boosting. It can prevent overfitting. Like other boosting method, the final classifier is the result to ensemble many based classifier up. And the ensemble classifier can always have a better prediction than previous iteration. XGB uses second order Taylor expansion to approximate objective function which approached by previous term. We discuss binary classification problem with label y {1, 1}, and the training set {(x i, y i ) i = 1, 2, N}, where x i is a vector with d dimensions(d features) 2.6.1 Algorithm Algorithm 5 Extreme boosting algorithm 1: Initial F 0 (x i ) = γ 0 i = 1, 2,, N Give an initial value to F 2: while m M do 3: g im = ψ(y i, F (x i )) F (xi)=f F (x i ) (m 1) (x i), h im = l2 (y i, F (x i )) F (x i ) 2 F (xi)=f (m 1) (x i) Calculate g im and h im for 2nd order Taylor expansion approximation 4: Train a new classifier which partitions {x i } N x i into L disjoint regions {R lm } L l=1 Train a classifier with training set {x i, {g im, h im }} 5: g im x w lm = i R lm h im + λ x i R lm calculate the optimal score of each region 6: F m (x) = F (m 1) (x) + ν w lm 1(x R lm ) add the base classifier to the additive model(ensemble classifier) 7: end while γ 0 : initial score of each sample g im : value of first derivation of loss function of sample i at m th iteration h im : value of first derivation of loss function of sample i at m th iteration R lm : x i which are classified to region l at m th iteration w lm : optimal score of leaf l at m th iteration 2.6.2 Derivation The objective function in XGB includes two terms, which are loss function and regularization term. Denote the Objective function L m at m th iteration. L (m) = N ψ(y i, F (m 1) (x i ) + f m (x i )) + Ω(f m ) (26) Ω(f m ) = γl + 1 2 λ w 2 (27) are loss function and regularization term respectively. f m is a regression tree trained by {x i, {g im, h im }}. It partitions all samples {x i i = 1, 2, N} into L-disjoint Regions 17

{R lm } N. The goal is to minimize the objective function L(m) with given under fixed partition {R lm } N l=1 and find optimal weight w lm of region l such that f m (x i R lm ) = w lm The regions can be regarded as leaves. Different from traditional gradient boosting, XGB uses 2nd order Taylor expansion to approximate loss function, thus, ψ is required to be a second differentiable function. ψ(y i, F (m 1) (x i ) + f m (x i )) = ψ(f (m 1) (x i ), y i ) + g im f m (x i ) + 1 2 h imfm(x 2 i ) (28) where g im = F ψ(y i, F ) F =F (m 1) (x i ) (29) h im = 2 F ψ(yi, F ) F =F (m 1) (x i ) (30) Eq (26) becomes L (m) N = ψ(f (m 1) (x i ), y i ) + g im f m (x i ) + 1 2 h imfm(x 2 i ) + γl + 1 2 λ w 2. (31) Since ψ(f (m 1) (x i ), y i ) is known, it s equivalent to optimize L (m) N = g im f m (x i ) + 1 2 h imfm(x 2 i ) + γl + 1 2 λ w 2 (32) Let f m (x i R lm ) = w l, equation (32) can be rewritten as Let L (m) L = {g im w l + 1 L 2 h imwl 2 + γ} + 1 2 λw2 l l=1 x i R lm l=1 L = {w l g im + 1 2 w2 l h im + λ + γ} x i R lm x i R lm l=1 L l (m) = wl ( x i R lm g im ) + 1 2 w2 l ( x i R lm h im + λ) + γ (33) Finding optimal value of regression tree f m is equivalent to find optimal w l which minimize L l (m) of leaf l, respectively. w lm = arg min w l (m) Ll Its minimal value occurs when L (m) l = 0 w l Denote a lm = 1 2 ( h im + λ) and b lm = ( g im ). x i R lm x i R lm Equation (33) becomes (34) 18

L l (m) = alm w 2 l + b lmw l + γ and we get Since it turns out that L l (m) w l = 2a lm w l + b lm = 0, ( g im ) w lm = b lm x i R = lm 2a lm ( h im + λ) x i R lm Optimal ( Ll (m) ) = γ b2 lm 4a lm = γ x i R lm g im 2 h im + λ, x i R lm ( L(m) ) g im ) 2 Optimal L = γ x i R lm 2( x i R lm h im + λ) l=1 2 (35) = γl 2 g im L x i R lm 2, (36) h im + λ x i R lm l=1 which is important for the latter derivation of the splitting criteria for training new classifier of each iteration. In the algorithm, it gives a initial value to ensemble classifier which can be defined by user in practical application. It also can simply use the same value as gradient boosting. In the m th iteration, it includes four steps: (1) Compute the estimated first derivation g im and second derivation h im by using y i and the ensemble value F (m 1) (x i ) of previous iteration for all the samples. (2) Fit a classifier with given training set {x i, {g im, h im }}. (3) Compute optimal score w lm. (4) Add the score to ensemble classifier. (1) Compute g im and h im of different loss function by equation (29) (30) Least square loss:(y F (x)) 2 g im = 2F (m 1) (x i ) 2y i 19

h im = 2 Exponential loss function: e yf (x) g im = y i e y i F (m 1)(xi ) because of y 2 i = 1 for y i { 1, 1}. Logistic loss function: log(1 + e 2yF (x) ) h im = y 2 i e y i F (m 1)(x i ) = e y i F (m 1)(x i ), g im = 2y ie 2y if (m 1)(xi ) 1 + e 2y if (m 1)(xi ) = 2y i 1 + e 2y if (m 1)(xi ) h im = 4y2 i e2y if (m 1)(xi ) 1 + e 2y if (m 1)(xi ) = 4 1 + e 2y if (m 1)(xi ) (2) and (3) Train a classifier to partition {x i i = 1, 2,, N} into {R lm } L l=1 and find the optimal score w lm which minimize the objective function. In step (2), to use g im and h im as a stopping/splitting criteria to train a new classifier with L leaves. It means the classifier stop splitting after evaluating the further splitting of each leaf. To measure the goodness of the classifier, we can use the similar idea as decision tree. Grow a tree from a single node and stop while impurity(entropy) is increasing after splitting and the optimal classifier will be the one before splitting. In XGB, the objective function L (m) can be an index to represent goodness of a classifier. If objective function value is increasing after splitting leaf l, then not to split l, otherwise continue to split until every leaf can t be split further. To formulate the stopping criteria, firstly, we partitions the samples in R lm into R L and R R with leaf index l L, l R. Every possible partition(splitting) of R lm won t increase objective function value, if l is a leaf of optimal classifier. Since other leaves objective function value won t change after splitting R lm, the difference of objectives is equal to the difference between Optimal( L l (m) ) and Optimal( L(m) l L l R ). Thus, we can formulate the stopping the criteria as following ( δl (m) (m) ) ( L(m) ) = Optimal Ll Optimal l L l R < 0 (37) 20

where ( xi Rlm g im) 2 Optimal ( Ll (m) ) = γ ( ), 2 x i R h lm im + λ and ( L(m) ) Optimal l L l R = 2γ 2 g im 2 g im x i R R x i R L 2 + h im + λ 2. h im + λ x i R R x i R L Inequality (37) becomes x i R L g im 2 x i R R g im 2 x i R lm g im 2 + h im + λ 2 h im + λ 2 γ < 0 (38) h im + λ x i R L x i R R x i R lm 2 If l is a leaf of the optimal classifier, all the possible partitions satisfies inequality (38). If δl (m) > 0 which means l can be split further, then we have to find the best splitting for leaf l. The best splitting in decision tree is to choose the partition which can reduce most score of impurity(entropy). Exact greedy algorithm for splitting finding in XGB is also with the similar idea, so it choose the splitting which maximizes δl (m) (decrease the objective function value most) as optimal spitting. In exact greedy algorithm for splitting finding, it sorts all the samples in R lm by x ik which is the value of k th feature such that R L = {x s x sk x ik } and R R = {x j x jk x ik }. Always Take the partition with larger δl (m). Repeat it for every feature and finally we can find the best splitting for leaf l. After getting L and {R lm } N l=1, calculate optimal weight w lm of each leaf l by equation (35). In last step, output the ensemble classifier. The prediction output the sign of the ensemble value while using least square loss for binary classification problem with y i { 1, 1}. For exponential and logistic loss, sigmoid function can be used to map the value to [0, 1] and return 1 if the value is larger than 0.5. Table 2 summarize the results of different objective functions. 21

Least square Exponential loss Logistic loss loss function (y F (x)) 2 e yf (x) log(e 2yF (x) + 1) 2(yi F(m 1)(xi)) yie F 2yi (m 1)(xi)yi 1 + e 2y if (m 1) (xi) xi Rlm xi Rlm wlm (2Nl + λ) ( xi Rlm e F (m 1) (xi)yi + λ) ( 4e 2y if (m 1) xi Rlm (1 + e + λ) 2y if (m 1) (xi) ) 2 gim 2(yi F(m 1)(xi)) yie F (m 1)(xi)yi him 2 e F (m 1) (xi)yi xi Rlm 2yi 1 + e 2y if (m 1) (xi) 4e 2y if (m 1) (1 + e 2y if (m 1) (xi) ) 2 Table 2: Summary of different loss function for XGB 22

3 Measurement and Experiment 3.1 Measurement of goodness of the classifier In general, the goodness of a model can be measured by accuracy, which is the ratio between correct predicted counts and total counts. But in some specific case, it may happen that it can not truly represent the model s goodness, such as the quality test prediction of high yield product, which always includes rare fail samples compared to Pass samples. In the dataset will be used for demonstration in this thesis, the fail samples of quality test is less than 1% of all products, which means two classes have big disparity. The model trained by such imbalanced dataset will tend to predict the result as the major class, but it s still possible to get high accuracy, which is variant by the given test set of two-class ratio. For example, if the ratio between two classes(fail/pass) is 0.01(1:100), the model will be prone to predict the result as Pass, and the accuracy is higher than 99% while the class distribution of given test set is same as training set. But when we change an test set with more failure cases, this model can not predict precisely. To effectively measure the goodness of the model learned from imbalance two-class data set, the accuracy of ROC curve will be used to measure the model s goodness in whole paper. 3.1.1 ROC curve ROC(receiver operating characteristics) curve is a curve using for measuring the goodness of binary classifier. Its value of x axis and y axis represent false positive rate and true positive rate respectively. The area under the curve can be an index of goodness of the model, which is called AUC. In binary classification, we define two classes being + and -, and the combination of predicted result and true class label have 4 different cases, which are TP (True positive), FP (False positive), TN (True negative) and FN (False negative) and it can be represented by confusion matrix as Table 1. True positive means condition positive case is predicted as positive, and false positive means negative case is predicted as positive case. To draw ROC curve, we adjust the threshold and get different pair of false positive rate and true positive rate (F P R, T P R) which is defined by equation (39) and (40). For example, If the instance s probability of being + is p and p > threshold, then classify it to positive class. Thus, the higher the threshold is, the smaller the false positive rate is. 23

Figure 1: Receiver operating characteristics curve Confusion matrix Predicted + Predicted - Condition + True Positive(TP) False negative(fn) Condition - false Positive(FP) True negative(tn) Table 3: Confusion matrix which is gotten under certain threshold. T P R = F P R = T P T P + F N F P F P + T N (39) (40) 3.1.2 sigmoid function The curve of sigmoid function which is an antisymmetric function looks like S as Fig 2, The special case of sigmoid function is logistic function, which maps a real number to [0,1]. Denote S(x) = 1. (41) 1 + e x Some of boosting methods returns the sign of ensemble classifier F (x) as the prediction result while F (x) is an additive logistic function. But in this kind of sense, it means it already choose the threshold as 0.5. As we can see in Fig 2, S(x) > 0.5 when x is 24

positive. For cutting different threshold to draw ROC curve, in this paper, Sigmoid function will be used to map F to [0,1]. Dscrete adaboost is not suitable for using this case, because the return value will be likely accuracy(1-error) instead of class probability after mapping by sigmoid function. Basically, this will be used only when the objective function is exponential or logistic such as real adaboost, gradient boosting and extreme gradient boosting. Figure 2: Sigmoid function maps the score to [0,1] 3.2 Data preparation The original dataset includes three parts, which are categorical, numerical and timestamps data. Three datasets are totally with 4267 columns 1183748 rows. Since the limitation of computational ability of laptop, only numerical and time-related data from partial process flow will be used for the later demonstration. Before starting to train the model, I construct three the training sets as Table 5. First training set is constructed by cutting partial consecutive process flow and leaving those rows with missing values out. The dataset didn t reveal the real physical meaning of each feature in the numerical dataset, but there are still some useful information, such as the process sequence of each product in a machine. There are some processes which can be done not in only single machine. For example, in Fig 3, B1 and B2 are equivalent machines, in this station, product can be processed either in B1 or in B2, except some special case needed rework. In first training set, some data with same feature data from equivalent machine are putting into 25

different columns, so I merged them into same column. Second training set is constructed by adding some extra features to first training set. I add machine number and machine idling time before the product staring process as extra features. If the product pass B1, then the added feature is B1. It s possible that machine has a long queueing time before product comes. The machine s idling time is the difference between two consecutive products which are processed in this machine. Since there are many missing values in each columns, which means the sampling rate of measurement is not 100%. In third training set, two consecutive partial process flow are cut, but later I use XGB to demonstrate. In all the packages, only XGB support to handle the cell with missing value. Training set # of rows # of columns Note 1 80000 42 Partial process flow without missing data 2 80000 56 Partial process flow and extra added features 3 22000 267 Partial process flow with missing data Table 4: Size of training set Figure 3: Partial process flow 3.3 implementation The demonstration will be implement using ASUS laptop under the environment with window 10, 64-bit system, 4-GB ram, X64 processor. I encode in python language and the programming part will be executed and complied in jupyter notebook, which is a on-line 26

interactive platform support multi-programming languages. The code can be executed in a single cell independently. The version of Python is 3.3 and jupyter notebook is 4.1. Compared to original data set, the training set used for demonstration is very small, so the result of the prediction is not as good as the those results of higher rank in the competition. Since the purpose of this thesis is not to require the high accuracy, it s the demonstration of the procedure. Programming will be executed by using scikit-learn package which exists in python. Most methods can be found in this package, such as adaboost, gradient boost, and extreme gradient boosting. To build the training model, the procedure includes two steps. (1) Find the optimal parameters to train the model. (2) Find an optimal threshold so that we can output the result of prediction. 3.3.1 Optimization of parameters In programming, the model is optimized under given condition with fixed parameters. To find these optimal parameters before training the model, I took line-search strategy which is a methods looking for local extreme value by adjusting one specific parameter while other parameters are fixed. I will demonstrate the procedure by using Extreme gradient boosting with third training set in Table 4, because XGB is the only methods able to handle training set with data missing among four methods. The line search strategy can be shown from Fig 4 to 8. The sequence of picking parameter to maximize AUC does not matter. Firstly, we fix other parameters except maximal depth of base classifier and find local extreme by adjusting maximal depth of base classifier. Fig 4 shows that the local maximum Roc accuracy occurs when max depth = 2. 27

Figure 4: Default value: M = 100, λ = 1, γ = 0, objective=logistic, ν = 0.1 Secondly, fix max depth = 2 and pick the number of base classifiers(n estimators) for doing line search to find the optimal value which maximize the ROC accuracy while other parameters fixed. In Fig 5, the result show the maximum of AUC occurs when it s 103. Figure 5: max depth = 2, and default value λ = 1, γ = 0, objective = logistic, ν = 0.1 28

Figure 6: M = 103, max depth = 2, λ = 1, γ = 0, objective = logistic Fig 6 shows the learning rate ν with maximum AUC occurs while it s 0.1 and it keep trending down after 0.1. In empirical experience, it s is between 0.01 and 0.2. γ is a parameter which only appears in XGB. Since AUC doesn t change between 0.01 and 0.2 while programming. Thus, I took 0.2 as increment of each iteration. The maximum value happens while γ = 0 0.2. In practice, it could happen that γ is either very close to zero or very far from zero, but the result in Fig 7 shows it s almost same as random guessing while γ is very large. 29

Figure 7: M = 103, max depth = 2, λ = 1, objective = logistic Figure 8: M = 103, max depth = 2, λ = 1, objective = logistic 30

Figure 9: M = 103, max depth = 2, λ = 1, objective = logistic λ is also a parameter only appears in XGB, it represents the coefficient of L2-regularization. Figure 10: M = 103, max depth = 2, γ = 0, objective = logistic 9. Finally, we get the optimal condition with auc 0.68. The optimal condition is as Table 31

3.3.2 Optimal threshold The ROC curve under optimal condition is shown as Fig 10. To find a optimal threshold, we have to maximize MCC(Matthew s Correlation Coefficient) which can strongly represent the correlation between imbalanced two classes and the value is always between -1 1, it can be computed by equation (43) with the confusion matrix which is got by adjusting threshold. MCC = T P T N + F P F N (T P + F P )(T P + F N)(T N + F P )(T N + F N) (42) Figure 11: ROC curve under optimal condition 32

Figure 12: Confusion matrix while threshold is equal to 0.0644 3.3.3 Regularization and over-fitting In XGB algorithm, the most significant difference from other boosting methods is to consider the regularization term which can prevent over-fitting problem. Fig 12 and 13 show the model become slightly stable while the number of estimators is larger than 100, and training error continue to decay in the meantime. Fig 14 shows not only the over-fitting problem also the AUC got improvement while taking the regularization into account. 33