CSCI-567: Machine Learning (Spring 2019)

CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43

Administration March 19, 2019 2 / 43

Administration TA3 is due this week March 19, 2019 2 / 43

Administration TA3 is due this week TA4 will be available in the next week. March 19, 2019 2 / 43

Administration TA3 is due this week TA4 will be available in the next week. PA4 (Clustering, Markov chains) will be available in two weeks March 19, 2019 2 / 43

Outline 1 Boosting 2 Gaussian mixture models March 19, 2019 3 / 43

Top 10 Algorithms in Machine Learning... March 19, 2019 4 / 43

Top 10 Algorithms in Machine Learning... You should know (in 2019): March 19, 2019 4 / 43

Top 10 Algorithms in Machine Learning... You should know (in 2019): k-nearest Neighbors March 19, 2019 4 / 43

Top 10 Algorithms in Machine Learning... You should know (in 2019): k-nearest Neighbors Decision Tree and Random Forests March 19, 2019 4 / 43

Top 10 Algorithms in Machine Learning... You should know (in 2019): k-nearest Neighbors Decision Tree and Random Forests Naive Bayes March 19, 2019 4 / 43

Top 10 Algorithms in Machine Learning... You should know (in 2019): k-nearest Neighbors Decision Tree and Random Forests Naive Bayes Linear and Logistic Regression March 19, 2019 4 / 43

Top 10 Algorithms in Machine Learning... You should know (in 2019): k-nearest Neighbors Decision Tree and Random Forests Naive Bayes Linear and Logistic Regression Artificial Neural Networks March 19, 2019 4 / 43

Outline 1 Boosting Examples AdaBoost Derivation of AdaBoost 2 Gaussian mixture models March 19, 2019 5 / 43

Introduction Boosting is a meta-algorithm, which takes a base algorithm (classification, regression, ranking, etc) as input and boosts its accuracy March 19, 2019 6 / 43

Introduction Boosting is a meta-algorithm, which takes a base algorithm (classification, regression, ranking, etc) as input and boosts its accuracy main idea: combine weak rules of thumb (e.g. 51% accuracy) to form a highly accurate predictor (e.g. 99% accuracy) works very well in practice (especially in combination with trees) March 19, 2019 6 / 43

A simple example Email spam detection: March 19, 2019 7 / 43

A simple example Email spam detection: given a training set like: ( Want to make money fast?..., spam) ( Viterbi Research Gist..., not spam) March 19, 2019 7 / 43

A simple example Email spam detection: given a training set like: ( Want to make money fast?..., spam) ( Viterbi Research Gist..., not spam) first obtain a classifier by applying a base algorithm, which can be a rather simple/weak one, like decision stumps: e.g. contains the word money spam reweight the examples so that difficult ones get more attention e.g. spam that doesn t contain the word money March 19, 2019 7 / 43

The base algorithm A base algorithm A (also called weak learning algorithm/oracle) takes a training set S weighted by D as input, and outputs classifier h A(S, D) March 19, 2019 8 / 43

The base algorithm A base algorithm A (also called weak learning algorithm/oracle) takes a training set S weighted by D as input, and outputs classifier h A(S, D) this can be any off-the-shelf classification algorithm (e.g. decision trees, logistic regression, neural nets, etc) many algorithms can deal with a weighted training set (e.g. for algorithm that minimizes some loss, we can simply replace total loss by weighted total loss ) March 19, 2019 8 / 43

Boosting Algorithms Given: a training set S a base algorithm A March 19, 2019 9 / 43

Boosting Algorithms Given: a training set S a base algorithm A Two things to specify a boosting algorithm: how to reweight the examples? how to combine all the weak classifiers? March 19, 2019 9 / 43

Boosting Algorithms Given: a training set S a base algorithm A Two things to specify a boosting algorithm: how to reweight the examples? how to combine all the weak classifiers? AdaBoost is one of the most successful boosting algorithms. March 19, 2019 9 / 43

The AdaBoost Algorithm, 1990 Given N samples {x n, y n }, where y n {+1, 1}, and a base algorithm A. March 19, 2019 10 / 43

The AdaBoost Algorithm, 1990 Given N samples {x n, y n }, where y n {+1, 1}, and a base algorithm A. Initialize D 1 = 1 N to be uniform. March 19, 2019 10 / 43

The AdaBoost Algorithm, 1990 Given N samples {x n, y n }, where y n {+1, 1}, and a base algorithm A. Initialize D 1 = 1 N to be uniform. For t = 1,..., T Train a weak classifier h t A(S, D t ) based on the current weight D t (n), by minimizing the weighted classification error ɛ t = n D t (n)i[y n h t (x n )] March 19, 2019 10 / 43

The Betas Calculate the importance of h t as β t = 1 2 ln ( 1 ɛt ɛ t ) (β t > 0 ɛ t < 0.5) March 19, 2019 11 / 43

The AdaBoost Algorithm For t = 1,..., T Train a weak classifier h t A(S, D t ) Calculate β t March 19, 2019 12 / 43

The AdaBoost Algorithm For t = 1,..., T Train a weak classifier h t A(S, D t ) Calculate β t Update weights D t+1 (n) = D t (n)e βtynht(xn) March 19, 2019 12 / 43

The AdaBoost Algorithm For t = 1,..., T Train a weak classifier h t A(S, D t ) Calculate β t Update weights D t+1 (n) = D t (n)e βtynht(xn) = { D t (n)e βt D t (n)e βt if h t (x n ) = y n else March 19, 2019 12 / 43

The AdaBoost Algorithm For t = 1,..., T Train a weak classifier h t A(S, D t ) Calculate β t Update weights D t+1 (n) = D t (n)e βtynht(xn) = { D t (n)e βt D t (n)e βt if h t (x n ) = y n else and normalize them such that D t+1 (n) = D t+1(n) n D t+1(n). March 19, 2019 12 / 43

Example 10 data points in R 2 The size of + or - indicates the weight, which starts from uniform D 1 March 19, 2019 13 / 43

Example 10 data points in R 2 The size of + or - indicates the weight, which starts from uniform D 1 Base algorithm is decision stump: March 19, 2019 13 / 43

Observe that no stump can predict very accurately for this dataset March 19, 2019 13 / 43 Example 10 data points in R 2 The size of + or - indicates the weight, which starts from uniform D 1 Base algorithm is decision stump:

Round 1: t = 1 h 1 D 2 March 19, 2019 14 / 43

Round 1: t = 1 h 1 D 2 3 misclassified (circled): ɛ 1 = 0.3 March 19, 2019 14 / 43

Round 1: t = 1 h 1 D 2 3 misclassified (circled): ɛ 1 = 0.3 β 1 = 1 2 ln ( 1 ɛt ɛ t ) 0.42. March 19, 2019 14 / 43

Round 1: t = 1 h 1 D 2 3 misclassified (circled): ɛ 1 = 0.3 β 1 = 1 2 ln ( 1 ɛt ɛ t ) 0.42. D 2 puts more weights on those examples March 19, 2019 14 / 43

Round 2: t = 2 h 2 D 3 3 misclassified (circled): ɛ 2 = 0.21 β 2 = 0.65. March 19, 2019 15 / 43

Round 2: t = 2 h 2 D 3 3 misclassified (circled): ɛ 2 = 0.21 β 2 = 0.65. D 3 puts more weights on those examples March 19, 2019 15 / 43

Round 3: t = 3 Round 3 h3 "3 =0.14!3=0.92 again 3 misclassified (circled): 3 = 0.14 β3 = 0.92. March 19, 2019 16 / 43

Final classifier: combining 3 classifiers H = sign 0.42 + 0.65 + 0.92 final = March 19, 2019 17 / 43

Final classifier: combining 3 classifiers H = sign 0.42 + 0.65 + 0.92 final = All data points are now classified correctly, even though each weak classifier makes 3 mistakes. March 19, 2019 17 / 43

Overfitting When T is large, the model is very complicated and overfitting can happen March 19, 2019 18 / 43

Resistance to overfitting However, very often AdaBoost is resistant to overfitting March 19, 2019 19 / 43

Resistance to overfitting However, very often AdaBoost is resistant to overfitting Used to be a mystery, but by now rigorous theory has been developed to explain this phenomenon. March 19, 2019 19 / 43

Why AdaBoost works? In fact, AdaBoost also follows the general framework of minimizing some surrogate loss. March 19, 2019 20 / 43

Why AdaBoost works? In fact, AdaBoost also follows the general framework of minimizing some surrogate loss. Step 1: the model that AdaBoost considers is { } T sgn (f( )) f( ) = β t h t ( ) for some β t 0 and h t H t=1 where H is the set of models considered by the base algorithm March 19, 2019 20 / 43

Greedy minimization Step 3: the way that AdaBoost minimizes exponential loss is by a greedy approach, that is, find β t, h t one by one for t = 1,..., T. March 19, 2019 21 / 43

Greedy minimization Step 3: the way that AdaBoost minimizes exponential loss is by a greedy approach, that is, find β t, h t one by one for t = 1,..., T. Specifically, let f t = t τ=1 β τ h τ. Suppose we have found f t 1, what should f t be? t 1 f t = β τ h τ + β t h t = f t 1 + β t h t. τ=1 Greedily, we want to find β t, h t to minimize March 19, 2019 21 / 43

Greedy minimization Claim: exp ( y n f t 1 (x n )) D t (n) March 19, 2019 22 / 43

Greedy minimization Claim: exp ( y n f t 1 (x n )) D t (n) Proof. March 19, 2019 22 / 43

Greedy minimization Claim: exp ( y n f t 1 (x n )) D t (n) Proof. D t (n) D t 1 (n) exp ( y n β t 1 h t 1 (x n )) March 19, 2019 22 / 43

Greedy minimization Claim: exp ( y n f t 1 (x n )) D t (n) Proof. D t (n) D t 1 (n) exp ( y n β t 1 h t 1 (x n )) D t 2 (n) exp ( y n β t 2 h t 2 (x n )) exp ( y n β t 1 h t 1 (x n )) March 19, 2019 22 / 43

Greedy minimization Claim: exp ( y n f t 1 (x n )) D t (n) Proof. D t (n) D t 1 (n) exp ( y n β t 1 h t 1 (x n )) D t 2 (n) exp ( y n β t 2 h t 2 (x n )) exp ( y n β t 1 h t 1 (x n )) D 1 exp ( y n β 1 h 1 (x n )... y n β t 1 h t 1 (x n )) March 19, 2019 22 / 43

Greedy minimization Claim: exp ( y n f t 1 (x n )) D t (n) Proof. D t (n) D t 1 (n) exp ( y n β t 1 h t 1 (x n )) D t 2 (n) exp ( y n β t 2 h t 2 (x n )) exp ( y n β t 1 h t 1 (x n )) D 1 exp ( y n β 1 h 1 (x n )... y n β t 1 h t 1 (x n )) exp ( y n f t 1 (x n )) March 19, 2019 22 / 43

Greedy minimization So the goal becomes finding β t 0, h t H that minimize argmin β t,h t N n=1 exp ( y n f t (x n )) = argmin β t,h t N D t (n) exp ( y n β t h t (x n )) n=1 March 19, 2019 23 / 43

Greedy minimization So the goal becomes finding β t 0, h t H that minimize argmin β t,h t N n=1 exp ( y n f t (x n )) = argmin β t,h t N D t (n) exp ( y n β t h t (x n )) n=1 We decompose the weighted loss function into two parts N D t (n) exp ( y n β t h t (x n )) n=1 = n:y n h t(x n) D t (n)e βt + n:y n=h t(x n) D t (n)e βt = ɛ t e βt + (1 ɛ t )e βt (recall ɛ t = n:y n h t(x n) D t(n)) March 19, 2019 23 / 43

Minimizing the weighted classification error Thus, we would want to choose h t (x n ) such that h t (x) = argmin ɛ t = h t(x) n:y n h t(x n) D t (n) March 19, 2019 24 / 43

Minimizing the weighted classification error Thus, we would want to choose h t (x n ) such that h t (x) = argmin ɛ t = h t(x) n:y n h t(x n) D t (n) This is exactly the first step of the AdaBoost algorithm on slide 10 train a weak classifier based on the current weight D t (n). March 19, 2019 24 / 43

Greedy minimization When h t (and thus ɛ t ) is fixed, we then find β t to minimize ɛ t (e βt e βt ) + e βt March 19, 2019 25 / 43

Greedy minimization When h t (and thus ɛ t ) is fixed, we then find β t to minimize ɛ t (e βt e βt ) + e βt We take derivative with respect to β t, set it to zero, and derive the optimal β t as βt = 1 ( ) 1 2 ln ɛt ɛ t March 19, 2019 25 / 43

Updating the weights Now that we have improved our classifier into f t (x n ) = f t 1 (x n ) + β t h t (x n ) March 19, 2019 26 / 43

Updating the weights Now that we have improved our classifier into f t (x n ) = f t 1 (x n ) + β t h t (x n ) At the t-th iteration, we will need to compute the weights for the above classifier, which is, D t+1 (n) e ynf(xn) = e yn[f t 1(x)+βt h t (xn)] { D t (n)e ynβ t h t (xn) Dt (n)e = β t if y n h t (x n ) D t (n)e β t if y n = h t (x n ) March 19, 2019 26 / 43

Remarks Note that the AdaBoost algorithm itself never specifies how we would get h t (x) as long as it minimizes the weighted classification error ɛ t = n D t (n)i[y n h t (x n )] In this aspect, the AdaBoost algorithm is a meta-algorithm and can be used with any classifier where we can do the above. Ex. How do we choose the decision stump classifier given the weights at the second round of the following distribution? h 1 D 2 We can simply enumerate all possible ways of putting vertical and horizontal lines to separate the data points into two classes and find the one with the smallest weighted classification error! March 19, 2019 27 / 43

Summary for boosting Key idea of boosting is to combine weak predictors into a strong one. March 19, 2019 28 / 43

Summary for boosting Key idea of boosting is to combine weak predictors into a strong one. There are many boosting algorithms; AdaBoost is the most classic one. March 19, 2019 28 / 43

Summary for boosting Key idea of boosting is to combine weak predictors into a strong one. There are many boosting algorithms; AdaBoost is the most classic one. AdaBoost is greedily minimizing the exponential loss. March 19, 2019 28 / 43

Taxonomy of ML Models There are two kinds of classification models in machine learning generative models and discriminative models. March 19, 2019 29 / 43

Taxonomy of ML Models There are two kinds of classification models in machine learning generative models and discriminative models. Discriminative models: March 19, 2019 29 / 43

Taxonomy of ML Models There are two kinds of classification models in machine learning generative models and discriminative models. Discriminative models: nearest neighbor, traditional neural networks, SVM. we learn f() on data set (x i, y i ) to output the most likely y on unseen x. having f() we know how to discriminate unseen x s from different classes. March 19, 2019 29 / 43

Taxonomy of ML Models There are two kinds of classification models in machine learning generative models and discriminative models. Generative models: March 19, 2019 30 / 43

Taxonomy of ML Models There are two kinds of classification models in machine learning generative models and discriminative models. Generative models: Naïve Bayes, Gaussian mixture model, Hidden Markov model, Adversarial Network (GAN). it s used widely in unsupervised machine learning. it s a probabilistic way to think about how the data might have been generated. March 19, 2019 30 / 43

Outline 1 Boosting 2 Gaussian mixture models Motivation and Model EM algorithm March 19, 2019 31 / 43

Gaussian mixture models Gaussian mixture models (GMM) is a probabilistic approach for clustering March 19, 2019 32 / 43

Gaussian mixture models Gaussian mixture models (GMM) is a probabilistic approach for clustering more explanatory than minimizing the K-means objective can be seen as a soft version of K-means To solve GMM, we will introduce a powerful method for learning probabilistic mode: Expectation Maximization (EM) algorithm March 19, 2019 32 / 43

A generative model For classification, we discussed the sigmoid model to explain how the labels are generated. March 19, 2019 33 / 43

A generative model For classification, we discussed the sigmoid model to explain how the labels are generated. Similarly, for clustering, we want to come up with a probabilistic model p to explain how the data is generated. March 19, 2019 33 / 43

Gaussian mixture models: intuition We will model each region with a Gaussian distribution. This leads to the idea of Gaussian mixture models (GMMs). The problem we are now facing is that i) we do not know which (color) region a data point comes from; ii) the parameters of Gaussian distributions in each region. We need to find all of them from unsupervised data D = {x n } N n=1. March 19, 2019 34 / 43

GMM: formal definition A GMM has the following density function: p(x) = K ω k N(x µ k, Σ k ) = k=1 K k=1 ω k 1 (2π) D Σ k e 1 2 (x µ k) T Σ 1 k (x µ k) March 19, 2019 35 / 43

GMM: formal definition A GMM has the following density function: p(x) = where K ω k N(x µ k, Σ k ) = k=1 K k=1 ω k 1 (2π) D Σ k e 1 2 (x µ k) T Σ 1 k (x µ k) K: the number of Gaussian components (same as #clusters we want) µ k and Σ k : mean and covariance matrix of the k-th Gaussian ω 1,..., ω K : mixture weights, they represent how much each component contributes to the final distribution. It satisfies two properties: k, ω k > 0, and ω k = 1 k March 19, 2019 35 / 43

Another view By introducing a latent variable z [K], which indicates cluster membership, we can see p as a marginal distribution p(x) = K p(x, z = k) k=1 March 19, 2019 36 / 43

Another view By introducing a latent variable z [K], which indicates cluster membership, we can see p as a marginal distribution p(x) = K p(x, z = k) = k=1 K p(z = k)p(x z = k) k=1 March 19, 2019 36 / 43

Another view By introducing a latent variable z [K], which indicates cluster membership, we can see p as a marginal distribution p(x) = K p(x, z = k) = k=1 K p(z = k)p(x z = k) = k=1 K ω k N(x µ k, Σ k ) k=1 March 19, 2019 36 / 43

An example The conditional distributions are p(x z = red) = N(x µ 1, Σ 1 ) p(x z = blue) = N(x µ 2, Σ 2 ) p(x z = green) = N(x µ 3, Σ 3 ) March 19, 2019 37 / 43

An example The conditional distributions are p(x z = red) = N(x µ 1, Σ 1 ) p(x z = blue) = N(x µ 2, Σ 2 ) p(x z = green) = N(x µ 3, Σ 3 ) The marginal distribution is p(x) = p(red)n(x µ 1, Σ 1 ) + p(blue)n(x µ 2, Σ 2 ) + p(green)n(x µ 3, Σ 3 ) March 19, 2019 37 / 43

Learning GMMs Learning a GMM means finding all the parameters θ = {ω k, µ k, Σ k } K k=1. March 19, 2019 38 / 43

Learning GMMs Learning a GMM means finding all the parameters θ = {ω k, µ k, Σ k } K k=1. In the process, we will learn the latent variable z n as well: p(z n = k x n ) March 19, 2019 38 / 43

Learning GMMs Learning a GMM means finding all the parameters θ = {ω k, µ k, Σ k } K k=1. In the process, we will learn the latent variable z n as well: p(z n = k x n ) γ nk [0, 1] i.e. soft assignment of each point to each cluster, as opposed to hard assignment by K-means. GMM is more explanatory than K-means both learn the cluster centers µ k s March 19, 2019 38 / 43

How to learn these parameters? An obvious attempt is maximum-likelihood estimation (MLE): find argmax θ ln N n=1 p(x n ; θ) = argmax θ N n=1 ln p(x n ; θ) argmax P (θ) θ March 19, 2019 39 / 43

How to learn these parameters? An obvious attempt is maximum-likelihood estimation (MLE): find argmax θ ln N n=1 p(x n ; θ) = argmax θ N n=1 ln p(x n ; θ) argmax P (θ) θ This is called incomplete likelihood (since z n s are unobserved), and is intractable in general (non-concave problem). March 19, 2019 39 / 43

Preview of EM for learning GMMs Step 0 Initialize ω k, µ k, Σ k for each k [K] March 19, 2019 40 / 43

Preview of EM for learning GMMs Step 0 Initialize ω k, µ k, Σ k for each k [K] Step 1 (E-Step) update the soft assignment (fixing parameters) γ nk = p(z n = k x n ) ω k N (x n µ k, Σ k ) March 19, 2019 40 / 43

Preview of EM for learning GMMs Step 0 Initialize ω k, µ k, Σ k for each k [K] Step 1 (E-Step) update the soft assignment (fixing parameters) γ nk = p(z n = k x n ) ω k N (x n µ k, Σ k ) Step 2 (M-Step) update the model parameter (fixing assignments) n ω k = γ nk µ k = n γ nkx n N n γ nk 1 Σ k = n γ γ nk (x n µ k )(x n µ k ) T nk n March 19, 2019 40 / 43

Demo Generate 50 data points from a mixture of 2 Gaussians with ω 1 = 0.3, µ 1 = 0.8, Σ 1 = 0.52 ω 2 = 0.7, µ 2 = 1.2, Σ 2 = 0.35 March 19, 2019 41 / 43

Demo Generate 50 data points from a mixture of 2 Gaussians with ω 1 = 0.3, µ 1 = 0.8, Σ 1 = 0.52 ω 2 = 0.7, µ 2 = 1.2, Σ 2 = 0.35 histogram represents the data 1 0.8 0.6 0.4 0.2 0-6 -4-2 0 2 4 6 March 19, 2019 41 / 43

Demo Generate 50 data points from a mixture of 2 Gaussians with ω 1 = 0.3, µ 1 = 0.8, Σ 1 = 0.52 ω 2 = 0.7, µ 2 = 1.2, Σ 2 = 0.35 histogram represents the data red curve represents the ground-truth density p(x) = K k=1 ω kn(x µ k, Σ k ) blue curve represents the learned density for a specific round 1 0.8 0.6 0.4 0.2 0-6 -4-2 0 2 4 6 March 19, 2019 41 / 43

EM algorithm In general EM is a heuristic to solve MLE with latent variables (not just GMM), i.e. find the maximizer of P (θ) = N ln p(x n ; θ) n=1 March 19, 2019 42 / 43

EM algorithm In general EM is a heuristic to solve MLE with latent variables (not just GMM), i.e. find the maximizer of P (θ) = N ln p(x n ; θ) n=1 θ is the parameters for a general probabilistic model x n s are observed random variables March 19, 2019 42 / 43

High level idea Keep maximizing a lower bound of P that is more manageable March 19, 2019 43 / 43