CSCI-567: Machine Learning (Spring 2019)

Size: px

Start display at page:

Download "CSCI-567: Machine Learning (Spring 2019)"

Hugo Harmon
5 years ago
Views:

1 CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, / 43

2 Administration March 19, / 43

3 Administration TA3 is due this week March 19, / 43

4 Administration TA3 is due this week TA4 will be available in the next week. March 19, / 43

5 Administration TA3 is due this week TA4 will be available in the next week. PA4 (Clustering, Markov chains) will be available in two weeks March 19, / 43

6 Outline 1 Boosting 2 Gaussian mixture models March 19, / 43

7 Top 10 Algorithms in Machine Learning... March 19, / 43

8 Top 10 Algorithms in Machine Learning... You should know (in 2019): March 19, / 43

9 Top 10 Algorithms in Machine Learning... You should know (in 2019): k-nearest Neighbors March 19, / 43

10 Top 10 Algorithms in Machine Learning... You should know (in 2019): k-nearest Neighbors Decision Tree and Random Forests March 19, / 43

11 Top 10 Algorithms in Machine Learning... You should know (in 2019): k-nearest Neighbors Decision Tree and Random Forests Naive Bayes March 19, / 43

12 Top 10 Algorithms in Machine Learning... You should know (in 2019): k-nearest Neighbors Decision Tree and Random Forests Naive Bayes Linear and Logistic Regression March 19, / 43

13 Top 10 Algorithms in Machine Learning... You should know (in 2019): k-nearest Neighbors Decision Tree and Random Forests Naive Bayes Linear and Logistic Regression Artificial Neural Networks March 19, / 43

14 Top 10 Algorithms in Machine Learning... You should know (in 2019): k-nearest Neighbors Decision Tree and Random Forests Naive Bayes Linear and Logistic Regression Artificial Neural Networks SVM March 19, / 43

15 Top 10 Algorithms in Machine Learning... You should know (in 2019): k-nearest Neighbors Decision Tree and Random Forests Naive Bayes Linear and Logistic Regression Artificial Neural Networks SVM Clustering (k-means) March 19, / 43

16 Top 10 Algorithms in Machine Learning... You should know (in 2019): k-nearest Neighbors Decision Tree and Random Forests Naive Bayes Linear and Logistic Regression Artificial Neural Networks SVM Clustering (k-means) Boosting March 19, / 43

17 Top 10 Algorithms in Machine Learning... You should know (in 2019): k-nearest Neighbors Decision Tree and Random Forests Naive Bayes Linear and Logistic Regression Artificial Neural Networks SVM Clustering (k-means) Boosting Dimensionality Reduction Algorithms March 19, / 43

18 Top 10 Algorithms in Machine Learning... You should know (in 2019): k-nearest Neighbors Decision Tree and Random Forests Naive Bayes Linear and Logistic Regression Artificial Neural Networks SVM Clustering (k-means) Boosting Dimensionality Reduction Algorithms Markov Chains March 19, / 43

19 Outline 1 Boosting Examples AdaBoost Derivation of AdaBoost 2 Gaussian mixture models March 19, / 43

20 Introduction Boosting is a meta-algorithm, which takes a base algorithm (classification, regression, ranking, etc) as input and boosts its accuracy March 19, / 43

21 Introduction Boosting is a meta-algorithm, which takes a base algorithm (classification, regression, ranking, etc) as input and boosts its accuracy main idea: combine weak rules of thumb (e.g. 51% accuracy) to form a highly accurate predictor (e.g. 99% accuracy) March 19, / 43

22 Introduction Boosting is a meta-algorithm, which takes a base algorithm (classification, regression, ranking, etc) as input and boosts its accuracy main idea: combine weak rules of thumb (e.g. 51% accuracy) to form a highly accurate predictor (e.g. 99% accuracy) works very well in practice (especially in combination with trees) March 19, / 43

23 Introduction Boosting is a meta-algorithm, which takes a base algorithm (classification, regression, ranking, etc) as input and boosts its accuracy main idea: combine weak rules of thumb (e.g. 51% accuracy) to form a highly accurate predictor (e.g. 99% accuracy) works very well in practice (especially in combination with trees) often is resistant to overfitting March 19, / 43

24 Introduction Boosting is a meta-algorithm, which takes a base algorithm (classification, regression, ranking, etc) as input and boosts its accuracy main idea: combine weak rules of thumb (e.g. 51% accuracy) to form a highly accurate predictor (e.g. 99% accuracy) works very well in practice (especially in combination with trees) often is resistant to overfitting has strong theoretical guarantees March 19, / 43

25 Introduction Boosting is a meta-algorithm, which takes a base algorithm (classification, regression, ranking, etc) as input and boosts its accuracy main idea: combine weak rules of thumb (e.g. 51% accuracy) to form a highly accurate predictor (e.g. 99% accuracy) works very well in practice (especially in combination with trees) often is resistant to overfitting has strong theoretical guarantees We again focus on binary classification. March 19, / 43

26 A simple example spam detection: March 19, / 43

27 A simple example spam detection: given a training set like: ( Want to make money fast?..., spam) ( Viterbi Research Gist..., not spam) March 19, / 43

28 A simple example spam detection: given a training set like: ( Want to make money fast?..., spam) ( Viterbi Research Gist..., not spam) first obtain a classifier by applying a base algorithm, which can be a rather simple/weak one, like decision stumps: e.g. contains the word money spam March 19, / 43

29 A simple example spam detection: given a training set like: ( Want to make money fast?..., spam) ( Viterbi Research Gist..., not spam) first obtain a classifier by applying a base algorithm, which can be a rather simple/weak one, like decision stumps: e.g. contains the word money spam reweight the examples so that difficult ones get more attention e.g. spam that doesn t contain the word money March 19, / 43

30 A simple example spam detection: given a training set like: ( Want to make money fast?..., spam) ( Viterbi Research Gist..., not spam) first obtain a classifier by applying a base algorithm, which can be a rather simple/weak one, like decision stumps: e.g. contains the word money spam reweight the examples so that difficult ones get more attention e.g. spam that doesn t contain the word money obtain another classifier by applying the same base algorithm: e.g. empty to address spam March 19, / 43

31 A simple example spam detection: given a training set like: ( Want to make money fast?..., spam) ( Viterbi Research Gist..., not spam) first obtain a classifier by applying a base algorithm, which can be a rather simple/weak one, like decision stumps: e.g. contains the word money spam reweight the examples so that difficult ones get more attention e.g. spam that doesn t contain the word money obtain another classifier by applying the same base algorithm: e.g. empty to address spam repeat... March 19, / 43

32 A simple example spam detection: given a training set like: ( Want to make money fast?..., spam) ( Viterbi Research Gist..., not spam) first obtain a classifier by applying a base algorithm, which can be a rather simple/weak one, like decision stumps: e.g. contains the word money spam reweight the examples so that difficult ones get more attention e.g. spam that doesn t contain the word money obtain another classifier by applying the same base algorithm: e.g. empty to address spam repeat... final classifier is the (weighted) majority vote of all weak classifiers March 19, / 43

33 The base algorithm A base algorithm A (also called weak learning algorithm/oracle) takes a training set S weighted by D as input, and outputs classifier h A(S, D) March 19, / 43

34 The base algorithm A base algorithm A (also called weak learning algorithm/oracle) takes a training set S weighted by D as input, and outputs classifier h A(S, D) this can be any off-the-shelf classification algorithm (e.g. decision trees, logistic regression, neural nets, etc) March 19, / 43

35 The base algorithm A base algorithm A (also called weak learning algorithm/oracle) takes a training set S weighted by D as input, and outputs classifier h A(S, D) this can be any off-the-shelf classification algorithm (e.g. decision trees, logistic regression, neural nets, etc) many algorithms can deal with a weighted training set (e.g. for algorithm that minimizes some loss, we can simply replace total loss by weighted total loss ) March 19, / 43

36 Boosting Algorithms Given: a training set S a base algorithm A March 19, / 43

37 Boosting Algorithms Given: a training set S a base algorithm A Two things to specify a boosting algorithm: how to reweight the examples? how to combine all the weak classifiers? March 19, / 43

38 Boosting Algorithms Given: a training set S a base algorithm A Two things to specify a boosting algorithm: how to reweight the examples? how to combine all the weak classifiers? AdaBoost is one of the most successful boosting algorithms. March 19, / 43

39 The AdaBoost Algorithm, 1990 Given N samples {x n, y n }, where y n {+1, 1}, and a base algorithm A. March 19, / 43

40 The AdaBoost Algorithm, 1990 Given N samples {x n, y n }, where y n {+1, 1}, and a base algorithm A. Initialize D 1 = 1 N to be uniform. March 19, / 43

41 The AdaBoost Algorithm, 1990 Given N samples {x n, y n }, where y n {+1, 1}, and a base algorithm A. Initialize D 1 = 1 N to be uniform. For t = 1,..., T Train a weak classifier h t A(S, D t ) based on the current weight D t (n), by minimizing the weighted classification error ɛ t = n D t (n)i[y n h t (x n )] March 19, / 43

42 The AdaBoost Algorithm, 1990 Given N samples {x n, y n }, where y n {+1, 1}, and a base algorithm A. Initialize D 1 = 1 N to be uniform. For t = 1,..., T Train a weak classifier h t A(S, D t ) based on the current weight D t (n), by minimizing the weighted classification error ɛ t = n D t (n)i[y n h t (x n )] Calculate the importance of h t as β t = 1 ( ) 1 2 ln ɛt ɛ t (β t > 0 ɛ t < 0.5) March 19, / 43

43 The Betas Calculate the importance of h t as β t = 1 2 ln ( 1 ɛt ɛ t ) (β t > 0 ɛ t < 0.5) March 19, / 43

44 The AdaBoost Algorithm For t = 1,..., T Train a weak classifier h t A(S, D t ) Calculate β t March 19, / 43

45 The AdaBoost Algorithm For t = 1,..., T Train a weak classifier h t A(S, D t ) Calculate β t Update weights D t+1 (n) = D t (n)e βtynht(xn) March 19, / 43

46 The AdaBoost Algorithm For t = 1,..., T Train a weak classifier h t A(S, D t ) Calculate β t Update weights D t+1 (n) = D t (n)e βtynht(xn) = { D t (n)e βt D t (n)e βt if h t (x n ) = y n else March 19, / 43

47 The AdaBoost Algorithm For t = 1,..., T Train a weak classifier h t A(S, D t ) Calculate β t Update weights D t+1 (n) = D t (n)e βtynht(xn) = { D t (n)e βt D t (n)e βt if h t (x n ) = y n else and normalize them such that D t+1 (n) = D t+1(n) n D t+1(n). March 19, / 43

48 The AdaBoost Algorithm For t = 1,..., T Train a weak classifier h t A(S, D t ) Calculate β t Update weights D t+1 (n) = D t (n)e βtynht(xn) = { D t (n)e βt D t (n)e βt if h t (x n ) = y n else and normalize them such that D t+1 (n) = Output the final classifier: ( T ) H(x) = sign β t h t (x) t=1 D t+1(n) n D t+1(n). March 19, / 43

49 Example 10 data points in R 2 The size of + or - indicates the weight, which starts from uniform D 1 March 19, / 43

50 Example 10 data points in R 2 The size of + or - indicates the weight, which starts from uniform D 1 Base algorithm is decision stump: March 19, / 43

51 Observe that no stump can predict very accurately for this dataset March 19, / 43 Example 10 data points in R 2 The size of + or - indicates the weight, which starts from uniform D 1 Base algorithm is decision stump:

52 Round 1: t = 1 h 1 D 2 March 19, / 43

53 Round 1: t = 1 h 1 D 2 3 misclassified (circled): ɛ 1 = 0.3 March 19, / 43

54 Round 1: t = 1 h 1 D 2 3 misclassified (circled): ɛ 1 = 0.3 β 1 = 1 2 ln ( 1 ɛt ɛ t ) March 19, / 43

55 Round 1: t = 1 h 1 D 2 3 misclassified (circled): ɛ 1 = 0.3 β 1 = 1 2 ln ( 1 ɛt ɛ t ) D 2 puts more weights on those examples March 19, / 43

56 Round 2: t = 2 h 2 D 3 3 misclassified (circled): ɛ 2 = 0.21 β 2 = March 19, / 43

57 Round 2: t = 2 h 2 D 3 3 misclassified (circled): ɛ 2 = 0.21 β 2 = D 3 puts more weights on those examples March 19, / 43

58 Round 3: t = 3 Round 3 h3 "3 =0.14!3=0.92 again 3 misclassified (circled): 3 = 0.14 β3 = March 19, / 43

59 Final classifier: combining 3 classifiers H = sign final = March 19, / 43

60 Final classifier: combining 3 classifiers H = sign final = All data points are now classified correctly, even though each weak classifier makes 3 mistakes. March 19, / 43

61 Overfitting When T is large, the model is very complicated and overfitting can happen March 19, / 43

62 Overfitting When T is large, the model is very complicated and overfitting can happen March 19, / 43

63 Resistance to overfitting However, very often AdaBoost is resistant to overfitting March 19, / 43

64 Resistance to overfitting However, very often AdaBoost is resistant to overfitting March 19, / 43

65 Resistance to overfitting However, very often AdaBoost is resistant to overfitting Used to be a mystery, but by now rigorous theory has been developed to explain this phenomenon. March 19, / 43

66 Why AdaBoost works? In fact, AdaBoost also follows the general framework of minimizing some surrogate loss. March 19, / 43

67 Why AdaBoost works? In fact, AdaBoost also follows the general framework of minimizing some surrogate loss. Step 1: the model that AdaBoost considers is { } T sgn (f( )) f( ) = β t h t ( ) for some β t 0 and h t H t=1 where H is the set of models considered by the base algorithm March 19, / 43

68 Why AdaBoost works? In fact, AdaBoost also follows the general framework of minimizing some surrogate loss. Step 1: the model that AdaBoost considers is { } T sgn (f( )) f( ) = β t h t ( ) for some β t 0 and h t H t=1 where H is the set of models considered by the base algorithm Step 2: the loss that AdaBoost minimizes is the exponential loss N exp ( y n f(x n )) n=1 March 19, / 43

69 Greedy minimization Step 3: the way that AdaBoost minimizes exponential loss is by a greedy approach, that is, find β t, h t one by one for t = 1,..., T. March 19, / 43

70 Greedy minimization Step 3: the way that AdaBoost minimizes exponential loss is by a greedy approach, that is, find β t, h t one by one for t = 1,..., T. Specifically, let f t = t τ=1 β τ h τ. Suppose we have found f t 1, what should f t be? March 19, / 43

71 Greedy minimization Step 3: the way that AdaBoost minimizes exponential loss is by a greedy approach, that is, find β t, h t one by one for t = 1,..., T. Specifically, let f t = t τ=1 β τ h τ. Suppose we have found f t 1, what should f t be? t 1 f t = β τ h τ + β t h t = f t 1 + β t h t. τ=1 March 19, / 43

72 Greedy minimization Step 3: the way that AdaBoost minimizes exponential loss is by a greedy approach, that is, find β t, h t one by one for t = 1,..., T. Specifically, let f t = t τ=1 β τ h τ. Suppose we have found f t 1, what should f t be? t 1 f t = β τ h τ + β t h t = f t 1 + β t h t. τ=1 Greedily, we want to find β t, h t to minimize March 19, / 43

73 Greedy minimization Step 3: the way that AdaBoost minimizes exponential loss is by a greedy approach, that is, find β t, h t one by one for t = 1,..., T. Specifically, let f t = t τ=1 β τ h τ. Suppose we have found f t 1, what should f t be? t 1 f t = β τ h τ + β t h t = f t 1 + β t h t. τ=1 Greedily, we want to find β t, h t to minimize N exp ( y n f t (x n )) n=1 March 19, / 43

74 Greedy minimization Step 3: the way that AdaBoost minimizes exponential loss is by a greedy approach, that is, find β t, h t one by one for t = 1,..., T. Specifically, let f t = t τ=1 β τ h τ. Suppose we have found f t 1, what should f t be? t 1 f t = β τ h τ + β t h t = f t 1 + β t h t. τ=1 Greedily, we want to find β t, h t to minimize N exp ( y n f t (x n )) = n=1 N exp ( y n f t 1 (x n )) exp ( y n β t h t (x n )) n=1 March 19, / 43

75 Greedy minimization Step 3: the way that AdaBoost minimizes exponential loss is by a greedy approach, that is, find β t, h t one by one for t = 1,..., T. Specifically, let f t = t τ=1 β τ h τ. Suppose we have found f t 1, what should f t be? t 1 f t = β τ h τ + β t h t = f t 1 + β t h t. τ=1 Greedily, we want to find β t, h t to minimize N exp ( y n f t (x n )) = n=1 N exp ( y n f t 1 (x n )) exp ( y n β t h t (x n )) n=1 Next, we use the definition of weights (slide 12). March 19, / 43

76 Greedy minimization Claim: exp ( y n f t 1 (x n )) D t (n) March 19, / 43

77 Greedy minimization Claim: exp ( y n f t 1 (x n )) D t (n) Proof. March 19, / 43

78 Greedy minimization Claim: exp ( y n f t 1 (x n )) D t (n) Proof. D t (n) D t 1 (n) exp ( y n β t 1 h t 1 (x n )) March 19, / 43

79 Greedy minimization Claim: exp ( y n f t 1 (x n )) D t (n) Proof. D t (n) D t 1 (n) exp ( y n β t 1 h t 1 (x n )) D t 2 (n) exp ( y n β t 2 h t 2 (x n )) exp ( y n β t 1 h t 1 (x n )) March 19, / 43

80 Greedy minimization Claim: exp ( y n f t 1 (x n )) D t (n) Proof. D t (n) D t 1 (n) exp ( y n β t 1 h t 1 (x n )) D t 2 (n) exp ( y n β t 2 h t 2 (x n )) exp ( y n β t 1 h t 1 (x n )) D 1 exp ( y n β 1 h 1 (x n )... y n β t 1 h t 1 (x n )) March 19, / 43

81 Greedy minimization Claim: exp ( y n f t 1 (x n )) D t (n) Proof. D t (n) D t 1 (n) exp ( y n β t 1 h t 1 (x n )) D t 2 (n) exp ( y n β t 2 h t 2 (x n )) exp ( y n β t 1 h t 1 (x n )) D 1 exp ( y n β 1 h 1 (x n )... y n β t 1 h t 1 (x n )) exp ( y n f t 1 (x n )) March 19, / 43

82 Greedy minimization Claim: exp ( y n f t 1 (x n )) D t (n) Proof. D t (n) D t 1 (n) exp ( y n β t 1 h t 1 (x n )) D t 2 (n) exp ( y n β t 2 h t 2 (x n )) exp ( y n β t 1 h t 1 (x n )) D 1 exp ( y n β 1 h 1 (x n )... y n β t 1 h t 1 (x n )) exp ( y n f t 1 (x n )) Remark. All weights D t (n) are normalized: n D t(n) = 1. March 19, / 43

83 Greedy minimization So the goal becomes finding β t 0, h t H that minimize argmin β t,h t N n=1 exp ( y n f t (x n )) = argmin β t,h t N D t (n) exp ( y n β t h t (x n )) n=1 March 19, / 43

84 Greedy minimization So the goal becomes finding β t 0, h t H that minimize argmin β t,h t N n=1 exp ( y n f t (x n )) = argmin β t,h t N D t (n) exp ( y n β t h t (x n )) n=1 We decompose the weighted loss function into two parts March 19, / 43

85 Greedy minimization So the goal becomes finding β t 0, h t H that minimize argmin β t,h t N n=1 exp ( y n f t (x n )) = argmin β t,h t N D t (n) exp ( y n β t h t (x n )) n=1 We decompose the weighted loss function into two parts N D t (n) exp ( y n β t h t (x n )) n=1 March 19, / 43

86 Greedy minimization So the goal becomes finding β t 0, h t H that minimize argmin β t,h t N n=1 exp ( y n f t (x n )) = argmin β t,h t N D t (n) exp ( y n β t h t (x n )) n=1 We decompose the weighted loss function into two parts N D t (n) exp ( y n β t h t (x n )) n=1 = n:y n h t(x n) D t (n)e βt + n:y n=h t(x n) D t (n)e βt March 19, / 43

87 Greedy minimization So the goal becomes finding β t 0, h t H that minimize argmin β t,h t N n=1 exp ( y n f t (x n )) = argmin β t,h t N D t (n) exp ( y n β t h t (x n )) n=1 We decompose the weighted loss function into two parts N D t (n) exp ( y n β t h t (x n )) n=1 = n:y n h t(x n) D t (n)e βt + n:y n=h t(x n) D t (n)e βt = ɛ t e βt + (1 ɛ t )e βt (recall ɛ t = n:y n h t(x n) D t(n)) March 19, / 43

88 Greedy minimization So the goal becomes finding β t 0, h t H that minimize argmin β t,h t N n=1 exp ( y n f t (x n )) = argmin β t,h t N D t (n) exp ( y n β t h t (x n )) n=1 We decompose the weighted loss function into two parts N D t (n) exp ( y n β t h t (x n )) n=1 = n:y n h t(x n) D t (n)e βt + n:y n=h t(x n) D t (n)e βt = ɛ t e βt + (1 ɛ t )e βt (recall ɛ t = n:y n h t(x n) D t(n)) = ɛ t (e βt e βt ) + e βt March 19, / 43

89 Greedy minimization So the goal becomes finding β t 0, h t H that minimize argmin β t,h t N n=1 exp ( y n f t (x n )) = argmin β t,h t N D t (n) exp ( y n β t h t (x n )) n=1 We decompose the weighted loss function into two parts N D t (n) exp ( y n β t h t (x n )) n=1 = n:y n h t(x n) D t (n)e βt + n:y n=h t(x n) D t (n)e βt = ɛ t e βt + (1 ɛ t )e βt (recall ɛ t = n:y n h t(x n) D t(n)) = ɛ t (e βt e βt ) + e βt We find h t by minimizing the weighted classification error ɛ t. March 19, / 43

90 Minimizing the weighted classification error Thus, we would want to choose h t (x n ) such that h t (x) = argmin ɛ t = h t(x) n:y n h t(x n) D t (n) March 19, / 43

91 Minimizing the weighted classification error Thus, we would want to choose h t (x n ) such that h t (x) = argmin ɛ t = h t(x) n:y n h t(x n) D t (n) This is exactly the first step of the AdaBoost algorithm on slide 10 train a weak classifier based on the current weight D t (n). March 19, / 43

92 Greedy minimization When h t (and thus ɛ t ) is fixed, we then find β t to minimize ɛ t (e βt e βt ) + e βt March 19, / 43

93 Greedy minimization When h t (and thus ɛ t ) is fixed, we then find β t to minimize ɛ t (e βt e βt ) + e βt We take derivative with respect to β t, set it to zero, and derive the optimal β t as βt = 1 ( ) 1 2 ln ɛt ɛ t March 19, / 43

94 Greedy minimization When h t (and thus ɛ t ) is fixed, we then find β t to minimize ɛ t (e βt e βt ) + e βt We take derivative with respect to β t, set it to zero, and derive the optimal β t as βt = 1 ( ) 1 2 ln ɛt ɛ t which is precisely the step of the AdaBoost algorithm on slide 10. March 19, / 43

95 Greedy minimization When h t (and thus ɛ t ) is fixed, we then find β t to minimize ɛ t (e βt e βt ) + e βt We take derivative with respect to β t, set it to zero, and derive the optimal β t as βt = 1 ( ) 1 2 ln ɛt ɛ t which is precisely the step of the AdaBoost algorithm on slide 10. Verify the solution β t. March 19, / 43

96 Updating the weights Now that we have improved our classifier into f t (x n ) = f t 1 (x n ) + β t h t (x n ) March 19, / 43

97 Updating the weights Now that we have improved our classifier into f t (x n ) = f t 1 (x n ) + β t h t (x n ) At the t-th iteration, we will need to compute the weights for the above classifier, which is, March 19, / 43

98 Updating the weights Now that we have improved our classifier into f t (x n ) = f t 1 (x n ) + β t h t (x n ) At the t-th iteration, we will need to compute the weights for the above classifier, which is, D t+1 (n) e ynf(xn) March 19, / 43

99 Updating the weights Now that we have improved our classifier into f t (x n ) = f t 1 (x n ) + β t h t (x n ) At the t-th iteration, we will need to compute the weights for the above classifier, which is, D t+1 (n) e ynf(xn) = e yn[f t 1(x)+β t h t (xn)] March 19, / 43

100 Updating the weights Now that we have improved our classifier into f t (x n ) = f t 1 (x n ) + β t h t (x n ) At the t-th iteration, we will need to compute the weights for the above classifier, which is, D t+1 (n) e ynf(xn) = e yn[f t 1(x)+β t h t (xn)] D t (n)e ynβ t h t (xn) March 19, / 43

101 Updating the weights Now that we have improved our classifier into f t (x n ) = f t 1 (x n ) + β t h t (x n ) At the t-th iteration, we will need to compute the weights for the above classifier, which is, D t+1 (n) e ynf(xn) = e yn[f t 1(x)+βt h t (xn)] { D t (n)e ynβ t h t (xn) Dt (n)e = β t if y n h t (x n ) D t (n)e β t if y n = h t (x n ) March 19, / 43

102 Updating the weights Now that we have improved our classifier into f t (x n ) = f t 1 (x n ) + β t h t (x n ) At the t-th iteration, we will need to compute the weights for the above classifier, which is, D t+1 (n) e ynf(xn) = e yn[f t 1(x)+βt h t (xn)] { D t (n)e ynβ t h t (xn) Dt (n)e = β t if y n h t (x n ) D t (n)e β t if y n = h t (x n ) which is precisely the last step of the AdaBoost algorithm on slide 12. March 19, / 43

103 Remarks Note that the AdaBoost algorithm itself never specifies how we would get h t (x) as long as it minimizes the weighted classification error ɛ t = n D t (n)i[y n h t (x n )] In this aspect, the AdaBoost algorithm is a meta-algorithm and can be used with any classifier where we can do the above. March 19, / 43

104 Remarks Note that the AdaBoost algorithm itself never specifies how we would get h t (x) as long as it minimizes the weighted classification error ɛ t = n D t (n)i[y n h t (x n )] In this aspect, the AdaBoost algorithm is a meta-algorithm and can be used with any classifier where we can do the above. Ex. How do we choose the decision stump classifier given the weights at the second round of the following distribution? h 1 D 2 We can simply enumerate all possible ways of putting vertical and horizontal lines to separate the data points into two classes and find the one with the smallest weighted classification error! March 19, / 43

105 Summary for boosting Key idea of boosting is to combine weak predictors into a strong one. March 19, / 43

106 Summary for boosting Key idea of boosting is to combine weak predictors into a strong one. There are many boosting algorithms; AdaBoost is the most classic one. March 19, / 43

107 Summary for boosting Key idea of boosting is to combine weak predictors into a strong one. There are many boosting algorithms; AdaBoost is the most classic one. AdaBoost is greedily minimizing the exponential loss. March 19, / 43

108 Summary for boosting Key idea of boosting is to combine weak predictors into a strong one. There are many boosting algorithms; AdaBoost is the most classic one. AdaBoost is greedily minimizing the exponential loss. AdaBoost tends to not overfit. March 19, / 43

109 Taxonomy of ML Models There are two kinds of classification models in machine learning generative models and discriminative models. March 19, / 43

110 Taxonomy of ML Models There are two kinds of classification models in machine learning generative models and discriminative models. Discriminative models: March 19, / 43

111 Taxonomy of ML Models There are two kinds of classification models in machine learning generative models and discriminative models. Discriminative models: nearest neighbor, traditional neural networks, SVM. March 19, / 43

112 Taxonomy of ML Models There are two kinds of classification models in machine learning generative models and discriminative models. Discriminative models: nearest neighbor, traditional neural networks, SVM. we learn f() on data set (x i, y i ) to output the most likely y on unseen x. March 19, / 43

113 Taxonomy of ML Models There are two kinds of classification models in machine learning generative models and discriminative models. Discriminative models: nearest neighbor, traditional neural networks, SVM. we learn f() on data set (x i, y i ) to output the most likely y on unseen x. having f() we know how to discriminate unseen x s from different classes. March 19, / 43

114 Taxonomy of ML Models There are two kinds of classification models in machine learning generative models and discriminative models. Discriminative models: nearest neighbor, traditional neural networks, SVM. we learn f() on data set (x i, y i ) to output the most likely y on unseen x. having f() we know how to discriminate unseen x s from different classes. we learn the decision boundary between the classes. March 19, / 43

115 Taxonomy of ML Models There are two kinds of classification models in machine learning generative models and discriminative models. Discriminative models: nearest neighbor, traditional neural networks, SVM. we learn f() on data set (x i, y i ) to output the most likely y on unseen x. having f() we know how to discriminate unseen x s from different classes. we learn the decision boundary between the classes. we have no idea how the data is generated. March 19, / 43

116 Taxonomy of ML Models There are two kinds of classification models in machine learning generative models and discriminative models. Generative models: March 19, / 43

117 Taxonomy of ML Models There are two kinds of classification models in machine learning generative models and discriminative models. Generative models: Naïve Bayes, Gaussian mixture model, Hidden Markov model, Adversarial Network (GAN). March 19, / 43

118 Taxonomy of ML Models There are two kinds of classification models in machine learning generative models and discriminative models. Generative models: Naïve Bayes, Gaussian mixture model, Hidden Markov model, Adversarial Network (GAN). it s used widely in unsupervised machine learning. March 19, / 43

119 Taxonomy of ML Models There are two kinds of classification models in machine learning generative models and discriminative models. Generative models: Naïve Bayes, Gaussian mixture model, Hidden Markov model, Adversarial Network (GAN). it s used widely in unsupervised machine learning. it s a probabilistic way to think about how the data might have been generated. March 19, / 43

120 Taxonomy of ML Models There are two kinds of classification models in machine learning generative models and discriminative models. Generative models: Naïve Bayes, Gaussian mixture model, Hidden Markov model, Adversarial Network (GAN). it s used widely in unsupervised machine learning. it s a probabilistic way to think about how the data might have been generated. learn the joint probability distribution P (x, y) and predict P (y x) with the help of Bayes Theorem. March 19, / 43

121 Outline 1 Boosting 2 Gaussian mixture models Motivation and Model EM algorithm March 19, / 43

122 Gaussian mixture models Gaussian mixture models (GMM) is a probabilistic approach for clustering March 19, / 43

123 Gaussian mixture models Gaussian mixture models (GMM) is a probabilistic approach for clustering more explanatory than minimizing the K-means objective can be seen as a soft version of K-means March 19, / 43

124 Gaussian mixture models Gaussian mixture models (GMM) is a probabilistic approach for clustering more explanatory than minimizing the K-means objective can be seen as a soft version of K-means To solve GMM, we will introduce a powerful method for learning probabilistic mode: Expectation Maximization (EM) algorithm March 19, / 43

125 A generative model For classification, we discussed the sigmoid model to explain how the labels are generated. March 19, / 43

126 A generative model For classification, we discussed the sigmoid model to explain how the labels are generated. Similarly, for clustering, we want to come up with a probabilistic model p to explain how the data is generated. March 19, / 43

127 A generative model For classification, we discussed the sigmoid model to explain how the labels are generated. Similarly, for clustering, we want to come up with a probabilistic model p to explain how the data is generated. That is, each point is an independent sample of x p. March 19, / 43

128 A generative model For classification, we discussed the sigmoid model to explain how the labels are generated. Similarly, for clustering, we want to come up with a probabilistic model p to explain how the data is generated. That is, each point is an independent sample of x p. What probabilistic model generates data like this? March 19, / 43

129 Gaussian mixture models: intuition We will model each region with a Gaussian distribution. This leads to the idea of Gaussian mixture models (GMMs). The problem we are now facing is that i) we do not know which (color) region a data point comes from; ii) the parameters of Gaussian distributions in each region. We need to find all of them from unsupervised data D = {x n } N n=1. March 19, / 43

130 GMM: formal definition A GMM has the following density function: p(x) = K ω k N(x µ k, Σ k ) = k=1 K k=1 ω k 1 (2π) D Σ k e 1 2 (x µ k) T Σ 1 k (x µ k) March 19, / 43

131 GMM: formal definition A GMM has the following density function: p(x) = where K ω k N(x µ k, Σ k ) = k=1 K k=1 ω k 1 (2π) D Σ k e 1 2 (x µ k) T Σ 1 k (x µ k) K: the number of Gaussian components (same as #clusters we want) March 19, / 43

132 GMM: formal definition A GMM has the following density function: p(x) = where K ω k N(x µ k, Σ k ) = k=1 K k=1 ω k 1 (2π) D Σ k e 1 2 (x µ k) T Σ 1 k (x µ k) K: the number of Gaussian components (same as #clusters we want) µ k and Σ k : mean and covariance matrix of the k-th Gaussian March 19, / 43

133 GMM: formal definition A GMM has the following density function: p(x) = where K ω k N(x µ k, Σ k ) = k=1 K k=1 ω k 1 (2π) D Σ k e 1 2 (x µ k) T Σ 1 k (x µ k) K: the number of Gaussian components (same as #clusters we want) µ k and Σ k : mean and covariance matrix of the k-th Gaussian ω 1,..., ω K : mixture weights, they represent how much each component contributes to the final distribution. It satisfies two properties: k, ω k > 0, and ω k = 1 k March 19, / 43

134 Another view By introducing a latent variable z [K], which indicates cluster membership, we can see p as a marginal distribution p(x) = K p(x, z = k) k=1 March 19, / 43

135 Another view By introducing a latent variable z [K], which indicates cluster membership, we can see p as a marginal distribution p(x) = K p(x, z = k) = k=1 K p(z = k)p(x z = k) k=1 March 19, / 43

136 Another view By introducing a latent variable z [K], which indicates cluster membership, we can see p as a marginal distribution p(x) = K p(x, z = k) = k=1 K p(z = k)p(x z = k) = k=1 K ω k N(x µ k, Σ k ) k=1 March 19, / 43

137 Another view By introducing a latent variable z [K], which indicates cluster membership, we can see p as a marginal distribution p(x) = K p(x, z = k) = k=1 K p(z = k)p(x z = k) = k=1 K ω k N(x µ k, Σ k ) k=1 x and z are both random variables drawn from the model x is observed z is unobserved/latent March 19, / 43

138 An example The conditional distributions are p(x z = red) = N(x µ 1, Σ 1 ) p(x z = blue) = N(x µ 2, Σ 2 ) p(x z = green) = N(x µ 3, Σ 3 ) March 19, / 43

139 An example The conditional distributions are p(x z = red) = N(x µ 1, Σ 1 ) p(x z = blue) = N(x µ 2, Σ 2 ) p(x z = green) = N(x µ 3, Σ 3 ) The marginal distribution is p(x) = p(red)n(x µ 1, Σ 1 ) + p(blue)n(x µ 2, Σ 2 ) + p(green)n(x µ 3, Σ 3 ) March 19, / 43

140 Learning GMMs Learning a GMM means finding all the parameters θ = {ω k, µ k, Σ k } K k=1. March 19, / 43

141 Learning GMMs Learning a GMM means finding all the parameters θ = {ω k, µ k, Σ k } K k=1. In the process, we will learn the latent variable z n as well: p(z n = k x n ) March 19, / 43

142 Learning GMMs Learning a GMM means finding all the parameters θ = {ω k, µ k, Σ k } K k=1. In the process, we will learn the latent variable z n as well: p(z n = k x n ) γ nk [0, 1] i.e. soft assignment of each point to each cluster, as opposed to hard assignment by K-means. March 19, / 43

143 Learning GMMs Learning a GMM means finding all the parameters θ = {ω k, µ k, Σ k } K k=1. In the process, we will learn the latent variable z n as well: p(z n = k x n ) γ nk [0, 1] i.e. soft assignment of each point to each cluster, as opposed to hard assignment by K-means. GMM is more explanatory than K-means both learn the cluster centers µ k s March 19, / 43

144 Learning GMMs Learning a GMM means finding all the parameters θ = {ω k, µ k, Σ k } K k=1. In the process, we will learn the latent variable z n as well: p(z n = k x n ) γ nk [0, 1] i.e. soft assignment of each point to each cluster, as opposed to hard assignment by K-means. GMM is more explanatory than K-means both learn the cluster centers µ k s in addition, GMM learns cluster weight ω k and covariance Σ k, March 19, / 43

145 Learning GMMs Learning a GMM means finding all the parameters θ = {ω k, µ k, Σ k } K k=1. In the process, we will learn the latent variable z n as well: p(z n = k x n ) γ nk [0, 1] i.e. soft assignment of each point to each cluster, as opposed to hard assignment by K-means. GMM is more explanatory than K-means both learn the cluster centers µ k s in addition, GMM learns cluster weight ω k and covariance Σ k, thus we can predict probability of seeing a new point we can generate synthetic data March 19, / 43

146 How to learn these parameters? An obvious attempt is maximum-likelihood estimation (MLE): find argmax θ ln N n=1 p(x n ; θ) = argmax θ N n=1 ln p(x n ; θ) argmax P (θ) θ March 19, / 43

147 How to learn these parameters? An obvious attempt is maximum-likelihood estimation (MLE): find argmax θ ln N n=1 p(x n ; θ) = argmax θ N n=1 ln p(x n ; θ) argmax P (θ) θ This is called incomplete likelihood (since z n s are unobserved), and is intractable in general (non-concave problem). March 19, / 43

148 How to learn these parameters? An obvious attempt is maximum-likelihood estimation (MLE): find argmax θ ln N n=1 p(x n ; θ) = argmax θ N n=1 ln p(x n ; θ) argmax P (θ) θ This is called incomplete likelihood (since z n s are unobserved), and is intractable in general (non-concave problem). One solution is to still apply GD/SGD, but a much more effective approach is the Expectation Maximization (EM) algorithm. March 19, / 43

149 Preview of EM for learning GMMs Step 0 Initialize ω k, µ k, Σ k for each k [K] March 19, / 43

150 Preview of EM for learning GMMs Step 0 Initialize ω k, µ k, Σ k for each k [K] Step 1 (E-Step) update the soft assignment (fixing parameters) γ nk = p(z n = k x n ) ω k N (x n µ k, Σ k ) March 19, / 43

151 Preview of EM for learning GMMs Step 0 Initialize ω k, µ k, Σ k for each k [K] Step 1 (E-Step) update the soft assignment (fixing parameters) γ nk = p(z n = k x n ) ω k N (x n µ k, Σ k ) Step 2 (M-Step) update the model parameter (fixing assignments) n ω k = γ nk µ k = n γ nkx n N n γ nk 1 Σ k = n γ γ nk (x n µ k )(x n µ k ) T nk n March 19, / 43

152 Preview of EM for learning GMMs Step 0 Initialize ω k, µ k, Σ k for each k [K] Step 1 (E-Step) update the soft assignment (fixing parameters) γ nk = p(z n = k x n ) ω k N (x n µ k, Σ k ) Step 2 (M-Step) update the model parameter (fixing assignments) n ω k = γ nk µ k = n γ nkx n N n γ nk 1 Σ k = n γ γ nk (x n µ k )(x n µ k ) T nk Step 3 return to Step 1 if not converged n March 19, / 43

153 Demo Generate 50 data points from a mixture of 2 Gaussians with ω 1 = 0.3, µ 1 = 0.8, Σ 1 = 0.52 ω 2 = 0.7, µ 2 = 1.2, Σ 2 = 0.35 March 19, / 43

154 Demo Generate 50 data points from a mixture of 2 Gaussians with ω 1 = 0.3, µ 1 = 0.8, Σ 1 = 0.52 ω 2 = 0.7, µ 2 = 1.2, Σ 2 = 0.35 histogram represents the data March 19, / 43

155 Demo Generate 50 data points from a mixture of 2 Gaussians with ω 1 = 0.3, µ 1 = 0.8, Σ 1 = 0.52 ω 2 = 0.7, µ 2 = 1.2, Σ 2 = 0.35 histogram represents the data red curve represents the ground-truth density p(x) = K k=1 ω kn(x µ k, Σ k ) March 19, / 43

156 Demo Generate 50 data points from a mixture of 2 Gaussians with ω 1 = 0.3, µ 1 = 0.8, Σ 1 = 0.52 ω 2 = 0.7, µ 2 = 1.2, Σ 2 = 0.35 histogram represents the data red curve represents the ground-truth density p(x) = K k=1 ω kn(x µ k, Σ k ) blue curve represents the learned density for a specific round March 19, / 43

157 Demo Generate 50 data points from a mixture of 2 Gaussians with ω 1 = 0.3, µ 1 = 0.8, Σ 1 = 0.52 ω 2 = 0.7, µ 2 = 1.2, Σ 2 = 0.35 histogram represents the data red curve represents the ground-truth density p(x) = K k=1 ω kn(x µ k, Σ k ) blue curve represents the learned density for a specific round EM demo.pdf shows how the blue curve moves towards red curve quickly via EM March 19, / 43

158 EM algorithm In general EM is a heuristic to solve MLE with latent variables (not just GMM), i.e. find the maximizer of P (θ) = N ln p(x n ; θ) n=1 March 19, / 43

159 EM algorithm In general EM is a heuristic to solve MLE with latent variables (not just GMM), i.e. find the maximizer of P (θ) = N ln p(x n ; θ) n=1 θ is the parameters for a general probabilistic model March 19, / 43

160 EM algorithm In general EM is a heuristic to solve MLE with latent variables (not just GMM), i.e. find the maximizer of P (θ) = N ln p(x n ; θ) n=1 θ is the parameters for a general probabilistic model x n s are observed random variables March 19, / 43

161 EM algorithm In general EM is a heuristic to solve MLE with latent variables (not just GMM), i.e. find the maximizer of P (θ) = N ln p(x n ; θ) n=1 θ is the parameters for a general probabilistic model x n s are observed random variables z n s are latent variables March 19, / 43

162 EM algorithm In general EM is a heuristic to solve MLE with latent variables (not just GMM), i.e. find the maximizer of P (θ) = N ln p(x n ; θ) n=1 θ is the parameters for a general probabilistic model x n s are observed random variables z n s are latent variables Again, directly solving the objective is intractable. March 19, / 43

163 High level idea Keep maximizing a lower bound of P that is more manageable March 19, / 43

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26 Clustering Professor Ameet Talwalkar Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 1 / 26 Outline 1 Administration 2 Review of last lecture 3 Clustering Professor Ameet Talwalkar