CSCI-567: Machine Learning (Spring 2019)
|
|
- Hugo Harmon
- 5 years ago
- Views:
Transcription
1 CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, / 43
2 Administration March 19, / 43
3 Administration TA3 is due this week March 19, / 43
4 Administration TA3 is due this week TA4 will be available in the next week. March 19, / 43
5 Administration TA3 is due this week TA4 will be available in the next week. PA4 (Clustering, Markov chains) will be available in two weeks March 19, / 43
6 Outline 1 Boosting 2 Gaussian mixture models March 19, / 43
7 Top 10 Algorithms in Machine Learning... March 19, / 43
8 Top 10 Algorithms in Machine Learning... You should know (in 2019): March 19, / 43
9 Top 10 Algorithms in Machine Learning... You should know (in 2019): k-nearest Neighbors March 19, / 43
10 Top 10 Algorithms in Machine Learning... You should know (in 2019): k-nearest Neighbors Decision Tree and Random Forests March 19, / 43
11 Top 10 Algorithms in Machine Learning... You should know (in 2019): k-nearest Neighbors Decision Tree and Random Forests Naive Bayes March 19, / 43
12 Top 10 Algorithms in Machine Learning... You should know (in 2019): k-nearest Neighbors Decision Tree and Random Forests Naive Bayes Linear and Logistic Regression March 19, / 43
13 Top 10 Algorithms in Machine Learning... You should know (in 2019): k-nearest Neighbors Decision Tree and Random Forests Naive Bayes Linear and Logistic Regression Artificial Neural Networks March 19, / 43
14 Top 10 Algorithms in Machine Learning... You should know (in 2019): k-nearest Neighbors Decision Tree and Random Forests Naive Bayes Linear and Logistic Regression Artificial Neural Networks SVM March 19, / 43
15 Top 10 Algorithms in Machine Learning... You should know (in 2019): k-nearest Neighbors Decision Tree and Random Forests Naive Bayes Linear and Logistic Regression Artificial Neural Networks SVM Clustering (k-means) March 19, / 43
16 Top 10 Algorithms in Machine Learning... You should know (in 2019): k-nearest Neighbors Decision Tree and Random Forests Naive Bayes Linear and Logistic Regression Artificial Neural Networks SVM Clustering (k-means) Boosting March 19, / 43
17 Top 10 Algorithms in Machine Learning... You should know (in 2019): k-nearest Neighbors Decision Tree and Random Forests Naive Bayes Linear and Logistic Regression Artificial Neural Networks SVM Clustering (k-means) Boosting Dimensionality Reduction Algorithms March 19, / 43
18 Top 10 Algorithms in Machine Learning... You should know (in 2019): k-nearest Neighbors Decision Tree and Random Forests Naive Bayes Linear and Logistic Regression Artificial Neural Networks SVM Clustering (k-means) Boosting Dimensionality Reduction Algorithms Markov Chains March 19, / 43
19 Outline 1 Boosting Examples AdaBoost Derivation of AdaBoost 2 Gaussian mixture models March 19, / 43
20 Introduction Boosting is a meta-algorithm, which takes a base algorithm (classification, regression, ranking, etc) as input and boosts its accuracy March 19, / 43
21 Introduction Boosting is a meta-algorithm, which takes a base algorithm (classification, regression, ranking, etc) as input and boosts its accuracy main idea: combine weak rules of thumb (e.g. 51% accuracy) to form a highly accurate predictor (e.g. 99% accuracy) March 19, / 43
22 Introduction Boosting is a meta-algorithm, which takes a base algorithm (classification, regression, ranking, etc) as input and boosts its accuracy main idea: combine weak rules of thumb (e.g. 51% accuracy) to form a highly accurate predictor (e.g. 99% accuracy) works very well in practice (especially in combination with trees) March 19, / 43
23 Introduction Boosting is a meta-algorithm, which takes a base algorithm (classification, regression, ranking, etc) as input and boosts its accuracy main idea: combine weak rules of thumb (e.g. 51% accuracy) to form a highly accurate predictor (e.g. 99% accuracy) works very well in practice (especially in combination with trees) often is resistant to overfitting March 19, / 43
24 Introduction Boosting is a meta-algorithm, which takes a base algorithm (classification, regression, ranking, etc) as input and boosts its accuracy main idea: combine weak rules of thumb (e.g. 51% accuracy) to form a highly accurate predictor (e.g. 99% accuracy) works very well in practice (especially in combination with trees) often is resistant to overfitting has strong theoretical guarantees March 19, / 43
25 Introduction Boosting is a meta-algorithm, which takes a base algorithm (classification, regression, ranking, etc) as input and boosts its accuracy main idea: combine weak rules of thumb (e.g. 51% accuracy) to form a highly accurate predictor (e.g. 99% accuracy) works very well in practice (especially in combination with trees) often is resistant to overfitting has strong theoretical guarantees We again focus on binary classification. March 19, / 43
26 A simple example spam detection: March 19, / 43
27 A simple example spam detection: given a training set like: ( Want to make money fast?..., spam) ( Viterbi Research Gist..., not spam) March 19, / 43
28 A simple example spam detection: given a training set like: ( Want to make money fast?..., spam) ( Viterbi Research Gist..., not spam) first obtain a classifier by applying a base algorithm, which can be a rather simple/weak one, like decision stumps: e.g. contains the word money spam March 19, / 43
29 A simple example spam detection: given a training set like: ( Want to make money fast?..., spam) ( Viterbi Research Gist..., not spam) first obtain a classifier by applying a base algorithm, which can be a rather simple/weak one, like decision stumps: e.g. contains the word money spam reweight the examples so that difficult ones get more attention e.g. spam that doesn t contain the word money March 19, / 43
30 A simple example spam detection: given a training set like: ( Want to make money fast?..., spam) ( Viterbi Research Gist..., not spam) first obtain a classifier by applying a base algorithm, which can be a rather simple/weak one, like decision stumps: e.g. contains the word money spam reweight the examples so that difficult ones get more attention e.g. spam that doesn t contain the word money obtain another classifier by applying the same base algorithm: e.g. empty to address spam March 19, / 43
31 A simple example spam detection: given a training set like: ( Want to make money fast?..., spam) ( Viterbi Research Gist..., not spam) first obtain a classifier by applying a base algorithm, which can be a rather simple/weak one, like decision stumps: e.g. contains the word money spam reweight the examples so that difficult ones get more attention e.g. spam that doesn t contain the word money obtain another classifier by applying the same base algorithm: e.g. empty to address spam repeat... March 19, / 43
32 A simple example spam detection: given a training set like: ( Want to make money fast?..., spam) ( Viterbi Research Gist..., not spam) first obtain a classifier by applying a base algorithm, which can be a rather simple/weak one, like decision stumps: e.g. contains the word money spam reweight the examples so that difficult ones get more attention e.g. spam that doesn t contain the word money obtain another classifier by applying the same base algorithm: e.g. empty to address spam repeat... final classifier is the (weighted) majority vote of all weak classifiers March 19, / 43
33 The base algorithm A base algorithm A (also called weak learning algorithm/oracle) takes a training set S weighted by D as input, and outputs classifier h A(S, D) March 19, / 43
34 The base algorithm A base algorithm A (also called weak learning algorithm/oracle) takes a training set S weighted by D as input, and outputs classifier h A(S, D) this can be any off-the-shelf classification algorithm (e.g. decision trees, logistic regression, neural nets, etc) March 19, / 43
35 The base algorithm A base algorithm A (also called weak learning algorithm/oracle) takes a training set S weighted by D as input, and outputs classifier h A(S, D) this can be any off-the-shelf classification algorithm (e.g. decision trees, logistic regression, neural nets, etc) many algorithms can deal with a weighted training set (e.g. for algorithm that minimizes some loss, we can simply replace total loss by weighted total loss ) March 19, / 43
36 Boosting Algorithms Given: a training set S a base algorithm A March 19, / 43
37 Boosting Algorithms Given: a training set S a base algorithm A Two things to specify a boosting algorithm: how to reweight the examples? how to combine all the weak classifiers? March 19, / 43
38 Boosting Algorithms Given: a training set S a base algorithm A Two things to specify a boosting algorithm: how to reweight the examples? how to combine all the weak classifiers? AdaBoost is one of the most successful boosting algorithms. March 19, / 43
39 The AdaBoost Algorithm, 1990 Given N samples {x n, y n }, where y n {+1, 1}, and a base algorithm A. March 19, / 43
40 The AdaBoost Algorithm, 1990 Given N samples {x n, y n }, where y n {+1, 1}, and a base algorithm A. Initialize D 1 = 1 N to be uniform. March 19, / 43
41 The AdaBoost Algorithm, 1990 Given N samples {x n, y n }, where y n {+1, 1}, and a base algorithm A. Initialize D 1 = 1 N to be uniform. For t = 1,..., T Train a weak classifier h t A(S, D t ) based on the current weight D t (n), by minimizing the weighted classification error ɛ t = n D t (n)i[y n h t (x n )] March 19, / 43
42 The AdaBoost Algorithm, 1990 Given N samples {x n, y n }, where y n {+1, 1}, and a base algorithm A. Initialize D 1 = 1 N to be uniform. For t = 1,..., T Train a weak classifier h t A(S, D t ) based on the current weight D t (n), by minimizing the weighted classification error ɛ t = n D t (n)i[y n h t (x n )] Calculate the importance of h t as β t = 1 ( ) 1 2 ln ɛt ɛ t (β t > 0 ɛ t < 0.5) March 19, / 43
43 The Betas Calculate the importance of h t as β t = 1 2 ln ( 1 ɛt ɛ t ) (β t > 0 ɛ t < 0.5) March 19, / 43
44 The AdaBoost Algorithm For t = 1,..., T Train a weak classifier h t A(S, D t ) Calculate β t March 19, / 43
45 The AdaBoost Algorithm For t = 1,..., T Train a weak classifier h t A(S, D t ) Calculate β t Update weights D t+1 (n) = D t (n)e βtynht(xn) March 19, / 43
46 The AdaBoost Algorithm For t = 1,..., T Train a weak classifier h t A(S, D t ) Calculate β t Update weights D t+1 (n) = D t (n)e βtynht(xn) = { D t (n)e βt D t (n)e βt if h t (x n ) = y n else March 19, / 43
47 The AdaBoost Algorithm For t = 1,..., T Train a weak classifier h t A(S, D t ) Calculate β t Update weights D t+1 (n) = D t (n)e βtynht(xn) = { D t (n)e βt D t (n)e βt if h t (x n ) = y n else and normalize them such that D t+1 (n) = D t+1(n) n D t+1(n). March 19, / 43
48 The AdaBoost Algorithm For t = 1,..., T Train a weak classifier h t A(S, D t ) Calculate β t Update weights D t+1 (n) = D t (n)e βtynht(xn) = { D t (n)e βt D t (n)e βt if h t (x n ) = y n else and normalize them such that D t+1 (n) = Output the final classifier: ( T ) H(x) = sign β t h t (x) t=1 D t+1(n) n D t+1(n). March 19, / 43
49 Example 10 data points in R 2 The size of + or - indicates the weight, which starts from uniform D 1 March 19, / 43
50 Example 10 data points in R 2 The size of + or - indicates the weight, which starts from uniform D 1 Base algorithm is decision stump: March 19, / 43
51 Observe that no stump can predict very accurately for this dataset March 19, / 43 Example 10 data points in R 2 The size of + or - indicates the weight, which starts from uniform D 1 Base algorithm is decision stump:
52 Round 1: t = 1 h 1 D 2 March 19, / 43
53 Round 1: t = 1 h 1 D 2 3 misclassified (circled): ɛ 1 = 0.3 March 19, / 43
54 Round 1: t = 1 h 1 D 2 3 misclassified (circled): ɛ 1 = 0.3 β 1 = 1 2 ln ( 1 ɛt ɛ t ) March 19, / 43
55 Round 1: t = 1 h 1 D 2 3 misclassified (circled): ɛ 1 = 0.3 β 1 = 1 2 ln ( 1 ɛt ɛ t ) D 2 puts more weights on those examples March 19, / 43
56 Round 2: t = 2 h 2 D 3 3 misclassified (circled): ɛ 2 = 0.21 β 2 = March 19, / 43
57 Round 2: t = 2 h 2 D 3 3 misclassified (circled): ɛ 2 = 0.21 β 2 = D 3 puts more weights on those examples March 19, / 43
58 Round 3: t = 3 Round 3 h3 "3 =0.14!3=0.92 again 3 misclassified (circled): 3 = 0.14 β3 = March 19, / 43
59 Final classifier: combining 3 classifiers H = sign final = March 19, / 43
60 Final classifier: combining 3 classifiers H = sign final = All data points are now classified correctly, even though each weak classifier makes 3 mistakes. March 19, / 43
61 Overfitting When T is large, the model is very complicated and overfitting can happen March 19, / 43
62 Overfitting When T is large, the model is very complicated and overfitting can happen March 19, / 43
63 Resistance to overfitting However, very often AdaBoost is resistant to overfitting March 19, / 43
64 Resistance to overfitting However, very often AdaBoost is resistant to overfitting March 19, / 43
65 Resistance to overfitting However, very often AdaBoost is resistant to overfitting Used to be a mystery, but by now rigorous theory has been developed to explain this phenomenon. March 19, / 43
66 Why AdaBoost works? In fact, AdaBoost also follows the general framework of minimizing some surrogate loss. March 19, / 43
67 Why AdaBoost works? In fact, AdaBoost also follows the general framework of minimizing some surrogate loss. Step 1: the model that AdaBoost considers is { } T sgn (f( )) f( ) = β t h t ( ) for some β t 0 and h t H t=1 where H is the set of models considered by the base algorithm March 19, / 43
68 Why AdaBoost works? In fact, AdaBoost also follows the general framework of minimizing some surrogate loss. Step 1: the model that AdaBoost considers is { } T sgn (f( )) f( ) = β t h t ( ) for some β t 0 and h t H t=1 where H is the set of models considered by the base algorithm Step 2: the loss that AdaBoost minimizes is the exponential loss N exp ( y n f(x n )) n=1 March 19, / 43
69 Greedy minimization Step 3: the way that AdaBoost minimizes exponential loss is by a greedy approach, that is, find β t, h t one by one for t = 1,..., T. March 19, / 43
70 Greedy minimization Step 3: the way that AdaBoost minimizes exponential loss is by a greedy approach, that is, find β t, h t one by one for t = 1,..., T. Specifically, let f t = t τ=1 β τ h τ. Suppose we have found f t 1, what should f t be? March 19, / 43
71 Greedy minimization Step 3: the way that AdaBoost minimizes exponential loss is by a greedy approach, that is, find β t, h t one by one for t = 1,..., T. Specifically, let f t = t τ=1 β τ h τ. Suppose we have found f t 1, what should f t be? t 1 f t = β τ h τ + β t h t = f t 1 + β t h t. τ=1 March 19, / 43
72 Greedy minimization Step 3: the way that AdaBoost minimizes exponential loss is by a greedy approach, that is, find β t, h t one by one for t = 1,..., T. Specifically, let f t = t τ=1 β τ h τ. Suppose we have found f t 1, what should f t be? t 1 f t = β τ h τ + β t h t = f t 1 + β t h t. τ=1 Greedily, we want to find β t, h t to minimize March 19, / 43
73 Greedy minimization Step 3: the way that AdaBoost minimizes exponential loss is by a greedy approach, that is, find β t, h t one by one for t = 1,..., T. Specifically, let f t = t τ=1 β τ h τ. Suppose we have found f t 1, what should f t be? t 1 f t = β τ h τ + β t h t = f t 1 + β t h t. τ=1 Greedily, we want to find β t, h t to minimize N exp ( y n f t (x n )) n=1 March 19, / 43
74 Greedy minimization Step 3: the way that AdaBoost minimizes exponential loss is by a greedy approach, that is, find β t, h t one by one for t = 1,..., T. Specifically, let f t = t τ=1 β τ h τ. Suppose we have found f t 1, what should f t be? t 1 f t = β τ h τ + β t h t = f t 1 + β t h t. τ=1 Greedily, we want to find β t, h t to minimize N exp ( y n f t (x n )) = n=1 N exp ( y n f t 1 (x n )) exp ( y n β t h t (x n )) n=1 March 19, / 43
75 Greedy minimization Step 3: the way that AdaBoost minimizes exponential loss is by a greedy approach, that is, find β t, h t one by one for t = 1,..., T. Specifically, let f t = t τ=1 β τ h τ. Suppose we have found f t 1, what should f t be? t 1 f t = β τ h τ + β t h t = f t 1 + β t h t. τ=1 Greedily, we want to find β t, h t to minimize N exp ( y n f t (x n )) = n=1 N exp ( y n f t 1 (x n )) exp ( y n β t h t (x n )) n=1 Next, we use the definition of weights (slide 12). March 19, / 43
76 Greedy minimization Claim: exp ( y n f t 1 (x n )) D t (n) March 19, / 43
77 Greedy minimization Claim: exp ( y n f t 1 (x n )) D t (n) Proof. March 19, / 43
78 Greedy minimization Claim: exp ( y n f t 1 (x n )) D t (n) Proof. D t (n) D t 1 (n) exp ( y n β t 1 h t 1 (x n )) March 19, / 43
79 Greedy minimization Claim: exp ( y n f t 1 (x n )) D t (n) Proof. D t (n) D t 1 (n) exp ( y n β t 1 h t 1 (x n )) D t 2 (n) exp ( y n β t 2 h t 2 (x n )) exp ( y n β t 1 h t 1 (x n )) March 19, / 43
80 Greedy minimization Claim: exp ( y n f t 1 (x n )) D t (n) Proof. D t (n) D t 1 (n) exp ( y n β t 1 h t 1 (x n )) D t 2 (n) exp ( y n β t 2 h t 2 (x n )) exp ( y n β t 1 h t 1 (x n )) D 1 exp ( y n β 1 h 1 (x n )... y n β t 1 h t 1 (x n )) March 19, / 43
81 Greedy minimization Claim: exp ( y n f t 1 (x n )) D t (n) Proof. D t (n) D t 1 (n) exp ( y n β t 1 h t 1 (x n )) D t 2 (n) exp ( y n β t 2 h t 2 (x n )) exp ( y n β t 1 h t 1 (x n )) D 1 exp ( y n β 1 h 1 (x n )... y n β t 1 h t 1 (x n )) exp ( y n f t 1 (x n )) March 19, / 43
82 Greedy minimization Claim: exp ( y n f t 1 (x n )) D t (n) Proof. D t (n) D t 1 (n) exp ( y n β t 1 h t 1 (x n )) D t 2 (n) exp ( y n β t 2 h t 2 (x n )) exp ( y n β t 1 h t 1 (x n )) D 1 exp ( y n β 1 h 1 (x n )... y n β t 1 h t 1 (x n )) exp ( y n f t 1 (x n )) Remark. All weights D t (n) are normalized: n D t(n) = 1. March 19, / 43
83 Greedy minimization So the goal becomes finding β t 0, h t H that minimize argmin β t,h t N n=1 exp ( y n f t (x n )) = argmin β t,h t N D t (n) exp ( y n β t h t (x n )) n=1 March 19, / 43
84 Greedy minimization So the goal becomes finding β t 0, h t H that minimize argmin β t,h t N n=1 exp ( y n f t (x n )) = argmin β t,h t N D t (n) exp ( y n β t h t (x n )) n=1 We decompose the weighted loss function into two parts March 19, / 43
85 Greedy minimization So the goal becomes finding β t 0, h t H that minimize argmin β t,h t N n=1 exp ( y n f t (x n )) = argmin β t,h t N D t (n) exp ( y n β t h t (x n )) n=1 We decompose the weighted loss function into two parts N D t (n) exp ( y n β t h t (x n )) n=1 March 19, / 43
86 Greedy minimization So the goal becomes finding β t 0, h t H that minimize argmin β t,h t N n=1 exp ( y n f t (x n )) = argmin β t,h t N D t (n) exp ( y n β t h t (x n )) n=1 We decompose the weighted loss function into two parts N D t (n) exp ( y n β t h t (x n )) n=1 = n:y n h t(x n) D t (n)e βt + n:y n=h t(x n) D t (n)e βt March 19, / 43
87 Greedy minimization So the goal becomes finding β t 0, h t H that minimize argmin β t,h t N n=1 exp ( y n f t (x n )) = argmin β t,h t N D t (n) exp ( y n β t h t (x n )) n=1 We decompose the weighted loss function into two parts N D t (n) exp ( y n β t h t (x n )) n=1 = n:y n h t(x n) D t (n)e βt + n:y n=h t(x n) D t (n)e βt = ɛ t e βt + (1 ɛ t )e βt (recall ɛ t = n:y n h t(x n) D t(n)) March 19, / 43
88 Greedy minimization So the goal becomes finding β t 0, h t H that minimize argmin β t,h t N n=1 exp ( y n f t (x n )) = argmin β t,h t N D t (n) exp ( y n β t h t (x n )) n=1 We decompose the weighted loss function into two parts N D t (n) exp ( y n β t h t (x n )) n=1 = n:y n h t(x n) D t (n)e βt + n:y n=h t(x n) D t (n)e βt = ɛ t e βt + (1 ɛ t )e βt (recall ɛ t = n:y n h t(x n) D t(n)) = ɛ t (e βt e βt ) + e βt March 19, / 43
89 Greedy minimization So the goal becomes finding β t 0, h t H that minimize argmin β t,h t N n=1 exp ( y n f t (x n )) = argmin β t,h t N D t (n) exp ( y n β t h t (x n )) n=1 We decompose the weighted loss function into two parts N D t (n) exp ( y n β t h t (x n )) n=1 = n:y n h t(x n) D t (n)e βt + n:y n=h t(x n) D t (n)e βt = ɛ t e βt + (1 ɛ t )e βt (recall ɛ t = n:y n h t(x n) D t(n)) = ɛ t (e βt e βt ) + e βt We find h t by minimizing the weighted classification error ɛ t. March 19, / 43
90 Minimizing the weighted classification error Thus, we would want to choose h t (x n ) such that h t (x) = argmin ɛ t = h t(x) n:y n h t(x n) D t (n) March 19, / 43
91 Minimizing the weighted classification error Thus, we would want to choose h t (x n ) such that h t (x) = argmin ɛ t = h t(x) n:y n h t(x n) D t (n) This is exactly the first step of the AdaBoost algorithm on slide 10 train a weak classifier based on the current weight D t (n). March 19, / 43
92 Greedy minimization When h t (and thus ɛ t ) is fixed, we then find β t to minimize ɛ t (e βt e βt ) + e βt March 19, / 43
93 Greedy minimization When h t (and thus ɛ t ) is fixed, we then find β t to minimize ɛ t (e βt e βt ) + e βt We take derivative with respect to β t, set it to zero, and derive the optimal β t as βt = 1 ( ) 1 2 ln ɛt ɛ t March 19, / 43
94 Greedy minimization When h t (and thus ɛ t ) is fixed, we then find β t to minimize ɛ t (e βt e βt ) + e βt We take derivative with respect to β t, set it to zero, and derive the optimal β t as βt = 1 ( ) 1 2 ln ɛt ɛ t which is precisely the step of the AdaBoost algorithm on slide 10. March 19, / 43
95 Greedy minimization When h t (and thus ɛ t ) is fixed, we then find β t to minimize ɛ t (e βt e βt ) + e βt We take derivative with respect to β t, set it to zero, and derive the optimal β t as βt = 1 ( ) 1 2 ln ɛt ɛ t which is precisely the step of the AdaBoost algorithm on slide 10. Verify the solution β t. March 19, / 43
96 Updating the weights Now that we have improved our classifier into f t (x n ) = f t 1 (x n ) + β t h t (x n ) March 19, / 43
97 Updating the weights Now that we have improved our classifier into f t (x n ) = f t 1 (x n ) + β t h t (x n ) At the t-th iteration, we will need to compute the weights for the above classifier, which is, March 19, / 43
98 Updating the weights Now that we have improved our classifier into f t (x n ) = f t 1 (x n ) + β t h t (x n ) At the t-th iteration, we will need to compute the weights for the above classifier, which is, D t+1 (n) e ynf(xn) March 19, / 43
99 Updating the weights Now that we have improved our classifier into f t (x n ) = f t 1 (x n ) + β t h t (x n ) At the t-th iteration, we will need to compute the weights for the above classifier, which is, D t+1 (n) e ynf(xn) = e yn[f t 1(x)+β t h t (xn)] March 19, / 43
100 Updating the weights Now that we have improved our classifier into f t (x n ) = f t 1 (x n ) + β t h t (x n ) At the t-th iteration, we will need to compute the weights for the above classifier, which is, D t+1 (n) e ynf(xn) = e yn[f t 1(x)+β t h t (xn)] D t (n)e ynβ t h t (xn) March 19, / 43
101 Updating the weights Now that we have improved our classifier into f t (x n ) = f t 1 (x n ) + β t h t (x n ) At the t-th iteration, we will need to compute the weights for the above classifier, which is, D t+1 (n) e ynf(xn) = e yn[f t 1(x)+βt h t (xn)] { D t (n)e ynβ t h t (xn) Dt (n)e = β t if y n h t (x n ) D t (n)e β t if y n = h t (x n ) March 19, / 43
102 Updating the weights Now that we have improved our classifier into f t (x n ) = f t 1 (x n ) + β t h t (x n ) At the t-th iteration, we will need to compute the weights for the above classifier, which is, D t+1 (n) e ynf(xn) = e yn[f t 1(x)+βt h t (xn)] { D t (n)e ynβ t h t (xn) Dt (n)e = β t if y n h t (x n ) D t (n)e β t if y n = h t (x n ) which is precisely the last step of the AdaBoost algorithm on slide 12. March 19, / 43
103 Remarks Note that the AdaBoost algorithm itself never specifies how we would get h t (x) as long as it minimizes the weighted classification error ɛ t = n D t (n)i[y n h t (x n )] In this aspect, the AdaBoost algorithm is a meta-algorithm and can be used with any classifier where we can do the above. March 19, / 43
104 Remarks Note that the AdaBoost algorithm itself never specifies how we would get h t (x) as long as it minimizes the weighted classification error ɛ t = n D t (n)i[y n h t (x n )] In this aspect, the AdaBoost algorithm is a meta-algorithm and can be used with any classifier where we can do the above. Ex. How do we choose the decision stump classifier given the weights at the second round of the following distribution? h 1 D 2 We can simply enumerate all possible ways of putting vertical and horizontal lines to separate the data points into two classes and find the one with the smallest weighted classification error! March 19, / 43
105 Summary for boosting Key idea of boosting is to combine weak predictors into a strong one. March 19, / 43
106 Summary for boosting Key idea of boosting is to combine weak predictors into a strong one. There are many boosting algorithms; AdaBoost is the most classic one. March 19, / 43
107 Summary for boosting Key idea of boosting is to combine weak predictors into a strong one. There are many boosting algorithms; AdaBoost is the most classic one. AdaBoost is greedily minimizing the exponential loss. March 19, / 43
108 Summary for boosting Key idea of boosting is to combine weak predictors into a strong one. There are many boosting algorithms; AdaBoost is the most classic one. AdaBoost is greedily minimizing the exponential loss. AdaBoost tends to not overfit. March 19, / 43
109 Taxonomy of ML Models There are two kinds of classification models in machine learning generative models and discriminative models. March 19, / 43
110 Taxonomy of ML Models There are two kinds of classification models in machine learning generative models and discriminative models. Discriminative models: March 19, / 43
111 Taxonomy of ML Models There are two kinds of classification models in machine learning generative models and discriminative models. Discriminative models: nearest neighbor, traditional neural networks, SVM. March 19, / 43
112 Taxonomy of ML Models There are two kinds of classification models in machine learning generative models and discriminative models. Discriminative models: nearest neighbor, traditional neural networks, SVM. we learn f() on data set (x i, y i ) to output the most likely y on unseen x. March 19, / 43
113 Taxonomy of ML Models There are two kinds of classification models in machine learning generative models and discriminative models. Discriminative models: nearest neighbor, traditional neural networks, SVM. we learn f() on data set (x i, y i ) to output the most likely y on unseen x. having f() we know how to discriminate unseen x s from different classes. March 19, / 43
114 Taxonomy of ML Models There are two kinds of classification models in machine learning generative models and discriminative models. Discriminative models: nearest neighbor, traditional neural networks, SVM. we learn f() on data set (x i, y i ) to output the most likely y on unseen x. having f() we know how to discriminate unseen x s from different classes. we learn the decision boundary between the classes. March 19, / 43
115 Taxonomy of ML Models There are two kinds of classification models in machine learning generative models and discriminative models. Discriminative models: nearest neighbor, traditional neural networks, SVM. we learn f() on data set (x i, y i ) to output the most likely y on unseen x. having f() we know how to discriminate unseen x s from different classes. we learn the decision boundary between the classes. we have no idea how the data is generated. March 19, / 43
116 Taxonomy of ML Models There are two kinds of classification models in machine learning generative models and discriminative models. Generative models: March 19, / 43
117 Taxonomy of ML Models There are two kinds of classification models in machine learning generative models and discriminative models. Generative models: Naïve Bayes, Gaussian mixture model, Hidden Markov model, Adversarial Network (GAN). March 19, / 43
118 Taxonomy of ML Models There are two kinds of classification models in machine learning generative models and discriminative models. Generative models: Naïve Bayes, Gaussian mixture model, Hidden Markov model, Adversarial Network (GAN). it s used widely in unsupervised machine learning. March 19, / 43
119 Taxonomy of ML Models There are two kinds of classification models in machine learning generative models and discriminative models. Generative models: Naïve Bayes, Gaussian mixture model, Hidden Markov model, Adversarial Network (GAN). it s used widely in unsupervised machine learning. it s a probabilistic way to think about how the data might have been generated. March 19, / 43
120 Taxonomy of ML Models There are two kinds of classification models in machine learning generative models and discriminative models. Generative models: Naïve Bayes, Gaussian mixture model, Hidden Markov model, Adversarial Network (GAN). it s used widely in unsupervised machine learning. it s a probabilistic way to think about how the data might have been generated. learn the joint probability distribution P (x, y) and predict P (y x) with the help of Bayes Theorem. March 19, / 43
121 Outline 1 Boosting 2 Gaussian mixture models Motivation and Model EM algorithm March 19, / 43
122 Gaussian mixture models Gaussian mixture models (GMM) is a probabilistic approach for clustering March 19, / 43
123 Gaussian mixture models Gaussian mixture models (GMM) is a probabilistic approach for clustering more explanatory than minimizing the K-means objective can be seen as a soft version of K-means March 19, / 43
124 Gaussian mixture models Gaussian mixture models (GMM) is a probabilistic approach for clustering more explanatory than minimizing the K-means objective can be seen as a soft version of K-means To solve GMM, we will introduce a powerful method for learning probabilistic mode: Expectation Maximization (EM) algorithm March 19, / 43
125 A generative model For classification, we discussed the sigmoid model to explain how the labels are generated. March 19, / 43
126 A generative model For classification, we discussed the sigmoid model to explain how the labels are generated. Similarly, for clustering, we want to come up with a probabilistic model p to explain how the data is generated. March 19, / 43
127 A generative model For classification, we discussed the sigmoid model to explain how the labels are generated. Similarly, for clustering, we want to come up with a probabilistic model p to explain how the data is generated. That is, each point is an independent sample of x p. March 19, / 43
128 A generative model For classification, we discussed the sigmoid model to explain how the labels are generated. Similarly, for clustering, we want to come up with a probabilistic model p to explain how the data is generated. That is, each point is an independent sample of x p. What probabilistic model generates data like this? March 19, / 43
129 Gaussian mixture models: intuition We will model each region with a Gaussian distribution. This leads to the idea of Gaussian mixture models (GMMs). The problem we are now facing is that i) we do not know which (color) region a data point comes from; ii) the parameters of Gaussian distributions in each region. We need to find all of them from unsupervised data D = {x n } N n=1. March 19, / 43
130 GMM: formal definition A GMM has the following density function: p(x) = K ω k N(x µ k, Σ k ) = k=1 K k=1 ω k 1 (2π) D Σ k e 1 2 (x µ k) T Σ 1 k (x µ k) March 19, / 43
131 GMM: formal definition A GMM has the following density function: p(x) = where K ω k N(x µ k, Σ k ) = k=1 K k=1 ω k 1 (2π) D Σ k e 1 2 (x µ k) T Σ 1 k (x µ k) K: the number of Gaussian components (same as #clusters we want) March 19, / 43
132 GMM: formal definition A GMM has the following density function: p(x) = where K ω k N(x µ k, Σ k ) = k=1 K k=1 ω k 1 (2π) D Σ k e 1 2 (x µ k) T Σ 1 k (x µ k) K: the number of Gaussian components (same as #clusters we want) µ k and Σ k : mean and covariance matrix of the k-th Gaussian March 19, / 43
133 GMM: formal definition A GMM has the following density function: p(x) = where K ω k N(x µ k, Σ k ) = k=1 K k=1 ω k 1 (2π) D Σ k e 1 2 (x µ k) T Σ 1 k (x µ k) K: the number of Gaussian components (same as #clusters we want) µ k and Σ k : mean and covariance matrix of the k-th Gaussian ω 1,..., ω K : mixture weights, they represent how much each component contributes to the final distribution. It satisfies two properties: k, ω k > 0, and ω k = 1 k March 19, / 43
134 Another view By introducing a latent variable z [K], which indicates cluster membership, we can see p as a marginal distribution p(x) = K p(x, z = k) k=1 March 19, / 43
135 Another view By introducing a latent variable z [K], which indicates cluster membership, we can see p as a marginal distribution p(x) = K p(x, z = k) = k=1 K p(z = k)p(x z = k) k=1 March 19, / 43
136 Another view By introducing a latent variable z [K], which indicates cluster membership, we can see p as a marginal distribution p(x) = K p(x, z = k) = k=1 K p(z = k)p(x z = k) = k=1 K ω k N(x µ k, Σ k ) k=1 March 19, / 43
137 Another view By introducing a latent variable z [K], which indicates cluster membership, we can see p as a marginal distribution p(x) = K p(x, z = k) = k=1 K p(z = k)p(x z = k) = k=1 K ω k N(x µ k, Σ k ) k=1 x and z are both random variables drawn from the model x is observed z is unobserved/latent March 19, / 43
138 An example The conditional distributions are p(x z = red) = N(x µ 1, Σ 1 ) p(x z = blue) = N(x µ 2, Σ 2 ) p(x z = green) = N(x µ 3, Σ 3 ) March 19, / 43
139 An example The conditional distributions are p(x z = red) = N(x µ 1, Σ 1 ) p(x z = blue) = N(x µ 2, Σ 2 ) p(x z = green) = N(x µ 3, Σ 3 ) The marginal distribution is p(x) = p(red)n(x µ 1, Σ 1 ) + p(blue)n(x µ 2, Σ 2 ) + p(green)n(x µ 3, Σ 3 ) March 19, / 43
140 Learning GMMs Learning a GMM means finding all the parameters θ = {ω k, µ k, Σ k } K k=1. March 19, / 43
141 Learning GMMs Learning a GMM means finding all the parameters θ = {ω k, µ k, Σ k } K k=1. In the process, we will learn the latent variable z n as well: p(z n = k x n ) March 19, / 43
142 Learning GMMs Learning a GMM means finding all the parameters θ = {ω k, µ k, Σ k } K k=1. In the process, we will learn the latent variable z n as well: p(z n = k x n ) γ nk [0, 1] i.e. soft assignment of each point to each cluster, as opposed to hard assignment by K-means. March 19, / 43
143 Learning GMMs Learning a GMM means finding all the parameters θ = {ω k, µ k, Σ k } K k=1. In the process, we will learn the latent variable z n as well: p(z n = k x n ) γ nk [0, 1] i.e. soft assignment of each point to each cluster, as opposed to hard assignment by K-means. GMM is more explanatory than K-means both learn the cluster centers µ k s March 19, / 43
144 Learning GMMs Learning a GMM means finding all the parameters θ = {ω k, µ k, Σ k } K k=1. In the process, we will learn the latent variable z n as well: p(z n = k x n ) γ nk [0, 1] i.e. soft assignment of each point to each cluster, as opposed to hard assignment by K-means. GMM is more explanatory than K-means both learn the cluster centers µ k s in addition, GMM learns cluster weight ω k and covariance Σ k, March 19, / 43
145 Learning GMMs Learning a GMM means finding all the parameters θ = {ω k, µ k, Σ k } K k=1. In the process, we will learn the latent variable z n as well: p(z n = k x n ) γ nk [0, 1] i.e. soft assignment of each point to each cluster, as opposed to hard assignment by K-means. GMM is more explanatory than K-means both learn the cluster centers µ k s in addition, GMM learns cluster weight ω k and covariance Σ k, thus we can predict probability of seeing a new point we can generate synthetic data March 19, / 43
146 How to learn these parameters? An obvious attempt is maximum-likelihood estimation (MLE): find argmax θ ln N n=1 p(x n ; θ) = argmax θ N n=1 ln p(x n ; θ) argmax P (θ) θ March 19, / 43
147 How to learn these parameters? An obvious attempt is maximum-likelihood estimation (MLE): find argmax θ ln N n=1 p(x n ; θ) = argmax θ N n=1 ln p(x n ; θ) argmax P (θ) θ This is called incomplete likelihood (since z n s are unobserved), and is intractable in general (non-concave problem). March 19, / 43
148 How to learn these parameters? An obvious attempt is maximum-likelihood estimation (MLE): find argmax θ ln N n=1 p(x n ; θ) = argmax θ N n=1 ln p(x n ; θ) argmax P (θ) θ This is called incomplete likelihood (since z n s are unobserved), and is intractable in general (non-concave problem). One solution is to still apply GD/SGD, but a much more effective approach is the Expectation Maximization (EM) algorithm. March 19, / 43
149 Preview of EM for learning GMMs Step 0 Initialize ω k, µ k, Σ k for each k [K] March 19, / 43
150 Preview of EM for learning GMMs Step 0 Initialize ω k, µ k, Σ k for each k [K] Step 1 (E-Step) update the soft assignment (fixing parameters) γ nk = p(z n = k x n ) ω k N (x n µ k, Σ k ) March 19, / 43
151 Preview of EM for learning GMMs Step 0 Initialize ω k, µ k, Σ k for each k [K] Step 1 (E-Step) update the soft assignment (fixing parameters) γ nk = p(z n = k x n ) ω k N (x n µ k, Σ k ) Step 2 (M-Step) update the model parameter (fixing assignments) n ω k = γ nk µ k = n γ nkx n N n γ nk 1 Σ k = n γ γ nk (x n µ k )(x n µ k ) T nk n March 19, / 43
152 Preview of EM for learning GMMs Step 0 Initialize ω k, µ k, Σ k for each k [K] Step 1 (E-Step) update the soft assignment (fixing parameters) γ nk = p(z n = k x n ) ω k N (x n µ k, Σ k ) Step 2 (M-Step) update the model parameter (fixing assignments) n ω k = γ nk µ k = n γ nkx n N n γ nk 1 Σ k = n γ γ nk (x n µ k )(x n µ k ) T nk Step 3 return to Step 1 if not converged n March 19, / 43
153 Demo Generate 50 data points from a mixture of 2 Gaussians with ω 1 = 0.3, µ 1 = 0.8, Σ 1 = 0.52 ω 2 = 0.7, µ 2 = 1.2, Σ 2 = 0.35 March 19, / 43
154 Demo Generate 50 data points from a mixture of 2 Gaussians with ω 1 = 0.3, µ 1 = 0.8, Σ 1 = 0.52 ω 2 = 0.7, µ 2 = 1.2, Σ 2 = 0.35 histogram represents the data March 19, / 43
155 Demo Generate 50 data points from a mixture of 2 Gaussians with ω 1 = 0.3, µ 1 = 0.8, Σ 1 = 0.52 ω 2 = 0.7, µ 2 = 1.2, Σ 2 = 0.35 histogram represents the data red curve represents the ground-truth density p(x) = K k=1 ω kn(x µ k, Σ k ) March 19, / 43
156 Demo Generate 50 data points from a mixture of 2 Gaussians with ω 1 = 0.3, µ 1 = 0.8, Σ 1 = 0.52 ω 2 = 0.7, µ 2 = 1.2, Σ 2 = 0.35 histogram represents the data red curve represents the ground-truth density p(x) = K k=1 ω kn(x µ k, Σ k ) blue curve represents the learned density for a specific round March 19, / 43
157 Demo Generate 50 data points from a mixture of 2 Gaussians with ω 1 = 0.3, µ 1 = 0.8, Σ 1 = 0.52 ω 2 = 0.7, µ 2 = 1.2, Σ 2 = 0.35 histogram represents the data red curve represents the ground-truth density p(x) = K k=1 ω kn(x µ k, Σ k ) blue curve represents the learned density for a specific round EM demo.pdf shows how the blue curve moves towards red curve quickly via EM March 19, / 43
158 EM algorithm In general EM is a heuristic to solve MLE with latent variables (not just GMM), i.e. find the maximizer of P (θ) = N ln p(x n ; θ) n=1 March 19, / 43
159 EM algorithm In general EM is a heuristic to solve MLE with latent variables (not just GMM), i.e. find the maximizer of P (θ) = N ln p(x n ; θ) n=1 θ is the parameters for a general probabilistic model March 19, / 43
160 EM algorithm In general EM is a heuristic to solve MLE with latent variables (not just GMM), i.e. find the maximizer of P (θ) = N ln p(x n ; θ) n=1 θ is the parameters for a general probabilistic model x n s are observed random variables March 19, / 43
161 EM algorithm In general EM is a heuristic to solve MLE with latent variables (not just GMM), i.e. find the maximizer of P (θ) = N ln p(x n ; θ) n=1 θ is the parameters for a general probabilistic model x n s are observed random variables z n s are latent variables March 19, / 43
162 EM algorithm In general EM is a heuristic to solve MLE with latent variables (not just GMM), i.e. find the maximizer of P (θ) = N ln p(x n ; θ) n=1 θ is the parameters for a general probabilistic model x n s are observed random variables z n s are latent variables Again, directly solving the objective is intractable. March 19, / 43
163 High level idea Keep maximizing a lower bound of P that is more manageable March 19, / 43
Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26
Clustering Professor Ameet Talwalkar Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 1 / 26 Outline 1 Administration 2 Review of last lecture 3 Clustering Professor Ameet Talwalkar
More informationComputer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization
Prof. Daniel Cremers 6. Mixture Models and Expectation-Maximization Motivation Often the introduction of latent (unobserved) random variables into a model can help to express complex (marginal) distributions
More informationLecture 8. Instructor: Haipeng Luo
Lecture 8 Instructor: Haipeng Luo Boosting and AdaBoost In this lecture we discuss the connection between boosting and online learning. Boosting is not only one of the most fundamental theories in machine
More informationParametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a
Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a Some slides are due to Christopher Bishop Limitations of K-means Hard assignments of data points to clusters small shift of a
More informationExpectation maximization
Expectation maximization Subhransu Maji CMSCI 689: Machine Learning 14 April 2015 Motivation Suppose you are building a naive Bayes spam classifier. After your are done your boss tells you that there is
More informationCS534 Machine Learning - Spring Final Exam
CS534 Machine Learning - Spring 2013 Final Exam Name: You have 110 minutes. There are 6 questions (8 pages including cover page). If you get stuck on one question, move on to others and come back to the
More informationECE 5424: Introduction to Machine Learning
ECE 5424: Introduction to Machine Learning Topics: Ensemble Methods: Bagging, Boosting PAC Learning Readings: Murphy 16.4;; Hastie 16 Stefan Lee Virginia Tech Fighting the bias-variance tradeoff Simple
More informationLogistic Regression. Machine Learning Fall 2018
Logistic Regression Machine Learning Fall 2018 1 Where are e? We have seen the folloing ideas Linear models Learning as loss minimization Bayesian learning criteria (MAP and MLE estimation) The Naïve Bayes
More informationIntroduction to Machine Learning Midterm, Tues April 8
Introduction to Machine Learning 10-701 Midterm, Tues April 8 [1 point] Name: Andrew ID: Instructions: You are allowed a (two-sided) sheet of notes. Exam ends at 2:45pm Take a deep breath and don t spend
More informationMachine Learning for Signal Processing Bayes Classification and Regression
Machine Learning for Signal Processing Bayes Classification and Regression Instructor: Bhiksha Raj 11755/18797 1 Recap: KNN A very effective and simple way of performing classification Simple model: For
More informationFINAL: CS 6375 (Machine Learning) Fall 2014
FINAL: CS 6375 (Machine Learning) Fall 2014 The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run out of room for
More informationVBM683 Machine Learning
VBM683 Machine Learning Pinar Duygulu Slides are adapted from Dhruv Batra Bias is the algorithm's tendency to consistently learn the wrong thing by not taking into account all the information in the data
More informationProbabilistic modeling. The slides are closely adapted from Subhransu Maji s slides
Probabilistic modeling The slides are closely adapted from Subhransu Maji s slides Overview So far the models and algorithms you have learned about are relatively disconnected Probabilistic modeling framework
More informationMachine Learning Lecture 5
Machine Learning Lecture 5 Linear Discriminant Functions 26.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory
More informationMIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE
MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE March 28, 2012 The exam is closed book. You are allowed a double sided one page cheat sheet. Answer the questions in the spaces provided on
More informationLinear Models for Classification
Linear Models for Classification Oliver Schulte - CMPT 726 Bishop PRML Ch. 4 Classification: Hand-written Digit Recognition CHINE INTELLIGENCE, VOL. 24, NO. 24, APRIL 2002 x i = t i = (0, 0, 0, 1, 0, 0,
More informationUNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013
UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and
More informationStatistical Pattern Recognition
Statistical Pattern Recognition Expectation Maximization (EM) and Mixture Models Hamid R. Rabiee Jafar Muhammadi, Mohammad J. Hosseini Spring 2014 http://ce.sharif.edu/courses/92-93/2/ce725-2 Agenda Expectation-maximization
More informationQualifying Exam in Machine Learning
Qualifying Exam in Machine Learning October 20, 2009 Instructions: Answer two out of the three questions in Part 1. In addition, answer two out of three questions in two additional parts (choose two parts
More informationECE 5984: Introduction to Machine Learning
ECE 5984: Introduction to Machine Learning Topics: Ensemble Methods: Bagging, Boosting Readings: Murphy 16.4; Hastie 16 Dhruv Batra Virginia Tech Administrativia HW3 Due: April 14, 11:55pm You will implement
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 218 Outlines Overview Introduction Linear Algebra Probability Linear Regression 1
More informationOnline Learning and Sequential Decision Making
Online Learning and Sequential Decision Making Emilie Kaufmann CNRS & CRIStAL, Inria SequeL, emilie.kaufmann@univ-lille.fr Research School, ENS Lyon, Novembre 12-13th 2018 Emilie Kaufmann Online Learning
More informationMachine Learning for Signal Processing Bayes Classification
Machine Learning for Signal Processing Bayes Classification Class 16. 24 Oct 2017 Instructor: Bhiksha Raj - Abelino Jimenez 11755/18797 1 Recap: KNN A very effective and simple way of performing classification
More informationMidterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas
Midterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas Machine Learning Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Y Continuous Non-parametric
More informationVoting (Ensemble Methods)
1 2 Voting (Ensemble Methods) Instead of learning a single classifier, learn many weak classifiers that are good at different parts of the data Output class: (Weighted) vote of each classifier Classifiers
More informationMachine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall
Machine Learning Gaussian Mixture Models Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 1 The Generative Model POV We think of the data as being generated from some process. We assume
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning Expectation Maximization Mark Schmidt University of British Columbia Winter 2018 Last Time: Learning with MAR Values We discussed learning with missing at random values in data:
More informationNeural Networks and Deep Learning
Neural Networks and Deep Learning Professor Ameet Talwalkar November 12, 2015 Professor Ameet Talwalkar Neural Networks and Deep Learning November 12, 2015 1 / 16 Outline 1 Review of last lecture AdaBoost
More informationNaïve Bayes classification
Naïve Bayes classification 1 Probability theory Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. Examples: A person s height, the outcome of a coin toss
More informationNaïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability
Probability theory Naïve Bayes classification Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. s: A person s height, the outcome of a coin toss Distinguish
More informationThe exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.
CS 189 Spring 013 Introduction to Machine Learning Final You have 3 hours for the exam. The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet. Please
More informationMidterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas
Midterm Review CS 7301: Advanced Machine Learning Vibhav Gogate The University of Texas at Dallas Supervised Learning Issues in supervised learning What makes learning hard Point Estimation: MLE vs Bayesian
More informationLecture 3: Machine learning, classification, and generative models
EE E6820: Speech & Audio Processing & Recognition Lecture 3: Machine learning, classification, and generative models 1 Classification 2 Generative models 3 Gaussian models Michael Mandel
More informationStochastic Gradient Descent
Stochastic Gradient Descent Machine Learning CSE546 Carlos Guestrin University of Washington October 9, 2013 1 Logistic Regression Logistic function (or Sigmoid): Learn P(Y X) directly Assume a particular
More informationPerformance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project
Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Devin Cornell & Sushruth Sastry May 2015 1 Abstract In this article, we explore
More informationChapter 14 Combining Models
Chapter 14 Combining Models T-61.62 Special Course II: Pattern Recognition and Machine Learning Spring 27 Laboratory of Computer and Information Science TKK April 3th 27 Outline Independent Mixing Coefficients
More information1 Machine Learning Concepts (16 points)
CSCI 567 Fall 2018 Midterm Exam DO NOT OPEN EXAM UNTIL INSTRUCTED TO DO SO PLEASE TURN OFF ALL CELL PHONES Problem 1 2 3 4 5 6 Total Max 16 10 16 42 24 12 120 Points Please read the following instructions
More informationIntroduction to Machine Learning
Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 20: Expectation Maximization Algorithm EM for Mixture Models Many figures courtesy Kevin Murphy s
More informationLogistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu
Logistic Regression Review 10-601 Fall 2012 Recitation September 25, 2012 TA: Selen Uguroglu!1 Outline Decision Theory Logistic regression Goal Loss function Inference Gradient Descent!2 Training Data
More informationMathematical Formulation of Our Example
Mathematical Formulation of Our Example We define two binary random variables: open and, where is light on or light off. Our question is: What is? Computer Vision 1 Combining Evidence Suppose our robot
More informationClustering and Gaussian Mixture Models
Clustering and Gaussian Mixture Models Piyush Rai IIT Kanpur Probabilistic Machine Learning (CS772A) Jan 25, 2016 Probabilistic Machine Learning (CS772A) Clustering and Gaussian Mixture Models 1 Recap
More informationGaussian Mixture Models
Gaussian Mixture Models Pradeep Ravikumar Co-instructor: Manuela Veloso Machine Learning 10-701 Some slides courtesy of Eric Xing, Carlos Guestrin (One) bad case for K- means Clusters may overlap Some
More information10-701/ Machine Learning - Midterm Exam, Fall 2010
10-701/15-781 Machine Learning - Midterm Exam, Fall 2010 Aarti Singh Carnegie Mellon University 1. Personal info: Name: Andrew account: E-mail address: 2. There should be 15 numbered pages in this exam
More informationExpectation Maximization
Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane c Ghane/Mori 4 6 8 4 6 8 4 6 8 4 6 8 5 5 5 5 5 5 4 6 8 4 4 6 8 4 5 5 5 5 5 5 µ, Σ) α f Learningscale is slightly Parameters is slightly larger larger
More informationBoosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13
Boosting Ryan Tibshirani Data Mining: 36-462/36-662 April 25 2013 Optional reading: ISL 8.2, ESL 10.1 10.4, 10.7, 10.13 1 Reminder: classification trees Suppose that we are given training data (x i, y
More informationLatent Variable Models and Expectation Maximization
Latent Variable Models and Expectation Maximization Oliver Schulte - CMPT 726 Bishop PRML Ch. 9 2 4 6 8 1 12 14 16 18 2 4 6 8 1 12 14 16 18 5 1 15 2 25 5 1 15 2 25 2 4 6 8 1 12 14 2 4 6 8 1 12 14 5 1 15
More information9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering
Types of learning Modeling data Supervised: we know input and targets Goal is to learn a model that, given input data, accurately predicts target data Unsupervised: we know the input only and want to make
More informationCh 4. Linear Models for Classification
Ch 4. Linear Models for Classification Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Department of Computer Science and Engineering Pohang University of Science and echnology 77 Cheongam-ro,
More informationCS446: Machine Learning Fall Final Exam. December 6 th, 2016
CS446: Machine Learning Fall 2016 Final Exam December 6 th, 2016 This is a closed book exam. Everything you need in order to solve the problems is supplied in the body of this exam. This exam booklet contains
More informationClassification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012
Classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Topics Discriminant functions Logistic regression Perceptron Generative models Generative vs. discriminative
More informationHidden Markov Models Part 2: Algorithms
Hidden Markov Models Part 2: Algorithms CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 Hidden Markov Model An HMM consists of:
More informationCSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18
CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$
More informationMixtures of Gaussians. Sargur Srihari
Mixtures of Gaussians Sargur srihari@cedar.buffalo.edu 1 9. Mixture Models and EM 0. Mixture Models Overview 1. K-Means Clustering 2. Mixtures of Gaussians 3. An Alternative View of EM 4. The EM Algorithm
More informationStatistical Machine Learning from Data
Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Ensembles Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique Fédérale de Lausanne
More informationGaussian and Linear Discriminant Analysis; Multiclass Classification
Gaussian and Linear Discriminant Analysis; Multiclass Classification Professor Ameet Talwalkar Slide Credit: Professor Fei Sha Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015
More informationCSE 151 Machine Learning. Instructor: Kamalika Chaudhuri
CSE 151 Machine Learning Instructor: Kamalika Chaudhuri Ensemble Learning How to combine multiple classifiers into a single one Works well if the classifiers are complementary This class: two types of
More informationIntroduction to Machine Learning Midterm Exam Solutions
10-701 Introduction to Machine Learning Midterm Exam Solutions Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes,
More informationIntroduction to Machine Learning Midterm Exam
10-701 Introduction to Machine Learning Midterm Exam Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes, but
More informationLatent Variable Models and Expectation Maximization
Latent Variable Models and Expectation Maximization Oliver Schulte - CMPT 726 Bishop PRML Ch. 9 2 4 6 8 1 12 14 16 18 2 4 6 8 1 12 14 16 18 5 1 15 2 25 5 1 15 2 25 2 4 6 8 1 12 14 2 4 6 8 1 12 14 5 1 15
More informationStatistical Data Mining and Machine Learning Hilary Term 2016
Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes
More informationAdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague
AdaBoost Lecturer: Jan Šochman Authors: Jan Šochman, Jiří Matas Center for Machine Perception Czech Technical University, Prague http://cmp.felk.cvut.cz Motivation Presentation 2/17 AdaBoost with trees
More informationThe AdaBoost algorithm =1/n for i =1,...,n 1) At the m th iteration we find (any) classifier h(x; ˆθ m ) for which the weighted classification error m
) Set W () i The AdaBoost algorithm =1/n for i =1,...,n 1) At the m th iteration we find (any) classifier h(x; ˆθ m ) for which the weighted classification error m m =.5 1 n W (m 1) i y i h(x i ; 2 ˆθ
More informationFinal Overview. Introduction to ML. Marek Petrik 4/25/2017
Final Overview Introduction to ML Marek Petrik 4/25/2017 This Course: Introduction to Machine Learning Build a foundation for practice and research in ML Basic machine learning concepts: max likelihood,
More informationBoosting. CAP5610: Machine Learning Instructor: Guo-Jun Qi
Boosting CAP5610: Machine Learning Instructor: Guo-Jun Qi Weak classifiers Weak classifiers Decision stump one layer decision tree Naive Bayes A classifier without feature correlations Linear classifier
More informationMachine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall
Machine Learning Gaussian Mixture Models Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 1 Discriminative vs Generative Models Discriminative: Just learn a decision boundary between your
More informationMachine Learning, Midterm Exam
10-601 Machine Learning, Midterm Exam Instructors: Tom Mitchell, Ziv Bar-Joseph Wednesday 12 th December, 2012 There are 9 questions, for a total of 100 points. This exam has 20 pages, make sure you have
More informationBayesian Learning (II)
Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Bayesian Learning (II) Niels Landwehr Overview Probabilities, expected values, variance Basic concepts of Bayesian learning MAP
More informationLogistic Regression. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824
Logistic Regression Jia-Bin Huang ECE-5424G / CS-5824 Virginia Tech Spring 2019 Administrative Please start HW 1 early! Questions are welcome! Two principles for estimating parameters Maximum Likelihood
More informationProbabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016
Probabilistic classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Topics Probabilistic approach Bayes decision theory Generative models Gaussian Bayes classifier
More informationCS6220: DATA MINING TECHNIQUES
CS6220: DATA MINING TECHNIQUES Matrix Data: Clustering: Part 2 Instructor: Yizhou Sun yzsun@ccs.neu.edu November 3, 2015 Methods to Learn Matrix Data Text Data Set Data Sequence Data Time Series Graph
More information10701/15781 Machine Learning, Spring 2007: Homework 2
070/578 Machine Learning, Spring 2007: Homework 2 Due: Wednesday, February 2, beginning of the class Instructions There are 4 questions on this assignment The second question involves coding Do not attach
More informationCOMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017
COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University SOFT CLUSTERING VS HARD CLUSTERING
More informationFinal Exam, Fall 2002
15-781 Final Exam, Fall 22 1. Write your name and your andrew email address below. Name: Andrew ID: 2. There should be 17 pages in this exam (excluding this cover sheet). 3. If you need more room to work
More informationLecture 4 Discriminant Analysis, k-nearest Neighbors
Lecture 4 Discriminant Analysis, k-nearest Neighbors Fredrik Lindsten Division of Systems and Control Department of Information Technology Uppsala University. Email: fredrik.lindsten@it.uu.se fredrik.lindsten@it.uu.se
More informationFinal Exam, Spring 2006
070 Final Exam, Spring 2006. Write your name and your email address below. Name: Andrew account: 2. There should be 22 numbered pages in this exam (including this cover sheet). 3. You may use any and all
More informationMachine Learning, Midterm Exam: Spring 2008 SOLUTIONS. Q Topic Max. Score Score. 1 Short answer questions 20.
10-601 Machine Learning, Midterm Exam: Spring 2008 Please put your name on this cover sheet If you need more room to work out your answer to a question, use the back of the page and clearly mark on the
More informationBoosting: Foundations and Algorithms. Rob Schapire
Boosting: Foundations and Algorithms Rob Schapire Example: Spam Filtering problem: filter out spam (junk email) gather large collection of examples of spam and non-spam: From: yoav@ucsd.edu Rob, can you
More informationSTA414/2104. Lecture 11: Gaussian Processes. Department of Statistics
STA414/2104 Lecture 11: Gaussian Processes Department of Statistics www.utstat.utoronto.ca Delivered by Mark Ebden with thanks to Russ Salakhutdinov Outline Gaussian Processes Exam review Course evaluations
More informationCS7267 MACHINE LEARNING
CS7267 MACHINE LEARNING ENSEMBLE LEARNING Ref: Dr. Ricardo Gutierrez-Osuna at TAMU, and Aarti Singh at CMU Mingon Kang, Ph.D. Computer Science, Kennesaw State University Definition of Ensemble Learning
More informationMachine Learning: A Statistics and Optimization Perspective
Machine Learning: A Statistics and Optimization Perspective Nan Ye Mathematical Sciences School Queensland University of Technology 1 / 109 What is Machine Learning? 2 / 109 Machine Learning Machine learning
More informationData Mining Techniques
Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 6 Jan-Willem van de Meent (credit: Yijun Zhao, Chris Bishop, Andrew Moore, Hastie et al.) Project Project Deadlines 3 Feb: Form teams of
More informationSTATS 306B: Unsupervised Learning Spring Lecture 2 April 2
STATS 306B: Unsupervised Learning Spring 2014 Lecture 2 April 2 Lecturer: Lester Mackey Scribe: Junyang Qian, Minzhe Wang 2.1 Recap In the last lecture, we formulated our working definition of unsupervised
More informationBias-Variance Tradeoff
What s learning, revisited Overfitting Generative versus Discriminative Logistic Regression Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University September 19 th, 2007 Bias-Variance Tradeoff
More informationBrief Introduction of Machine Learning Techniques for Content Analysis
1 Brief Introduction of Machine Learning Techniques for Content Analysis Wei-Ta Chu 2008/11/20 Outline 2 Overview Gaussian Mixture Model (GMM) Hidden Markov Model (HMM) Support Vector Machine (SVM) Overview
More informationMachine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?
Machine Learning and Bayesian Inference Dr Sean Holden Computer Laboratory, Room FC6 Telephone extension 6372 Email: sbh11@cl.cam.ac.uk www.cl.cam.ac.uk/ sbh11/ Unsupervised learning Can we find regularity
More informationLecture 3: Pattern Classification
EE E6820: Speech & Audio Processing & Recognition Lecture 3: Pattern Classification 1 2 3 4 5 The problem of classification Linear and nonlinear classifiers Probabilistic classification Gaussians, mixtures
More informationClustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.
Clustering K-means Machine Learning CSE546 Carlos Guestrin University of Washington November 4, 2014 1 Clustering images Set of Images [Goldberger et al.] 2 1 K-means Randomly initialize k centers µ (0)
More informationGenerative v. Discriminative classifiers Intuition
Logistic Regression (Continued) Generative v. Discriminative Decision rees Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University January 31 st, 2007 2005-2007 Carlos Guestrin 1 Generative
More informationMachine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.
Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted
More informationMachine Learning Lecture 7
Course Outline Machine Learning Lecture 7 Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Statistical Learning Theory 23.05.2016 Discriminative Approaches (5 weeks) Linear Discriminant
More informationPattern Recognition and Machine Learning
Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability
More informationThe Boosting Approach to. Machine Learning. Maria-Florina Balcan 10/31/2016
The Boosting Approach to Machine Learning Maria-Florina Balcan 10/31/2016 Boosting General method for improving the accuracy of any given learning algorithm. Works by creating a series of challenge datasets
More informationMachine Learning, Fall 2011: Homework 5
0-60 Machine Learning, Fall 0: Homework 5 Machine Learning Department Carnegie Mellon University Due:??? Instructions There are 3 questions on this assignment. Please submit your completed homework to
More informationEnsemble Methods. Charles Sutton Data Mining and Exploration Spring Friday, 27 January 12
Ensemble Methods Charles Sutton Data Mining and Exploration Spring 2012 Bias and Variance Consider a regression problem Y = f(x)+ N(0, 2 ) With an estimate regression function ˆf, e.g., ˆf(x) =w > x Suppose
More informationWe Live in Exciting Times. CSCI-567: Machine Learning (Spring 2019) Outline. Outline. ACM (an international computing research society) has named
We Live in Exciting Times ACM (an international computing research society) has named CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Apr. 2, 2019 Yoshua Bengio,
More informationStatistical learning. Chapter 20, Sections 1 4 1
Statistical learning Chapter 20, Sections 1 4 Chapter 20, Sections 1 4 1 Outline Bayesian learning Maximum a posteriori and maximum likelihood learning Bayes net learning ML parameter learning with complete
More informationFinal Examination CS 540-2: Introduction to Artificial Intelligence
Final Examination CS 540-2: Introduction to Artificial Intelligence May 7, 2017 LAST NAME: SOLUTIONS FIRST NAME: Problem Score Max Score 1 14 2 10 3 6 4 10 5 11 6 9 7 8 9 10 8 12 12 8 Total 100 1 of 11
More informationClassification: The rest of the story
U NIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN CS598 Machine Learning for Signal Processing Classification: The rest of the story 3 October 2017 Today s lecture Important things we haven t covered yet Fisher
More informationClustering and Gaussian Mixtures
Clustering and Gaussian Mixtures Oliver Schulte - CMPT 883 2 4 6 8 1 12 14 16 18 2 4 6 8 1 12 14 16 18 5 1 15 2 25 5 1 15 2 25 2 4 6 8 1 12 14 2 4 6 8 1 12 14 5 1 15 2 25 5 1 15 2 25 detected tures detected
More information