CSCI-567: Machine Learning (Spring 2019)

Size: px
Start display at page:

Download "CSCI-567: Machine Learning (Spring 2019)"

Transcription

1 CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, / 43

2 Administration March 19, / 43

3 Administration TA3 is due this week March 19, / 43

4 Administration TA3 is due this week TA4 will be available in the next week. March 19, / 43

5 Administration TA3 is due this week TA4 will be available in the next week. PA4 (Clustering, Markov chains) will be available in two weeks March 19, / 43

6 Outline 1 Boosting 2 Gaussian mixture models March 19, / 43

7 Top 10 Algorithms in Machine Learning... March 19, / 43

8 Top 10 Algorithms in Machine Learning... You should know (in 2019): March 19, / 43

9 Top 10 Algorithms in Machine Learning... You should know (in 2019): k-nearest Neighbors March 19, / 43

10 Top 10 Algorithms in Machine Learning... You should know (in 2019): k-nearest Neighbors Decision Tree and Random Forests March 19, / 43

11 Top 10 Algorithms in Machine Learning... You should know (in 2019): k-nearest Neighbors Decision Tree and Random Forests Naive Bayes March 19, / 43

12 Top 10 Algorithms in Machine Learning... You should know (in 2019): k-nearest Neighbors Decision Tree and Random Forests Naive Bayes Linear and Logistic Regression March 19, / 43

13 Top 10 Algorithms in Machine Learning... You should know (in 2019): k-nearest Neighbors Decision Tree and Random Forests Naive Bayes Linear and Logistic Regression Artificial Neural Networks March 19, / 43

14 Top 10 Algorithms in Machine Learning... You should know (in 2019): k-nearest Neighbors Decision Tree and Random Forests Naive Bayes Linear and Logistic Regression Artificial Neural Networks SVM March 19, / 43

15 Top 10 Algorithms in Machine Learning... You should know (in 2019): k-nearest Neighbors Decision Tree and Random Forests Naive Bayes Linear and Logistic Regression Artificial Neural Networks SVM Clustering (k-means) March 19, / 43

16 Top 10 Algorithms in Machine Learning... You should know (in 2019): k-nearest Neighbors Decision Tree and Random Forests Naive Bayes Linear and Logistic Regression Artificial Neural Networks SVM Clustering (k-means) Boosting March 19, / 43

17 Top 10 Algorithms in Machine Learning... You should know (in 2019): k-nearest Neighbors Decision Tree and Random Forests Naive Bayes Linear and Logistic Regression Artificial Neural Networks SVM Clustering (k-means) Boosting Dimensionality Reduction Algorithms March 19, / 43

18 Top 10 Algorithms in Machine Learning... You should know (in 2019): k-nearest Neighbors Decision Tree and Random Forests Naive Bayes Linear and Logistic Regression Artificial Neural Networks SVM Clustering (k-means) Boosting Dimensionality Reduction Algorithms Markov Chains March 19, / 43

19 Outline 1 Boosting Examples AdaBoost Derivation of AdaBoost 2 Gaussian mixture models March 19, / 43

20 Introduction Boosting is a meta-algorithm, which takes a base algorithm (classification, regression, ranking, etc) as input and boosts its accuracy March 19, / 43

21 Introduction Boosting is a meta-algorithm, which takes a base algorithm (classification, regression, ranking, etc) as input and boosts its accuracy main idea: combine weak rules of thumb (e.g. 51% accuracy) to form a highly accurate predictor (e.g. 99% accuracy) March 19, / 43

22 Introduction Boosting is a meta-algorithm, which takes a base algorithm (classification, regression, ranking, etc) as input and boosts its accuracy main idea: combine weak rules of thumb (e.g. 51% accuracy) to form a highly accurate predictor (e.g. 99% accuracy) works very well in practice (especially in combination with trees) March 19, / 43

23 Introduction Boosting is a meta-algorithm, which takes a base algorithm (classification, regression, ranking, etc) as input and boosts its accuracy main idea: combine weak rules of thumb (e.g. 51% accuracy) to form a highly accurate predictor (e.g. 99% accuracy) works very well in practice (especially in combination with trees) often is resistant to overfitting March 19, / 43

24 Introduction Boosting is a meta-algorithm, which takes a base algorithm (classification, regression, ranking, etc) as input and boosts its accuracy main idea: combine weak rules of thumb (e.g. 51% accuracy) to form a highly accurate predictor (e.g. 99% accuracy) works very well in practice (especially in combination with trees) often is resistant to overfitting has strong theoretical guarantees March 19, / 43

25 Introduction Boosting is a meta-algorithm, which takes a base algorithm (classification, regression, ranking, etc) as input and boosts its accuracy main idea: combine weak rules of thumb (e.g. 51% accuracy) to form a highly accurate predictor (e.g. 99% accuracy) works very well in practice (especially in combination with trees) often is resistant to overfitting has strong theoretical guarantees We again focus on binary classification. March 19, / 43

26 A simple example spam detection: March 19, / 43

27 A simple example spam detection: given a training set like: ( Want to make money fast?..., spam) ( Viterbi Research Gist..., not spam) March 19, / 43

28 A simple example spam detection: given a training set like: ( Want to make money fast?..., spam) ( Viterbi Research Gist..., not spam) first obtain a classifier by applying a base algorithm, which can be a rather simple/weak one, like decision stumps: e.g. contains the word money spam March 19, / 43

29 A simple example spam detection: given a training set like: ( Want to make money fast?..., spam) ( Viterbi Research Gist..., not spam) first obtain a classifier by applying a base algorithm, which can be a rather simple/weak one, like decision stumps: e.g. contains the word money spam reweight the examples so that difficult ones get more attention e.g. spam that doesn t contain the word money March 19, / 43

30 A simple example spam detection: given a training set like: ( Want to make money fast?..., spam) ( Viterbi Research Gist..., not spam) first obtain a classifier by applying a base algorithm, which can be a rather simple/weak one, like decision stumps: e.g. contains the word money spam reweight the examples so that difficult ones get more attention e.g. spam that doesn t contain the word money obtain another classifier by applying the same base algorithm: e.g. empty to address spam March 19, / 43

31 A simple example spam detection: given a training set like: ( Want to make money fast?..., spam) ( Viterbi Research Gist..., not spam) first obtain a classifier by applying a base algorithm, which can be a rather simple/weak one, like decision stumps: e.g. contains the word money spam reweight the examples so that difficult ones get more attention e.g. spam that doesn t contain the word money obtain another classifier by applying the same base algorithm: e.g. empty to address spam repeat... March 19, / 43

32 A simple example spam detection: given a training set like: ( Want to make money fast?..., spam) ( Viterbi Research Gist..., not spam) first obtain a classifier by applying a base algorithm, which can be a rather simple/weak one, like decision stumps: e.g. contains the word money spam reweight the examples so that difficult ones get more attention e.g. spam that doesn t contain the word money obtain another classifier by applying the same base algorithm: e.g. empty to address spam repeat... final classifier is the (weighted) majority vote of all weak classifiers March 19, / 43

33 The base algorithm A base algorithm A (also called weak learning algorithm/oracle) takes a training set S weighted by D as input, and outputs classifier h A(S, D) March 19, / 43

34 The base algorithm A base algorithm A (also called weak learning algorithm/oracle) takes a training set S weighted by D as input, and outputs classifier h A(S, D) this can be any off-the-shelf classification algorithm (e.g. decision trees, logistic regression, neural nets, etc) March 19, / 43

35 The base algorithm A base algorithm A (also called weak learning algorithm/oracle) takes a training set S weighted by D as input, and outputs classifier h A(S, D) this can be any off-the-shelf classification algorithm (e.g. decision trees, logistic regression, neural nets, etc) many algorithms can deal with a weighted training set (e.g. for algorithm that minimizes some loss, we can simply replace total loss by weighted total loss ) March 19, / 43

36 Boosting Algorithms Given: a training set S a base algorithm A March 19, / 43

37 Boosting Algorithms Given: a training set S a base algorithm A Two things to specify a boosting algorithm: how to reweight the examples? how to combine all the weak classifiers? March 19, / 43

38 Boosting Algorithms Given: a training set S a base algorithm A Two things to specify a boosting algorithm: how to reweight the examples? how to combine all the weak classifiers? AdaBoost is one of the most successful boosting algorithms. March 19, / 43

39 The AdaBoost Algorithm, 1990 Given N samples {x n, y n }, where y n {+1, 1}, and a base algorithm A. March 19, / 43

40 The AdaBoost Algorithm, 1990 Given N samples {x n, y n }, where y n {+1, 1}, and a base algorithm A. Initialize D 1 = 1 N to be uniform. March 19, / 43

41 The AdaBoost Algorithm, 1990 Given N samples {x n, y n }, where y n {+1, 1}, and a base algorithm A. Initialize D 1 = 1 N to be uniform. For t = 1,..., T Train a weak classifier h t A(S, D t ) based on the current weight D t (n), by minimizing the weighted classification error ɛ t = n D t (n)i[y n h t (x n )] March 19, / 43

42 The AdaBoost Algorithm, 1990 Given N samples {x n, y n }, where y n {+1, 1}, and a base algorithm A. Initialize D 1 = 1 N to be uniform. For t = 1,..., T Train a weak classifier h t A(S, D t ) based on the current weight D t (n), by minimizing the weighted classification error ɛ t = n D t (n)i[y n h t (x n )] Calculate the importance of h t as β t = 1 ( ) 1 2 ln ɛt ɛ t (β t > 0 ɛ t < 0.5) March 19, / 43

43 The Betas Calculate the importance of h t as β t = 1 2 ln ( 1 ɛt ɛ t ) (β t > 0 ɛ t < 0.5) March 19, / 43

44 The AdaBoost Algorithm For t = 1,..., T Train a weak classifier h t A(S, D t ) Calculate β t March 19, / 43

45 The AdaBoost Algorithm For t = 1,..., T Train a weak classifier h t A(S, D t ) Calculate β t Update weights D t+1 (n) = D t (n)e βtynht(xn) March 19, / 43

46 The AdaBoost Algorithm For t = 1,..., T Train a weak classifier h t A(S, D t ) Calculate β t Update weights D t+1 (n) = D t (n)e βtynht(xn) = { D t (n)e βt D t (n)e βt if h t (x n ) = y n else March 19, / 43

47 The AdaBoost Algorithm For t = 1,..., T Train a weak classifier h t A(S, D t ) Calculate β t Update weights D t+1 (n) = D t (n)e βtynht(xn) = { D t (n)e βt D t (n)e βt if h t (x n ) = y n else and normalize them such that D t+1 (n) = D t+1(n) n D t+1(n). March 19, / 43

48 The AdaBoost Algorithm For t = 1,..., T Train a weak classifier h t A(S, D t ) Calculate β t Update weights D t+1 (n) = D t (n)e βtynht(xn) = { D t (n)e βt D t (n)e βt if h t (x n ) = y n else and normalize them such that D t+1 (n) = Output the final classifier: ( T ) H(x) = sign β t h t (x) t=1 D t+1(n) n D t+1(n). March 19, / 43

49 Example 10 data points in R 2 The size of + or - indicates the weight, which starts from uniform D 1 March 19, / 43

50 Example 10 data points in R 2 The size of + or - indicates the weight, which starts from uniform D 1 Base algorithm is decision stump: March 19, / 43

51 Observe that no stump can predict very accurately for this dataset March 19, / 43 Example 10 data points in R 2 The size of + or - indicates the weight, which starts from uniform D 1 Base algorithm is decision stump:

52 Round 1: t = 1 h 1 D 2 March 19, / 43

53 Round 1: t = 1 h 1 D 2 3 misclassified (circled): ɛ 1 = 0.3 March 19, / 43

54 Round 1: t = 1 h 1 D 2 3 misclassified (circled): ɛ 1 = 0.3 β 1 = 1 2 ln ( 1 ɛt ɛ t ) March 19, / 43

55 Round 1: t = 1 h 1 D 2 3 misclassified (circled): ɛ 1 = 0.3 β 1 = 1 2 ln ( 1 ɛt ɛ t ) D 2 puts more weights on those examples March 19, / 43

56 Round 2: t = 2 h 2 D 3 3 misclassified (circled): ɛ 2 = 0.21 β 2 = March 19, / 43

57 Round 2: t = 2 h 2 D 3 3 misclassified (circled): ɛ 2 = 0.21 β 2 = D 3 puts more weights on those examples March 19, / 43

58 Round 3: t = 3 Round 3 h3 "3 =0.14!3=0.92 again 3 misclassified (circled): 3 = 0.14 β3 = March 19, / 43

59 Final classifier: combining 3 classifiers H = sign final = March 19, / 43

60 Final classifier: combining 3 classifiers H = sign final = All data points are now classified correctly, even though each weak classifier makes 3 mistakes. March 19, / 43

61 Overfitting When T is large, the model is very complicated and overfitting can happen March 19, / 43

62 Overfitting When T is large, the model is very complicated and overfitting can happen March 19, / 43

63 Resistance to overfitting However, very often AdaBoost is resistant to overfitting March 19, / 43

64 Resistance to overfitting However, very often AdaBoost is resistant to overfitting March 19, / 43

65 Resistance to overfitting However, very often AdaBoost is resistant to overfitting Used to be a mystery, but by now rigorous theory has been developed to explain this phenomenon. March 19, / 43

66 Why AdaBoost works? In fact, AdaBoost also follows the general framework of minimizing some surrogate loss. March 19, / 43

67 Why AdaBoost works? In fact, AdaBoost also follows the general framework of minimizing some surrogate loss. Step 1: the model that AdaBoost considers is { } T sgn (f( )) f( ) = β t h t ( ) for some β t 0 and h t H t=1 where H is the set of models considered by the base algorithm March 19, / 43

68 Why AdaBoost works? In fact, AdaBoost also follows the general framework of minimizing some surrogate loss. Step 1: the model that AdaBoost considers is { } T sgn (f( )) f( ) = β t h t ( ) for some β t 0 and h t H t=1 where H is the set of models considered by the base algorithm Step 2: the loss that AdaBoost minimizes is the exponential loss N exp ( y n f(x n )) n=1 March 19, / 43

69 Greedy minimization Step 3: the way that AdaBoost minimizes exponential loss is by a greedy approach, that is, find β t, h t one by one for t = 1,..., T. March 19, / 43

70 Greedy minimization Step 3: the way that AdaBoost minimizes exponential loss is by a greedy approach, that is, find β t, h t one by one for t = 1,..., T. Specifically, let f t = t τ=1 β τ h τ. Suppose we have found f t 1, what should f t be? March 19, / 43

71 Greedy minimization Step 3: the way that AdaBoost minimizes exponential loss is by a greedy approach, that is, find β t, h t one by one for t = 1,..., T. Specifically, let f t = t τ=1 β τ h τ. Suppose we have found f t 1, what should f t be? t 1 f t = β τ h τ + β t h t = f t 1 + β t h t. τ=1 March 19, / 43

72 Greedy minimization Step 3: the way that AdaBoost minimizes exponential loss is by a greedy approach, that is, find β t, h t one by one for t = 1,..., T. Specifically, let f t = t τ=1 β τ h τ. Suppose we have found f t 1, what should f t be? t 1 f t = β τ h τ + β t h t = f t 1 + β t h t. τ=1 Greedily, we want to find β t, h t to minimize March 19, / 43

73 Greedy minimization Step 3: the way that AdaBoost minimizes exponential loss is by a greedy approach, that is, find β t, h t one by one for t = 1,..., T. Specifically, let f t = t τ=1 β τ h τ. Suppose we have found f t 1, what should f t be? t 1 f t = β τ h τ + β t h t = f t 1 + β t h t. τ=1 Greedily, we want to find β t, h t to minimize N exp ( y n f t (x n )) n=1 March 19, / 43

74 Greedy minimization Step 3: the way that AdaBoost minimizes exponential loss is by a greedy approach, that is, find β t, h t one by one for t = 1,..., T. Specifically, let f t = t τ=1 β τ h τ. Suppose we have found f t 1, what should f t be? t 1 f t = β τ h τ + β t h t = f t 1 + β t h t. τ=1 Greedily, we want to find β t, h t to minimize N exp ( y n f t (x n )) = n=1 N exp ( y n f t 1 (x n )) exp ( y n β t h t (x n )) n=1 March 19, / 43

75 Greedy minimization Step 3: the way that AdaBoost minimizes exponential loss is by a greedy approach, that is, find β t, h t one by one for t = 1,..., T. Specifically, let f t = t τ=1 β τ h τ. Suppose we have found f t 1, what should f t be? t 1 f t = β τ h τ + β t h t = f t 1 + β t h t. τ=1 Greedily, we want to find β t, h t to minimize N exp ( y n f t (x n )) = n=1 N exp ( y n f t 1 (x n )) exp ( y n β t h t (x n )) n=1 Next, we use the definition of weights (slide 12). March 19, / 43

76 Greedy minimization Claim: exp ( y n f t 1 (x n )) D t (n) March 19, / 43

77 Greedy minimization Claim: exp ( y n f t 1 (x n )) D t (n) Proof. March 19, / 43

78 Greedy minimization Claim: exp ( y n f t 1 (x n )) D t (n) Proof. D t (n) D t 1 (n) exp ( y n β t 1 h t 1 (x n )) March 19, / 43

79 Greedy minimization Claim: exp ( y n f t 1 (x n )) D t (n) Proof. D t (n) D t 1 (n) exp ( y n β t 1 h t 1 (x n )) D t 2 (n) exp ( y n β t 2 h t 2 (x n )) exp ( y n β t 1 h t 1 (x n )) March 19, / 43

80 Greedy minimization Claim: exp ( y n f t 1 (x n )) D t (n) Proof. D t (n) D t 1 (n) exp ( y n β t 1 h t 1 (x n )) D t 2 (n) exp ( y n β t 2 h t 2 (x n )) exp ( y n β t 1 h t 1 (x n )) D 1 exp ( y n β 1 h 1 (x n )... y n β t 1 h t 1 (x n )) March 19, / 43

81 Greedy minimization Claim: exp ( y n f t 1 (x n )) D t (n) Proof. D t (n) D t 1 (n) exp ( y n β t 1 h t 1 (x n )) D t 2 (n) exp ( y n β t 2 h t 2 (x n )) exp ( y n β t 1 h t 1 (x n )) D 1 exp ( y n β 1 h 1 (x n )... y n β t 1 h t 1 (x n )) exp ( y n f t 1 (x n )) March 19, / 43

82 Greedy minimization Claim: exp ( y n f t 1 (x n )) D t (n) Proof. D t (n) D t 1 (n) exp ( y n β t 1 h t 1 (x n )) D t 2 (n) exp ( y n β t 2 h t 2 (x n )) exp ( y n β t 1 h t 1 (x n )) D 1 exp ( y n β 1 h 1 (x n )... y n β t 1 h t 1 (x n )) exp ( y n f t 1 (x n )) Remark. All weights D t (n) are normalized: n D t(n) = 1. March 19, / 43

83 Greedy minimization So the goal becomes finding β t 0, h t H that minimize argmin β t,h t N n=1 exp ( y n f t (x n )) = argmin β t,h t N D t (n) exp ( y n β t h t (x n )) n=1 March 19, / 43

84 Greedy minimization So the goal becomes finding β t 0, h t H that minimize argmin β t,h t N n=1 exp ( y n f t (x n )) = argmin β t,h t N D t (n) exp ( y n β t h t (x n )) n=1 We decompose the weighted loss function into two parts March 19, / 43

85 Greedy minimization So the goal becomes finding β t 0, h t H that minimize argmin β t,h t N n=1 exp ( y n f t (x n )) = argmin β t,h t N D t (n) exp ( y n β t h t (x n )) n=1 We decompose the weighted loss function into two parts N D t (n) exp ( y n β t h t (x n )) n=1 March 19, / 43

86 Greedy minimization So the goal becomes finding β t 0, h t H that minimize argmin β t,h t N n=1 exp ( y n f t (x n )) = argmin β t,h t N D t (n) exp ( y n β t h t (x n )) n=1 We decompose the weighted loss function into two parts N D t (n) exp ( y n β t h t (x n )) n=1 = n:y n h t(x n) D t (n)e βt + n:y n=h t(x n) D t (n)e βt March 19, / 43

87 Greedy minimization So the goal becomes finding β t 0, h t H that minimize argmin β t,h t N n=1 exp ( y n f t (x n )) = argmin β t,h t N D t (n) exp ( y n β t h t (x n )) n=1 We decompose the weighted loss function into two parts N D t (n) exp ( y n β t h t (x n )) n=1 = n:y n h t(x n) D t (n)e βt + n:y n=h t(x n) D t (n)e βt = ɛ t e βt + (1 ɛ t )e βt (recall ɛ t = n:y n h t(x n) D t(n)) March 19, / 43

88 Greedy minimization So the goal becomes finding β t 0, h t H that minimize argmin β t,h t N n=1 exp ( y n f t (x n )) = argmin β t,h t N D t (n) exp ( y n β t h t (x n )) n=1 We decompose the weighted loss function into two parts N D t (n) exp ( y n β t h t (x n )) n=1 = n:y n h t(x n) D t (n)e βt + n:y n=h t(x n) D t (n)e βt = ɛ t e βt + (1 ɛ t )e βt (recall ɛ t = n:y n h t(x n) D t(n)) = ɛ t (e βt e βt ) + e βt March 19, / 43

89 Greedy minimization So the goal becomes finding β t 0, h t H that minimize argmin β t,h t N n=1 exp ( y n f t (x n )) = argmin β t,h t N D t (n) exp ( y n β t h t (x n )) n=1 We decompose the weighted loss function into two parts N D t (n) exp ( y n β t h t (x n )) n=1 = n:y n h t(x n) D t (n)e βt + n:y n=h t(x n) D t (n)e βt = ɛ t e βt + (1 ɛ t )e βt (recall ɛ t = n:y n h t(x n) D t(n)) = ɛ t (e βt e βt ) + e βt We find h t by minimizing the weighted classification error ɛ t. March 19, / 43

90 Minimizing the weighted classification error Thus, we would want to choose h t (x n ) such that h t (x) = argmin ɛ t = h t(x) n:y n h t(x n) D t (n) March 19, / 43

91 Minimizing the weighted classification error Thus, we would want to choose h t (x n ) such that h t (x) = argmin ɛ t = h t(x) n:y n h t(x n) D t (n) This is exactly the first step of the AdaBoost algorithm on slide 10 train a weak classifier based on the current weight D t (n). March 19, / 43

92 Greedy minimization When h t (and thus ɛ t ) is fixed, we then find β t to minimize ɛ t (e βt e βt ) + e βt March 19, / 43

93 Greedy minimization When h t (and thus ɛ t ) is fixed, we then find β t to minimize ɛ t (e βt e βt ) + e βt We take derivative with respect to β t, set it to zero, and derive the optimal β t as βt = 1 ( ) 1 2 ln ɛt ɛ t March 19, / 43

94 Greedy minimization When h t (and thus ɛ t ) is fixed, we then find β t to minimize ɛ t (e βt e βt ) + e βt We take derivative with respect to β t, set it to zero, and derive the optimal β t as βt = 1 ( ) 1 2 ln ɛt ɛ t which is precisely the step of the AdaBoost algorithm on slide 10. March 19, / 43

95 Greedy minimization When h t (and thus ɛ t ) is fixed, we then find β t to minimize ɛ t (e βt e βt ) + e βt We take derivative with respect to β t, set it to zero, and derive the optimal β t as βt = 1 ( ) 1 2 ln ɛt ɛ t which is precisely the step of the AdaBoost algorithm on slide 10. Verify the solution β t. March 19, / 43

96 Updating the weights Now that we have improved our classifier into f t (x n ) = f t 1 (x n ) + β t h t (x n ) March 19, / 43

97 Updating the weights Now that we have improved our classifier into f t (x n ) = f t 1 (x n ) + β t h t (x n ) At the t-th iteration, we will need to compute the weights for the above classifier, which is, March 19, / 43

98 Updating the weights Now that we have improved our classifier into f t (x n ) = f t 1 (x n ) + β t h t (x n ) At the t-th iteration, we will need to compute the weights for the above classifier, which is, D t+1 (n) e ynf(xn) March 19, / 43

99 Updating the weights Now that we have improved our classifier into f t (x n ) = f t 1 (x n ) + β t h t (x n ) At the t-th iteration, we will need to compute the weights for the above classifier, which is, D t+1 (n) e ynf(xn) = e yn[f t 1(x)+β t h t (xn)] March 19, / 43

100 Updating the weights Now that we have improved our classifier into f t (x n ) = f t 1 (x n ) + β t h t (x n ) At the t-th iteration, we will need to compute the weights for the above classifier, which is, D t+1 (n) e ynf(xn) = e yn[f t 1(x)+β t h t (xn)] D t (n)e ynβ t h t (xn) March 19, / 43

101 Updating the weights Now that we have improved our classifier into f t (x n ) = f t 1 (x n ) + β t h t (x n ) At the t-th iteration, we will need to compute the weights for the above classifier, which is, D t+1 (n) e ynf(xn) = e yn[f t 1(x)+βt h t (xn)] { D t (n)e ynβ t h t (xn) Dt (n)e = β t if y n h t (x n ) D t (n)e β t if y n = h t (x n ) March 19, / 43

102 Updating the weights Now that we have improved our classifier into f t (x n ) = f t 1 (x n ) + β t h t (x n ) At the t-th iteration, we will need to compute the weights for the above classifier, which is, D t+1 (n) e ynf(xn) = e yn[f t 1(x)+βt h t (xn)] { D t (n)e ynβ t h t (xn) Dt (n)e = β t if y n h t (x n ) D t (n)e β t if y n = h t (x n ) which is precisely the last step of the AdaBoost algorithm on slide 12. March 19, / 43

103 Remarks Note that the AdaBoost algorithm itself never specifies how we would get h t (x) as long as it minimizes the weighted classification error ɛ t = n D t (n)i[y n h t (x n )] In this aspect, the AdaBoost algorithm is a meta-algorithm and can be used with any classifier where we can do the above. March 19, / 43

104 Remarks Note that the AdaBoost algorithm itself never specifies how we would get h t (x) as long as it minimizes the weighted classification error ɛ t = n D t (n)i[y n h t (x n )] In this aspect, the AdaBoost algorithm is a meta-algorithm and can be used with any classifier where we can do the above. Ex. How do we choose the decision stump classifier given the weights at the second round of the following distribution? h 1 D 2 We can simply enumerate all possible ways of putting vertical and horizontal lines to separate the data points into two classes and find the one with the smallest weighted classification error! March 19, / 43

105 Summary for boosting Key idea of boosting is to combine weak predictors into a strong one. March 19, / 43

106 Summary for boosting Key idea of boosting is to combine weak predictors into a strong one. There are many boosting algorithms; AdaBoost is the most classic one. March 19, / 43

107 Summary for boosting Key idea of boosting is to combine weak predictors into a strong one. There are many boosting algorithms; AdaBoost is the most classic one. AdaBoost is greedily minimizing the exponential loss. March 19, / 43

108 Summary for boosting Key idea of boosting is to combine weak predictors into a strong one. There are many boosting algorithms; AdaBoost is the most classic one. AdaBoost is greedily minimizing the exponential loss. AdaBoost tends to not overfit. March 19, / 43

109 Taxonomy of ML Models There are two kinds of classification models in machine learning generative models and discriminative models. March 19, / 43

110 Taxonomy of ML Models There are two kinds of classification models in machine learning generative models and discriminative models. Discriminative models: March 19, / 43

111 Taxonomy of ML Models There are two kinds of classification models in machine learning generative models and discriminative models. Discriminative models: nearest neighbor, traditional neural networks, SVM. March 19, / 43

112 Taxonomy of ML Models There are two kinds of classification models in machine learning generative models and discriminative models. Discriminative models: nearest neighbor, traditional neural networks, SVM. we learn f() on data set (x i, y i ) to output the most likely y on unseen x. March 19, / 43

113 Taxonomy of ML Models There are two kinds of classification models in machine learning generative models and discriminative models. Discriminative models: nearest neighbor, traditional neural networks, SVM. we learn f() on data set (x i, y i ) to output the most likely y on unseen x. having f() we know how to discriminate unseen x s from different classes. March 19, / 43

114 Taxonomy of ML Models There are two kinds of classification models in machine learning generative models and discriminative models. Discriminative models: nearest neighbor, traditional neural networks, SVM. we learn f() on data set (x i, y i ) to output the most likely y on unseen x. having f() we know how to discriminate unseen x s from different classes. we learn the decision boundary between the classes. March 19, / 43

115 Taxonomy of ML Models There are two kinds of classification models in machine learning generative models and discriminative models. Discriminative models: nearest neighbor, traditional neural networks, SVM. we learn f() on data set (x i, y i ) to output the most likely y on unseen x. having f() we know how to discriminate unseen x s from different classes. we learn the decision boundary between the classes. we have no idea how the data is generated. March 19, / 43

116 Taxonomy of ML Models There are two kinds of classification models in machine learning generative models and discriminative models. Generative models: March 19, / 43

117 Taxonomy of ML Models There are two kinds of classification models in machine learning generative models and discriminative models. Generative models: Naïve Bayes, Gaussian mixture model, Hidden Markov model, Adversarial Network (GAN). March 19, / 43

118 Taxonomy of ML Models There are two kinds of classification models in machine learning generative models and discriminative models. Generative models: Naïve Bayes, Gaussian mixture model, Hidden Markov model, Adversarial Network (GAN). it s used widely in unsupervised machine learning. March 19, / 43

119 Taxonomy of ML Models There are two kinds of classification models in machine learning generative models and discriminative models. Generative models: Naïve Bayes, Gaussian mixture model, Hidden Markov model, Adversarial Network (GAN). it s used widely in unsupervised machine learning. it s a probabilistic way to think about how the data might have been generated. March 19, / 43

120 Taxonomy of ML Models There are two kinds of classification models in machine learning generative models and discriminative models. Generative models: Naïve Bayes, Gaussian mixture model, Hidden Markov model, Adversarial Network (GAN). it s used widely in unsupervised machine learning. it s a probabilistic way to think about how the data might have been generated. learn the joint probability distribution P (x, y) and predict P (y x) with the help of Bayes Theorem. March 19, / 43

121 Outline 1 Boosting 2 Gaussian mixture models Motivation and Model EM algorithm March 19, / 43

122 Gaussian mixture models Gaussian mixture models (GMM) is a probabilistic approach for clustering March 19, / 43

123 Gaussian mixture models Gaussian mixture models (GMM) is a probabilistic approach for clustering more explanatory than minimizing the K-means objective can be seen as a soft version of K-means March 19, / 43

124 Gaussian mixture models Gaussian mixture models (GMM) is a probabilistic approach for clustering more explanatory than minimizing the K-means objective can be seen as a soft version of K-means To solve GMM, we will introduce a powerful method for learning probabilistic mode: Expectation Maximization (EM) algorithm March 19, / 43

125 A generative model For classification, we discussed the sigmoid model to explain how the labels are generated. March 19, / 43

126 A generative model For classification, we discussed the sigmoid model to explain how the labels are generated. Similarly, for clustering, we want to come up with a probabilistic model p to explain how the data is generated. March 19, / 43

127 A generative model For classification, we discussed the sigmoid model to explain how the labels are generated. Similarly, for clustering, we want to come up with a probabilistic model p to explain how the data is generated. That is, each point is an independent sample of x p. March 19, / 43

128 A generative model For classification, we discussed the sigmoid model to explain how the labels are generated. Similarly, for clustering, we want to come up with a probabilistic model p to explain how the data is generated. That is, each point is an independent sample of x p. What probabilistic model generates data like this? March 19, / 43

129 Gaussian mixture models: intuition We will model each region with a Gaussian distribution. This leads to the idea of Gaussian mixture models (GMMs). The problem we are now facing is that i) we do not know which (color) region a data point comes from; ii) the parameters of Gaussian distributions in each region. We need to find all of them from unsupervised data D = {x n } N n=1. March 19, / 43

130 GMM: formal definition A GMM has the following density function: p(x) = K ω k N(x µ k, Σ k ) = k=1 K k=1 ω k 1 (2π) D Σ k e 1 2 (x µ k) T Σ 1 k (x µ k) March 19, / 43

131 GMM: formal definition A GMM has the following density function: p(x) = where K ω k N(x µ k, Σ k ) = k=1 K k=1 ω k 1 (2π) D Σ k e 1 2 (x µ k) T Σ 1 k (x µ k) K: the number of Gaussian components (same as #clusters we want) March 19, / 43

132 GMM: formal definition A GMM has the following density function: p(x) = where K ω k N(x µ k, Σ k ) = k=1 K k=1 ω k 1 (2π) D Σ k e 1 2 (x µ k) T Σ 1 k (x µ k) K: the number of Gaussian components (same as #clusters we want) µ k and Σ k : mean and covariance matrix of the k-th Gaussian March 19, / 43

133 GMM: formal definition A GMM has the following density function: p(x) = where K ω k N(x µ k, Σ k ) = k=1 K k=1 ω k 1 (2π) D Σ k e 1 2 (x µ k) T Σ 1 k (x µ k) K: the number of Gaussian components (same as #clusters we want) µ k and Σ k : mean and covariance matrix of the k-th Gaussian ω 1,..., ω K : mixture weights, they represent how much each component contributes to the final distribution. It satisfies two properties: k, ω k > 0, and ω k = 1 k March 19, / 43

134 Another view By introducing a latent variable z [K], which indicates cluster membership, we can see p as a marginal distribution p(x) = K p(x, z = k) k=1 March 19, / 43

135 Another view By introducing a latent variable z [K], which indicates cluster membership, we can see p as a marginal distribution p(x) = K p(x, z = k) = k=1 K p(z = k)p(x z = k) k=1 March 19, / 43

136 Another view By introducing a latent variable z [K], which indicates cluster membership, we can see p as a marginal distribution p(x) = K p(x, z = k) = k=1 K p(z = k)p(x z = k) = k=1 K ω k N(x µ k, Σ k ) k=1 March 19, / 43

137 Another view By introducing a latent variable z [K], which indicates cluster membership, we can see p as a marginal distribution p(x) = K p(x, z = k) = k=1 K p(z = k)p(x z = k) = k=1 K ω k N(x µ k, Σ k ) k=1 x and z are both random variables drawn from the model x is observed z is unobserved/latent March 19, / 43

138 An example The conditional distributions are p(x z = red) = N(x µ 1, Σ 1 ) p(x z = blue) = N(x µ 2, Σ 2 ) p(x z = green) = N(x µ 3, Σ 3 ) March 19, / 43

139 An example The conditional distributions are p(x z = red) = N(x µ 1, Σ 1 ) p(x z = blue) = N(x µ 2, Σ 2 ) p(x z = green) = N(x µ 3, Σ 3 ) The marginal distribution is p(x) = p(red)n(x µ 1, Σ 1 ) + p(blue)n(x µ 2, Σ 2 ) + p(green)n(x µ 3, Σ 3 ) March 19, / 43

140 Learning GMMs Learning a GMM means finding all the parameters θ = {ω k, µ k, Σ k } K k=1. March 19, / 43

141 Learning GMMs Learning a GMM means finding all the parameters θ = {ω k, µ k, Σ k } K k=1. In the process, we will learn the latent variable z n as well: p(z n = k x n ) March 19, / 43

142 Learning GMMs Learning a GMM means finding all the parameters θ = {ω k, µ k, Σ k } K k=1. In the process, we will learn the latent variable z n as well: p(z n = k x n ) γ nk [0, 1] i.e. soft assignment of each point to each cluster, as opposed to hard assignment by K-means. March 19, / 43

143 Learning GMMs Learning a GMM means finding all the parameters θ = {ω k, µ k, Σ k } K k=1. In the process, we will learn the latent variable z n as well: p(z n = k x n ) γ nk [0, 1] i.e. soft assignment of each point to each cluster, as opposed to hard assignment by K-means. GMM is more explanatory than K-means both learn the cluster centers µ k s March 19, / 43

144 Learning GMMs Learning a GMM means finding all the parameters θ = {ω k, µ k, Σ k } K k=1. In the process, we will learn the latent variable z n as well: p(z n = k x n ) γ nk [0, 1] i.e. soft assignment of each point to each cluster, as opposed to hard assignment by K-means. GMM is more explanatory than K-means both learn the cluster centers µ k s in addition, GMM learns cluster weight ω k and covariance Σ k, March 19, / 43

145 Learning GMMs Learning a GMM means finding all the parameters θ = {ω k, µ k, Σ k } K k=1. In the process, we will learn the latent variable z n as well: p(z n = k x n ) γ nk [0, 1] i.e. soft assignment of each point to each cluster, as opposed to hard assignment by K-means. GMM is more explanatory than K-means both learn the cluster centers µ k s in addition, GMM learns cluster weight ω k and covariance Σ k, thus we can predict probability of seeing a new point we can generate synthetic data March 19, / 43

146 How to learn these parameters? An obvious attempt is maximum-likelihood estimation (MLE): find argmax θ ln N n=1 p(x n ; θ) = argmax θ N n=1 ln p(x n ; θ) argmax P (θ) θ March 19, / 43

147 How to learn these parameters? An obvious attempt is maximum-likelihood estimation (MLE): find argmax θ ln N n=1 p(x n ; θ) = argmax θ N n=1 ln p(x n ; θ) argmax P (θ) θ This is called incomplete likelihood (since z n s are unobserved), and is intractable in general (non-concave problem). March 19, / 43

148 How to learn these parameters? An obvious attempt is maximum-likelihood estimation (MLE): find argmax θ ln N n=1 p(x n ; θ) = argmax θ N n=1 ln p(x n ; θ) argmax P (θ) θ This is called incomplete likelihood (since z n s are unobserved), and is intractable in general (non-concave problem). One solution is to still apply GD/SGD, but a much more effective approach is the Expectation Maximization (EM) algorithm. March 19, / 43

149 Preview of EM for learning GMMs Step 0 Initialize ω k, µ k, Σ k for each k [K] March 19, / 43

150 Preview of EM for learning GMMs Step 0 Initialize ω k, µ k, Σ k for each k [K] Step 1 (E-Step) update the soft assignment (fixing parameters) γ nk = p(z n = k x n ) ω k N (x n µ k, Σ k ) March 19, / 43

151 Preview of EM for learning GMMs Step 0 Initialize ω k, µ k, Σ k for each k [K] Step 1 (E-Step) update the soft assignment (fixing parameters) γ nk = p(z n = k x n ) ω k N (x n µ k, Σ k ) Step 2 (M-Step) update the model parameter (fixing assignments) n ω k = γ nk µ k = n γ nkx n N n γ nk 1 Σ k = n γ γ nk (x n µ k )(x n µ k ) T nk n March 19, / 43

152 Preview of EM for learning GMMs Step 0 Initialize ω k, µ k, Σ k for each k [K] Step 1 (E-Step) update the soft assignment (fixing parameters) γ nk = p(z n = k x n ) ω k N (x n µ k, Σ k ) Step 2 (M-Step) update the model parameter (fixing assignments) n ω k = γ nk µ k = n γ nkx n N n γ nk 1 Σ k = n γ γ nk (x n µ k )(x n µ k ) T nk Step 3 return to Step 1 if not converged n March 19, / 43

153 Demo Generate 50 data points from a mixture of 2 Gaussians with ω 1 = 0.3, µ 1 = 0.8, Σ 1 = 0.52 ω 2 = 0.7, µ 2 = 1.2, Σ 2 = 0.35 March 19, / 43

154 Demo Generate 50 data points from a mixture of 2 Gaussians with ω 1 = 0.3, µ 1 = 0.8, Σ 1 = 0.52 ω 2 = 0.7, µ 2 = 1.2, Σ 2 = 0.35 histogram represents the data March 19, / 43

155 Demo Generate 50 data points from a mixture of 2 Gaussians with ω 1 = 0.3, µ 1 = 0.8, Σ 1 = 0.52 ω 2 = 0.7, µ 2 = 1.2, Σ 2 = 0.35 histogram represents the data red curve represents the ground-truth density p(x) = K k=1 ω kn(x µ k, Σ k ) March 19, / 43

156 Demo Generate 50 data points from a mixture of 2 Gaussians with ω 1 = 0.3, µ 1 = 0.8, Σ 1 = 0.52 ω 2 = 0.7, µ 2 = 1.2, Σ 2 = 0.35 histogram represents the data red curve represents the ground-truth density p(x) = K k=1 ω kn(x µ k, Σ k ) blue curve represents the learned density for a specific round March 19, / 43

157 Demo Generate 50 data points from a mixture of 2 Gaussians with ω 1 = 0.3, µ 1 = 0.8, Σ 1 = 0.52 ω 2 = 0.7, µ 2 = 1.2, Σ 2 = 0.35 histogram represents the data red curve represents the ground-truth density p(x) = K k=1 ω kn(x µ k, Σ k ) blue curve represents the learned density for a specific round EM demo.pdf shows how the blue curve moves towards red curve quickly via EM March 19, / 43

158 EM algorithm In general EM is a heuristic to solve MLE with latent variables (not just GMM), i.e. find the maximizer of P (θ) = N ln p(x n ; θ) n=1 March 19, / 43

159 EM algorithm In general EM is a heuristic to solve MLE with latent variables (not just GMM), i.e. find the maximizer of P (θ) = N ln p(x n ; θ) n=1 θ is the parameters for a general probabilistic model March 19, / 43

160 EM algorithm In general EM is a heuristic to solve MLE with latent variables (not just GMM), i.e. find the maximizer of P (θ) = N ln p(x n ; θ) n=1 θ is the parameters for a general probabilistic model x n s are observed random variables March 19, / 43

161 EM algorithm In general EM is a heuristic to solve MLE with latent variables (not just GMM), i.e. find the maximizer of P (θ) = N ln p(x n ; θ) n=1 θ is the parameters for a general probabilistic model x n s are observed random variables z n s are latent variables March 19, / 43

162 EM algorithm In general EM is a heuristic to solve MLE with latent variables (not just GMM), i.e. find the maximizer of P (θ) = N ln p(x n ; θ) n=1 θ is the parameters for a general probabilistic model x n s are observed random variables z n s are latent variables Again, directly solving the objective is intractable. March 19, / 43

163 High level idea Keep maximizing a lower bound of P that is more manageable March 19, / 43

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26 Clustering Professor Ameet Talwalkar Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 1 / 26 Outline 1 Administration 2 Review of last lecture 3 Clustering Professor Ameet Talwalkar

More information

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization Prof. Daniel Cremers 6. Mixture Models and Expectation-Maximization Motivation Often the introduction of latent (unobserved) random variables into a model can help to express complex (marginal) distributions

More information

Lecture 8. Instructor: Haipeng Luo

Lecture 8. Instructor: Haipeng Luo Lecture 8 Instructor: Haipeng Luo Boosting and AdaBoost In this lecture we discuss the connection between boosting and online learning. Boosting is not only one of the most fundamental theories in machine

More information

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a Some slides are due to Christopher Bishop Limitations of K-means Hard assignments of data points to clusters small shift of a

More information

Expectation maximization

Expectation maximization Expectation maximization Subhransu Maji CMSCI 689: Machine Learning 14 April 2015 Motivation Suppose you are building a naive Bayes spam classifier. After your are done your boss tells you that there is

More information

CS534 Machine Learning - Spring Final Exam

CS534 Machine Learning - Spring Final Exam CS534 Machine Learning - Spring 2013 Final Exam Name: You have 110 minutes. There are 6 questions (8 pages including cover page). If you get stuck on one question, move on to others and come back to the

More information

ECE 5424: Introduction to Machine Learning

ECE 5424: Introduction to Machine Learning ECE 5424: Introduction to Machine Learning Topics: Ensemble Methods: Bagging, Boosting PAC Learning Readings: Murphy 16.4;; Hastie 16 Stefan Lee Virginia Tech Fighting the bias-variance tradeoff Simple

More information

Logistic Regression. Machine Learning Fall 2018

Logistic Regression. Machine Learning Fall 2018 Logistic Regression Machine Learning Fall 2018 1 Where are e? We have seen the folloing ideas Linear models Learning as loss minimization Bayesian learning criteria (MAP and MLE estimation) The Naïve Bayes

More information

Introduction to Machine Learning Midterm, Tues April 8

Introduction to Machine Learning Midterm, Tues April 8 Introduction to Machine Learning 10-701 Midterm, Tues April 8 [1 point] Name: Andrew ID: Instructions: You are allowed a (two-sided) sheet of notes. Exam ends at 2:45pm Take a deep breath and don t spend

More information

Machine Learning for Signal Processing Bayes Classification and Regression

Machine Learning for Signal Processing Bayes Classification and Regression Machine Learning for Signal Processing Bayes Classification and Regression Instructor: Bhiksha Raj 11755/18797 1 Recap: KNN A very effective and simple way of performing classification Simple model: For

More information

FINAL: CS 6375 (Machine Learning) Fall 2014

FINAL: CS 6375 (Machine Learning) Fall 2014 FINAL: CS 6375 (Machine Learning) Fall 2014 The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run out of room for

More information

VBM683 Machine Learning

VBM683 Machine Learning VBM683 Machine Learning Pinar Duygulu Slides are adapted from Dhruv Batra Bias is the algorithm's tendency to consistently learn the wrong thing by not taking into account all the information in the data

More information

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides Probabilistic modeling The slides are closely adapted from Subhransu Maji s slides Overview So far the models and algorithms you have learned about are relatively disconnected Probabilistic modeling framework

More information

Machine Learning Lecture 5

Machine Learning Lecture 5 Machine Learning Lecture 5 Linear Discriminant Functions 26.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory

More information

MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE March 28, 2012 The exam is closed book. You are allowed a double sided one page cheat sheet. Answer the questions in the spaces provided on

More information

Linear Models for Classification

Linear Models for Classification Linear Models for Classification Oliver Schulte - CMPT 726 Bishop PRML Ch. 4 Classification: Hand-written Digit Recognition CHINE INTELLIGENCE, VOL. 24, NO. 24, APRIL 2002 x i = t i = (0, 0, 0, 1, 0, 0,

More information

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and

More information

Statistical Pattern Recognition

Statistical Pattern Recognition Statistical Pattern Recognition Expectation Maximization (EM) and Mixture Models Hamid R. Rabiee Jafar Muhammadi, Mohammad J. Hosseini Spring 2014 http://ce.sharif.edu/courses/92-93/2/ce725-2 Agenda Expectation-maximization

More information

Qualifying Exam in Machine Learning

Qualifying Exam in Machine Learning Qualifying Exam in Machine Learning October 20, 2009 Instructions: Answer two out of the three questions in Part 1. In addition, answer two out of three questions in two additional parts (choose two parts

More information

ECE 5984: Introduction to Machine Learning

ECE 5984: Introduction to Machine Learning ECE 5984: Introduction to Machine Learning Topics: Ensemble Methods: Bagging, Boosting Readings: Murphy 16.4; Hastie 16 Dhruv Batra Virginia Tech Administrativia HW3 Due: April 14, 11:55pm You will implement

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 218 Outlines Overview Introduction Linear Algebra Probability Linear Regression 1

More information

Online Learning and Sequential Decision Making

Online Learning and Sequential Decision Making Online Learning and Sequential Decision Making Emilie Kaufmann CNRS & CRIStAL, Inria SequeL, emilie.kaufmann@univ-lille.fr Research School, ENS Lyon, Novembre 12-13th 2018 Emilie Kaufmann Online Learning

More information

Machine Learning for Signal Processing Bayes Classification

Machine Learning for Signal Processing Bayes Classification Machine Learning for Signal Processing Bayes Classification Class 16. 24 Oct 2017 Instructor: Bhiksha Raj - Abelino Jimenez 11755/18797 1 Recap: KNN A very effective and simple way of performing classification

More information

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas Midterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas Machine Learning Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Y Continuous Non-parametric

More information

Voting (Ensemble Methods)

Voting (Ensemble Methods) 1 2 Voting (Ensemble Methods) Instead of learning a single classifier, learn many weak classifiers that are good at different parts of the data Output class: (Weighted) vote of each classifier Classifiers

More information

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall Machine Learning Gaussian Mixture Models Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 1 The Generative Model POV We think of the data as being generated from some process. We assume

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning Expectation Maximization Mark Schmidt University of British Columbia Winter 2018 Last Time: Learning with MAR Values We discussed learning with missing at random values in data:

More information

Neural Networks and Deep Learning

Neural Networks and Deep Learning Neural Networks and Deep Learning Professor Ameet Talwalkar November 12, 2015 Professor Ameet Talwalkar Neural Networks and Deep Learning November 12, 2015 1 / 16 Outline 1 Review of last lecture AdaBoost

More information

Naïve Bayes classification

Naïve Bayes classification Naïve Bayes classification 1 Probability theory Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. Examples: A person s height, the outcome of a coin toss

More information

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability Probability theory Naïve Bayes classification Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. s: A person s height, the outcome of a coin toss Distinguish

More information

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet. CS 189 Spring 013 Introduction to Machine Learning Final You have 3 hours for the exam. The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet. Please

More information

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas Midterm Review CS 7301: Advanced Machine Learning Vibhav Gogate The University of Texas at Dallas Supervised Learning Issues in supervised learning What makes learning hard Point Estimation: MLE vs Bayesian

More information

Lecture 3: Machine learning, classification, and generative models

Lecture 3: Machine learning, classification, and generative models EE E6820: Speech & Audio Processing & Recognition Lecture 3: Machine learning, classification, and generative models 1 Classification 2 Generative models 3 Gaussian models Michael Mandel

More information

Stochastic Gradient Descent

Stochastic Gradient Descent Stochastic Gradient Descent Machine Learning CSE546 Carlos Guestrin University of Washington October 9, 2013 1 Logistic Regression Logistic function (or Sigmoid): Learn P(Y X) directly Assume a particular

More information

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Devin Cornell & Sushruth Sastry May 2015 1 Abstract In this article, we explore

More information

Chapter 14 Combining Models

Chapter 14 Combining Models Chapter 14 Combining Models T-61.62 Special Course II: Pattern Recognition and Machine Learning Spring 27 Laboratory of Computer and Information Science TKK April 3th 27 Outline Independent Mixing Coefficients

More information

1 Machine Learning Concepts (16 points)

1 Machine Learning Concepts (16 points) CSCI 567 Fall 2018 Midterm Exam DO NOT OPEN EXAM UNTIL INSTRUCTED TO DO SO PLEASE TURN OFF ALL CELL PHONES Problem 1 2 3 4 5 6 Total Max 16 10 16 42 24 12 120 Points Please read the following instructions

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 20: Expectation Maximization Algorithm EM for Mixture Models Many figures courtesy Kevin Murphy s

More information

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu Logistic Regression Review 10-601 Fall 2012 Recitation September 25, 2012 TA: Selen Uguroglu!1 Outline Decision Theory Logistic regression Goal Loss function Inference Gradient Descent!2 Training Data

More information

Mathematical Formulation of Our Example

Mathematical Formulation of Our Example Mathematical Formulation of Our Example We define two binary random variables: open and, where is light on or light off. Our question is: What is? Computer Vision 1 Combining Evidence Suppose our robot

More information

Clustering and Gaussian Mixture Models

Clustering and Gaussian Mixture Models Clustering and Gaussian Mixture Models Piyush Rai IIT Kanpur Probabilistic Machine Learning (CS772A) Jan 25, 2016 Probabilistic Machine Learning (CS772A) Clustering and Gaussian Mixture Models 1 Recap

More information

Gaussian Mixture Models

Gaussian Mixture Models Gaussian Mixture Models Pradeep Ravikumar Co-instructor: Manuela Veloso Machine Learning 10-701 Some slides courtesy of Eric Xing, Carlos Guestrin (One) bad case for K- means Clusters may overlap Some

More information

10-701/ Machine Learning - Midterm Exam, Fall 2010

10-701/ Machine Learning - Midterm Exam, Fall 2010 10-701/15-781 Machine Learning - Midterm Exam, Fall 2010 Aarti Singh Carnegie Mellon University 1. Personal info: Name: Andrew account: E-mail address: 2. There should be 15 numbered pages in this exam

More information

Expectation Maximization

Expectation Maximization Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane c Ghane/Mori 4 6 8 4 6 8 4 6 8 4 6 8 5 5 5 5 5 5 4 6 8 4 4 6 8 4 5 5 5 5 5 5 µ, Σ) α f Learningscale is slightly Parameters is slightly larger larger

More information

Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13

Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13 Boosting Ryan Tibshirani Data Mining: 36-462/36-662 April 25 2013 Optional reading: ISL 8.2, ESL 10.1 10.4, 10.7, 10.13 1 Reminder: classification trees Suppose that we are given training data (x i, y

More information

Latent Variable Models and Expectation Maximization

Latent Variable Models and Expectation Maximization Latent Variable Models and Expectation Maximization Oliver Schulte - CMPT 726 Bishop PRML Ch. 9 2 4 6 8 1 12 14 16 18 2 4 6 8 1 12 14 16 18 5 1 15 2 25 5 1 15 2 25 2 4 6 8 1 12 14 2 4 6 8 1 12 14 5 1 15

More information

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering Types of learning Modeling data Supervised: we know input and targets Goal is to learn a model that, given input data, accurately predicts target data Unsupervised: we know the input only and want to make

More information

Ch 4. Linear Models for Classification

Ch 4. Linear Models for Classification Ch 4. Linear Models for Classification Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Department of Computer Science and Engineering Pohang University of Science and echnology 77 Cheongam-ro,

More information

CS446: Machine Learning Fall Final Exam. December 6 th, 2016

CS446: Machine Learning Fall Final Exam. December 6 th, 2016 CS446: Machine Learning Fall 2016 Final Exam December 6 th, 2016 This is a closed book exam. Everything you need in order to solve the problems is supplied in the body of this exam. This exam booklet contains

More information

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012 Classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Topics Discriminant functions Logistic regression Perceptron Generative models Generative vs. discriminative

More information

Hidden Markov Models Part 2: Algorithms

Hidden Markov Models Part 2: Algorithms Hidden Markov Models Part 2: Algorithms CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 Hidden Markov Model An HMM consists of:

More information

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18 CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$

More information

Mixtures of Gaussians. Sargur Srihari

Mixtures of Gaussians. Sargur Srihari Mixtures of Gaussians Sargur srihari@cedar.buffalo.edu 1 9. Mixture Models and EM 0. Mixture Models Overview 1. K-Means Clustering 2. Mixtures of Gaussians 3. An Alternative View of EM 4. The EM Algorithm

More information

Statistical Machine Learning from Data

Statistical Machine Learning from Data Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Ensembles Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique Fédérale de Lausanne

More information

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Gaussian and Linear Discriminant Analysis; Multiclass Classification Gaussian and Linear Discriminant Analysis; Multiclass Classification Professor Ameet Talwalkar Slide Credit: Professor Fei Sha Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015

More information

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri CSE 151 Machine Learning Instructor: Kamalika Chaudhuri Ensemble Learning How to combine multiple classifiers into a single one Works well if the classifiers are complementary This class: two types of

More information

Introduction to Machine Learning Midterm Exam Solutions

Introduction to Machine Learning Midterm Exam Solutions 10-701 Introduction to Machine Learning Midterm Exam Solutions Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes,

More information

Introduction to Machine Learning Midterm Exam

Introduction to Machine Learning Midterm Exam 10-701 Introduction to Machine Learning Midterm Exam Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes, but

More information

Latent Variable Models and Expectation Maximization

Latent Variable Models and Expectation Maximization Latent Variable Models and Expectation Maximization Oliver Schulte - CMPT 726 Bishop PRML Ch. 9 2 4 6 8 1 12 14 16 18 2 4 6 8 1 12 14 16 18 5 1 15 2 25 5 1 15 2 25 2 4 6 8 1 12 14 2 4 6 8 1 12 14 5 1 15

More information

Statistical Data Mining and Machine Learning Hilary Term 2016

Statistical Data Mining and Machine Learning Hilary Term 2016 Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes

More information

AdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague

AdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague AdaBoost Lecturer: Jan Šochman Authors: Jan Šochman, Jiří Matas Center for Machine Perception Czech Technical University, Prague http://cmp.felk.cvut.cz Motivation Presentation 2/17 AdaBoost with trees

More information

The AdaBoost algorithm =1/n for i =1,...,n 1) At the m th iteration we find (any) classifier h(x; ˆθ m ) for which the weighted classification error m

The AdaBoost algorithm =1/n for i =1,...,n 1) At the m th iteration we find (any) classifier h(x; ˆθ m ) for which the weighted classification error m ) Set W () i The AdaBoost algorithm =1/n for i =1,...,n 1) At the m th iteration we find (any) classifier h(x; ˆθ m ) for which the weighted classification error m m =.5 1 n W (m 1) i y i h(x i ; 2 ˆθ

More information

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Final Overview. Introduction to ML. Marek Petrik 4/25/2017 Final Overview Introduction to ML Marek Petrik 4/25/2017 This Course: Introduction to Machine Learning Build a foundation for practice and research in ML Basic machine learning concepts: max likelihood,

More information

Boosting. CAP5610: Machine Learning Instructor: Guo-Jun Qi

Boosting. CAP5610: Machine Learning Instructor: Guo-Jun Qi Boosting CAP5610: Machine Learning Instructor: Guo-Jun Qi Weak classifiers Weak classifiers Decision stump one layer decision tree Naive Bayes A classifier without feature correlations Linear classifier

More information

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall Machine Learning Gaussian Mixture Models Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 1 Discriminative vs Generative Models Discriminative: Just learn a decision boundary between your

More information

Machine Learning, Midterm Exam

Machine Learning, Midterm Exam 10-601 Machine Learning, Midterm Exam Instructors: Tom Mitchell, Ziv Bar-Joseph Wednesday 12 th December, 2012 There are 9 questions, for a total of 100 points. This exam has 20 pages, make sure you have

More information

Bayesian Learning (II)

Bayesian Learning (II) Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Bayesian Learning (II) Niels Landwehr Overview Probabilities, expected values, variance Basic concepts of Bayesian learning MAP

More information

Logistic Regression. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Logistic Regression. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824 Logistic Regression Jia-Bin Huang ECE-5424G / CS-5824 Virginia Tech Spring 2019 Administrative Please start HW 1 early! Questions are welcome! Two principles for estimating parameters Maximum Likelihood

More information

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016 Probabilistic classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Topics Probabilistic approach Bayes decision theory Generative models Gaussian Bayes classifier

More information

CS6220: DATA MINING TECHNIQUES

CS6220: DATA MINING TECHNIQUES CS6220: DATA MINING TECHNIQUES Matrix Data: Clustering: Part 2 Instructor: Yizhou Sun yzsun@ccs.neu.edu November 3, 2015 Methods to Learn Matrix Data Text Data Set Data Sequence Data Time Series Graph

More information

10701/15781 Machine Learning, Spring 2007: Homework 2

10701/15781 Machine Learning, Spring 2007: Homework 2 070/578 Machine Learning, Spring 2007: Homework 2 Due: Wednesday, February 2, beginning of the class Instructions There are 4 questions on this assignment The second question involves coding Do not attach

More information

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017 COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University SOFT CLUSTERING VS HARD CLUSTERING

More information

Final Exam, Fall 2002

Final Exam, Fall 2002 15-781 Final Exam, Fall 22 1. Write your name and your andrew email address below. Name: Andrew ID: 2. There should be 17 pages in this exam (excluding this cover sheet). 3. If you need more room to work

More information

Lecture 4 Discriminant Analysis, k-nearest Neighbors

Lecture 4 Discriminant Analysis, k-nearest Neighbors Lecture 4 Discriminant Analysis, k-nearest Neighbors Fredrik Lindsten Division of Systems and Control Department of Information Technology Uppsala University. Email: fredrik.lindsten@it.uu.se fredrik.lindsten@it.uu.se

More information

Final Exam, Spring 2006

Final Exam, Spring 2006 070 Final Exam, Spring 2006. Write your name and your email address below. Name: Andrew account: 2. There should be 22 numbered pages in this exam (including this cover sheet). 3. You may use any and all

More information

Machine Learning, Midterm Exam: Spring 2008 SOLUTIONS. Q Topic Max. Score Score. 1 Short answer questions 20.

Machine Learning, Midterm Exam: Spring 2008 SOLUTIONS. Q Topic Max. Score Score. 1 Short answer questions 20. 10-601 Machine Learning, Midterm Exam: Spring 2008 Please put your name on this cover sheet If you need more room to work out your answer to a question, use the back of the page and clearly mark on the

More information

Boosting: Foundations and Algorithms. Rob Schapire

Boosting: Foundations and Algorithms. Rob Schapire Boosting: Foundations and Algorithms Rob Schapire Example: Spam Filtering problem: filter out spam (junk email) gather large collection of examples of spam and non-spam: From: yoav@ucsd.edu Rob, can you

More information

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics STA414/2104 Lecture 11: Gaussian Processes Department of Statistics www.utstat.utoronto.ca Delivered by Mark Ebden with thanks to Russ Salakhutdinov Outline Gaussian Processes Exam review Course evaluations

More information

CS7267 MACHINE LEARNING

CS7267 MACHINE LEARNING CS7267 MACHINE LEARNING ENSEMBLE LEARNING Ref: Dr. Ricardo Gutierrez-Osuna at TAMU, and Aarti Singh at CMU Mingon Kang, Ph.D. Computer Science, Kennesaw State University Definition of Ensemble Learning

More information

Machine Learning: A Statistics and Optimization Perspective

Machine Learning: A Statistics and Optimization Perspective Machine Learning: A Statistics and Optimization Perspective Nan Ye Mathematical Sciences School Queensland University of Technology 1 / 109 What is Machine Learning? 2 / 109 Machine Learning Machine learning

More information

Data Mining Techniques

Data Mining Techniques Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 6 Jan-Willem van de Meent (credit: Yijun Zhao, Chris Bishop, Andrew Moore, Hastie et al.) Project Project Deadlines 3 Feb: Form teams of

More information

STATS 306B: Unsupervised Learning Spring Lecture 2 April 2

STATS 306B: Unsupervised Learning Spring Lecture 2 April 2 STATS 306B: Unsupervised Learning Spring 2014 Lecture 2 April 2 Lecturer: Lester Mackey Scribe: Junyang Qian, Minzhe Wang 2.1 Recap In the last lecture, we formulated our working definition of unsupervised

More information

Bias-Variance Tradeoff

Bias-Variance Tradeoff What s learning, revisited Overfitting Generative versus Discriminative Logistic Regression Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University September 19 th, 2007 Bias-Variance Tradeoff

More information

Brief Introduction of Machine Learning Techniques for Content Analysis

Brief Introduction of Machine Learning Techniques for Content Analysis 1 Brief Introduction of Machine Learning Techniques for Content Analysis Wei-Ta Chu 2008/11/20 Outline 2 Overview Gaussian Mixture Model (GMM) Hidden Markov Model (HMM) Support Vector Machine (SVM) Overview

More information

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels? Machine Learning and Bayesian Inference Dr Sean Holden Computer Laboratory, Room FC6 Telephone extension 6372 Email: sbh11@cl.cam.ac.uk www.cl.cam.ac.uk/ sbh11/ Unsupervised learning Can we find regularity

More information

Lecture 3: Pattern Classification

Lecture 3: Pattern Classification EE E6820: Speech & Audio Processing & Recognition Lecture 3: Pattern Classification 1 2 3 4 5 The problem of classification Linear and nonlinear classifiers Probabilistic classification Gaussians, mixtures

More information

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014. Clustering K-means Machine Learning CSE546 Carlos Guestrin University of Washington November 4, 2014 1 Clustering images Set of Images [Goldberger et al.] 2 1 K-means Randomly initialize k centers µ (0)

More information

Generative v. Discriminative classifiers Intuition

Generative v. Discriminative classifiers Intuition Logistic Regression (Continued) Generative v. Discriminative Decision rees Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University January 31 st, 2007 2005-2007 Carlos Guestrin 1 Generative

More information

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted

More information

Machine Learning Lecture 7

Machine Learning Lecture 7 Course Outline Machine Learning Lecture 7 Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Statistical Learning Theory 23.05.2016 Discriminative Approaches (5 weeks) Linear Discriminant

More information

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability

More information

The Boosting Approach to. Machine Learning. Maria-Florina Balcan 10/31/2016

The Boosting Approach to. Machine Learning. Maria-Florina Balcan 10/31/2016 The Boosting Approach to Machine Learning Maria-Florina Balcan 10/31/2016 Boosting General method for improving the accuracy of any given learning algorithm. Works by creating a series of challenge datasets

More information

Machine Learning, Fall 2011: Homework 5

Machine Learning, Fall 2011: Homework 5 0-60 Machine Learning, Fall 0: Homework 5 Machine Learning Department Carnegie Mellon University Due:??? Instructions There are 3 questions on this assignment. Please submit your completed homework to

More information

Ensemble Methods. Charles Sutton Data Mining and Exploration Spring Friday, 27 January 12

Ensemble Methods. Charles Sutton Data Mining and Exploration Spring Friday, 27 January 12 Ensemble Methods Charles Sutton Data Mining and Exploration Spring 2012 Bias and Variance Consider a regression problem Y = f(x)+ N(0, 2 ) With an estimate regression function ˆf, e.g., ˆf(x) =w > x Suppose

More information

We Live in Exciting Times. CSCI-567: Machine Learning (Spring 2019) Outline. Outline. ACM (an international computing research society) has named

We Live in Exciting Times. CSCI-567: Machine Learning (Spring 2019) Outline. Outline. ACM (an international computing research society) has named We Live in Exciting Times ACM (an international computing research society) has named CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Apr. 2, 2019 Yoshua Bengio,

More information

Statistical learning. Chapter 20, Sections 1 4 1

Statistical learning. Chapter 20, Sections 1 4 1 Statistical learning Chapter 20, Sections 1 4 Chapter 20, Sections 1 4 1 Outline Bayesian learning Maximum a posteriori and maximum likelihood learning Bayes net learning ML parameter learning with complete

More information

Final Examination CS 540-2: Introduction to Artificial Intelligence

Final Examination CS 540-2: Introduction to Artificial Intelligence Final Examination CS 540-2: Introduction to Artificial Intelligence May 7, 2017 LAST NAME: SOLUTIONS FIRST NAME: Problem Score Max Score 1 14 2 10 3 6 4 10 5 11 6 9 7 8 9 10 8 12 12 8 Total 100 1 of 11

More information

Classification: The rest of the story

Classification: The rest of the story U NIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN CS598 Machine Learning for Signal Processing Classification: The rest of the story 3 October 2017 Today s lecture Important things we haven t covered yet Fisher

More information

Clustering and Gaussian Mixtures

Clustering and Gaussian Mixtures Clustering and Gaussian Mixtures Oliver Schulte - CMPT 883 2 4 6 8 1 12 14 16 18 2 4 6 8 1 12 14 16 18 5 1 15 2 25 5 1 15 2 25 2 4 6 8 1 12 14 2 4 6 8 1 12 14 5 1 15 2 25 5 1 15 2 25 detected tures detected

More information