Machine Learning: A Statistics and Optimization Perspective

Size: px

Start display at page:

Download "Machine Learning: A Statistics and Optimization Perspective"

Julian Williams
5 years ago
Views:

1 Machine Learning: A Statistics and Optimization Perspective Nan Ye Mathematical Sciences School Queensland University of Technology 1 / 109

2 What is Machine Learning? 2 / 109

3 Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse areas, including natural language processing, computer vision, recommender systems, medical diagnosis. 3 / 109

4 A Much Sought-after Technology 4 / 109

5 Enabled Applications Make reminders by talking to your phone. 5 / 109

6 Tell the car where you want to go to and the car takes you there. 6 / 109

Check your emails, see some spams in your inbox, mark

7 Check your s, see some spams in your inbox, mark them as spams, and similar spams will not show up. 7 / 109

8 Video recommendations. 8 / 109

9 Play Go against the computer. 9 / 109

10 Tutorial Objective Essentials for crafting basic machine learning systems. Formulate applications as machine learning problems Classification, regression, density estimation, clustering... Understand and apply basic learning algorithms Least squares regression, logistic regression, support vector machines, K-means,... Theoretical understanding Position and compare the problems and algorithms in a unifying statistical framework Have fun / 109

11 Outline A statistics and optimization perspective Statistical learning theory Regression Model selection Classification Clustering 11 / 109

12 Hands-on An exercise on using WEKA. WEKA Java H2O Java scikit-learn python CRAN R Some technical details are left as exercises. These are tagged with (verify). 12 / 109

13 A Statistics and Optimization Perspective Illustrations Learning a binomial distribution Learning a Gaussian distribution 13 / 109

14 Learning a Binomial Distribution I pick a coin with the probability of heads being θ. I flip it 100 times for you and you see a dataset D of 70 heads and 30 tails, can you learn θ? 14 / 109

15 Learning a Binomial Distribution I pick a coin with the probability of heads being θ. I flip it 100 times for you and you see a dataset D of 70 heads and 30 tails, can you learn θ? Maximum likelihood estimation The likelihood of θ is P(D θ) = θ 70 (1 θ) 30. Learning θ is an optimization problem. θ ml = arg max θ = arg max θ = arg max θ P(D θ) ln P(D θ) (70 ln θ + 30 ln(1 θ)). 14 / 109

16 θ ml = arg max (70 ln θ + 30 ln(1 θ)). θ Set derivative of log-likelihood to 0, we have 70 θ 30 1 θ = 0, θ ml = 70/( ). 15 / 109

17 Learning a Gaussian distribution I pick a Gaussian N(µ, σ 2 ) and give you a bunch of data D = {x 1,..., x n } independently drawn from it. Can you learn µ and σ. f (X ) X 16 / 109

18 Learning a Gaussian distribution I pick a Gaussian N(µ, σ 2 ) and give you a bunch of data D = {x 1,..., x n } independently drawn from it. Can you learn µ and σ. f (X ) X P(x µ, σ) = 1 σ (x µ) 2 2π e 2σ / 109

19 Maximum likelihood estimation Set derivative w.r.t. µ to 0, ln P(D µ, σ) ( ) ( 1 n = ln σ exp ) (x i µ) 2 2π 2σ 2 i = n ln(σ 2π) (x i µ) 2 2σ 2. i i Set derivative w.r.t. σ to 0, x i µ σ 2 = 0 µ ml = 1 n n σ + (x i µ) 2 σ 3 = 0 σ 2 ml = 1 n i x i n (x i µ ml ) 2. i=1 17 / 109

20 What You Need to Know... Learning is... Collect some data, e.g. coin flips. Choose a hypothesis class, e.g. binomial distribution. Choose a loss function, e.g. negative log-likelihood. Choose an optimization procedure, e.g. set derivative to 0. Have fun... Statistics and optimization provide powerful tools for formulating and solving machine learning problems. 18 / 109

21 Statistical Learning Theory The framework Applications in classification, regression, and density estimation Does empirical risk minimization work? 19 / 109

22 There is nothing more practical than a good theory. Kurt Lewin...at least in the problems of statistical inference. Vladimir Vapnik 20 / 109

23 Learning... H. Simon: Any process by which a system improves its performance. M. Minsky: Learning is making useful changes in our minds. R. Michalsky: Learning is constructing or modifying representations of what is being experienced. L. Valiant: Learning is the process of knowledge acquisition in the absence of explicit programming. 21 / 109

24 A Probabilistic Framework Data Training examples z 1,..., z n are i.i.d. drawn from a fixed but unknown distribution P(Z) on Z. e.g. outcomes of coin flips. Hypothesis space H e.g. head probability θ [0, 1]. Loss function L(z, h) measures the penalty for hypothesis h on example z. { ln(θ), z = H, e.g. log-loss L(z, θ) = ln P(z θ) = ln(1 θ), z = T. 22 / 109

25 Expected risk The expected risk of h is E(L(Z, h)). We want to find the hypothesis with minimum expected risk, Empirical risk minimization (ERM) Minimize empirical risk R n (h) def arg min E(L(Z, h)). h H i L(z i, h) over h H. = 1 n e.g. choose θ to minimize 70 ln θ 30 ln(1 θ). 23 / 109

26 This provides a unified formulation for many machine learning problems, which differ in the data domain Z, the choice of the hypothesis space H, and the choice of loss function L. Most algorithms that we see later can be seen as special cases of ERM. 24 / 109

27 Classification predict a discrete class Digit recognition: image to {0, 1,..., 9}. Spam filter: to {spam, not spam}. 25 / 109

28 Given D = {(x 1, y 1 ),..., (x n, y n )} X Y, find a classifier f that maps an input x X to a class y Y. We usually use the 0/1 loss L((x, y), h) = I(h(x) = y) = { 1, h(x) y, 0, h(x) = y.. ERM chooses the classifier with minimum classification error 1 min h H n I(h(x i ) = y i ). i 26 / 109

29 Regression predict a numerical value Stock market prediction: predict stock price using recent trading data 27 / 109

30 Given D = {(x 1, y 1 ),..., (x n, y n )} X R, find a function f that maps an input x X to a value y R. We usually use the quadratic loss L((x, y), h) = (y h(x)) 2. ERM is often called the method of least squares in this case 1 min h H n (y i h(x i )) 2. i 28 / 109

31 Density Estimation E.g. learning a binomial distribution, or a Gaussian distribution. f (X ) X We often use the log-loss L(x, h) = ln p(x h). ERM is MLE in this case. 29 / 109

32 Does ERM Work? Estimation error How does the empirically best hypothesis h n = arg min h H R n (h) compare with the best in the hypothesis space? Specifically, how large is the estimation error R(h n ) inf h H R(h)? Consistency: Does R(h n ) converge to inf h H R(h) as n? 30 / 109

33 Does ERM Work? Estimation error How does the empirically best hypothesis h n = arg min h H R n (h) compare with the best in the hypothesis space? Specifically, how large is the estimation error R(h n ) inf h H R(h)? Consistency: Does R(h n ) converge to inf h H R(h) as n? If H is finite, ERM is likely to pick the function with minimal expected risk when n is large, because then R n (h) is close to R(h) for all h H. If H is infinite, we can still show that ERM is likely to choose a near-optimal hypothesis if H has finite complexity (such as VC-dimension). 30 / 109

34 Approximation error How good is the best hypothesis in H? That is, how large is the approximation error inf h H R(h) inf h R(h)? 31 / 109

35 Approximation error How good is the best hypothesis in H? That is, how large is the approximation error inf h H R(h) inf h R(h)? Trade-off between estimation error and approximation error: Larger hypothesis space implies smaller approximation error, but larger estimation error. Smaller hypothesis space implies larger approximation error, but smaller estimation error. 31 / 109

36 Optimization error Is the optimization algorithm computing the empirically best hypothesis exact? 32 / 109

37 Optimization error Is the optimization algorithm computing the empirically best hypothesis exact? While ERM can be efficiently implemented in many cases, there are also computationally intractable cases, and efficient approximations are sought. The performance gap between the sub-optimal hypothesis and the empirically best hypothesis is the optimization error. 32 / 109

38 What You Need to Know... Recognise machine learning problems as special cases of the general statistical learning problem. Understand that the performance of ERM depends on the approximation error, estimation error and optimization error. 33 / 109

39 Ordinary least squares Ridge regression Basis function method Regression function Nearest neighbor regression Kernel regression Classification as regression Regression 34 / 109

40 Ordinary Least Squares Find a best fitting hyperplane for (x 1, y 1 ),..., (x n, y n ) R d R. 35 / 109

41 OLS finds a hyperplane minimizing the sum of squared errors β n = arg min β R d n (x T i β y i ) 2. i=1 A special case of function learning using ERM The input set is X = R d, and the output set is Y = R. The hypothesis space are hyperplanes H = {x T β : β R d }. Quadratic loss is used, as typically in regression. 36 / 109

42 Empirically best hyperplane. The solution to OLS is β n = (X T X) 1 X T y, where X is the n d matrix with x i as the i-th row, and y = (y 1,..., y n ) T. The formula holds when X T X is non-singular. When X T X is singular, there are infinitely many possible values for β n. They can be obtained by solving the linear systems (X T X)β = X T y. 37 / 109

43 Proof. The empirical risk is (ignoring a factor of 1 n ) R n (β) = n (x T i β y i ) 2 = Xβ y 2 2. i=1 Set the gradient of R n to 0 R n = 2X T (Xβ y) = 0, (verify) we have β n = (X T X) 1 X T y. 38 / 109

44 Optimal hyperplane. The hyperplane β = E(XX T ) 1 E(XY ) minimizes the expected quadratic loss among all hyperplanes. 39 / 109

45 Optimal hyperplane. The hyperplane β = E(XX T ) 1 E(XY ) minimizes the expected quadratic loss among all hyperplanes. Proof. The expected quadratic loss of a hyperplane β is R(β) = E((β T X Y ) 2 ) = E(β T XX T β 2β T XY + Y 2 ) = β T E(XX T )β 2β T E(XY ) + E(Y 2 ). Set the gradient of R to 0, we have R(β) = 2 E(XX T )β 2 E(XY ) = 0 β = E(XX T ) 1 E(XY ). 39 / 109

46 Consistency. We can show that least squares linear regression is consistent, that is, R(β n ) P R(β ), by using the law of large numbers. 40 / 109

47 Least Squares as MLE Consider the class of conditional distributions {p β (Y X ) : β R d }, where p β (Y X = x) = N(Y ; x T β, σ) def = 1 2πσ e (Y xt β) 2 /2σ 2, with σ being a constant. The (conditional) likelihood of β is L n (β) = p β (y 1 x 1 )... p β (y n x n ). Maximizing the likelihood L n (β) gives the same β n as given by the method of least squares. (verify) 41 / 109

48 Ridge Regression When collinearity is present, the matrix X T X may be singular or close to singular, making the solution unreliable. Ridge regression We add a regularizer λ β 2 2 to OLS objective, where λ > 0 is a fixed constant. ( n β n = arg min (x T β R d i β y i ) 2 + i=1 Empirically optimal hyperplane quadratic/l 2 regularizer {}}{ λ β 2 2 β n = (λi + X T X) 1 X T y. (verify) The matrix λi + X T X is non-singular (verify), and thus there is always a unique solution. ). 42 / 109

49 Regression with a Bias So far we have only considered hyperplanes of the form y = x T β, which passes through the origin (green line). Considering hyperplanes with a bias term, that is, hyperplanes of the form y = x T β + b is more useful (red line). 43 / 109

50 OLS with a bias solves (b n, β n ) = arg min b R,β R d n (x T i β + i=1 bias {}}{ b y i ) / 109

51 OLS with a bias solves (b n, β n ) = arg min b R,β R d n (x T i β + i=1 bias {}}{ b y i ) 2. Solution. Reduce it to ( regression ) without a bias term by replacing b x T β + b with (1 x T ). β 44 / 109

52 Ridge regression with a bias solves ( n ) (b n, β n ) = arg min (x T b R,β R d i β + b y i ) 2 + λ β 2 2. i=1 45 / 109

53 Ridge regression with a bias solves ( n ) (b n, β n ) = arg min (x T b R,β R d i β + b y i ) 2 + λ β 2 2. i=1 Solution. Reduce it to ridge regression without a bias term as follows. Let ˆx i = x i x, and ŷ i = y i ȳ, where x = n i=1 x i/n, and ȳ = n i=1 y i/n, then ( n ) β n = arg min (ˆx T β R d i β ŷ i ) 2 + λ β 2 2, i=1 b n = ȳ x T β n. (verify) 45 / 109

54 Basis Function Method We can use linear regression to learn complex regression functions Choose some basis functions g 1,..., g k : R d R. Transform each input x to (g 1 (x),..., g k (x)). Perform linear regression on the transformed data. 46 / 109

55 Basis Function Method We can use linear regression to learn complex regression functions Choose some basis functions g 1,..., g k : R d R. Transform each input x to (g 1 (x),..., g k (x)). Perform linear regression on the transformed data. Examples Linear regression: use basis functions g 1,..., g d with g i (x) = x i, and g 0 (x) = 1. Quadratic functions: use basis functions of the above form, together with basis functions of the form g ij (x) = x i x j for all 1 i j d. 46 / 109

56 Regression Function The minimizer of the expected quadratic loss is the regression function h (x) = E(Y x). 47 / 109

57 Regression Function The minimizer of the expected quadratic loss is the regression function h (x) = E(Y x). Proof. The expected quadratic loss of a function h is E((h(X ) Y ) 2 ) = E X ( h(x ) 2 2h(X ) E(Y X ) + E(Y 2 X ) ). Hence we can set the value of h(x) independently for each x by choosing it to minimize the expression under expectation. This leads to h (x) = E(Y x). 47 / 109

58 k nearest neighbor (knn) knn approximates the regression function using h n (x) = avg(y i x i N k (x)), which is the average of the values of the set N k (x) of the k nearest neighbors of x in the training data. 48 / 109

59 k nearest neighbor (knn) knn approximates the regression function using h n (x) = avg(y i x i N k (x)), which is the average of the values of the set N k (x) of the k nearest neighbors of x in the training data. Under mild conditions, as k and n/k, h n (x) h (x), for any distribution P(X, Y ). (Curse of dimensionality) The number of samples required for accurate approximation is exponential in the dimension. knn is a non-parametric method, while linear regression is a parametric method. 48 / 109

60 Kernel Regression h n (x) = n / n K(x, x i )y i K(x, x i ), i=1 i=1 where K(x, x ) is a function measuring the similarity between x and x, and is often called a kernel function. 49 / 109

61 Kernel Regression h n (x) = n / n K(x, x i )y i K(x, x i ), i=1 i=1 where K(x, x ) is a function measuring the similarity between x and x, and is often called a kernel function. Example kernel functions Gaussian kernel K λ (x, x ) = 1 λ exp( x x 2 2 2λ ). knn kernel K k (x, x ) = I( x x max x N k (x) x x ). Note that this kernel is data-dependent and non-symmetric. 49 / 109

62 Binary Classification as Regression Label one class as 1 and and the other as +1. Fit a function f (x) using least squares regression. Given a test example x, predict 1 if f (x) < 0 and +1 otherwise. 50 / 109

63 What You Need to Know... Regression function. Parametric methods: Ordinary least squares, ridge regression, basis function method. Non-parametric methods: knn, kernel regression. 51 / 109

64 Model Selection 52 / 109

65 Bias-Variance Tradeoff The predicted value Y at a fixed point x can be considered a random function of the training set. Let Y be the true value at x. The expected prediction error E((Y Y ) 2 ) is a property of the model class. 53 / 109

66 Bias-Variance Tradeoff The predicted value Y at a fixed point x can be considered a random function of the training set. Let Y be the true value at x. The expected prediction error E((Y Y ) 2 ) is a property of the model class. Bias-variance decomposition expected prediction error {}}{ E ( variance (Y Y ) 2) {}}{ = E ( (Y E(Y )) 2) + Proof. Expand the RHS and simplify. bias (squared) irreducible noise {}}{ ( {}}{ E(Y ) E(Y ) )2 + E ( (Y E(Y )) 2), 53 / 109

67 Bias-Variance Tradeoff The predicted value Y at a fixed point x can be considered a random function of the training set. Let Y be the true value at x. The expected prediction error E((Y Y ) 2 ) is a property of the model class. Bias-variance decomposition expected prediction error {}}{ E ( variance (Y Y ) 2) {}}{ = E ( (Y E(Y )) 2) + Proof. Expand the RHS and simplify. Bias-variance tradeoff bias (squared) irreducible noise {}}{ ( {}}{ E(Y ) E(Y ) )2 + E ( (Y E(Y )) 2), In general, as model complexity increases (i.e., the hypothesis becomes more complex), variance tends to increase, and bias tends to decrease. 53 / 109

68 Bias-variance Tradeoff in knn Assumption Suppose Y X N(f (X ), σ) for some function f and some fixed σ. In addition, suppose x 1,..., x n are fixed. 54 / 109

69 Bias-variance Tradeoff in knn Assumption Suppose Y X N(f (X ), σ) for some function f and some fixed σ. In addition, suppose x 1,..., x n are fixed. Bias and variance At x, Y = 1 k x i N k (x) y i is predicted and the true value is Y. bias = E(Y ) E(Y ) = 1 k x i N k (x) variance = E((Y E(Y )) 2 ) = σ 2 /k. f (x i ) f (x), 54 / 109

70 bias = 1 k variance = σ 2 /k. f (x i ) f (x), x i N k (x) 1 k as a complexity measure This is because with smaller 1 k, h n(x) = avg(y i x i N k (x)) is closer to a constant, and thus model complexity is smaller. Bias-variance trade-off As 1 k increases (or as model complexity increases), bias is likely to decrease, and variance increases. 55 / 109

71 56 / 109

72 Model Selection Assume model complexity is controlled by some parameter θ. How to pick the best value among candidates θ 0,..., θ m? 57 / 109

73 Model Selection Assume model complexity is controlled by some parameter θ. How to pick the best value among candidates θ 0,..., θ m? Using a development set Split available data into a training set T and a development set D. For each θ i, train a model on T, and test it on D. Choose the parameter with best performance. A lot of data is needed, while the amount may be limited. 57 / 109

74 K-fold cross validation Split the training data into K folds. For each θ i, train K models, with each trained on K 1 folds and tested on the remaining fold. Choose the parameter with best average performance. Computationally more expensive than using a development set. 58 / 109

75 59 / 109

76 What You Need to Know... The expected prediction error is a function of the model complexity. The bias-variance decomposition implies that minimizing the expected prediction error requires careful tuning of the model complexity. Using a development set and cross-validation are two basic methods for model selection. 60 / 109

77 Recap A statistics and optimization perspective Learning a binomial and a Gaussian Statistical learning theory Data, hypotheses, loss function, expected risk, empirical risk... Regression Regression function, parametric regression, non-parametric regression... Model selection Bias-variance tradeoff, development set, cross-validation... Classification Clustering 61 / 109

78 Bayes optimal classifier Nearest neighbor classifier Naive Bayes classifier Logistic regression The perceptron Support vector machines Classification 62 / 109

79 Recall Classification Output a function f : X Y where Y is a finite set. 0/1 loss L((x, y), h) = I(y h(x)) = Expected/True risk R(h) = E(L((X, Y ), h). { 1, y h(x), 0, y = h(x). 63 / 109

80 Bayes Optimal Classifier The expected 0/1 loss is minimized by the Bayes optimal classifier h (x) = arg max P(y x). y Y 64 / 109

81 Bayes Optimal Classifier The expected 0/1 loss is minimized by the Bayes optimal classifier h (x) = arg max P(y x). y Y Proof. The expected 0/1 loss of a classifier h(x) is E(L((X, Y ), h)) = E X E Y X (I(Y h(x ))) = E X P(Y h(x ) X ). Hence we can set the value of h(x) independently for each x by choosing it to minimize the expression under expectation. This leads to h (x) = arg max y Y P(y x). 64 / 109

82 The Bayes optimal classifier is h (x) = arg max P(y x). y Y However, P(Y x) is unknown... Idea. Estimate P(y x) from data. 65 / 109

83 Nearest Neighbor Classifier Approximate P(y x) using the label distribution in {y i x i N k (x)}, where N k (x) consists of the k nearest examples of x (with respect to some distance measure), and predict the majority label h n (x) = majority{y i x i N k (x)}. 66 / 109

84 Nearest Neighbor Classifier Approximate P(y x) using the label distribution in {y i x i N k (x)}, where N k (x) consists of the k nearest examples of x (with respect to some distance measure), and predict the majority label h n (x) = majority{y i x i N k (x)}. Under mild conditions, as k and n/k, h n (x) h (x), for any distribution P(X, Y ). (Curse of dimensionality) The number of samples required for accurate approximation is exponential in the dimension. 66 / 109

85 1-NN classifier 15-NN classifier Bayes optimal classifier 67 / 109

86 Model Naive Bayes Classifier (NB) X = X 1... X d, where each X i is a finite set. A model p(x, Y ) satisfies the independence assumption p(x 1,..., x d y) = p(x 1 y)... p(x d y). 68 / 109

87 Model Naive Bayes Classifier (NB) X = X 1... X d, where each X i is a finite set. A model p(x, Y ) satisfies the independence assumption Classification p(x 1,..., x d y) = p(x 1 y)... p(x d y). An example x = (x 1,..., x d ) is classified as This is equivalent to y = arg max y Y p(y x). y = arg max y Y p(y, x) = arg max y Y p(y )p(x 1 y)...p(x d y), by the independence assumption. 68 / 109

88 Learning (MLE) The maximum likelihood Naive Bayes model is ˆp(X, Y ) given by ˆp(y) = n y /n, ˆp(x i y) = n y,xi /n y, where n y is the number of times class y appears in the training set, and n y,xi is the number of times attribute i takes value x i when the class label is y. (verify) 69 / 109

89 Learning (MLE) The maximum likelihood Naive Bayes model is ˆp(X, Y ) given by ˆp(y) = n y /n, ˆp(x i y) = n y,xi /n y, where n y is the number of times class y appears in the training set, and n y,xi is the number of times attribute i takes value x i when the class label is y. (verify) Issues Independence assumption unlikely to be satisified. The counts n y may be 0, making the estimates undefined. The counts may be very small, leading to unstable estimates. 69 / 109

90 Laplace correction ˆp(y) = (n y + c 0 )/ y Y(n y + c 0 ), ˆp(x i y) = (n y,xi + c 1 )/ (n y,x i + c 1 ), x i X i where c 0 > 0 and c 1 > 0 are user-chosen constants. Laplace correction makes NB more stable, but still relies on strong independence assumption. 70 / 109

91 Logistic Regression (LR) Model X = R d. Logistic regression estimates conditional distributions of the form / p(y x, θ) = exp(x T θ y ) exp(x T θ y ), y Y where θ y = (θ y1,..., θ yd ) R d, and θ is the concatenation of θ y s. 71 / 109

92 Logistic Regression (LR) Model X = R d. Logistic regression estimates conditional distributions of the form / p(y x, θ) = exp(x T θ y ) exp(x T θ y ), y Y where θ y = (θ y1,..., θ yd ) R d, and θ is the concatenation of θ y s. Classification An example x is classified as y = arg max y Y p(y x, θ). 71 / 109

93 Learning Training is often done by maximizing regularized log-likelihood L(θ) = log n p(y i x i, θ) λ θ 2 2. i=1 That is, the parameter estimate is θ n = arg max L(θ). θ L(θ) is a concave function, and can be optimized using standard numerical methods (such as L-BFGS). 72 / 109

94 Comparing knn, NB and LR All approximates the Bayes-optimal classifier Learn an approximation P(X, Y ) to P(X, Y ) (or learn an approximation P(Y X ) to P(Y X )). Choose the function h n (x) = arg max y P(y x). knn and logistic regression estimate P(Y X ), while naive Bayes estimates P(X, Y ). knn is a non-parametric method, while naive Bayes and logistic regression are both parametric methods. 73 / 109

95 Application in Digit Recognition Assume each digit image is a binary image, represented as a vector with the pixel values as its elements. Applying the algorithms knn: use the Euclidean distance between the examples. NB can be applied because each example is a discrete vector. LR can be applied because each example is a continuous vector. 74 / 109

96 The Perceptron x 1 x 0 = 1 x 2 w 1 w 2 w 0 Σ x T w y = sgn(x T w). w d x d A perceptron maps an input x R d+1 to 1, x T w > 0, h(x) = sgn(x T w) = 0, x T w = 0,. 1, x T w < 0. Here x = (1, x 1,..., x d ) includes a dummy variable / 109

97 x x 0 + A perceptron corresponds to a linear decision boundary (i.e., the boundary between the regions for examples of the same class). 76 / 109

98 It is NP-hard to minimize the empirical 0/1 loss of a perceptron, that is, given a training set {(x 1, y 1 ),..., (x n, y n )}, it is NP-hard to solve min w 1 n I(sgn(xT i w) y i ) Idea: Can we use (x T i w y i ) 2 as a surrogate loss for the 0/1 loss? 77 / 109

99 Least Squares May Fail Recall: Binary classification as regression Label one class as 1 and and the other as +1. Fit a function f (x) using least squares regression. Given a test example x, predict 1 if f (x) < 0 and +1 otherwise. 78 / 109

100 Least Squares May Fail Recall: Binary classification as regression Label one class as 1 and and the other as +1. Fit a function f (x) using least squares regression. Given a test example x, predict 1 if f (x) < 0 and +1 otherwise. Issue Least squares fitting may not find a separating hyperplane (i.e., a hyperplane which puts the positive and negative examples on different sides of it) even there is one. 78 / 109

101 The decision boundary learned using least squares fitting (red line) wrongly classifies the negative example, while there exists separating hyperplanes (like the blue line). 79 / 109

102 Perceptron Algorithm Require: (x 1, y 1 ),..., (x n, y n ) R d+1 { 1, +1}, η (0, 1]. Ensure: Weight vector w. Randomly or smartly initialize w. while there is any misclassified example do Pick a misclassified example (x i, y i ). w w + ηy i x i. 80 / 109

103 Why the update rule w w + ηy i x i? w classifies (x i, y i ) correctly if and only if y i x T i w > 0. If w classifies (x i, y i ) wrongly, then the update rule moves y i x T i w towards positive, because y i x T i (w + ηy i x i ) = y i x T i w + η x i 2 > y i x T i w. 81 / 109

104 Perceptron convergence theorem If the training data is linearly separable (i.e., there exists some w such that y i x T i w > 0 for all i), then the perceptron algorithm terminates with all training examples correctly classified. 82 / 109

105 Perceptron convergence theorem If the training data is linearly separable (i.e., there exists some w such that y i x T i w > 0 for all i), then the perceptron algorithm terminates with all training examples correctly classified. Proof. Suppose w separates the data. We can scale w such that x T i w x i 2 2. If w classifies (x i, y i ) wrongly, then it will be updated to w = w + ηy i x i. We show that w w 2 2 w w 2 2 ηr 2, where R = min i x i 2. This implies that only finitely many updates is possible. The inequality can be shown as follows. w w 2 2 = w w ηy i x i 2 2 = w w 2 2 2ηy i x T i (w w) + η 2 y 2 i x i 2 2 w w 2 2 2η( x T i w + x T i w ) + η x i 2 2 w w 2 2 2η x T i w + η x i 2 2 w w 2 2 η x T i w w w 2 2 ηr / 109

106 Issues When the data is separable, the hyperplane found by the perceptron algorithm depends on the initial weight and is thus arbitrary. Convergence can be very slow, especially when the gap between the positive and negative examples is small. When the data is not separable, the algorithm does not stop, but this can be difficult to detect. 83 / 109

107 Support Vector Machines (SVMs) Separable data w T x + w 0 = 0 y i (w T x i +w 0) w 2 Geometric intuition Find a separating hyperplane with maximal margin (i.e., the minimum distance from the points to it). + + (x i, y i ) 84 / 109

108 Support Vector Machines (SVMs) Separable data w T x + w 0 = 0 (x i, y i ) y i (w T x i +w 0) w 2 Geometric intuition Find a separating hyperplane with maximal margin (i.e., the minimum distance from the points to it). Algebraic formulation max M,w,w0 M subject to y i (w T x i +w 0 ) w 2 M, i = 1,..., n. 84 / 109

109 Support Vector Machines (SVMs) Separable data w T x + w 0 = 0 (x i, y i ) y i (w T x i +w 0) w 2 Geometric intuition Find a separating hyperplane with maximal margin (i.e., the minimum distance from the points to it). Algebraic formulation max M,w,w0 M subject to y i (w T x i +w 0 ) w 2 M, i = 1,..., n. Equivalent formulation (add M w 2 = 1). min w,w0 1 2 w 2 2 subject to y i (w T x i + w 0 ) 1, i = 1,..., n. 84 / 109

110 Soft-margin SVMs Non-separable data Algebraic formulation 1 min w,w 0,ξ 1,...,ξ n 2 w C i ξ i subject to y i (w T x i + w 0 ) 1 ξ i, i = 1,..., n, ξ i 0, i = 1,..., n. C > 0 is a user chosen constant. Introducing ξ i allows (x i, y i ) to be misclassified with a penalty of Cξ i in the original objective function 1 2 w 2 2. An SVM always have a unique solution that can be found efficiently. 85 / 109

111 SVM as minimizing regularized hinge loss Soft-margin SVMs can be equivalently written as 1 min w,w 0 2C w i max(0, 1 y i (w T x i + w 0 )), where max(0, 1 y(w T x + w 0 )) is the hinge loss L hinge ((x, y), h) = max(0, 1 yh(x)) of the classifier h(x) = w T x + w 0, and upper bounds the 0/1 loss L 0/1 ((x, y), h) = I(y sgn(h(x))) 86 / 109

112 What You Need to Know... Bayes optimal classifier. Probabilistic classifiers: NN, NB and LR. Perceptrons and SVMs. 87 / 109

113 Clustering 88 / 109

114 Clustering Clustering is unsupervised function learning which learns a function from X to [K] = {1,..., K}. The objective generally depends on some similarity/distance measure between the items, so that similar items are grouped together. 89 / 109

115 K-means Clustering Given observations x 1,..., x n, find a K-clustering f (a surjective from [n] to [K]) to minimize the cost n x i c f (i) 2, i=1 where c k is the centroid of cluster k, that is, the average of x i s with f (i) = k. 90 / 109

116 K-means algorithm Randomly initialize c 1:K, and set each f (i) = 0. repeat (Assignment) Set each f (i) to be the index of the c j closest to x i. (Update) Set each c i as the centroid of cluster i given by f. until f does not change 91 / 109

117 K-means algorithm Randomly initialize c 1:K, and set each f (i) = 0. repeat (Assignment) Set each f (i) to be the index of the c j closest to x i. (Update) Set each c i as the centroid of cluster i given by f. until f does not change Initialization Forgy method: randomly choose K observations as centroids, and initialize f as in the assignment step. Random Partition: assign a random cluster to each example. Random partition is preferable. 91 / 109

118 Illustration Initialize centroids Generated using 92 / 109

Illustration Iter 1: Assignment Generated using http://www.

119 Illustration Iter 1: Assignment Generated using 92 / 109

Illustration Iter 1: Update Generated using http://www.

120 Illustration Iter 1: Update Generated using 92 / 109

121 Illustration Iter 2: Assignment Generated using 92 / 109

122 Illustration Iter 2: Update (converged) Generated using 92 / 109

123 K-means as Coordinate Descent K-means is a coordinate descent algorithm for the cost function C(f, c 1:K ) = n x i c f (i) 2. i=1 Specifically, we have (verify) (Assignment) f arg min f C(f, c 1:K ), where f is a K-clustering. (Update) c 1:K arg min c 1:K C(f, c 1:K ), 93 / 109

124 Convergence The cost decreases at each iteration before termination. The cost converges to a local minimum by the monotone convergence theorem. Convergence may be very slow, taking exponential time in some cases, but such cases do not seem to arise in practice. 94 / 109

125 Convergence The cost decreases at each iteration before termination. The cost converges to a local minimum by the monotone convergence theorem. Convergence may be very slow, taking exponential time in some cases, but such cases do not seem to arise in practice. Dealing with poor local minimum Restart multiple times and pick the minimum cost clustering found. 94 / 109

126 Convergence The cost decreases at each iteration before termination. The cost converges to a local minimum by the monotone convergence theorem. Convergence may be very slow, taking exponential time in some cases, but such cases do not seem to arise in practice. Dealing with poor local minimum Restart multiple times and pick the minimum cost clustering found. K-means gives hard clusterings. Can we give probabilistic assignments to clusters? 94 / 109

127 Soft Clustering with Gaussian Mixtures Assumption Each cluster is represented as a Gaussian N(x; µ k, Σ k ) = 1 ( 2πΣk exp 1 ) 2 (x µ k) T Σ 1 k (x µ k) Cluster k has weight w k 0, with K k=1 w k = / 109

128 Equivalently, we assume the probability of observing x and it is in cluster z = k is where θ = {w 1:K, µ 1:K, Σ 1:K }. p(x, z = k θ) = w k N(x; µ k, Σ k ), 96 / 109

129 Equivalently, we assume the probability of observing x and it is in cluster z = k is where θ = {w 1:K, µ 1:K, Σ 1:K }. p(x, z = k θ) = w k N(x; µ k, Σ k ), The distribution p(x θ) = k w kn(x; µ k, Σ k ) is called a Gaussian mixture. 96 / 109

130 Computing a soft clustering p(z = k x, θ) = p(x, Z = k θ) K k =1 p(x, Z = k θ). Learning a Gaussian mixture Given the observations D = {x 1,..., x n }, we choose θ by maximizing the log-likelihood L(D θ) = n i=1 ln p(x i θ) max L(D θ). θ This is solved using the EM algorithm. 97 / 109

131 The EM Algorithm Let z i be the random variable representing the cluster from which x i is drawn from. EM starts with some initial parameter θ (0), and repeats the following steps: Expectation step: Maximization step: Q(θ θ (t) ) = E ( ln p(x i, z i θ) D, θ (t)), i θ (t+1) = arg max Q(θ θ (t) ). θ In words... E-step: Expectation of the log-likelihood for the complete data, w.r.t. the conditional distribution p(z 1:n D, θ (t) ). M-step: Maximization of the expectation 98 / 109

132 The EM Algorithm Let z i be the random variable representing the cluster from which x i is drawn from. EM starts with some initial parameter θ (0), and repeats the following steps: Expectation step: Maximization step: Q(θ θ (t) ) = E ( ln p(x i, z i θ) D, θ (t)), i θ (t+1) = arg max Q(θ θ (t) ). θ Data completion interpretation E-step: Create complete data (x i, z i ) with weight p(z i x i, θ (t) ), for each x i and each z i [K]. M-step: Perform maximum likelihood estimation on the complete data set. 98 / 109

133 EM algorithm is iterative likelihood maximization EM algorithm iteratively improves the likelihood function, L(D θ (t+1) ) L(D θ (t) ). 99 / 109

134 EM algorithm is iterative likelihood maximization EM algorithm iteratively improves the likelihood function, L(D θ (t+1) ) L(D θ (t) ). Proof. With some algebraic manipulation, we have Q(θ (t+1) θ (t) ) Q(θ (t) θ (t) ) = L(D θ (t+1) ) L(D θ (t) ) i KL(p(Z i x i, θ (t) ) p(z i x i, θ (t+1) )), where KL(q q ) is the KL-divergence x q(x) ln q(x) q (x). The result follows by noting that the LHS is non-negative by the choice of θ (t+1) in the M-step, and the nonnegativeness of the KL-divergence. 99 / 109

135 Q(θ (t+1) θ (t) ) Q(θ (t) θ (t) ) ( = E ln p(x i, z i θ (t+1) ) D, θ (t)) ( E ln p(x i, z i θ (t) ) D, θ (t)) i ( = E (ln p(x i θ (t+1) ) + ln p(z i x i, θ (t+1) ) = i i ) ln p(x i θ (t) ) ln p(z i x i, θ (t) D, θ (t) ) ln p(x i θ (t+1) ) i i ( ln p(x i θ (t) ) E ln p(z i x i, θ (t) ) ) D, θ(t)) p(z i x i, θ (t) = L(D θ (t+1) ) L(D θ (t) ) i KL(p(Z i x i, θ (t) ) p(z i x i, θ (t+1) )). 100 / 109

136 Update Equations for Gaussian Mixtures Scalar covariance matrices Assume each covariance matrix Σ k is a scalar matrix σk 2I d. Let w (t,i) k = p(z i = k x i, θ (t) ). Then given θ (t), θ (t+1) can be computed using w (t+1) k µ (t+1) k = i = i (σ (t+1) k ) 2 = i w (t,i) k /n, w (t,i) k x i / i w (t,i) k j w (t,i) k, (x ij µ (t+1) kj ) 2 /(d i w (t,i) k ). 101 / 109

137 Data completion interpretation Split example x i into K complete examples (x i, 1),..., (x i, K), where example (x i, k) has weight w (t,i) k. Apply maximum likelihood estimation to the complete data w (t+1) k is total weight of examples in cluster k. µ (t+1) k is the mean x value of the (weighted) examples in cluster k. σ (t+1) k is the standard deviation of all the attributes of the (weighted) examples in cluster k. 102 / 109

138 Diagonal covariance matrices Assume each covariance matrix Σ k is a diagonal matrix with diagonal entries σk1 2,..., σ2 (t,i) kd. Let w k = p(z i = k x i, θ (t) ). Then given θ (t), θ (t+1) can be computed using w (t+1) k µ (t+1) k = i = i w (t,i) k /n, w (t,i) k x i / i w (t,i) k, (σ (t+1) kj ) 2 = i w (t,i) k (x ij µ (t+1) kj ) 2 / i w (t,i) k. 103 / 109

139 Arbitrary covariance matrices Let w (t,i) k = p(z i = k x i, θ (t) ). Then given θ (t), θ (t+1) can be computed using w (t+1) k µ (t+1) k = i = i w (t,i) k /n, w (t,i) k x i / i w (t,i) k, Σ (t+1) k = i w (t,i) k (x i µ (t+1) k )(x i µ (t+1) k ) T / i w (t,i) k. 104 / 109

140 What You Need to Know... The clustering problem. Hard clustering with K-means algorithm. Soft clustering with Gaussian mixture. 105 / 109

141 Density Estimation Maximum likelihood estimation. Naive Bayes. Logistic regression. 106 / 109

142 This Tutorial... Essentials for crafting basic machine learning systems. Formulate applications as machine learning problems Classification, regression, density estimation, clustering Understand and apply basic learning algorithms Least squares regression, logistic regression, support vector machines, K-means,... Theoretical understanding Position and compare the problems and algorithms in the unifying framework of statistical learning theory. 107 / 109

143 Beyond This Course Representation: dimensionality reduction, feature selection,... Algorithms: decision tree, artificial neural network, Gaussian processes,... Meta-learning algorithms: boosting, stacking, bagging,... Learning theory: generalization performance of learning algorithms Many other exciting stuff / 109

144 Further Readings 109 / 109

1 Machine Learning Concepts (16 points)

1 Machine Learning Concepts (16 points) CSCI 567 Fall 2018 Midterm Exam DO NOT OPEN EXAM UNTIL INSTRUCTED TO DO SO PLEASE TURN OFF ALL CELL PHONES Problem 1 2 3 4 5 6 Total Max 16 10 16 42 24 12 120 Points Please read the following instructions