Machine Learning: A Statistics and Optimization Perspective
|
|
- Julian Williams
- 5 years ago
- Views:
Transcription
1 Machine Learning: A Statistics and Optimization Perspective Nan Ye Mathematical Sciences School Queensland University of Technology 1 / 109
2 What is Machine Learning? 2 / 109
3 Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse areas, including natural language processing, computer vision, recommender systems, medical diagnosis. 3 / 109
4 A Much Sought-after Technology 4 / 109
5 Enabled Applications Make reminders by talking to your phone. 5 / 109
6 Tell the car where you want to go to and the car takes you there. 6 / 109
7 Check your s, see some spams in your inbox, mark them as spams, and similar spams will not show up. 7 / 109
8 Video recommendations. 8 / 109
9 Play Go against the computer. 9 / 109
10 Tutorial Objective Essentials for crafting basic machine learning systems. Formulate applications as machine learning problems Classification, regression, density estimation, clustering... Understand and apply basic learning algorithms Least squares regression, logistic regression, support vector machines, K-means,... Theoretical understanding Position and compare the problems and algorithms in a unifying statistical framework Have fun / 109
11 Outline A statistics and optimization perspective Statistical learning theory Regression Model selection Classification Clustering 11 / 109
12 Hands-on An exercise on using WEKA. WEKA Java H2O Java scikit-learn python CRAN R Some technical details are left as exercises. These are tagged with (verify). 12 / 109
13 A Statistics and Optimization Perspective Illustrations Learning a binomial distribution Learning a Gaussian distribution 13 / 109
14 Learning a Binomial Distribution I pick a coin with the probability of heads being θ. I flip it 100 times for you and you see a dataset D of 70 heads and 30 tails, can you learn θ? 14 / 109
15 Learning a Binomial Distribution I pick a coin with the probability of heads being θ. I flip it 100 times for you and you see a dataset D of 70 heads and 30 tails, can you learn θ? Maximum likelihood estimation The likelihood of θ is P(D θ) = θ 70 (1 θ) 30. Learning θ is an optimization problem. θ ml = arg max θ = arg max θ = arg max θ P(D θ) ln P(D θ) (70 ln θ + 30 ln(1 θ)). 14 / 109
16 θ ml = arg max (70 ln θ + 30 ln(1 θ)). θ Set derivative of log-likelihood to 0, we have 70 θ 30 1 θ = 0, θ ml = 70/( ). 15 / 109
17 Learning a Gaussian distribution I pick a Gaussian N(µ, σ 2 ) and give you a bunch of data D = {x 1,..., x n } independently drawn from it. Can you learn µ and σ. f (X ) X 16 / 109
18 Learning a Gaussian distribution I pick a Gaussian N(µ, σ 2 ) and give you a bunch of data D = {x 1,..., x n } independently drawn from it. Can you learn µ and σ. f (X ) X P(x µ, σ) = 1 σ (x µ) 2 2π e 2σ / 109
19 Maximum likelihood estimation Set derivative w.r.t. µ to 0, ln P(D µ, σ) ( ) ( 1 n = ln σ exp ) (x i µ) 2 2π 2σ 2 i = n ln(σ 2π) (x i µ) 2 2σ 2. i i Set derivative w.r.t. σ to 0, x i µ σ 2 = 0 µ ml = 1 n n σ + (x i µ) 2 σ 3 = 0 σ 2 ml = 1 n i x i n (x i µ ml ) 2. i=1 17 / 109
20 What You Need to Know... Learning is... Collect some data, e.g. coin flips. Choose a hypothesis class, e.g. binomial distribution. Choose a loss function, e.g. negative log-likelihood. Choose an optimization procedure, e.g. set derivative to 0. Have fun... Statistics and optimization provide powerful tools for formulating and solving machine learning problems. 18 / 109
21 Statistical Learning Theory The framework Applications in classification, regression, and density estimation Does empirical risk minimization work? 19 / 109
22 There is nothing more practical than a good theory. Kurt Lewin...at least in the problems of statistical inference. Vladimir Vapnik 20 / 109
23 Learning... H. Simon: Any process by which a system improves its performance. M. Minsky: Learning is making useful changes in our minds. R. Michalsky: Learning is constructing or modifying representations of what is being experienced. L. Valiant: Learning is the process of knowledge acquisition in the absence of explicit programming. 21 / 109
24 A Probabilistic Framework Data Training examples z 1,..., z n are i.i.d. drawn from a fixed but unknown distribution P(Z) on Z. e.g. outcomes of coin flips. Hypothesis space H e.g. head probability θ [0, 1]. Loss function L(z, h) measures the penalty for hypothesis h on example z. { ln(θ), z = H, e.g. log-loss L(z, θ) = ln P(z θ) = ln(1 θ), z = T. 22 / 109
25 Expected risk The expected risk of h is E(L(Z, h)). We want to find the hypothesis with minimum expected risk, Empirical risk minimization (ERM) Minimize empirical risk R n (h) def arg min E(L(Z, h)). h H i L(z i, h) over h H. = 1 n e.g. choose θ to minimize 70 ln θ 30 ln(1 θ). 23 / 109
26 This provides a unified formulation for many machine learning problems, which differ in the data domain Z, the choice of the hypothesis space H, and the choice of loss function L. Most algorithms that we see later can be seen as special cases of ERM. 24 / 109
27 Classification predict a discrete class Digit recognition: image to {0, 1,..., 9}. Spam filter: to {spam, not spam}. 25 / 109
28 Given D = {(x 1, y 1 ),..., (x n, y n )} X Y, find a classifier f that maps an input x X to a class y Y. We usually use the 0/1 loss L((x, y), h) = I(h(x) = y) = { 1, h(x) y, 0, h(x) = y.. ERM chooses the classifier with minimum classification error 1 min h H n I(h(x i ) = y i ). i 26 / 109
29 Regression predict a numerical value Stock market prediction: predict stock price using recent trading data 27 / 109
30 Given D = {(x 1, y 1 ),..., (x n, y n )} X R, find a function f that maps an input x X to a value y R. We usually use the quadratic loss L((x, y), h) = (y h(x)) 2. ERM is often called the method of least squares in this case 1 min h H n (y i h(x i )) 2. i 28 / 109
31 Density Estimation E.g. learning a binomial distribution, or a Gaussian distribution. f (X ) X We often use the log-loss L(x, h) = ln p(x h). ERM is MLE in this case. 29 / 109
32 Does ERM Work? Estimation error How does the empirically best hypothesis h n = arg min h H R n (h) compare with the best in the hypothesis space? Specifically, how large is the estimation error R(h n ) inf h H R(h)? Consistency: Does R(h n ) converge to inf h H R(h) as n? 30 / 109
33 Does ERM Work? Estimation error How does the empirically best hypothesis h n = arg min h H R n (h) compare with the best in the hypothesis space? Specifically, how large is the estimation error R(h n ) inf h H R(h)? Consistency: Does R(h n ) converge to inf h H R(h) as n? If H is finite, ERM is likely to pick the function with minimal expected risk when n is large, because then R n (h) is close to R(h) for all h H. If H is infinite, we can still show that ERM is likely to choose a near-optimal hypothesis if H has finite complexity (such as VC-dimension). 30 / 109
34 Approximation error How good is the best hypothesis in H? That is, how large is the approximation error inf h H R(h) inf h R(h)? 31 / 109
35 Approximation error How good is the best hypothesis in H? That is, how large is the approximation error inf h H R(h) inf h R(h)? Trade-off between estimation error and approximation error: Larger hypothesis space implies smaller approximation error, but larger estimation error. Smaller hypothesis space implies larger approximation error, but smaller estimation error. 31 / 109
36 Optimization error Is the optimization algorithm computing the empirically best hypothesis exact? 32 / 109
37 Optimization error Is the optimization algorithm computing the empirically best hypothesis exact? While ERM can be efficiently implemented in many cases, there are also computationally intractable cases, and efficient approximations are sought. The performance gap between the sub-optimal hypothesis and the empirically best hypothesis is the optimization error. 32 / 109
38 What You Need to Know... Recognise machine learning problems as special cases of the general statistical learning problem. Understand that the performance of ERM depends on the approximation error, estimation error and optimization error. 33 / 109
39 Ordinary least squares Ridge regression Basis function method Regression function Nearest neighbor regression Kernel regression Classification as regression Regression 34 / 109
40 Ordinary Least Squares Find a best fitting hyperplane for (x 1, y 1 ),..., (x n, y n ) R d R. 35 / 109
41 OLS finds a hyperplane minimizing the sum of squared errors β n = arg min β R d n (x T i β y i ) 2. i=1 A special case of function learning using ERM The input set is X = R d, and the output set is Y = R. The hypothesis space are hyperplanes H = {x T β : β R d }. Quadratic loss is used, as typically in regression. 36 / 109
42 Empirically best hyperplane. The solution to OLS is β n = (X T X) 1 X T y, where X is the n d matrix with x i as the i-th row, and y = (y 1,..., y n ) T. The formula holds when X T X is non-singular. When X T X is singular, there are infinitely many possible values for β n. They can be obtained by solving the linear systems (X T X)β = X T y. 37 / 109
43 Proof. The empirical risk is (ignoring a factor of 1 n ) R n (β) = n (x T i β y i ) 2 = Xβ y 2 2. i=1 Set the gradient of R n to 0 R n = 2X T (Xβ y) = 0, (verify) we have β n = (X T X) 1 X T y. 38 / 109
44 Optimal hyperplane. The hyperplane β = E(XX T ) 1 E(XY ) minimizes the expected quadratic loss among all hyperplanes. 39 / 109
45 Optimal hyperplane. The hyperplane β = E(XX T ) 1 E(XY ) minimizes the expected quadratic loss among all hyperplanes. Proof. The expected quadratic loss of a hyperplane β is R(β) = E((β T X Y ) 2 ) = E(β T XX T β 2β T XY + Y 2 ) = β T E(XX T )β 2β T E(XY ) + E(Y 2 ). Set the gradient of R to 0, we have R(β) = 2 E(XX T )β 2 E(XY ) = 0 β = E(XX T ) 1 E(XY ). 39 / 109
46 Consistency. We can show that least squares linear regression is consistent, that is, R(β n ) P R(β ), by using the law of large numbers. 40 / 109
47 Least Squares as MLE Consider the class of conditional distributions {p β (Y X ) : β R d }, where p β (Y X = x) = N(Y ; x T β, σ) def = 1 2πσ e (Y xt β) 2 /2σ 2, with σ being a constant. The (conditional) likelihood of β is L n (β) = p β (y 1 x 1 )... p β (y n x n ). Maximizing the likelihood L n (β) gives the same β n as given by the method of least squares. (verify) 41 / 109
48 Ridge Regression When collinearity is present, the matrix X T X may be singular or close to singular, making the solution unreliable. Ridge regression We add a regularizer λ β 2 2 to OLS objective, where λ > 0 is a fixed constant. ( n β n = arg min (x T β R d i β y i ) 2 + i=1 Empirically optimal hyperplane quadratic/l 2 regularizer {}}{ λ β 2 2 β n = (λi + X T X) 1 X T y. (verify) The matrix λi + X T X is non-singular (verify), and thus there is always a unique solution. ). 42 / 109
49 Regression with a Bias So far we have only considered hyperplanes of the form y = x T β, which passes through the origin (green line). Considering hyperplanes with a bias term, that is, hyperplanes of the form y = x T β + b is more useful (red line). 43 / 109
50 OLS with a bias solves (b n, β n ) = arg min b R,β R d n (x T i β + i=1 bias {}}{ b y i ) / 109
51 OLS with a bias solves (b n, β n ) = arg min b R,β R d n (x T i β + i=1 bias {}}{ b y i ) 2. Solution. Reduce it to ( regression ) without a bias term by replacing b x T β + b with (1 x T ). β 44 / 109
52 Ridge regression with a bias solves ( n ) (b n, β n ) = arg min (x T b R,β R d i β + b y i ) 2 + λ β 2 2. i=1 45 / 109
53 Ridge regression with a bias solves ( n ) (b n, β n ) = arg min (x T b R,β R d i β + b y i ) 2 + λ β 2 2. i=1 Solution. Reduce it to ridge regression without a bias term as follows. Let ˆx i = x i x, and ŷ i = y i ȳ, where x = n i=1 x i/n, and ȳ = n i=1 y i/n, then ( n ) β n = arg min (ˆx T β R d i β ŷ i ) 2 + λ β 2 2, i=1 b n = ȳ x T β n. (verify) 45 / 109
54 Basis Function Method We can use linear regression to learn complex regression functions Choose some basis functions g 1,..., g k : R d R. Transform each input x to (g 1 (x),..., g k (x)). Perform linear regression on the transformed data. 46 / 109
55 Basis Function Method We can use linear regression to learn complex regression functions Choose some basis functions g 1,..., g k : R d R. Transform each input x to (g 1 (x),..., g k (x)). Perform linear regression on the transformed data. Examples Linear regression: use basis functions g 1,..., g d with g i (x) = x i, and g 0 (x) = 1. Quadratic functions: use basis functions of the above form, together with basis functions of the form g ij (x) = x i x j for all 1 i j d. 46 / 109
56 Regression Function The minimizer of the expected quadratic loss is the regression function h (x) = E(Y x). 47 / 109
57 Regression Function The minimizer of the expected quadratic loss is the regression function h (x) = E(Y x). Proof. The expected quadratic loss of a function h is E((h(X ) Y ) 2 ) = E X ( h(x ) 2 2h(X ) E(Y X ) + E(Y 2 X ) ). Hence we can set the value of h(x) independently for each x by choosing it to minimize the expression under expectation. This leads to h (x) = E(Y x). 47 / 109
58 k nearest neighbor (knn) knn approximates the regression function using h n (x) = avg(y i x i N k (x)), which is the average of the values of the set N k (x) of the k nearest neighbors of x in the training data. 48 / 109
59 k nearest neighbor (knn) knn approximates the regression function using h n (x) = avg(y i x i N k (x)), which is the average of the values of the set N k (x) of the k nearest neighbors of x in the training data. Under mild conditions, as k and n/k, h n (x) h (x), for any distribution P(X, Y ). (Curse of dimensionality) The number of samples required for accurate approximation is exponential in the dimension. knn is a non-parametric method, while linear regression is a parametric method. 48 / 109
60 Kernel Regression h n (x) = n / n K(x, x i )y i K(x, x i ), i=1 i=1 where K(x, x ) is a function measuring the similarity between x and x, and is often called a kernel function. 49 / 109
61 Kernel Regression h n (x) = n / n K(x, x i )y i K(x, x i ), i=1 i=1 where K(x, x ) is a function measuring the similarity between x and x, and is often called a kernel function. Example kernel functions Gaussian kernel K λ (x, x ) = 1 λ exp( x x 2 2 2λ ). knn kernel K k (x, x ) = I( x x max x N k (x) x x ). Note that this kernel is data-dependent and non-symmetric. 49 / 109
62 Binary Classification as Regression Label one class as 1 and and the other as +1. Fit a function f (x) using least squares regression. Given a test example x, predict 1 if f (x) < 0 and +1 otherwise. 50 / 109
63 What You Need to Know... Regression function. Parametric methods: Ordinary least squares, ridge regression, basis function method. Non-parametric methods: knn, kernel regression. 51 / 109
64 Model Selection 52 / 109
65 Bias-Variance Tradeoff The predicted value Y at a fixed point x can be considered a random function of the training set. Let Y be the true value at x. The expected prediction error E((Y Y ) 2 ) is a property of the model class. 53 / 109
66 Bias-Variance Tradeoff The predicted value Y at a fixed point x can be considered a random function of the training set. Let Y be the true value at x. The expected prediction error E((Y Y ) 2 ) is a property of the model class. Bias-variance decomposition expected prediction error {}}{ E ( variance (Y Y ) 2) {}}{ = E ( (Y E(Y )) 2) + Proof. Expand the RHS and simplify. bias (squared) irreducible noise {}}{ ( {}}{ E(Y ) E(Y ) )2 + E ( (Y E(Y )) 2), 53 / 109
67 Bias-Variance Tradeoff The predicted value Y at a fixed point x can be considered a random function of the training set. Let Y be the true value at x. The expected prediction error E((Y Y ) 2 ) is a property of the model class. Bias-variance decomposition expected prediction error {}}{ E ( variance (Y Y ) 2) {}}{ = E ( (Y E(Y )) 2) + Proof. Expand the RHS and simplify. Bias-variance tradeoff bias (squared) irreducible noise {}}{ ( {}}{ E(Y ) E(Y ) )2 + E ( (Y E(Y )) 2), In general, as model complexity increases (i.e., the hypothesis becomes more complex), variance tends to increase, and bias tends to decrease. 53 / 109
68 Bias-variance Tradeoff in knn Assumption Suppose Y X N(f (X ), σ) for some function f and some fixed σ. In addition, suppose x 1,..., x n are fixed. 54 / 109
69 Bias-variance Tradeoff in knn Assumption Suppose Y X N(f (X ), σ) for some function f and some fixed σ. In addition, suppose x 1,..., x n are fixed. Bias and variance At x, Y = 1 k x i N k (x) y i is predicted and the true value is Y. bias = E(Y ) E(Y ) = 1 k x i N k (x) variance = E((Y E(Y )) 2 ) = σ 2 /k. f (x i ) f (x), 54 / 109
70 bias = 1 k variance = σ 2 /k. f (x i ) f (x), x i N k (x) 1 k as a complexity measure This is because with smaller 1 k, h n(x) = avg(y i x i N k (x)) is closer to a constant, and thus model complexity is smaller. Bias-variance trade-off As 1 k increases (or as model complexity increases), bias is likely to decrease, and variance increases. 55 / 109
71 56 / 109
72 Model Selection Assume model complexity is controlled by some parameter θ. How to pick the best value among candidates θ 0,..., θ m? 57 / 109
73 Model Selection Assume model complexity is controlled by some parameter θ. How to pick the best value among candidates θ 0,..., θ m? Using a development set Split available data into a training set T and a development set D. For each θ i, train a model on T, and test it on D. Choose the parameter with best performance. A lot of data is needed, while the amount may be limited. 57 / 109
74 K-fold cross validation Split the training data into K folds. For each θ i, train K models, with each trained on K 1 folds and tested on the remaining fold. Choose the parameter with best average performance. Computationally more expensive than using a development set. 58 / 109
75 59 / 109
76 What You Need to Know... The expected prediction error is a function of the model complexity. The bias-variance decomposition implies that minimizing the expected prediction error requires careful tuning of the model complexity. Using a development set and cross-validation are two basic methods for model selection. 60 / 109
77 Recap A statistics and optimization perspective Learning a binomial and a Gaussian Statistical learning theory Data, hypotheses, loss function, expected risk, empirical risk... Regression Regression function, parametric regression, non-parametric regression... Model selection Bias-variance tradeoff, development set, cross-validation... Classification Clustering 61 / 109
78 Bayes optimal classifier Nearest neighbor classifier Naive Bayes classifier Logistic regression The perceptron Support vector machines Classification 62 / 109
79 Recall Classification Output a function f : X Y where Y is a finite set. 0/1 loss L((x, y), h) = I(y h(x)) = Expected/True risk R(h) = E(L((X, Y ), h). { 1, y h(x), 0, y = h(x). 63 / 109
80 Bayes Optimal Classifier The expected 0/1 loss is minimized by the Bayes optimal classifier h (x) = arg max P(y x). y Y 64 / 109
81 Bayes Optimal Classifier The expected 0/1 loss is minimized by the Bayes optimal classifier h (x) = arg max P(y x). y Y Proof. The expected 0/1 loss of a classifier h(x) is E(L((X, Y ), h)) = E X E Y X (I(Y h(x ))) = E X P(Y h(x ) X ). Hence we can set the value of h(x) independently for each x by choosing it to minimize the expression under expectation. This leads to h (x) = arg max y Y P(y x). 64 / 109
82 The Bayes optimal classifier is h (x) = arg max P(y x). y Y However, P(Y x) is unknown... Idea. Estimate P(y x) from data. 65 / 109
83 Nearest Neighbor Classifier Approximate P(y x) using the label distribution in {y i x i N k (x)}, where N k (x) consists of the k nearest examples of x (with respect to some distance measure), and predict the majority label h n (x) = majority{y i x i N k (x)}. 66 / 109
84 Nearest Neighbor Classifier Approximate P(y x) using the label distribution in {y i x i N k (x)}, where N k (x) consists of the k nearest examples of x (with respect to some distance measure), and predict the majority label h n (x) = majority{y i x i N k (x)}. Under mild conditions, as k and n/k, h n (x) h (x), for any distribution P(X, Y ). (Curse of dimensionality) The number of samples required for accurate approximation is exponential in the dimension. 66 / 109
85 1-NN classifier 15-NN classifier Bayes optimal classifier 67 / 109
86 Model Naive Bayes Classifier (NB) X = X 1... X d, where each X i is a finite set. A model p(x, Y ) satisfies the independence assumption p(x 1,..., x d y) = p(x 1 y)... p(x d y). 68 / 109
87 Model Naive Bayes Classifier (NB) X = X 1... X d, where each X i is a finite set. A model p(x, Y ) satisfies the independence assumption Classification p(x 1,..., x d y) = p(x 1 y)... p(x d y). An example x = (x 1,..., x d ) is classified as This is equivalent to y = arg max y Y p(y x). y = arg max y Y p(y, x) = arg max y Y p(y )p(x 1 y)...p(x d y), by the independence assumption. 68 / 109
88 Learning (MLE) The maximum likelihood Naive Bayes model is ˆp(X, Y ) given by ˆp(y) = n y /n, ˆp(x i y) = n y,xi /n y, where n y is the number of times class y appears in the training set, and n y,xi is the number of times attribute i takes value x i when the class label is y. (verify) 69 / 109
89 Learning (MLE) The maximum likelihood Naive Bayes model is ˆp(X, Y ) given by ˆp(y) = n y /n, ˆp(x i y) = n y,xi /n y, where n y is the number of times class y appears in the training set, and n y,xi is the number of times attribute i takes value x i when the class label is y. (verify) Issues Independence assumption unlikely to be satisified. The counts n y may be 0, making the estimates undefined. The counts may be very small, leading to unstable estimates. 69 / 109
90 Laplace correction ˆp(y) = (n y + c 0 )/ y Y(n y + c 0 ), ˆp(x i y) = (n y,xi + c 1 )/ (n y,x i + c 1 ), x i X i where c 0 > 0 and c 1 > 0 are user-chosen constants. Laplace correction makes NB more stable, but still relies on strong independence assumption. 70 / 109
91 Logistic Regression (LR) Model X = R d. Logistic regression estimates conditional distributions of the form / p(y x, θ) = exp(x T θ y ) exp(x T θ y ), y Y where θ y = (θ y1,..., θ yd ) R d, and θ is the concatenation of θ y s. 71 / 109
92 Logistic Regression (LR) Model X = R d. Logistic regression estimates conditional distributions of the form / p(y x, θ) = exp(x T θ y ) exp(x T θ y ), y Y where θ y = (θ y1,..., θ yd ) R d, and θ is the concatenation of θ y s. Classification An example x is classified as y = arg max y Y p(y x, θ). 71 / 109
93 Learning Training is often done by maximizing regularized log-likelihood L(θ) = log n p(y i x i, θ) λ θ 2 2. i=1 That is, the parameter estimate is θ n = arg max L(θ). θ L(θ) is a concave function, and can be optimized using standard numerical methods (such as L-BFGS). 72 / 109
94 Comparing knn, NB and LR All approximates the Bayes-optimal classifier Learn an approximation P(X, Y ) to P(X, Y ) (or learn an approximation P(Y X ) to P(Y X )). Choose the function h n (x) = arg max y P(y x). knn and logistic regression estimate P(Y X ), while naive Bayes estimates P(X, Y ). knn is a non-parametric method, while naive Bayes and logistic regression are both parametric methods. 73 / 109
95 Application in Digit Recognition Assume each digit image is a binary image, represented as a vector with the pixel values as its elements. Applying the algorithms knn: use the Euclidean distance between the examples. NB can be applied because each example is a discrete vector. LR can be applied because each example is a continuous vector. 74 / 109
96 The Perceptron x 1 x 0 = 1 x 2 w 1 w 2 w 0 Σ x T w y = sgn(x T w). w d x d A perceptron maps an input x R d+1 to 1, x T w > 0, h(x) = sgn(x T w) = 0, x T w = 0,. 1, x T w < 0. Here x = (1, x 1,..., x d ) includes a dummy variable / 109
97 x x 0 + A perceptron corresponds to a linear decision boundary (i.e., the boundary between the regions for examples of the same class). 76 / 109
98 It is NP-hard to minimize the empirical 0/1 loss of a perceptron, that is, given a training set {(x 1, y 1 ),..., (x n, y n )}, it is NP-hard to solve min w 1 n I(sgn(xT i w) y i ) Idea: Can we use (x T i w y i ) 2 as a surrogate loss for the 0/1 loss? 77 / 109
99 Least Squares May Fail Recall: Binary classification as regression Label one class as 1 and and the other as +1. Fit a function f (x) using least squares regression. Given a test example x, predict 1 if f (x) < 0 and +1 otherwise. 78 / 109
100 Least Squares May Fail Recall: Binary classification as regression Label one class as 1 and and the other as +1. Fit a function f (x) using least squares regression. Given a test example x, predict 1 if f (x) < 0 and +1 otherwise. Issue Least squares fitting may not find a separating hyperplane (i.e., a hyperplane which puts the positive and negative examples on different sides of it) even there is one. 78 / 109
101 The decision boundary learned using least squares fitting (red line) wrongly classifies the negative example, while there exists separating hyperplanes (like the blue line). 79 / 109
102 Perceptron Algorithm Require: (x 1, y 1 ),..., (x n, y n ) R d+1 { 1, +1}, η (0, 1]. Ensure: Weight vector w. Randomly or smartly initialize w. while there is any misclassified example do Pick a misclassified example (x i, y i ). w w + ηy i x i. 80 / 109
103 Why the update rule w w + ηy i x i? w classifies (x i, y i ) correctly if and only if y i x T i w > 0. If w classifies (x i, y i ) wrongly, then the update rule moves y i x T i w towards positive, because y i x T i (w + ηy i x i ) = y i x T i w + η x i 2 > y i x T i w. 81 / 109
104 Perceptron convergence theorem If the training data is linearly separable (i.e., there exists some w such that y i x T i w > 0 for all i), then the perceptron algorithm terminates with all training examples correctly classified. 82 / 109
105 Perceptron convergence theorem If the training data is linearly separable (i.e., there exists some w such that y i x T i w > 0 for all i), then the perceptron algorithm terminates with all training examples correctly classified. Proof. Suppose w separates the data. We can scale w such that x T i w x i 2 2. If w classifies (x i, y i ) wrongly, then it will be updated to w = w + ηy i x i. We show that w w 2 2 w w 2 2 ηr 2, where R = min i x i 2. This implies that only finitely many updates is possible. The inequality can be shown as follows. w w 2 2 = w w ηy i x i 2 2 = w w 2 2 2ηy i x T i (w w) + η 2 y 2 i x i 2 2 w w 2 2 2η( x T i w + x T i w ) + η x i 2 2 w w 2 2 2η x T i w + η x i 2 2 w w 2 2 η x T i w w w 2 2 ηr / 109
106 Issues When the data is separable, the hyperplane found by the perceptron algorithm depends on the initial weight and is thus arbitrary. Convergence can be very slow, especially when the gap between the positive and negative examples is small. When the data is not separable, the algorithm does not stop, but this can be difficult to detect. 83 / 109
107 Support Vector Machines (SVMs) Separable data w T x + w 0 = 0 y i (w T x i +w 0) w 2 Geometric intuition Find a separating hyperplane with maximal margin (i.e., the minimum distance from the points to it). + + (x i, y i ) 84 / 109
108 Support Vector Machines (SVMs) Separable data w T x + w 0 = 0 (x i, y i ) y i (w T x i +w 0) w 2 Geometric intuition Find a separating hyperplane with maximal margin (i.e., the minimum distance from the points to it). Algebraic formulation max M,w,w0 M subject to y i (w T x i +w 0 ) w 2 M, i = 1,..., n. 84 / 109
109 Support Vector Machines (SVMs) Separable data w T x + w 0 = 0 (x i, y i ) y i (w T x i +w 0) w 2 Geometric intuition Find a separating hyperplane with maximal margin (i.e., the minimum distance from the points to it). Algebraic formulation max M,w,w0 M subject to y i (w T x i +w 0 ) w 2 M, i = 1,..., n. Equivalent formulation (add M w 2 = 1). min w,w0 1 2 w 2 2 subject to y i (w T x i + w 0 ) 1, i = 1,..., n. 84 / 109
110 Soft-margin SVMs Non-separable data Algebraic formulation 1 min w,w 0,ξ 1,...,ξ n 2 w C i ξ i subject to y i (w T x i + w 0 ) 1 ξ i, i = 1,..., n, ξ i 0, i = 1,..., n. C > 0 is a user chosen constant. Introducing ξ i allows (x i, y i ) to be misclassified with a penalty of Cξ i in the original objective function 1 2 w 2 2. An SVM always have a unique solution that can be found efficiently. 85 / 109
111 SVM as minimizing regularized hinge loss Soft-margin SVMs can be equivalently written as 1 min w,w 0 2C w i max(0, 1 y i (w T x i + w 0 )), where max(0, 1 y(w T x + w 0 )) is the hinge loss L hinge ((x, y), h) = max(0, 1 yh(x)) of the classifier h(x) = w T x + w 0, and upper bounds the 0/1 loss L 0/1 ((x, y), h) = I(y sgn(h(x))) 86 / 109
112 What You Need to Know... Bayes optimal classifier. Probabilistic classifiers: NN, NB and LR. Perceptrons and SVMs. 87 / 109
113 Clustering 88 / 109
114 Clustering Clustering is unsupervised function learning which learns a function from X to [K] = {1,..., K}. The objective generally depends on some similarity/distance measure between the items, so that similar items are grouped together. 89 / 109
115 K-means Clustering Given observations x 1,..., x n, find a K-clustering f (a surjective from [n] to [K]) to minimize the cost n x i c f (i) 2, i=1 where c k is the centroid of cluster k, that is, the average of x i s with f (i) = k. 90 / 109
116 K-means algorithm Randomly initialize c 1:K, and set each f (i) = 0. repeat (Assignment) Set each f (i) to be the index of the c j closest to x i. (Update) Set each c i as the centroid of cluster i given by f. until f does not change 91 / 109
117 K-means algorithm Randomly initialize c 1:K, and set each f (i) = 0. repeat (Assignment) Set each f (i) to be the index of the c j closest to x i. (Update) Set each c i as the centroid of cluster i given by f. until f does not change Initialization Forgy method: randomly choose K observations as centroids, and initialize f as in the assignment step. Random Partition: assign a random cluster to each example. Random partition is preferable. 91 / 109
118 Illustration Initialize centroids Generated using 92 / 109
119 Illustration Iter 1: Assignment Generated using 92 / 109
120 Illustration Iter 1: Update Generated using 92 / 109
121 Illustration Iter 2: Assignment Generated using 92 / 109
122 Illustration Iter 2: Update (converged) Generated using 92 / 109
123 K-means as Coordinate Descent K-means is a coordinate descent algorithm for the cost function C(f, c 1:K ) = n x i c f (i) 2. i=1 Specifically, we have (verify) (Assignment) f arg min f C(f, c 1:K ), where f is a K-clustering. (Update) c 1:K arg min c 1:K C(f, c 1:K ), 93 / 109
124 Convergence The cost decreases at each iteration before termination. The cost converges to a local minimum by the monotone convergence theorem. Convergence may be very slow, taking exponential time in some cases, but such cases do not seem to arise in practice. 94 / 109
125 Convergence The cost decreases at each iteration before termination. The cost converges to a local minimum by the monotone convergence theorem. Convergence may be very slow, taking exponential time in some cases, but such cases do not seem to arise in practice. Dealing with poor local minimum Restart multiple times and pick the minimum cost clustering found. 94 / 109
126 Convergence The cost decreases at each iteration before termination. The cost converges to a local minimum by the monotone convergence theorem. Convergence may be very slow, taking exponential time in some cases, but such cases do not seem to arise in practice. Dealing with poor local minimum Restart multiple times and pick the minimum cost clustering found. K-means gives hard clusterings. Can we give probabilistic assignments to clusters? 94 / 109
127 Soft Clustering with Gaussian Mixtures Assumption Each cluster is represented as a Gaussian N(x; µ k, Σ k ) = 1 ( 2πΣk exp 1 ) 2 (x µ k) T Σ 1 k (x µ k) Cluster k has weight w k 0, with K k=1 w k = / 109
128 Equivalently, we assume the probability of observing x and it is in cluster z = k is where θ = {w 1:K, µ 1:K, Σ 1:K }. p(x, z = k θ) = w k N(x; µ k, Σ k ), 96 / 109
129 Equivalently, we assume the probability of observing x and it is in cluster z = k is where θ = {w 1:K, µ 1:K, Σ 1:K }. p(x, z = k θ) = w k N(x; µ k, Σ k ), The distribution p(x θ) = k w kn(x; µ k, Σ k ) is called a Gaussian mixture. 96 / 109
130 Computing a soft clustering p(z = k x, θ) = p(x, Z = k θ) K k =1 p(x, Z = k θ). Learning a Gaussian mixture Given the observations D = {x 1,..., x n }, we choose θ by maximizing the log-likelihood L(D θ) = n i=1 ln p(x i θ) max L(D θ). θ This is solved using the EM algorithm. 97 / 109
131 The EM Algorithm Let z i be the random variable representing the cluster from which x i is drawn from. EM starts with some initial parameter θ (0), and repeats the following steps: Expectation step: Maximization step: Q(θ θ (t) ) = E ( ln p(x i, z i θ) D, θ (t)), i θ (t+1) = arg max Q(θ θ (t) ). θ In words... E-step: Expectation of the log-likelihood for the complete data, w.r.t. the conditional distribution p(z 1:n D, θ (t) ). M-step: Maximization of the expectation 98 / 109
132 The EM Algorithm Let z i be the random variable representing the cluster from which x i is drawn from. EM starts with some initial parameter θ (0), and repeats the following steps: Expectation step: Maximization step: Q(θ θ (t) ) = E ( ln p(x i, z i θ) D, θ (t)), i θ (t+1) = arg max Q(θ θ (t) ). θ Data completion interpretation E-step: Create complete data (x i, z i ) with weight p(z i x i, θ (t) ), for each x i and each z i [K]. M-step: Perform maximum likelihood estimation on the complete data set. 98 / 109
133 EM algorithm is iterative likelihood maximization EM algorithm iteratively improves the likelihood function, L(D θ (t+1) ) L(D θ (t) ). 99 / 109
134 EM algorithm is iterative likelihood maximization EM algorithm iteratively improves the likelihood function, L(D θ (t+1) ) L(D θ (t) ). Proof. With some algebraic manipulation, we have Q(θ (t+1) θ (t) ) Q(θ (t) θ (t) ) = L(D θ (t+1) ) L(D θ (t) ) i KL(p(Z i x i, θ (t) ) p(z i x i, θ (t+1) )), where KL(q q ) is the KL-divergence x q(x) ln q(x) q (x). The result follows by noting that the LHS is non-negative by the choice of θ (t+1) in the M-step, and the nonnegativeness of the KL-divergence. 99 / 109
135 Q(θ (t+1) θ (t) ) Q(θ (t) θ (t) ) ( = E ln p(x i, z i θ (t+1) ) D, θ (t)) ( E ln p(x i, z i θ (t) ) D, θ (t)) i ( = E (ln p(x i θ (t+1) ) + ln p(z i x i, θ (t+1) ) = i i ) ln p(x i θ (t) ) ln p(z i x i, θ (t) D, θ (t) ) ln p(x i θ (t+1) ) i i ( ln p(x i θ (t) ) E ln p(z i x i, θ (t) ) ) D, θ(t)) p(z i x i, θ (t) = L(D θ (t+1) ) L(D θ (t) ) i KL(p(Z i x i, θ (t) ) p(z i x i, θ (t+1) )). 100 / 109
136 Update Equations for Gaussian Mixtures Scalar covariance matrices Assume each covariance matrix Σ k is a scalar matrix σk 2I d. Let w (t,i) k = p(z i = k x i, θ (t) ). Then given θ (t), θ (t+1) can be computed using w (t+1) k µ (t+1) k = i = i (σ (t+1) k ) 2 = i w (t,i) k /n, w (t,i) k x i / i w (t,i) k j w (t,i) k, (x ij µ (t+1) kj ) 2 /(d i w (t,i) k ). 101 / 109
137 Data completion interpretation Split example x i into K complete examples (x i, 1),..., (x i, K), where example (x i, k) has weight w (t,i) k. Apply maximum likelihood estimation to the complete data w (t+1) k is total weight of examples in cluster k. µ (t+1) k is the mean x value of the (weighted) examples in cluster k. σ (t+1) k is the standard deviation of all the attributes of the (weighted) examples in cluster k. 102 / 109
138 Diagonal covariance matrices Assume each covariance matrix Σ k is a diagonal matrix with diagonal entries σk1 2,..., σ2 (t,i) kd. Let w k = p(z i = k x i, θ (t) ). Then given θ (t), θ (t+1) can be computed using w (t+1) k µ (t+1) k = i = i w (t,i) k /n, w (t,i) k x i / i w (t,i) k, (σ (t+1) kj ) 2 = i w (t,i) k (x ij µ (t+1) kj ) 2 / i w (t,i) k. 103 / 109
139 Arbitrary covariance matrices Let w (t,i) k = p(z i = k x i, θ (t) ). Then given θ (t), θ (t+1) can be computed using w (t+1) k µ (t+1) k = i = i w (t,i) k /n, w (t,i) k x i / i w (t,i) k, Σ (t+1) k = i w (t,i) k (x i µ (t+1) k )(x i µ (t+1) k ) T / i w (t,i) k. 104 / 109
140 What You Need to Know... The clustering problem. Hard clustering with K-means algorithm. Soft clustering with Gaussian mixture. 105 / 109
141 Density Estimation Maximum likelihood estimation. Naive Bayes. Logistic regression. 106 / 109
142 This Tutorial... Essentials for crafting basic machine learning systems. Formulate applications as machine learning problems Classification, regression, density estimation, clustering Understand and apply basic learning algorithms Least squares regression, logistic regression, support vector machines, K-means,... Theoretical understanding Position and compare the problems and algorithms in the unifying framework of statistical learning theory. 107 / 109
143 Beyond This Course Representation: dimensionality reduction, feature selection,... Algorithms: decision tree, artificial neural network, Gaussian processes,... Meta-learning algorithms: boosting, stacking, bagging,... Learning theory: generalization performance of learning algorithms Many other exciting stuff / 109
144 Further Readings 109 / 109
1 Machine Learning Concepts (16 points)
CSCI 567 Fall 2018 Midterm Exam DO NOT OPEN EXAM UNTIL INSTRUCTED TO DO SO PLEASE TURN OFF ALL CELL PHONES Problem 1 2 3 4 5 6 Total Max 16 10 16 42 24 12 120 Points Please read the following instructions
More informationCSCI-567: Machine Learning (Spring 2019)
CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March
More informationMachine Learning Practice Page 2 of 2 10/28/13
Machine Learning 10-701 Practice Page 2 of 2 10/28/13 1. True or False Please give an explanation for your answer, this is worth 1 pt/question. (a) (2 points) No classifier can do better than a naive Bayes
More informationMachine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.
Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted
More informationMidterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas
Midterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas Machine Learning Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Y Continuous Non-parametric
More informationRecap from previous lecture
Recap from previous lecture Learning is using past experience to improve future performance. Different types of learning: supervised unsupervised reinforcement active online... For a machine, experience
More informationFinal Overview. Introduction to ML. Marek Petrik 4/25/2017
Final Overview Introduction to ML Marek Petrik 4/25/2017 This Course: Introduction to Machine Learning Build a foundation for practice and research in ML Basic machine learning concepts: max likelihood,
More informationUNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013
UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and
More informationECE521 week 3: 23/26 January 2017
ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear
More informationMachine Learning Lecture 5
Machine Learning Lecture 5 Linear Discriminant Functions 26.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory
More informationIntroduction to Machine Learning Midterm Exam
10-701 Introduction to Machine Learning Midterm Exam Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes, but
More informationLinear Regression (continued)
Linear Regression (continued) Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 1 / 39 Outline 1 Administration 2 Review of last lecture 3 Linear regression
More informationMachine Learning. Lecture 9: Learning Theory. Feng Li.
Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell
More informationLinear & nonlinear classifiers
Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table
More informationOverfitting, Bias / Variance Analysis
Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic
More informationCSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18
CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$
More informationMidterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas
Midterm Review CS 7301: Advanced Machine Learning Vibhav Gogate The University of Texas at Dallas Supervised Learning Issues in supervised learning What makes learning hard Point Estimation: MLE vs Bayesian
More informationLecture 4 Discriminant Analysis, k-nearest Neighbors
Lecture 4 Discriminant Analysis, k-nearest Neighbors Fredrik Lindsten Division of Systems and Control Department of Information Technology Uppsala University. Email: fredrik.lindsten@it.uu.se fredrik.lindsten@it.uu.se
More informationMachine Learning for Signal Processing Bayes Classification and Regression
Machine Learning for Signal Processing Bayes Classification and Regression Instructor: Bhiksha Raj 11755/18797 1 Recap: KNN A very effective and simple way of performing classification Simple model: For
More informationAnnouncements. Proposals graded
Announcements Proposals graded Kevin Jamieson 2018 1 Bayesian Methods Machine Learning CSE546 Kevin Jamieson University of Washington November 1, 2018 2018 Kevin Jamieson 2 MLE Recap - coin flips Data:
More informationComputational Learning Theory
Computational Learning Theory Pardis Noorzad Department of Computer Engineering and IT Amirkabir University of Technology Ordibehesht 1390 Introduction For the analysis of data structures and algorithms
More informationContents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)
Contents Lecture Lecture Linear Discriminant Analysis Fredrik Lindsten Division of Systems and Control Department of Information Technology Uppsala University Email: fredriklindsten@ituuse Summary of lecture
More informationPAC-learning, VC Dimension and Margin-based Bounds
More details: General: http://www.learning-with-kernels.org/ Example of more complex bounds: http://www.research.ibm.com/people/t/tzhang/papers/jmlr02_cover.ps.gz PAC-learning, VC Dimension and Margin-based
More informationMachine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io
Machine Learning Lecture 4: Regularization and Bayesian Statistics Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 207 Overfitting Problem
More informationStephen Scott.
1 / 35 (Adapted from Ethem Alpaydin and Tom Mitchell) sscott@cse.unl.edu In Homework 1, you are (supposedly) 1 Choosing a data set 2 Extracting a test set of size > 30 3 Building a tree on the training
More informationIntroduction to Machine Learning. Introduction to ML - TAU 2016/7 1
Introduction to Machine Learning Introduction to ML - TAU 2016/7 1 Course Administration Lecturers: Amir Globerson (gamir@post.tau.ac.il) Yishay Mansour (Mansour@tau.ac.il) Teaching Assistance: Regev Schweiger
More informationMachine Learning. Linear Models. Fabio Vandin October 10, 2017
Machine Learning Linear Models Fabio Vandin October 10, 2017 1 Linear Predictors and Affine Functions Consider X = R d Affine functions: L d = {h w,b : w R d, b R} where ( d ) h w,b (x) = w, x + b = w
More informationCS534 Machine Learning - Spring Final Exam
CS534 Machine Learning - Spring 2013 Final Exam Name: You have 110 minutes. There are 6 questions (8 pages including cover page). If you get stuck on one question, move on to others and come back to the
More informationIntroduction to Machine Learning Midterm Exam Solutions
10-701 Introduction to Machine Learning Midterm Exam Solutions Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes,
More informationFINAL: CS 6375 (Machine Learning) Fall 2014
FINAL: CS 6375 (Machine Learning) Fall 2014 The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run out of room for
More informationCS6220: DATA MINING TECHNIQUES
CS6220: DATA MINING TECHNIQUES Matrix Data: Prediction Instructor: Yizhou Sun yzsun@ccs.neu.edu September 14, 2014 Today s Schedule Course Project Introduction Linear Regression Model Decision Tree 2 Methods
More informationUniversität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr
Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Bayesian Learning Tobias Scheffer, Niels Landwehr Remember: Normal Distribution Distribution over x. Density function with parameters
More informationNaïve Bayes classification
Naïve Bayes classification 1 Probability theory Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. Examples: A person s height, the outcome of a coin toss
More informationMark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.
CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.
More informationLecture 2 Machine Learning Review
Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017 Things we will look at today Formal Setup for Supervised Learning Things
More informationMachine Learning
Machine Learning 10-701 Tom M. Mitchell Machine Learning Department Carnegie Mellon University February 1, 2011 Today: Generative discriminative classifiers Linear regression Decomposition of error into
More informationNaïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability
Probability theory Naïve Bayes classification Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. s: A person s height, the outcome of a coin toss Distinguish
More informationDay 3: Classification, logistic regression
Day 3: Classification, logistic regression Introduction to Machine Learning Summer School June 18, 2018 - June 29, 2018, Chicago Instructor: Suriya Gunasekar, TTI Chicago 20 June 2018 Topics so far Supervised
More informationProbabilistic modeling. The slides are closely adapted from Subhransu Maji s slides
Probabilistic modeling The slides are closely adapted from Subhransu Maji s slides Overview So far the models and algorithms you have learned about are relatively disconnected Probabilistic modeling framework
More informationIntroduction to Machine Learning
Introduction to Machine Learning Thomas G. Dietterich tgd@eecs.oregonstate.edu 1 Outline What is Machine Learning? Introduction to Supervised Learning: Linear Methods Overfitting, Regularization, and the
More informationStatistical Data Mining and Machine Learning Hilary Term 2016
Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes
More informationLinear & nonlinear classifiers
Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 218 Outlines Overview Introduction Linear Algebra Probability Linear Regression 1
More informationSupport Vector Machines for Classification: A Statistical Portrait
Support Vector Machines for Classification: A Statistical Portrait Yoonkyung Lee Department of Statistics The Ohio State University May 27, 2011 The Spring Conference of Korean Statistical Society KAIST,
More informationCS145: INTRODUCTION TO DATA MINING
CS145: INTRODUCTION TO DATA MINING 5: Vector Data: Support Vector Machine Instructor: Yizhou Sun yzsun@cs.ucla.edu October 18, 2017 Homework 1 Announcements Due end of the day of this Thursday (11:59pm)
More informationEXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING
EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING DATE AND TIME: August 30, 2018, 14.00 19.00 RESPONSIBLE TEACHER: Niklas Wahlström NUMBER OF PROBLEMS: 5 AIDING MATERIAL: Calculator, mathematical
More informationSupport Vector Machines
Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized
More informationClustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26
Clustering Professor Ameet Talwalkar Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 1 / 26 Outline 1 Administration 2 Review of last lecture 3 Clustering Professor Ameet Talwalkar
More informationThe exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.
CS 189 Spring 013 Introduction to Machine Learning Final You have 3 hours for the exam. The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet. Please
More informationClassification objectives COMS 4771
Classification objectives COMS 4771 1. Recap: binary classification Scoring functions Consider binary classification problems with Y = { 1, +1}. 1 / 22 Scoring functions Consider binary classification
More informationLeast Squares Regression
CIS 50: Machine Learning Spring 08: Lecture 4 Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may or may not cover all the
More informationSupport Vector Machines
Support Vector Machines Hypothesis Space variable size deterministic continuous parameters Learning Algorithm linear and quadratic programming eager batch SVMs combine three important ideas Apply optimization
More informationBrief Introduction to Machine Learning
Brief Introduction to Machine Learning Yuh-Jye Lee Lab of Data Science and Machine Intelligence Dept. of Applied Math. at NCTU August 29, 2016 1 / 49 1 Introduction 2 Binary Classification 3 Support Vector
More informationDiscriminative Models
No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models
More informationMachine Learning
Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University February 4, 2015 Today: Generative discriminative classifiers Linear regression Decomposition of error into
More informationUnderstanding Generalization Error: Bounds and Decompositions
CIS 520: Machine Learning Spring 2018: Lecture 11 Understanding Generalization Error: Bounds and Decompositions Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the
More informationOutline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)
Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Linear Models for Regression Linear Regression Probabilistic Interpretation
More informationPAC-learning, VC Dimension and Margin-based Bounds
More details: General: http://www.learning-with-kernels.org/ Example of more complex bounds: http://www.research.ibm.com/people/t/tzhang/papers/jmlr02_cover.ps.gz PAC-learning, VC Dimension and Margin-based
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression
More informationClassifier Complexity and Support Vector Classifiers
Classifier Complexity and Support Vector Classifiers Feature 2 6 4 2 0 2 4 6 8 RBF kernel 10 10 8 6 4 2 0 2 4 6 Feature 1 David M.J. Tax Pattern Recognition Laboratory Delft University of Technology D.M.J.Tax@tudelft.nl
More informationRegularization. CSCE 970 Lecture 3: Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline
Other Measures 1 / 52 sscott@cse.unl.edu learning can generally be distilled to an optimization problem Choose a classifier (function, hypothesis) from a set of functions that minimizes an objective function
More informationPart of the slides are adapted from Ziko Kolter
Part of the slides are adapted from Ziko Kolter OUTLINE 1 Supervised learning: classification........................................................ 2 2 Non-linear regression/classification, overfitting,
More information18.9 SUPPORT VECTOR MACHINES
744 Chapter 8. Learning from Examples is the fact that each regression problem will be easier to solve, because it involves only the examples with nonzero weight the examples whose kernels overlap the
More informationCS6220: DATA MINING TECHNIQUES
CS6220: DATA MINING TECHNIQUES Matrix Data: Prediction Instructor: Yizhou Sun yzsun@ccs.neu.edu September 21, 2015 Announcements TA Monisha s office hour has changed to Thursdays 10-12pm, 462WVH (the same
More informationGaussian Mixture Models
Gaussian Mixture Models Pradeep Ravikumar Co-instructor: Manuela Veloso Machine Learning 10-701 Some slides courtesy of Eric Xing, Carlos Guestrin (One) bad case for K- means Clusters may overlap Some
More informationBayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework
HT5: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Maximum Likelihood Principle A generative model for
More informationIntroduction to Machine Learning
1, DATA11002 Introduction to Machine Learning Lecturer: Antti Ukkonen TAs: Saska Dönges and Janne Leppä-aho Department of Computer Science University of Helsinki (based in part on material by Patrik Hoyer,
More informationExpectation maximization
Expectation maximization Subhransu Maji CMSCI 689: Machine Learning 14 April 2015 Motivation Suppose you are building a naive Bayes spam classifier. After your are done your boss tells you that there is
More informationCh 4. Linear Models for Classification
Ch 4. Linear Models for Classification Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Department of Computer Science and Engineering Pohang University of Science and echnology 77 Cheongam-ro,
More informationLogistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018
10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Logistic Regression Matt Gormley Lecture 8 Feb. 12, 2018 1 10-601 Introduction
More informationPattern Recognition and Machine Learning
Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability
More informationLinear Models for Classification
Linear Models for Classification Oliver Schulte - CMPT 726 Bishop PRML Ch. 4 Classification: Hand-written Digit Recognition CHINE INTELLIGENCE, VOL. 24, NO. 24, APRIL 2002 x i = t i = (0, 0, 0, 1, 0, 0,
More informationMachine Learning And Applications: Supervised Learning-SVM
Machine Learning And Applications: Supervised Learning-SVM Raphaël Bournhonesque École Normale Supérieure de Lyon, Lyon, France raphael.bournhonesque@ens-lyon.fr 1 Supervised vs unsupervised learning Machine
More informationMachine Learning for Signal Processing Bayes Classification
Machine Learning for Signal Processing Bayes Classification Class 16. 24 Oct 2017 Instructor: Bhiksha Raj - Abelino Jimenez 11755/18797 1 Recap: KNN A very effective and simple way of performing classification
More informationWarm up: risk prediction with logistic regression
Warm up: risk prediction with logistic regression Boss gives you a bunch of data on loans defaulting or not: {(x i,y i )} n i= x i 2 R d, y i 2 {, } You model the data as: P (Y = y x, w) = + exp( yw T
More informationTufts COMP 135: Introduction to Machine Learning
Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2019s/ Logistic Regression Many slides attributable to: Prof. Mike Hughes Erik Sudderth (UCI) Finale Doshi-Velez (Harvard)
More informationMachine Learning, Fall 2012 Homework 2
0-60 Machine Learning, Fall 202 Homework 2 Instructors: Tom Mitchell, Ziv Bar-Joseph TA in charge: Selen Uguroglu email: sugurogl@cs.cmu.edu SOLUTIONS Naive Bayes, 20 points Problem. Basic concepts, 0
More informationUNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014
UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014 Exam policy: This exam allows two one-page, two-sided cheat sheets (i.e. 4 sides); No other materials. Time: 2 hours. Be sure to write
More informationCOM336: Neural Computing
COM336: Neural Computing http://www.dcs.shef.ac.uk/ sjr/com336/ Lecture 2: Density Estimation Steve Renals Department of Computer Science University of Sheffield Sheffield S1 4DP UK email: s.renals@dcs.shef.ac.uk
More informationProbability and Statistical Decision Theory
Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2019s/ Probability and Statistical Decision Theory Many slides attributable to: Erik Sudderth (UCI) Prof. Mike Hughes
More informationMidterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.
CS 89 Spring 07 Introduction to Machine Learning Midterm Please do not open the exam before you are instructed to do so. The exam is closed book, closed notes except your one-page cheat sheet. Electronic
More informationMLCC 2017 Regularization Networks I: Linear Models
MLCC 2017 Regularization Networks I: Linear Models Lorenzo Rosasco UNIGE-MIT-IIT June 27, 2017 About this class We introduce a class of learning algorithms based on Tikhonov regularization We study computational
More informationCSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18
CSE 417T: Introduction to Machine Learning Lecture 11: Review Henry Chai 10/02/18 Unknown Target Function!: # % Training data Formal Setup & = ( ), + ),, ( -, + - Learning Algorithm 2 Hypothesis Set H
More informationKernel Methods and Support Vector Machines
Kernel Methods and Support Vector Machines Oliver Schulte - CMPT 726 Bishop PRML Ch. 6 Support Vector Machines Defining Characteristics Like logistic regression, good for continuous input features, discrete
More informationBias-Variance Tradeoff
What s learning, revisited Overfitting Generative versus Discriminative Logistic Regression Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University September 19 th, 2007 Bias-Variance Tradeoff
More informationStatistical learning. Chapter 20, Sections 1 4 1
Statistical learning Chapter 20, Sections 1 4 Chapter 20, Sections 1 4 1 Outline Bayesian learning Maximum a posteriori and maximum likelihood learning Bayes net learning ML parameter learning with complete
More informationMachine Learning 2017
Machine Learning 2017 Volker Roth Department of Mathematics & Computer Science University of Basel 21st March 2017 Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 1 / 41 Section
More informationL11: Pattern recognition principles
L11: Pattern recognition principles Bayesian decision theory Statistical classifiers Dimensionality reduction Clustering This lecture is partly based on [Huang, Acero and Hon, 2001, ch. 4] Introduction
More information9 Classification. 9.1 Linear Classifiers
9 Classification This topic returns to prediction. Unlike linear regression where we were predicting a numeric value, in this case we are predicting a class: winner or loser, yes or no, rich or poor, positive
More informationConvex Optimization M2
Convex Optimization M2 Lecture 8 A. d Aspremont. Convex Optimization M2. 1/57 Applications A. d Aspremont. Convex Optimization M2. 2/57 Outline Geometrical problems Approximation problems Combinatorial
More informationClick Prediction and Preference Ranking of RSS Feeds
Click Prediction and Preference Ranking of RSS Feeds 1 Introduction December 11, 2009 Steven Wu RSS (Really Simple Syndication) is a family of data formats used to publish frequently updated works. RSS
More informationLeast Squares Regression
E0 70 Machine Learning Lecture 4 Jan 7, 03) Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are a brief summary of the topics covered in the lecture. They are not a substitute
More informationLinear Models in Machine Learning
CS540 Intro to AI Linear Models in Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu We briefly go over two linear models frequently used in machine learning: linear regression for, well, regression,
More informationCS 6375 Machine Learning
CS 6375 Machine Learning Nicholas Ruozzi University of Texas at Dallas Slides adapted from David Sontag and Vibhav Gogate Course Info. Instructor: Nicholas Ruozzi Office: ECSS 3.409 Office hours: Tues.
More information10-701/ Machine Learning - Midterm Exam, Fall 2010
10-701/15-781 Machine Learning - Midterm Exam, Fall 2010 Aarti Singh Carnegie Mellon University 1. Personal info: Name: Andrew account: E-mail address: 2. There should be 15 numbered pages in this exam
More informationDiscriminative Models
No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models
More informationData Mining Techniques
Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 6 Jan-Willem van de Meent (credit: Yijun Zhao, Chris Bishop, Andrew Moore, Hastie et al.) Project Project Deadlines 3 Feb: Form teams of
More informationBayesian Methods: Naïve Bayes
Bayesian Methods: aïve Bayes icholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate Last Time Parameter learning Learning the parameter of a simple coin flipping model Prior
More informationMIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,
MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October, 23 2013 The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run
More information