Generalized Linear Models and Logistic Regression
|
|
- May Doyle
- 6 years ago
- Views:
Transcription
1 Department of Electrical and Computer Engineering University of Texas at Austin
2 Machine Learning - What do we want the machines to learn? We want humanoid agents capable of making intelligent decisions Decisions often need to be made based on a certain target outcome associated with a specified input. So... If a machine can predict the target output after seeing an input, we can easily program (using a few simple if-else statements) to make the decision associated with the output. For example, if a robot can predict whether it is about to fall or not based on inputs from various motor sensors, it can be programmed to initiate balancing motor functions to prevent the fall.
3 Predictive Modeling The process of learning to predict an outcome from a given input is known as Predictive Modeling. The output can be A continuous value - This is called Regression. For example, predicting the force acting on the right limb of a robot given inputs from motor sensors. A discrete label - This is called Classification. For example, predicting whether the robot will fall or not based on the sensor inputs. Supervised Learning is an important class of methods for predictive modeling.
4 Supervised Learning - Introduction Given a set of input-output pairs, we want to learn a function f which relates an input to the corresponding output. This is Supervised Learning. The learnt function can then be used to predict output for unseen input values and the process is known as Prediction. The set of given input-output pairs is known as Training Set. Learning of the function f is supervised by the training set. The set of unseen input values is known as Test Set. Inputs are also called features/attributes or independent variables. Training and Test sets can be assumed to be sampled i.i.d. from a joint unknown distribtuion P(x, y) = P(y x)p(x).
5 Supervised Learning - Preliminaries Let X be the domain of the attributes. For a real valued input (input from a single sensor), X R For a vector valued input (from multiple sensors) X R D. For a categorical attribute, X S, a set of possible values of the attribute. For example, the set of allowed rotations of a limb. Let Y be the domain of the target output. For regression, Y R or if we have vector valued output Y R D. For classification, Y S, a set of possible labels. For robot example, S {0, 1}, 0 for fall and 1 for no fall. Given a training set {(x i, y i )} N i=1 of N input-output pairs, we need to learn a function f : X Y such that y = f (x). f is the model which relates X to Y and we want to learn this model.
6 Supervised Learning - Preliminaries The space of all possible functions f : X Y is infinite. Learning thus requires to search this infinite space for a function f that satisfies the input-output pairs of the training set. This is very expensive. So... We restrict the search process to a sub-space of functions The sub-space is represented by a parameterized class of functions having a particular form. y f (x; θ) The goal then is to learn the parameters θ which give the function that best approximates the output within this restricted class of functions. For example, we can only look for linear functions of inputs f (x; θ) = θ T x. This is the case for linear models. Or we can search for a more complicated non-linear function for better approximation, for example f (x; θ) = g(θ T x), where g( ) is some non-linear function.
7 Supervised Learning - Noise Model Restricting the search to a sub-space results in an approximate estimate of the output y i = f (x i ; θ) + ɛ i ɛ is the residue/noise that the learnt function fails to account for due to restricted search Since each input-output pair (x i, y i ) is sampled from P(x, y), We learn a model for the expected value of the output E[y i ] = f (x i ; θ) + E[ɛ i ] Finite size of training set will be another approximation as now the expected value is based on finite sample. The associated noise ɛ will also be random having same form of distribution as P(y x i ) (Note we just need to scale y i by a deterministic value f (x i ; θ) to get ɛ i ). Different distribution assumptions for the output, P(y x) give rise to different models. We will first look at simple models - least squares regression and logistic regression, and then generalize them to a class of models known as Generalized Linear Models (GLM).
8 Least Squares Regression - Gaussian Noise Model The domain of target output values is Y R. It is reasonable to assume that predicted output f (x i ; θ) is corrupted by a zero mean Gaussian noise. ɛ i N (0, σ 2 ) The target output y i then will also be normally distributed y i N (f (x i ; θ), σ 2 ) The training samples are assumed to be drawn i.i.d. from the distribution P(x, y) P({y i } {x i })P({x i }) = N N (f (x i ; θ), σ 2 )P({x i }) We can learn the parameters θ by maximizing the log-likelihood θ = arg max θ θ = arg min θ i=1 N log(p({y i } {x i })P({x i })) i=1 N (y i f (x i ; θ)) 2 + constant constant includes the terms independent of θ which do not affect the optimization for θ and can be ignored. i=1
9 Least Squares Regression - Gaussian Noise Model Least squares Regression arises when we assume output is normally distributed. For special case of linear models, we can assume f (x i ; θ) = θ T x i θ = arg min θ N ) 2 (y i θ T x i i=1 This is commonly used Linear Regression or Linear Least Squares discovered by Gauss! A closed form solution can be obtained for θ by taking derivative of objective function above and setting it zero. The solution is the famous pseudo inverse solution θ = (X T X) 1 X T y What happens when y is not normally distributed? We will now look at models that arise when y has a non-normal distribution. We will start with the case when y has Bernoulli distribution. This gives rise to well known Logistic Regression.
10 Logistic Regression - Introduction Output is now a Bernoulli random variable with domain Y {0, 1}. The problem is really a classification problem since we have just two labels. Regression stuck to make it consistent with a general class of models called GLM of which logistic regression is a special case. Remember we model the expected value of the target. Also the noise is generally assumed to be a zero mean random variable. E[y i ] = f (x i ; θ) = P(y i = 1 x i ; θ) The function that we want to learn thus is constrained to have a range [0, 1]. This is generally done by applying a non-linear squashing function to θ T x i so that output of linear model lies between [0, 1]. For logistic regression it is the sigmoid function ) f (x i ; θ) = σ (θ T 1 x i = 1 + exp ( θ T ) x i
11 Log Odds Modeling E[y] by a sigmoid function over linear model is equivalent to modeling log-odds by the linear model. This is easy to prove 1 E[y] = P(y = 1 x) = 1 + exp( θ T x) exp( θ T 1 P(y = 1 x) x) = P(y = 1 x) ( ) P(y = 1 x) θ T x = log P(y = 0 x) ( ) P(y=1 x) log are the log-odds for the Bernoulli random variable. P(y=0 x)
12 Logistic Sigmoid Function Figure: Logistic Sigmoid Function : σ(x) A smooth monotonic well-behaved function to squash the real line to range [0, 1] Well-behaved implies continuity, double differentiability etc. which are important during optimization. For logistic regression, it is the conditional probability of label 1 given the attributes. In Bayes decision theory, it is the posterior distribution of the positive class. We will not cover Bayes decision theory in these slides. Useful property - σ(x) = σ(x)(1 σ(x)). We will use this when learning the parameters.
13 Linear Decision Boundary Modeling expected value via sigmoid function applied to a linear model results in a linear separating hyperplane. This can easily be proved. In classification, decision boundary is given by A linear decision boundary. P(y = 1 x) = P(y = 0 x) P(y = 1 x) = 1 P(y = 1 x) exp( θ T x) = θ T x = 0 exp( θ T x) 1 + exp( θ T x)
14 Parameter Learning We modeled P(y i = 1 x i, θ) = σ(θ T x i ). Likewise, P(y i = 0 x i, θ) = 1 σ(θ T x i ) The distribution for the Bernoulli random variable is then given by ( ) P(y i x i, θ) = σ(θ T yi ( ) x i ) 1 σ(θ T (1 yi ) x i ) With i.i.d. assumption on the training set, the likelihood is given by P(y x, θ) = The log-likelihood is given by L(θ) = N i=1 N i=1 θ can be learnt by maximizing the L(θ) ( ) yi ( ) (1 yi ) σ(θ T x i ) 1 σ(θ T x i ) ( ) ( ) y i log σ(θ T x i ) + (1 y i ) log 1 σ(θ T x i ) θ = arg maxl(θ) θ Due to sigmoid function, a closed form solution does not exist. So we turn to gradient methods for optimization.
15 Gradient Ascent for Optimization L(θ) is a concave function, known as cross-entropy. Global maximum can be obtained using gradient ascent. Starting from an initial point, move in the direction of steepest ascent i.e. in the direction of gradient. We get the following iterative update procedure The gradient is given by θ (t+1) = θ (t) + α L(θ) (t) L(θ) (t) = N i=1 ( ) y i σ(θ (t) x i ) x i α is the learning rate and needs to be selected using line-search to ensure likelihood increases at every iteration which can be time consuming. In practise, α is gradually decreased as the iterations progress to avoid drastic jumps in the search space. To avoid expensive line search, other methods like Newton s method or Stochastic gradient ascent can be used.
16 IRLS - Iterated Re-weighted Least Squares Another method to minimize L(θ) (same as maximizing the liklihood) is by Newton s method involving inverse of the Hessian matrix ( 1 θ (t+1) = θ (t) L(θ )) (t) L(θ (t) ) The Hessian is given by N L(θ (t) ) = σ(θ (t) x i )(1 σ(θ (t) x i ))xi T x i = x T W (t) x i=1 ( ) W (t) = diag σ(θ (t) x i )(1 σ(θ (t) x i )) The update for the parameters is then given by ( 1 θ (t+1) = θ (t) x T W x) (t) x T (σ(θ (t) x) y) θ (t+1) = (x T W (t) x) 1 x T W (t) z z = xθ (t) (W (t) ) 1 (σ(θ (t) x) y) If Hessian is positive-definite (which is in this case), a unique optimum is found. The update is a weighted least squares problem with weights W (t) The weight matrix is dependent on current estimate of θ (t) and needs to be re-estimated every iteration. Hence, the name IRLS. For large problems, Hessian could be a large matrix with an expensive inverse computation.
17 Stochastic Gradient Ascent Incremental counterpart of Gradient Ascent method. We update the parameters as we receive more training points. The gradient is now computed only for the newly obtained point which gives the following update ( ) θ (t+1) = θ (t) + α y i σ(θ (t) x i ) x i Can be slow to reach the optimum as opposed to batch updates where we sum up over the gradients for all the training points. For large scale problems with streaming data, this is the ideal choice and is the reason for the popularity of logistic regression (GLMs in general) for large scale predictive modeling problems.
18 Multi-Class Classification - SoftMax Regression Logistic Regression can be generalized to multiple classes by SoftMax Regression. For K class case, the target is a K dimensional vector y such that y k = 1 if the training examples belongs to class k. Following standard noise model, we model the expected value of each y k as E[y ik ] = P(y ik = 1 x i, θ k ) = exp(θ T k x) K k =1 exp(θt k x) The likelihood for the training set can then be written as ( N K exp(θ T k P(Y X, {θ k }) = x) K i=1 k=1 k =1 exp(θt k x) The likelihood can be maximized by any of the methods discussed to learn the parameters. In SoftMax Regression, there is an implicit Multinomial distribution assumption for the target output. During prediction on test set, the learnt parameters estimate the probability of each class. Normally, the final classification is decided based on maximum estimated probability of a particular class. ) yik
19 Fisher Iris Dataset Figure: Fisher Iris Data One of the most widely used datasets in machine learning. Comes with Matlab! Just use the command load fisheriris to load the data into workspace. Three classes with 50 instances each. Setosa is linearly separable from Versicolor and Virginica. Independent variables - Sepal length, Sepal width, Petal length, Petal Width (all in cms). Let s try to classify setosa from the other two classes.
20 >>load fisheriris >> X = meas(:,1:2); >> Y(1:50) = 0; Y(51:150) = 1; Y = Y ; >> beta = glmfit(x, Y, binomial, logit ); >> scatter(x(1:50,1), X(1:50,2), +g ); >> hold on >> scatter(x(51:150,1), X(51:150,2), +r ); >> x = 4:0.01:8; >> y = (-beta(2).* x - beta(1))./ beta(3); >> plot(x,y); Figure: Classifying fisher iris dataset using features (1,2) and (3,4) glmfit uses IRLS for iteratively learning the model parameters.
21 The two classes are linearly separable in the four dimensional space of the given independent variables. Here is an illustration using first three attributes. Figure: Classifying fisher iris dataset using features (1,2,3)
22 Train-Test Split >> ind test = randsample(150, 50); >> X test = X(ind test, :); Y test = Y(ind test); >> ind train = setdiff([1:150], ind test); >> X train = X(idx train,:); Y train = Y(ind train); >> betas = glmfit(x train, Y train, binomial, logit ); >> Y hat = glmval(betas, X test, logit ); Figure: Classifying fisher iris dataset - train and test sets Only one point was miss-classified. The classes however, seem to be strongly linearly separable. Let us now look at a more challenging dataset which might not be linearly separable.
23 E. Coli Dataset Protein localization site for Ecoli. The available classes are 1 cp (cytoplasm), 2 im (inner membrane without signal sequence), 3 pp (perisplasm), 4 imu (inner membrane, uncleavable signal sequence), 5 om (outer membrane), 6 oml (outer membrane lipoprotein), 7 iml (inner membrane lipoprotein), 8 ims (inner membrane, cleavable signal sequence). We only focus on two classes cp and im as they form bulk of the available data (more than 95%). There are 7 predicitve attributes - 1 mcg: McGeoch s method for signal sequence recognition. 2 gvh: von Heijne s method for signal sequence recognition. 3 lip: von Heijne s Signal Peptidase II consensus sequence score. Binary attribute. 4 chg: Presence of charge on N-terminus of predicted lipoproteins. Binary attribute. 5 aac: score of discriminant analysis of the amino acid content of outer membrane and periplasmic proteins. 6 alm1: score of the ALOM membrane spanning region prediction program. 7 alm2: score of ALOM program after excluding putative cleavable signal regions from the sequence.
24 Logistic Regression on Ecoli Dataset >> data = dlmread( ecoli.txt ); >> X = data(:,1:7); Y = data(:,8); >> betas = glmfit(x, Y, binomial, logit ); >> betas [ ] Some of the coefficients are very small relative to other. This shows that the corresponding dimensions are not informative in predicting the target class. Let us do a PCA on the data matrix and then do logistic regression.
25 >> data = dlmread( ecoli.txt ); >> X = data(:,1:7); Y = data(:,8); >> [coeff, X1, latent] = princomp(x); >> beta = glmfit(x1(:,1:3), Y, binomial, logit ); >> scatter3(x1(1:143,1), X1(1:143,2),X1(1:143,3), +g ); >> scatter3(x1(144:220,1), X1(144:220,2),X1(144:220,3), +r ); >> y = -0.6:0.01:0.6; x = -0.6:0.01:0.6; >> z = (-beta(2).* x - beta(3).* y - beta(1))./ beta(4); >> plot3(x,y,z); >> beta = [ ] Figure: Classsifying E.coli using first three principal components It gives a reasonable separating hyperplane even though the classes are not linearly separable.
26 Gradient Ascent Implementation function [theta] = learn theta(x, Y, prec, niters) log sigmoid 1./ (1 + exp( x)); linear model theta) x * theta; % Add a column of ones to X to account for the bias term X = [ones(length(y), 1) X]; theta current = rand(size(x,2), 1); theta new = 2 * theta current; t = 1; alpha current = 10; eta = 0.5; % the update rule for the step size theta stack(t,:) = theta current; alpha update iters = 1000; counter = 1; % Gradient ascent iterations while (norm(theta new theta current) > prec & t < niters) theta current = theta new; if (t == counter * alpha update iters) % update only every alpha update iters alpha current = update alpha(alpha current, eta); counter = counter + 1; end theta new = theta current + alpha current.* X' *... (Y log sigmoid(linear model(x, theta current))); t = t + 1; theta stack(t,:) = theta new; end theta = theta current; x = 1:size(theta stack, 1); for ii = 1:length(theta) plot(x, theta stack(:,ii)'); hold on; end return; % Step size selection function [alpha ret] = update alpha(alpha current, eta) alpha ret = eta * alpha current; if (alpha ret < 0.001) alpha ret = alpha current; end return;
27 Gradient Ascent on Fisher Iris Dataset Figure: Parameter convergence and performance using gradient ascent
28 Gradient ascent can easily be implemented in matlab using matrix operations. It is considerably faster than IRLS since it avoids expensive inverse computation. We however, need to play around with the step size α to get reasonable convergence of the parameters. The plots in the last slide were generated by starting with a high α = 0.1 and then reducing it by 0.5 times every 1000 iterations till it reached We want longer steps in the beginning to quickly reach the basin of convergence but need to reduce step size progressively to avoid oscillations around the optimum. Note that the separation is not as good as glmfit. This is because we used a simple stepsize selection rule and stopped early. Both IRLS and gradient ascent are guranteed to reach the same correct optimum. Other sophisticated (but expensive!) line-search methods can be used for step-size selection. Now, let us try to see how gradient ascent behaves when the classes are not linearly separable. Let us try to separate virginica from other two classes
29 Figure: Parameter convergence and performance using gradient ascent in non-linear separation case Seems to be a reasonable separation. Note that the code for gradient ascent accepts X and Y as inputs. To generate these plots I just changed the class labels in Y to separate virginica from other two classes.
30 Train-Test Split for Fisher Iris Dataset Figure: Classifying fisher iris dataset - train and test sets As in case of IRLS, only one point was miss-classified using gradient ascent method.
31 Gradient Ascent for E.coli Dataset Figure: Parameter convergence for E.coli dataset >>Y hat = glmval(theta, X, logit ); >> idx = find(y hat >= 0.8); >> Y hat(idx) = 1; >> idx = find(y hat < 0.8); >> Y hat(idx) = 0; glmval computes the expected value of the target by applying the inverse link function to linear model. For logistic regression, this is probability of target being 1. We threshold by 0.8 to get the target class The model correctly classifies 217 (98.64%) training instances.
32 Generalizing to other distributions We now generalize the formulation to other possible distributions that the target output can assume. We will consider a rich class of distributions called Exponential Family of Distributions. Many known distributions belong to this family such as Gaussian, Bernoulli, Binomial, Gamma, Beta and many more. Following this family, the class of models we obtain is known as Generalized Linear Models or (GLMs). Lets begin with the basics of Exponential Family of Distributions and see how many commonly known distributions are the special cases.
33 Exponential Family of Distributions A parameterized family of distributions having the form p(x θ) = exp( t(x), θ Ψ(θ))p 0 (x) Few properties In canonical form, t(x) = x θ is the natural parameter of the distribution Ψ( ) is the cummulant function associated with the distribution p 0(x) is the Lebesgue-Steiltjes integral with respect to the underlying measure. We are only interested in the property that relates E[x] to the natural parameter θ E[x] = Ψ(θ) This property forms the basis of GLMs. Ψ 1 ( ) has a special name called canonical link function - Simplest function that links the expected value to the natural parameter. Table: Important special members of Exponential family Distribution Ψ(θ) Ψ(θ) Ψ ( 1 (t) ) exp(θ) Bernoulli log(1 + exp(θ)) log t (1+exp(θ)) 1 t Binomial N log(1 + exp(θ)) N exp(θ) ( ) log t (1+exp(θ)) N t Poisson exp(θ) 1 exp(θ) log t Gaussian θ 2 θ t 2 Gamma log( θ) 1 1 θ t Note that the inverse link function for the Bernoulli distribution is the logistic sigmoid function.
34 Generalized Linear Models (GLMs) The target output is now distributed according to a particular member of the exponential family of distributions. We assume a linear model for the natural parameter of the distribution θ = β T x We changed notation slightly! We thus model the expected value of the target via inverse canonical link function computed at the linear model Ψ 1 (E[y]) = β T x We obtain the following expression for the likelihood P(y x; β) = exp( t(x), β T x Ψ(β T x))p 0 (x) The likelihood can then be maximized using methods described in previous slides to learn the model parameters βs.
35 Generalized Linear Models - GLMs GLMs are an important class of prediction models covering a large class of probability distributions. They can model a large class of problems by having more complicated link functions that don t need to be inverse cummulant function. For example, in probit regression, the link function is the Gaussian CDF for a standard normal random variable. Parameters can be easily learnt by maximizing the likelihood using the gradient methods. Table: Important special members of Generalized Linear Models Distribution Support Prediction Model Bernoulli {0, 1} Logistic Regression Poisson Z + Poisson Regression Gaussian R Least Squares Gamma R + Gamma Regression
36 Summary GLMs in general and Logistic Regression in particular are supervised learning methods catering to large class of distributions called exponential family of distributions. Simple, scalable parameter learning via gradient methods. Linear separating hyperplane for logistic regression. More complicated models can be obtained using complicated, non-monotonic link functions.
37 References Pattern Recognition and Machine Learning. Christopher M.Bishop, Springer Generalized Linear Models. P. McCullagh and J.A. Nelder, Chapman & Hall/CRC, Machine Learning. Thomas M. Mitchell, McGraw-Hill Higher Education, Generalized Linear Models, with Application in Engineering and Sciences. R.H. Myers, D.C. Montgomery and C.G. Vining, John Wiley & Sons, A probabilistic classification system for predicting the cellular localization sites of proteins. Paul Horton and Kenta Nakai, Intelligent Systems in Molecular Biology, , St. Louis, USA E.coli Dataset. UCI Machine Learning Repository.
Machine Learning. Lecture 3: Logistic Regression. Feng Li.
Machine Learning Lecture 3: Logistic Regression Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2016 Logistic Regression Classification
More informationMLE/MAP + Naïve Bayes
10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University MLE/MAP + Naïve Bayes MLE / MAP Readings: Estimating Probabilities (Mitchell, 2016)
More informationLogistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu
Logistic Regression Review 10-601 Fall 2012 Recitation September 25, 2012 TA: Selen Uguroglu!1 Outline Decision Theory Logistic regression Goal Loss function Inference Gradient Descent!2 Training Data
More information> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel
Logistic Regression Pattern Recognition 2016 Sandro Schönborn University of Basel Two Worlds: Probabilistic & Algorithmic We have seen two conceptual approaches to classification: data class density estimation
More informationNaïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018
10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Naïve Bayes Matt Gormley Lecture 18 Oct. 31, 2018 1 Reminders Homework 6: PAC Learning
More informationCh 4. Linear Models for Classification
Ch 4. Linear Models for Classification Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Department of Computer Science and Engineering Pohang University of Science and echnology 77 Cheongam-ro,
More informationNaïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 3 September 14, Readings: Mitchell Ch Murphy Ch.
School of Computer Science 10-701 Introduction to Machine Learning aïve Bayes Readings: Mitchell Ch. 6.1 6.10 Murphy Ch. 3 Matt Gormley Lecture 3 September 14, 2016 1 Homewor 1: due 9/26/16 Project Proposal:
More informationLogistic Regression. Seungjin Choi
Logistic Regression Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/
More informationClassification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012
Classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Topics Discriminant functions Logistic regression Perceptron Generative models Generative vs. discriminative
More informationOutline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)
Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Linear Models for Regression Linear Regression Probabilistic Interpretation
More informationWhy Bother? Predicting the Cellular Localization Sites of Proteins Using Bayesian Model Averaging. Yetian Chen
Predicting the Cellular Localization Sites of Proteins Using Bayesian Model Averaging Yetian Chen 04-27-2010 Why Bother? 2008 Nobel Prize in Chemistry Roger Tsien Osamu Shimomura Martin Chalfie Green Fluorescent
More informationHOMEWORK #4: LOGISTIC REGRESSION
HOMEWORK #4: LOGISTIC REGRESSION Probabilistic Learning: Theory and Algorithms CS 274A, Winter 2018 Due: Friday, February 23rd, 2018, 11:55 PM Submit code and report via EEE Dropbox You should submit a
More informationUniversity of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians
Engineering Part IIB: Module F Statistical Pattern Processing University of Cambridge Engineering Part IIB Module F: Statistical Pattern Processing Handout : Multivariate Gaussians. Generative Model Decision
More informationMachine Learning Lecture 7
Course Outline Machine Learning Lecture 7 Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Statistical Learning Theory 23.05.2016 Discriminative Approaches (5 weeks) Linear Discriminant
More informationMachine Learning, Fall 2012 Homework 2
0-60 Machine Learning, Fall 202 Homework 2 Instructors: Tom Mitchell, Ziv Bar-Joseph TA in charge: Selen Uguroglu email: sugurogl@cs.cmu.edu SOLUTIONS Naive Bayes, 20 points Problem. Basic concepts, 0
More informationLast Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression
CSE 446 Gaussian Naïve Bayes & Logistic Regression Winter 22 Dan Weld Learning Gaussians Naïve Bayes Last Time Gaussians Naïve Bayes Logistic Regression Today Some slides from Carlos Guestrin, Luke Zettlemoyer
More informationLecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.
Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Linear models for classification Logistic regression Gradient descent and second-order methods
More informationMLE/MAP + Naïve Bayes
10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University MLE/MAP + Naïve Bayes Matt Gormley Lecture 19 March 20, 2018 1 Midterm Exam Reminders
More informationLogistic Regression: Online, Lazy, Kernelized, Sequential, etc.
Logistic Regression: Online, Lazy, Kernelized, Sequential, etc. Harsha Veeramachaneni Thomson Reuter Research and Development April 1, 2010 Harsha Veeramachaneni (TR R&D) Logistic Regression April 1, 2010
More informationBias-Variance Tradeoff
What s learning, revisited Overfitting Generative versus Discriminative Logistic Regression Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University September 19 th, 2007 Bias-Variance Tradeoff
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 305 Part VII
More informationHOMEWORK #4: LOGISTIC REGRESSION
HOMEWORK #4: LOGISTIC REGRESSION Probabilistic Learning: Theory and Algorithms CS 274A, Winter 2019 Due: 11am Monday, February 25th, 2019 Submit scan of plots/written responses to Gradebook; submit your
More informationECE662: Pattern Recognition and Decision Making Processes: HW TWO
ECE662: Pattern Recognition and Decision Making Processes: HW TWO Purdue University Department of Electrical and Computer Engineering West Lafayette, INDIANA, USA Abstract. In this report experiments are
More informationGaussian and Linear Discriminant Analysis; Multiclass Classification
Gaussian and Linear Discriminant Analysis; Multiclass Classification Professor Ameet Talwalkar Slide Credit: Professor Fei Sha Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015
More informationLinear Models for Classification
Linear Models for Classification Oliver Schulte - CMPT 726 Bishop PRML Ch. 4 Classification: Hand-written Digit Recognition CHINE INTELLIGENCE, VOL. 24, NO. 24, APRIL 2002 x i = t i = (0, 0, 0, 1, 0, 0,
More informationLecture 3 - Linear and Logistic Regression
3 - Linear and Logistic Regression-1 Machine Learning Course Lecture 3 - Linear and Logistic Regression Lecturer: Haim Permuter Scribe: Ziv Aharoni Throughout this lecture we talk about how to use regression
More informationLearning from Data Logistic Regression
Learning from Data Logistic Regression Copyright David Barber 2-24. Course lecturer: Amos Storkey a.storkey@ed.ac.uk Course page : http://www.anc.ed.ac.uk/ amos/lfd/ 2.9.8.7.6.5.4.3.2...2.3.4.5.6.7.8.9
More informationCPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017
CPSC 340: Machine Learning and Data Mining MLE and MAP Fall 2017 Assignment 3: Admin 1 late day to hand in tonight, 2 late days for Wednesday. Assignment 4: Due Friday of next week. Last Time: Multi-Class
More informationComments. x > w = w > x. Clarification: this course is about getting you to be able to think as a machine learning expert
Logistic regression Comments Mini-review and feedback These are equivalent: x > w = w > x Clarification: this course is about getting you to be able to think as a machine learning expert There has to be
More informationProbabilistic modeling. The slides are closely adapted from Subhransu Maji s slides
Probabilistic modeling The slides are closely adapted from Subhransu Maji s slides Overview So far the models and algorithms you have learned about are relatively disconnected Probabilistic modeling framework
More informationMachine Learning 2017
Machine Learning 2017 Volker Roth Department of Mathematics & Computer Science University of Basel 21st March 2017 Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 1 / 41 Section
More informationLecture 2 Machine Learning Review
Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017 Things we will look at today Formal Setup for Supervised Learning Things
More informationLinear Discrimination Functions
Laurea Magistrale in Informatica Nicola Fanizzi Dipartimento di Informatica Università degli Studi di Bari November 4, 2009 Outline Linear models Gradient descent Perceptron Minimum square error approach
More informationLogistic Regression and Generalized Linear Models
Logistic Regression and Generalized Linear Models Sridhar Mahadevan mahadeva@cs.umass.edu University of Massachusetts Sridhar Mahadevan: CMPSCI 689 p. 1/2 Topics Generative vs. Discriminative models In
More informationUniversity of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians
University of Cambridge Engineering Part IIB Module 4F: Statistical Pattern Processing Handout 2: Multivariate Gaussians.2.5..5 8 6 4 2 2 4 6 8 Mark Gales mjfg@eng.cam.ac.uk Michaelmas 2 2 Engineering
More informationLogistic Regression. Will Monroe CS 109. Lecture Notes #22 August 14, 2017
1 Will Monroe CS 109 Logistic Regression Lecture Notes #22 August 14, 2017 Based on a chapter by Chris Piech Logistic regression is a classification algorithm1 that works by trying to learn a function
More informationLogistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.
Logistic Regression Vibhav Gogate The University of Texas at Dallas Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld. Generative vs. Discriminative Classifiers Want to Learn: h:x Y X features
More informationECE521 week 3: 23/26 January 2017
ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear
More informationMachine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang
Machine Learning Basics Lecture 7: Multiclass Classification Princeton University COS 495 Instructor: Yingyu Liang Example: image classification indoor Indoor outdoor Example: image classification (multiclass)
More informationLinear Classification
Linear Classification Lili MOU moull12@sei.pku.edu.cn http://sei.pku.edu.cn/ moull12 23 April 2015 Outline Introduction Discriminant Functions Probabilistic Generative Models Probabilistic Discriminative
More informationThe exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.
CS 189 Spring 013 Introduction to Machine Learning Final You have 3 hours for the exam. The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet. Please
More informationGenerative v. Discriminative classifiers Intuition
Logistic Regression Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University September 24 th, 2007 1 Generative v. Discriminative classifiers Intuition Want to Learn: h:x a Y X features
More informationCPSC 340: Machine Learning and Data Mining
CPSC 340: Machine Learning and Data Mining MLE and MAP Original version of these slides by Mark Schmidt, with modifications by Mike Gelbart. 1 Admin Assignment 4: Due tonight. Assignment 5: Will be released
More informationσ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =
Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,
More informationMeta-Learning for Escherichia Coli Bacteria Patterns Classification
Meta-Learning for Escherichia Coli Bacteria Patterns Classification Hafida Bouziane, Belhadri Messabih, and Abdallah Chouarfia MB University, BP 1505 El M Naouer 3100 Oran Algeria e-mail: (h_bouziane,messabih,chouarfia)@univ-usto.dz
More informationLinear and logistic regression
Linear and logistic regression Guillaume Obozinski Ecole des Ponts - ParisTech Master MVA Linear and logistic regression 1/22 Outline 1 Linear regression 2 Logistic regression 3 Fisher discriminant analysis
More informationCMU-Q Lecture 24:
CMU-Q 15-381 Lecture 24: Supervised Learning 2 Teacher: Gianni A. Di Caro SUPERVISED LEARNING Hypotheses space Hypothesis function Labeled Given Errors Performance criteria Given a collection of input
More informationProbabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016
Probabilistic classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Topics Probabilistic approach Bayes decision theory Generative models Gaussian Bayes classifier
More informationMachine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012
Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University September 20, 2012 Today: Logistic regression Generative/Discriminative classifiers Readings: (see class website)
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear
More informationMachine Learning Linear Classification. Prof. Matteo Matteucci
Machine Learning Linear Classification Prof. Matteo Matteucci Recall from the first lecture 2 X R p Regression Y R Continuous Output X R p Y {Ω 0, Ω 1,, Ω K } Classification Discrete Output X R p Y (X)
More informationMachine Learning Lecture 5
Machine Learning Lecture 5 Linear Discriminant Functions 26.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory
More informationLOGISTIC REGRESSION Joseph M. Hilbe
LOGISTIC REGRESSION Joseph M. Hilbe Arizona State University Logistic regression is the most common method used to model binary response data. When the response is binary, it typically takes the form of
More informationProbabilistic Graphical Models & Applications
Probabilistic Graphical Models & Applications Learning of Graphical Models Bjoern Andres and Bernt Schiele Max Planck Institute for Informatics The slides of today s lecture are authored by and shown with
More informationIntroduction to Machine Learning
Introduction to Machine Learning Logistic Regression Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574
More informationParametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012
Parametric Models Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012 Today s Topics Maximum Likelihood Estimation Bayesian Density Estimation Today s Topics Maximum Likelihood
More informationMachine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber
Machine Learning Regression-Based Classification & Gaussian Discriminant Analysis Manfred Huber 2015 1 Logistic Regression Linear regression provides a nice representation and an efficient solution to
More informationGeneralized Linear Models
Generalized Linear Models David Rosenberg New York University April 12, 2015 David Rosenberg (New York University) DS-GA 1003 April 12, 2015 1 / 20 Conditional Gaussian Regression Gaussian Regression Input
More informationLogistic Regression. COMP 527 Danushka Bollegala
Logistic Regression COMP 527 Danushka Bollegala Binary Classification Given an instance x we must classify it to either positive (1) or negative (0) class We can use {1,-1} instead of {1,0} but we will
More informationLogistic Regression. Machine Learning Fall 2018
Logistic Regression Machine Learning Fall 2018 1 Where are e? We have seen the folloing ideas Linear models Learning as loss minimization Bayesian learning criteria (MAP and MLE estimation) The Naïve Bayes
More informationA Probabilistic Classification System for Predicting the Cellular Localization Sites of Proteins
From: ISMB-96 Proceedings. Copyright 1996, AAAI (www.aaai.org). All rights reserved. A Probabilistic Classification System for Predicting the Cellular Localization Sites of Proteins Paul Horton Computer
More informationLinear Models in Machine Learning
CS540 Intro to AI Linear Models in Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu We briefly go over two linear models frequently used in machine learning: linear regression for, well, regression,
More informationStatistical Machine Learning Hilary Term 2018
Statistical Machine Learning Hilary Term 2018 Pier Francesco Palamara Department of Statistics University of Oxford Slide credits and other course material can be found at: http://www.stats.ox.ac.uk/~palamara/sml18.html
More informationGenerative v. Discriminative classifiers Intuition
Logistic Regression Machine Learning 070/578 Carlos Guestrin Carnegie Mellon University September 24 th, 2007 Generative v. Discriminative classifiers Intuition Want to Learn: h:x a Y X features Y target
More informationLogistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018
10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Logistic Regression Matt Gormley Lecture 8 Feb. 12, 2018 1 10-601 Introduction
More informationProbability and Estimation. Alan Moses
Probability and Estimation Alan Moses Random variables and probability A random variable is like a variable in algebra (e.g., y=e x ), but where at least part of the variability is taken to be stochastic.
More informationLogistic Regression Logistic
Case Study 1: Estimating Click Probabilities L2 Regularization for Logistic Regression Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin January 10 th,
More informationMLPR: Logistic Regression and Neural Networks
MLPR: Logistic Regression and Neural Networks Machine Learning and Pattern Recognition Amos Storkey Amos Storkey MLPR: Logistic Regression and Neural Networks 1/28 Outline 1 Logistic Regression 2 Multi-layer
More informationLinear Methods for Prediction
Chapter 5 Linear Methods for Prediction 5.1 Introduction We now revisit the classification problem and focus on linear methods. Since our prediction Ĝ(x) will always take values in the discrete set G we
More informationOutline. MLPR: Logistic Regression and Neural Networks Machine Learning and Pattern Recognition. Which is the correct model? Recap.
Outline MLPR: and Neural Networks Machine Learning and Pattern Recognition 2 Amos Storkey Amos Storkey MLPR: and Neural Networks /28 Recap Amos Storkey MLPR: and Neural Networks 2/28 Which is the correct
More informationMachine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io
Machine Learning Lecture 4: Regularization and Bayesian Statistics Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 207 Overfitting Problem
More informationLinear & nonlinear classifiers
Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table
More informationLecture 4: Exponential family of distributions and generalized linear model (GLM) (Draft: version 0.9.2)
Lectures on Machine Learning (Fall 2017) Hyeong In Choi Seoul National University Lecture 4: Exponential family of distributions and generalized linear model (GLM) (Draft: version 0.9.2) Topics to be covered:
More informationLogistic Regression. William Cohen
Logistic Regression William Cohen 1 Outline Quick review classi5ication, naïve Bayes, perceptrons new result for naïve Bayes Learning as optimization Logistic regression via gradient ascent Over5itting
More informationPattern Recognition and Machine Learning
Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability
More informationGenerative v. Discriminative classifiers Intuition
Logistic Regression (Continued) Generative v. Discriminative Decision rees Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University January 31 st, 2007 2005-2007 Carlos Guestrin 1 Generative
More informationClassification Logistic Regression
Announcements: Classification Logistic Regression Machine Learning CSE546 Sham Kakade University of Washington HW due on Friday. Today: Review: sub-gradients,lasso Logistic Regression October 3, 26 Sham
More informationGaussian discriminant analysis Naive Bayes
DM825 Introduction to Machine Learning Lecture 7 Gaussian discriminant analysis Marco Chiarandini Department of Mathematics & Computer Science University of Southern Denmark Outline 1. is 2. Multi-variate
More informationGeneralized Linear Models. Last time: Background & motivation for moving beyond linear
Generalized Linear Models Last time: Background & motivation for moving beyond linear regression - non-normal/non-linear cases, binary, categorical data Today s class: 1. Examples of count and ordered
More informationLinear model A linear model assumes Y X N(µ(X),σ 2 I), And IE(Y X) = µ(x) = X β, 2/52
Statistics for Applications Chapter 10: Generalized Linear Models (GLMs) 1/52 Linear model A linear model assumes Y X N(µ(X),σ 2 I), And IE(Y X) = µ(x) = X β, 2/52 Components of a linear model The two
More informationCS60021: Scalable Data Mining. Large Scale Machine Learning
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 1 CS60021: Scalable Data Mining Large Scale Machine Learning Sourangshu Bhattacharya Example: Spam filtering Instance
More informationLecture 7. Logistic Regression. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 11, 2016
Lecture 7 Logistic Regression Luigi Freda ALCOR Lab DIAG University of Rome La Sapienza December 11, 2016 Luigi Freda ( La Sapienza University) Lecture 7 December 11, 2016 1 / 39 Outline 1 Intro Logistic
More informationCSC 411: Lecture 04: Logistic Regression
CSC 411: Lecture 04: Logistic Regression Raquel Urtasun & Rich Zemel University of Toronto Sep 23, 2015 Urtasun & Zemel (UofT) CSC 411: 04-Prob Classif Sep 23, 2015 1 / 16 Today Key Concepts: Logistic
More informationSupport Vector Machines
Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized
More informationMidterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas
Midterm Review CS 7301: Advanced Machine Learning Vibhav Gogate The University of Texas at Dallas Supervised Learning Issues in supervised learning What makes learning hard Point Estimation: MLE vs Bayesian
More informationStatistical Data Mining and Machine Learning Hilary Term 2016
Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes
More informationKernel Logistic Regression and the Import Vector Machine
Kernel Logistic Regression and the Import Vector Machine Ji Zhu and Trevor Hastie Journal of Computational and Graphical Statistics, 2005 Presented by Mingtao Ding Duke University December 8, 2011 Mingtao
More informationMachine Learning for Signal Processing Bayes Classification and Regression
Machine Learning for Signal Processing Bayes Classification and Regression Instructor: Bhiksha Raj 11755/18797 1 Recap: KNN A very effective and simple way of performing classification Simple model: For
More informationModern Methods of Statistical Learning sf2935 Lecture 5: Logistic Regression T.K
Lecture 5: Logistic Regression T.K. 10.11.2016 Overview of the Lecture Your Learning Outcomes Discriminative v.s. Generative Odds, Odds Ratio, Logit function, Logistic function Logistic regression definition
More informationMachine Learning. 7. Logistic and Linear Regression
Sapienza University of Rome, Italy - Machine Learning (27/28) University of Rome La Sapienza Master in Artificial Intelligence and Robotics Machine Learning 7. Logistic and Linear Regression Luca Iocchi,
More informationAn Introduction to Statistical and Probabilistic Linear Models
An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakultät für Informatik Technische Universität München June 07, 2017 Introduction In statistical learning
More informationClassification. Chapter Introduction. 6.2 The Bayes classifier
Chapter 6 Classification 6.1 Introduction Often encountered in applications is the situation where the response variable Y takes values in a finite set of labels. For example, the response Y could encode
More informationMark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.
CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.
More informationECE 5984: Introduction to Machine Learning
ECE 5984: Introduction to Machine Learning Topics: Classification: Logistic Regression NB & LR connections Readings: Barber 17.4 Dhruv Batra Virginia Tech Administrativia HW2 Due: Friday 3/6, 3/15, 11:55pm
More informationArticle from. Predictive Analytics and Futurism. July 2016 Issue 13
Article from Predictive Analytics and Futurism July 2016 Issue 13 Regression and Classification: A Deeper Look By Jeff Heaton Classification and regression are the two most common forms of models fitted
More informationPATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS Parametric Distributions Basic building blocks: Need to determine given Representation: or? Recall Curve Fitting Binary Variables
More informationFinal Exam, Machine Learning, Spring 2009
Name: Andrew ID: Final Exam, 10701 Machine Learning, Spring 2009 - The exam is open-book, open-notes, no electronics other than calculators. - The maximum possible score on this exam is 100. You have 3
More informationVariations of Logistic Regression with Stochastic Gradient Descent
Variations of Logistic Regression with Stochastic Gradient Descent Panqu Wang(pawang@ucsd.edu) Phuc Xuan Nguyen(pxn002@ucsd.edu) January 26, 2012 Abstract In this paper, we extend the traditional logistic
More informationMachine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.
Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted
More information