Data Mining 2018 Logistic Regression Text Classification
|
|
- Osborn Hunt
- 5 years ago
- Views:
Transcription
1 Data Mining 2018 Logistic Regression Text Classification Ad Feelders Universiteit Utrecht Ad Feelders ( Universiteit Utrecht ) Data Mining 1 / 50
2 Two types of approaches to classification In (probabilistic) classification we are interested in the conditional distribution P(Y x), so that, for example, when we observe X = x we can predict the class with the highest probability for that value of X. There are two basic approaches to modeling P(Y x): Generative Models (use Bayes rule): P(Y = j x) = P(x Y = j)p(y = j) P(x) Discriminative Models: model P(Y x) directly. P(x Y = j)p(y = j) = k i=1 P(x Y = i)p(y = i) Ad Feelders ( Universiteit Utrecht ) Data Mining 2 / 50
3 Generative Models Examples of generative classification methods: Naive Bayes classifier (discussed in the previous lecture) Linear/Quadratic Discriminant Analysis (not discussed)... Ad Feelders ( Universiteit Utrecht ) Data Mining 3 / 50
4 Discriminative Models Discriminative methods only model the conditional distribution of Y given X. The probability distribution of X itself is not modeled. For the binary classification problem: P(Y = 1 X ) = f (X, β) where f (X, β) is some deterministic function of X. Ad Feelders ( Universiteit Utrecht ) Data Mining 4 / 50
5 Discriminative Models Examples of discriminative classification methods: Linear probability model Logistic regression Feed-forward neural networks... Ad Feelders ( Universiteit Utrecht ) Data Mining 5 / 50
6 Discriminative Models: linear probability model Consider the linear regression model E[Y x] = β x Y {0, 1}, where But β x = m β j x j, with x 0 1. j=0 E[Y x] = 1 P(Y = 1 x) + 0 P(Y = 0 x) = P(Y = 1 x) So the model assumes that P(Y = 1 x) = β x Ad Feelders ( Universiteit Utrecht ) Data Mining 6 / 50
7 Linear response function P (Y = 1 x) 1 1 β x Ad Feelders ( Universiteit Utrecht ) Data Mining 7 / 50
8 Logistic regression Logistic response function E[Y x] = P(Y = 1 x) = eβ x 1 + e β x or (divide numerator and denominator by e β x ) P(Y = 1 x) = e β x = (1 + e β x ) 1 Ad Feelders ( Universiteit Utrecht ) Data Mining 8 / 50
9 Logistic Response Function P (Y = 1 x) β x Ad Feelders ( Universiteit Utrecht ) Data Mining 9 / 50
10 Linearization: the logit transformation { } P(Y = 1 x) ln 1 P(Y = 1 x) { } (1 + e β x ) 1 = ln 1 (1 + e β x ) 1 { } { } 1 1 = ln = ln (1 + e β x ) 1 e β x = ln e β x = β x In the second step, we divided the numerator and the denominator by (1 + e β x ) 1. The ratio is called the odds. P(Y = 1 x) P(Y = 1 x) = 1 P(Y = 1 x) P(Y = 0 x) Ad Feelders ( Universiteit Utrecht ) Data Mining 10 / 50
11 Linear Separation Assign to class 1 if P(Y = 1 x) > P(Y = 0 x), i.e. if P(Y = 1 x) P(Y = 0 x) > 1 This is true if ln { } P(Y = 1 x) > 0 P(Y = 0 x) So assign to class 1 if β x > 0, and to class 0 otherwise. Ad Feelders ( Universiteit Utrecht ) Data Mining 11 / 50
12 Decision Boundary x 2 Class 1 β 0 + β 1 x 1 + β 2 x 2 = 0 Class 0 x 1 Ad Feelders ( Universiteit Utrecht ) Data Mining 12 / 50
13 Maximum Likelihood Estimation Y = 1 if heads, Y = 0 if tails. p = P(Y = 1). One coin flip P(y) = p y (1 p) 1 y Note that P(1) = p, P(0) = 1 p as required. Sequence of n independent coin flips P(y 1, y 2,..., y n ) = n p y i (1 p) 1 y i which defines the likelihood function when viewed as a function of p. i=1 Ad Feelders ( Universiteit Utrecht ) Data Mining 13 / 50
14 Maximum Likelihood Estimation In a sequence of 10 coin flips we observe y = (1, 0, 1, 1, 0, 1, 1, 1, 1, 0). The corresponding likelihood function is P(y p) = p (1 p) p p (1 p) p p p p (1 p) = p 7 (1 p) 3 The corresponding log-likelihood function is ln P(y p) = ln(p 7 (1 p) 3 ) = 7 ln p + 3 ln(1 p) Ad Feelders ( Universiteit Utrecht ) Data Mining 14 / 50
15 Computing the maximum To determine the maximum we take the derivative and equate it to zero d ln P(y p) dp = 7 p 3 1 p = 0 which yields maximum likelihood estimate ˆp = 0.7. This is just the relative frequency of heads in the sample. Ad Feelders ( Universiteit Utrecht ) Data Mining 15 / 50
16 Log-likelihood function for y = (1, 0, 1, 1, 0, 1, 1, 1, 1, 0) loglikelihood Ad Feelders ( Universiteit Utrecht ) Data Mining mu 16 / 50
17 ML estimation for logistic regression Logistic regression is a bit like the coin tossing example, except that now the probability of success p i depends on x i : p i = P(Y = 1 x i ) = (1 + e β x i ) 1 1 p i = P(Y = 0 x i ) = (1 + e β x i ) 1 we can represent its probability distribution as follows P(y i ) = p y i i (1 p i ) 1 y i y i {0, 1}; i = 1,..., n Ad Feelders ( Universiteit Utrecht ) Data Mining 17 / 50
18 ML estimation for logistic regression Example i x i y i P(y i ) (1 + e β 0+8β 1 ) (1 + e β 0+12β 1 ) (1 + e β 0 15β 1 ) (1 + e β 0 10β 1 ) 1 Ad Feelders ( Universiteit Utrecht ) Data Mining 18 / 50
19 LR: likelihood function Since the y i observations are independent: P(y β) = Or, taking the natural log: ln P(y β) = ln = n P(y i ) = i=1 n i=1 n i=1 p y i i (1 p i ) 1 y i p y i i (1 p i ) 1 y i (1) n {y i ln p i + (1 y i ) ln(1 p i )} i=1 Ad Feelders ( Universiteit Utrecht ) Data Mining 19 / 50
20 LR: log-likelihood function For the logistic regression model we have so filling in gives l(β) = n i=1 p i = (1 + e β x i ) 1 1 p i = (1 + e β x i ) 1, { ( 1 y i ln 1 + e β x i Non-linear function of the parameters. ) ( + (1 y i ) ln e β x i No closed form solution (no nice formulas for the parameter estimates). )} Likelihood function globally concave so relatively easy optimization problem. Ad Feelders ( Universiteit Utrecht ) Data Mining 20 / 50
21 First-order conditions l(β) = n {y i ln p i + (1 y i ) ln(1 p i )} i=1 g(β j ) = l(β) β j = n i=1 Filling in (2) in equation (1) gives: y i p i p i β j + 1 y i 1 p i (1 p i) β j (1) p i β j = p i (1 p i )x ij (2) g(β j ) = n (y i p i )x ij i=1 Ad Feelders ( Universiteit Utrecht ) Data Mining 21 / 50
22 First-order conditions First order condition for maximum: g(β j ) = For the intercept β 0 we get n (y i p i )x ij = 0, j = 0,..., m. i=1 l(β) β 0 = n (y i p i ) = 0, i=1 since x i0 1. So in the optimal solution we have n y i = i=1 The expected number of cases with Y = 1 is equal to the observed number of cases with Y = 1. n i=1 p i Ad Feelders ( Universiteit Utrecht ) Data Mining 22 / 50
23 First-order conditions First order condition for maximum: Likewise, for β j we get g(β j ) = l(β) β j = So in the optimal solution we have Interpretation? n (y i p i )x ij = 0 i=1 n (y i p i )x ij = 0. i=1 n y i x ij = i=1 n p i x ij i=1 Ad Feelders ( Universiteit Utrecht ) Data Mining 23 / 50
24 Optimization by Gradient Ascent The gradient points in the direction of the steepest ascent of the function. We can perform optimization by the method of gradient ascent as follows. The new estimate of β j based on processing the i-th observation is: where β (t+1) j = β (t) j + η g(β j ) β=β (t)= β (t) j p (t) i = (1 + e β(t) x i ) 1 + η (y i p (t) i )x ij, is the current estimate of P(Y = 1 x i ), that is, using β (t). Ad Feelders ( Universiteit Utrecht ) Data Mining 24 / 50
25 Optimization by Gradient Ascent The basic gradient-ascent algorithm applied to logistic regression: 1 choose an initial value β (0) (e.g. at random); t 0 2 determine the gradient l(β) β j β=β (t) of l(β) at β (t) and update β (t+1) j = β (t) j + η n i=1 3 Repeat the previous step until (y i p (t) i )x ij, j = 0,..., m g(β j ) β=β (t)= 0 for all j = 0,..., m and check if a (local) maximum has been reached. η > 0 is the step size. Ad Feelders ( Universiteit Utrecht ) Data Mining 25 / 50
26 Fitted Response Function Substitute maximum likelihood estimates into the response function to obtain the fitted response function ˆP(Y = 1 x) = e ˆβ x 1 + e ˆβ x Ad Feelders ( Universiteit Utrecht ) Data Mining 26 / 50
27 Example: Programming Assignment Model the probability of successfully completing a programming assignment. Explanatory variable: programming experience. We find ˆβ 0 = and ˆβ 1 = , so ˆP(Y = 1 x) = 14 months of programming experience: e x 1 + e x ˆP(Y = 1 x = 14) = e (14) e (14) Ad Feelders ( Universiteit Utrecht ) Data Mining 27 / 50
28 Interpretation We have { } ˆP(Y = 1 x) ln = x, ˆP(Y = 0 x) so with every additional month of programming experience, the log odds increase with The odds are multiplied by e so with every additional month of programming experience, the odds increase with 17.5%. Note that the effect of an increase in x on the probability of success depends on the value of x: An increase from 14 to 24 months of programming experience leads to an increase of the probability of success from 0.31 to An increase from 34 to 44 months of programming experience leads to an increase of the probability of success from 0.92 to Ad Feelders ( Universiteit Utrecht ) Data Mining 28 / 50
29 Example: Programming Assignment month.exp success fitted month.exp success fitted Ad Feelders ( Universiteit Utrecht ) Data Mining 29 / 50
30 Allocation Rule Probability of the classes is equal when Solving for x we get x Allocation Rule: x 19: predict y = 1 x < 19: predict y = x = 0 If a person has 19 months or more programming experience, predict success, otherwise predict failure. Ad Feelders ( Universiteit Utrecht ) Data Mining 30 / 50
31 Programming Assignment: Confusion Matrix Cross table of observed and predicted class label: Row: observed, Column: predicted Error rate: 6/25= Default (predict majority class): 11/25=0.44 Ad Feelders ( Universiteit Utrecht ) Data Mining 31 / 50
32 How to in R > prog.logreg <- glm(success month.exp, data=prog.dat, family=binomial) > summary(prog.logreg) Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) * month.exp * --- Number of Fisher Scoring iterations: 4 > table(prog.dat$success, as.numeric(prog.logreg$fitted > 0.5)) Ad Feelders ( Universiteit Utrecht ) Data Mining 32 / 50
33 Non-binary classes in logistic regression Recall the logistic regression model assumption for binary class variable Y {0, 1}: P(Y = 1 x) = exp(β x) 1 + exp(β x) from which it follows that P(Y = 0 x) = exp(β x) since P(Y = 1 x) + P(Y = 0 x) = 1 Ad Feelders ( Universiteit Utrecht ) Data Mining 33 / 50
34 Non-binary classes in logistic regression We can generalize this model to non-binary class variable Y {0, 1,..., K 1} (where K is the number of classes) as follows P(Y = k x) = exp(β k x) K 1 j=0 exp(β j x) where we now have a weight vector β k for each class. This is called the multinomial logit model or multi-class logistic regression. Ad Feelders ( Universiteit Utrecht ) Data Mining 34 / 50
35 Multi-class logistic regression We can arrive at this model in the following steps: 1 Assume that P(Y = k x) is a function of the linear combination β k x. 2 To ensure that the probabilities are non-negative, take the exponential exp(β k x). 3 To make sure the probabilities sum to 1, we divide exp(β k x) by K 1 j=0 exp(β j x): P(Y = k x) = exp(β k x) K 1 j=0 exp(β j x) Ad Feelders ( Universiteit Utrecht ) Data Mining 35 / 50
36 Identification Restriction Note that exp((β k + d) x) K 1 j=0 exp((β j + d) x) = exp(d x) exp(βk exp(d x) x) K 1 j=0 exp(β j x) so adding a vector d to each of the vectors β j, j = 0,..., K 1 would yield the same fitted probabilities. To identify the model, we put β 0 = 0. Verify that binary logistic regression is a special case of the multinomial logit model, with K = 2, since exp(β0 x) = exp(0) = 1. Ad Feelders ( Universiteit Utrecht ) Data Mining 36 / 50
37 Interpretation We have ln { } P(Y = k x) = (β k β l ) x, P(Y = l x) which allows us to interpret β kj β lj as follows: for a unit increase in x j, the log-odds of class k versus class l is expected to change by β kj β lj units, holding all the other variables constant. Since β 0 = 0, β kj is the effect of x j on the log-odds of class k relative to class 0: for a unit increase in x j, the log-odds of class k versus class 0 is expected to change by β kj units, holding all the other variables constant. Ad Feelders ( Universiteit Utrecht ) Data Mining 37 / 50
38 Regularization If we have a large number of predictors, even a linear model estimated with maximum likelihood can be prone to overfitting. This can be controlled by punishing large (positive or negative) weights. The coefficient estimates are shrunken towards zero. Add a penalty term for the size of the coefficients to the objective function. With LASSO penalty: E(β) = l(β) + λ m β j, where E(β) is the new error function that we want to minimize, and l(β) is the negative log-likelihood function. The value of λ is usually selected by cross-validation. j=1 Ad Feelders ( Universiteit Utrecht ) Data Mining 38 / 50
39 Application of Logistic Regression to Movie Reviews # logistic regression with lasso penalty > reviews.glmnet <- cv.glmnet(as.matrix(train.dtm),labels[index.train], family="binomial",type.measure="class") > plot(reviews.glmnet) > coef(reviews.glmnet,s="lambda.1se") 309 x 1 sparse Matrix of class "dgcmatrix" 1 bad beautiful best better boring excellent fun funny. minutes perfect poor script stupid supposed terrible wonderful worst Ad Feelders ( Universiteit Utrecht ) Data Mining 39 / 50
40 Cross-Validation on lambda Misclassification Error log(lambda) Ad Feelders ( Universiteit Utrecht ) Data Mining 40 / 50
41 Application of Logistic Regression to Movie Reviews # make predictions on the test set > reviews.logreg.pred <- predict(reviews.glmnet, newx=as.matrix(test.dtm),s="lambda.1se",type="class") # show confusion matrix > table(reviews.logreg.pred,labels[-index.train]) reviews.logreg.pred # compute accuracy: about 81% correct > ( )/9000 [1] Ad Feelders ( Universiteit Utrecht ) Data Mining 41 / 50
42 Using tf-idf weights Currently, we use term frequency as feature value. In information retrieval other term weighting schemes are popular. For example: tf-idf(i, j) = tf(i, j) idf(i), where tf-idf(i, j) is the number of times term i occurs in document j, and N doc idf(i) = log {d : w i d} There are many variations on this theme. Should work well in finding relevant documents to queries, but not obvious why it should work in document classification. Ad Feelders ( Universiteit Utrecht ) Data Mining 42 / 50
43 Logistic Regression: using tf-idf weights # construct training document term matrix with tf-idf weights > train2.dtm <- DocumentTermMatrix(reviews.all[index.train], control=list(weighting=weighttfidf)) # remove sparse terms > train2.dtm <- removesparseterms(train2.dtm,0.95) # perform logistic regression with lasso penalty > reviews2.glmnet <- cv.glmnet(as.matrix(train2.dtm),labels[index.train], family="binomial",type.measure="class") # compute tf-idf scores on test set: we use the document frequency # from the training set! > train3.dtm <- as.matrix(train.dtm) # convert term frequency counts to binary indicator > train3.dtm <- matrix(as.numeric(train3.dtm > 0),nrow=16000,ncol=308) Ad Feelders ( Universiteit Utrecht ) Data Mining 43 / 50
44 Logistic Regression: using tf-idf weights # sum the columns of the training set > train3.idf <- apply(train3.dtm,2,sum) # compute idf for each term (column) > train3.idf <- log2(16000/train3.idf) # term frequencies on the test set > test3.dtm <- as.matrix(test.dtm) # compute tf-idf weights on the test set > for(i in 1:308){test3.dtm[,i] <- test3.dtm[,i]*train3.idf[i]} Ad Feelders ( Universiteit Utrecht ) Data Mining 44 / 50
45 Logistic Regression: using tf-idf weights # make predictions on the test set using lambda=lambda.1se > reviews.logreg2.pred <- predict(reviews2.glmnet, newx=test3.dtm,s="lambda.1se",type="class") # show confusion matrix > table(reviews.logreg2.pred,labels[-index.train]) reviews.logreg2.pred # compute accuracy: still about 81% correct > ( )/9000 [1] Ad Feelders ( Universiteit Utrecht ) Data Mining 45 / 50
46 Including Bigrams The bigrams in are: the spy who loved me the spy spy who who loved loved me but not for example spy loved The two words need to be next to each other. Ad Feelders ( Universiteit Utrecht ) Data Mining 46 / 50
47 Including Bigrams # extract bigrams > train.dtm2 <- DocumentTermMatrix(reviews.all[index.train], control = list(tokenize = BigramTokenizer)) # more than one million bigrams! > dim(train.dtm2) [1] > train.dtm2 <- removesparseterms(train.dtm2,0.99) # after removing sparse bigrams only 152 left > dim(train.dtm2) [1] > train.dat1 <- as.matrix(train.dtm) > train.dat2 <- as.matrix(train.dtm2) # combine unigrams and bigrams > train.dat <- cbind(train.dat1,train.dat2) > dim(train.dat) [1] Ad Feelders ( Universiteit Utrecht ) Data Mining 47 / 50
48 Including Bigrams # fit regularized logistic regression model # use cross-validation to evaluate different lambda values > reviews3.glmnet <- cv.glmnet(train.dat,labels[index.train], family="binomial",type.measure="class") # show coefficient estimates for lambda-1se # (only a selection of the bigram coefficients is shown here) > coef(reviews3.glmnet,s="lambda.1se") bad movie black white cant believe character development dont want end movie even though felt like first time good film great movie highly recommend just plain make sense much better must see one best one worst read book really good worst movie Ad Feelders ( Universiteit Utrecht ) Data Mining 48 / 50
49 Cross-Validation on lambda Misclassification Error log(lambda) Ad Feelders ( Universiteit Utrecht ) Data Mining 49 / 50
50 Including Bigrams # create document term matrix for the test data, # using the training dictionary > test3.dtm <- DocumentTermMatrix(reviews.all[-index.train], list(dictionary=dimnames(train.dat)[[2]])) # convert to ordinary matrix > test3.dat <- as.matrix(test3.dtm) # get columns in the same order as on the training set > test3.dat <- test3.dat[,dimnames(train.dat)[[2]]] # make predictions using lambda.1se > reviews.logreg3.pred <- predict(reviews3.glmnet,newx=test3.dat, s="lambda.1se",type="class") > table(reviews.logreg3.pred,labels[-index.train]) reviews.logreg3.pred # accuracy does not improve > ( )/9000 [1] Ad Feelders ( Universiteit Utrecht ) Data Mining 50 / 50
ECLT 5810 Linear Regression and Logistic Regression for Classification. Prof. Wai Lam
ECLT 5810 Linear Regression and Logistic Regression for Classification Prof. Wai Lam Linear Regression Models Least Squares Input vectors is an attribute / feature / predictor (independent variable) The
More informationLinear Regression Models P8111
Linear Regression Models P8111 Lecture 25 Jeff Goldsmith April 26, 2016 1 of 37 Today s Lecture Logistic regression / GLMs Model framework Interpretation Estimation 2 of 37 Linear regression Course started
More informationLogistic Regression & Neural Networks
Logistic Regression & Neural Networks CMSC 723 / LING 723 / INST 725 Marine Carpuat Slides credit: Graham Neubig, Jacob Eisenstein Logistic Regression Perceptron & Probabilities What if we want a probability
More informationClassification. Chapter Introduction. 6.2 The Bayes classifier
Chapter 6 Classification 6.1 Introduction Often encountered in applications is the situation where the response variable Y takes values in a finite set of labels. For example, the response Y could encode
More informationLogistic Regression. Will Monroe CS 109. Lecture Notes #22 August 14, 2017
1 Will Monroe CS 109 Logistic Regression Lecture Notes #22 August 14, 2017 Based on a chapter by Chris Piech Logistic regression is a classification algorithm1 that works by trying to learn a function
More informationMachine Learning Linear Classification. Prof. Matteo Matteucci
Machine Learning Linear Classification Prof. Matteo Matteucci Recall from the first lecture 2 X R p Regression Y R Continuous Output X R p Y {Ω 0, Ω 1,, Ω K } Classification Discrete Output X R p Y (X)
More informationSupport Vector Machines
Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized
More informationComments. x > w = w > x. Clarification: this course is about getting you to be able to think as a machine learning expert
Logistic regression Comments Mini-review and feedback These are equivalent: x > w = w > x Clarification: this course is about getting you to be able to think as a machine learning expert There has to be
More informationProbabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016
Probabilistic classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Topics Probabilistic approach Bayes decision theory Generative models Gaussian Bayes classifier
More informationLogistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu
Logistic Regression Review 10-601 Fall 2012 Recitation September 25, 2012 TA: Selen Uguroglu!1 Outline Decision Theory Logistic regression Goal Loss function Inference Gradient Descent!2 Training Data
More informationLogistic Regression. Advanced Methods for Data Analysis (36-402/36-608) Spring 2014
Logistic Regression Advanced Methods for Data Analysis (36-402/36-608 Spring 204 Classification. Introduction to classification Classification, like regression, is a predictive task, but one in which the
More informationECLT 5810 Linear Regression and Logistic Regression for Classification. Prof. Wai Lam
ECLT 5810 Linear Regression and Logistic Regression for Classification Prof. Wai Lam Linear Regression Models Least Squares Input vectors is an attribute / feature / predictor (independent variable) The
More informationData-analysis and Retrieval Ordinal Classification
Data-analysis and Retrieval Ordinal Classification Ad Feelders Universiteit Utrecht Data-analysis and Retrieval 1 / 30 Strongly disagree Ordinal Classification 1 2 3 4 5 0% (0) 10.5% (2) 21.1% (4) 42.1%
More informationProbabilistic modeling. The slides are closely adapted from Subhransu Maji s slides
Probabilistic modeling The slides are closely adapted from Subhransu Maji s slides Overview So far the models and algorithms you have learned about are relatively disconnected Probabilistic modeling framework
More informationNeural networks (not in book)
(not in book) Another approach to classification is neural networks. were developed in the 1980s as a way to model how learning occurs in the brain. There was therefore wide interest in neural networks
More informationCLUe Training An Introduction to Machine Learning in R with an example from handwritten digit recognition
CLUe Training An Introduction to Machine Learning in R with an example from handwritten digit recognition Ad Feelders Universiteit Utrecht Department of Information and Computing Sciences Algorithmic Data
More informationGenerative v. Discriminative classifiers Intuition
Logistic Regression Machine Learning 070/578 Carlos Guestrin Carnegie Mellon University September 24 th, 2007 Generative v. Discriminative classifiers Intuition Want to Learn: h:x a Y X features Y target
More informationNaïve Bayes classification
Naïve Bayes classification 1 Probability theory Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. Examples: A person s height, the outcome of a coin toss
More informationStochastic Gradient Descent
Stochastic Gradient Descent Machine Learning CSE546 Carlos Guestrin University of Washington October 9, 2013 1 Logistic Regression Logistic function (or Sigmoid): Learn P(Y X) directly Assume a particular
More informationFeature Engineering, Model Evaluations
Feature Engineering, Model Evaluations Giri Iyengar Cornell University gi43@cornell.edu Feb 5, 2018 Giri Iyengar (Cornell Tech) Feature Engineering Feb 5, 2018 1 / 35 Overview 1 ETL 2 Feature Engineering
More informationMaximum Likelihood, Logistic Regression, and Stochastic Gradient Training
Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training Charles Elkan elkan@cs.ucsd.edu January 17, 2013 1 Principle of maximum likelihood Consider a family of probability distributions
More informationMachine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5
Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5 Slides adapted from Jordan Boyd-Graber, Tom Mitchell, Ziv Bar-Joseph Machine Learning: Chenhao Tan Boulder 1 of 27 Quiz question For
More informationBehavioral Data Mining. Lecture 7 Linear and Logistic Regression
Behavioral Data Mining Lecture 7 Linear and Logistic Regression Outline Linear Regression Regularization Logistic Regression Stochastic Gradient Fast Stochastic Methods Performance tips Linear Regression
More informationBANA 7046 Data Mining I Lecture 4. Logistic Regression and Classications 1
BANA 7046 Data Mining I Lecture 4. Logistic Regression and Classications 1 Shaobo Li University of Cincinnati 1 Partially based on Hastie, et al. (2009) ESL, and James, et al. (2013) ISLR Data Mining I
More informationMachine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang
Machine Learning Basics Lecture 7: Multiclass Classification Princeton University COS 495 Instructor: Yingyu Liang Example: image classification indoor Indoor outdoor Example: image classification (multiclass)
More informationNaïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability
Probability theory Naïve Bayes classification Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. s: A person s height, the outcome of a coin toss Distinguish
More informationLogistic Regression. Seungjin Choi
Logistic Regression Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/
More informationGenerative Models for Discrete Data
Generative Models for Discrete Data ddebarr@uw.edu 2016-04-21 Agenda Bayesian Concept Learning Beta-Binomial Model Dirichlet-Multinomial Model Naïve Bayes Classifiers Bayesian Concept Learning Numbers
More informationIntroduction to Machine Learning
1, DATA11002 Introduction to Machine Learning Lecturer: Teemu Roos TAs: Ville Hyvönen and Janne Leppä-aho Department of Computer Science University of Helsinki (based in part on material by Patrik Hoyer
More informationLinear Methods for Prediction
Chapter 5 Linear Methods for Prediction 5.1 Introduction We now revisit the classification problem and focus on linear methods. Since our prediction Ĝ(x) will always take values in the discrete set G we
More informationThe Naïve Bayes Classifier. Machine Learning Fall 2017
The Naïve Bayes Classifier Machine Learning Fall 2017 1 Today s lecture The naïve Bayes Classifier Learning the naïve Bayes Classifier Practical concerns 2 Today s lecture The naïve Bayes Classifier Learning
More informationIntroduction to Machine Learning
1, DATA11002 Introduction to Machine Learning Lecturer: Antti Ukkonen TAs: Saska Dönges and Janne Leppä-aho Department of Computer Science University of Helsinki (based in part on material by Patrik Hoyer,
More informationBayesian Learning (II)
Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Bayesian Learning (II) Niels Landwehr Overview Probabilities, expected values, variance Basic concepts of Bayesian learning MAP
More informationLecture 15: Logistic Regression
Lecture 15: Logistic Regression William Webber (william@williamwebber.com) COMP90042, 2014, Semester 1, Lecture 15 What we ll learn in this lecture Model-based regression and classification Logistic regression
More informationClassification Logistic Regression
Announcements: Classification Logistic Regression Machine Learning CSE546 Sham Kakade University of Washington HW due on Friday. Today: Review: sub-gradients,lasso Logistic Regression October 3, 26 Sham
More informationLogistic Regression. Sargur N. Srihari. University at Buffalo, State University of New York USA
Logistic Regression Sargur N. University at Buffalo, State University of New York USA Topics in Linear Classification using Probabilistic Discriminative Models Generative vs Discriminative 1. Fixed basis
More informationGenerative v. Discriminative classifiers Intuition
Logistic Regression Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University September 24 th, 2007 1 Generative v. Discriminative classifiers Intuition Want to Learn: h:x a Y X features
More informationNaïve Bayes Classifiers and Logistic Regression. Doug Downey Northwestern EECS 349 Winter 2014
Naïve Bayes Classifiers and Logistic Regression Doug Downey Northwestern EECS 349 Winter 2014 Naïve Bayes Classifiers Combines all ideas we ve covered Conditional Independence Bayes Rule Statistical Estimation
More informationFinal Overview. Introduction to ML. Marek Petrik 4/25/2017
Final Overview Introduction to ML Marek Petrik 4/25/2017 This Course: Introduction to Machine Learning Build a foundation for practice and research in ML Basic machine learning concepts: max likelihood,
More informationECE 5984: Introduction to Machine Learning
ECE 5984: Introduction to Machine Learning Topics: Classification: Logistic Regression NB & LR connections Readings: Barber 17.4 Dhruv Batra Virginia Tech Administrativia HW2 Due: Friday 3/6, 3/15, 11:55pm
More informationBehavioral Data Mining. Lecture 3 Naïve Bayes Classifier and Generalized Linear Models
Behavioral Data Mining Lecture 3 Naïve Bayes Classifier and Generalized Linear Models Outline Naïve Bayes Classifier Regularization in Linear Regression Generalized Linear Models Assignment Tips: Matrix
More informationLDA, QDA, Naive Bayes
LDA, QDA, Naive Bayes Generative Classification Models Marek Petrik 2/16/2017 Last Class Logistic Regression Maximum Likelihood Principle Logistic Regression Predict probability of a class: p(x) Example:
More informationLinear classifiers: Logistic regression
Linear classifiers: Logistic regression STAT/CSE 416: Machine Learning Emily Fox University of Washington April 19, 2018 How confident is your prediction? The sushi & everything else were awesome! The
More informationContents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)
Contents Lecture Lecture Linear Discriminant Analysis Fredrik Lindsten Division of Systems and Control Department of Information Technology Uppsala University Email: fredriklindsten@ituuse Summary of lecture
More informationProblem #1 #2 #3 #4 #5 #6 Total Points /6 /8 /14 /10 /8 /10 /56
STAT 391 - Spring Quarter 2017 - Midterm 1 - April 27, 2017 Name: Student ID Number: Problem #1 #2 #3 #4 #5 #6 Total Points /6 /8 /14 /10 /8 /10 /56 Directions. Read directions carefully and show all your
More informationLogistic Regression: Online, Lazy, Kernelized, Sequential, etc.
Logistic Regression: Online, Lazy, Kernelized, Sequential, etc. Harsha Veeramachaneni Thomson Reuter Research and Development April 1, 2010 Harsha Veeramachaneni (TR R&D) Logistic Regression April 1, 2010
More informationLecture #11: Classification & Logistic Regression
Lecture #11: Classification & Logistic Regression CS 109A, STAT 121A, AC 209A: Data Science Weiwei Pan, Pavlos Protopapas, Kevin Rader Fall 2016 Harvard University 1 Announcements Midterm: will be graded
More informationLast Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression
CSE 446 Gaussian Naïve Bayes & Logistic Regression Winter 22 Dan Weld Learning Gaussians Naïve Bayes Last Time Gaussians Naïve Bayes Logistic Regression Today Some slides from Carlos Guestrin, Luke Zettlemoyer
More informationClassification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012
Classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Topics Discriminant functions Logistic regression Perceptron Generative models Generative vs. discriminative
More informationLogistic Regression. Machine Learning Fall 2018
Logistic Regression Machine Learning Fall 2018 1 Where are e? We have seen the folloing ideas Linear models Learning as loss minimization Bayesian learning criteria (MAP and MLE estimation) The Naïve Bayes
More informationLecture 4 Discriminant Analysis, k-nearest Neighbors
Lecture 4 Discriminant Analysis, k-nearest Neighbors Fredrik Lindsten Division of Systems and Control Department of Information Technology Uppsala University. Email: fredrik.lindsten@it.uu.se fredrik.lindsten@it.uu.se
More informationAd Placement Strategies
Case Study : Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD AdaGrad Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox January 7 th, 04 Ad
More informationUncertainty in prediction. Can we usually expect to get a perfect classifier, if we have enough training data?
Logistic regression Uncertainty in prediction Can we usually expect to get a perfect classifier, if we have enough training data? Uncertainty in prediction Can we usually expect to get a perfect classifier,
More informationMachine Learning, Fall 2012 Homework 2
0-60 Machine Learning, Fall 202 Homework 2 Instructors: Tom Mitchell, Ziv Bar-Joseph TA in charge: Selen Uguroglu email: sugurogl@cs.cmu.edu SOLUTIONS Naive Bayes, 20 points Problem. Basic concepts, 0
More informationMIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE
MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE March 28, 2012 The exam is closed book. You are allowed a double sided one page cheat sheet. Answer the questions in the spaces provided on
More informationLinear Models for Classification: Discriminative Learning (Perceptron, SVMs, MaxEnt)
Linear Models for Classification: Discriminative Learning (Perceptron, SVMs, MaxEnt) Nathan Schneider (some slides borrowed from Chris Dyer) ENLP 12 February 2018 23 Outline Words, probabilities Features,
More informationLinear classifiers: Overfitting and regularization
Linear classifiers: Overfitting and regularization Emily Fox University of Washington January 25, 2017 Logistic regression recap 1 . Thus far, we focused on decision boundaries Score(x i ) = w 0 h 0 (x
More informationOutline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)
Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Linear Models for Regression Linear Regression Probabilistic Interpretation
More informationLogistic regression for conditional probability estimation
Logistic regression for conditional probability estimation Instructor: Taylor Berg-Kirkpatrick Slides: Sanjoy Dasgupta Course website: http://cseweb.ucsd.edu/classes/wi19/cse151-b/ Uncertainty in prediction
More informationMidterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas
Midterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas Machine Learning Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Y Continuous Non-parametric
More informationMidterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas
Midterm Review CS 7301: Advanced Machine Learning Vibhav Gogate The University of Texas at Dallas Supervised Learning Issues in supervised learning What makes learning hard Point Estimation: MLE vs Bayesian
More information22s:152 Applied Linear Regression. Example: Study on lead levels in children. Ch. 14 (sec. 1) and Ch. 15 (sec. 1 & 4): Logistic Regression
22s:52 Applied Linear Regression Ch. 4 (sec. and Ch. 5 (sec. & 4: Logistic Regression Logistic Regression When the response variable is a binary variable, such as 0 or live or die fail or succeed then
More informationMachine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012
Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University September 20, 2012 Today: Logistic regression Generative/Discriminative classifiers Readings: (see class website)
More informationIntroduction to Logistic Regression
Introduction to Logistic Regression Guy Lebanon Binary Classification Binary classification is the most basic task in machine learning, and yet the most frequent. Binary classifiers often serve as the
More informationApplied Natural Language Processing
Applied Natural Language Processing Info 256 Lecture 5: Text classification (Feb 5, 2019) David Bamman, UC Berkeley Data Classification A mapping h from input data x (drawn from instance space X) to a
More informationSparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.
ANLP Lecture 22 Lexical Semantics with Dense Vectors Henry S. Thompson Based on slides by Jurafsky & Martin, some via Dorota Glowacka 5 November 2018 Previous lectures: Sparse vectors recap How to represent
More informationANLP Lecture 22 Lexical Semantics with Dense Vectors
ANLP Lecture 22 Lexical Semantics with Dense Vectors Henry S. Thompson Based on slides by Jurafsky & Martin, some via Dorota Glowacka 5 November 2018 Henry S. Thompson ANLP Lecture 22 5 November 2018 Previous
More informationCS6220: DATA MINING TECHNIQUES
CS6220: DATA MINING TECHNIQUES Matrix Data: Prediction Instructor: Yizhou Sun yzsun@ccs.neu.edu September 21, 2015 Announcements TA Monisha s office hour has changed to Thursdays 10-12pm, 462WVH (the same
More informationNaïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018
10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Naïve Bayes Matt Gormley Lecture 18 Oct. 31, 2018 1 Reminders Homework 6: PAC Learning
More informationExpectation maximization tutorial
Expectation maximization tutorial Octavian Ganea November 18, 2016 1/1 Today Expectation - maximization algorithm Topic modelling 2/1 ML & MAP Observed data: X = {x 1, x 2... x N } 3/1 ML & MAP Observed
More informationLecture 5: LDA and Logistic Regression
Lecture 5: and Logistic Regression Hao Helen Zhang Hao Helen Zhang Lecture 5: and Logistic Regression 1 / 39 Outline Linear Classification Methods Two Popular Linear Models for Classification Linear Discriminant
More informationMark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.
CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.
More informationLinear Methods for Prediction
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike License. Your use of this material constitutes acceptance of that license and the conditions of use of materials on this
More informationGenerative v. Discriminative classifiers Intuition
Logistic Regression (Continued) Generative v. Discriminative Decision rees Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University January 31 st, 2007 2005-2007 Carlos Guestrin 1 Generative
More informationMLE/MAP + Naïve Bayes
10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University MLE/MAP + Naïve Bayes MLE / MAP Readings: Estimating Probabilities (Mitchell, 2016)
More informationPattern Recognition 2018 Support Vector Machines
Pattern Recognition 2018 Support Vector Machines Ad Feelders Universiteit Utrecht Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 1 / 48 Support Vector Machines Ad Feelders ( Universiteit Utrecht
More informationMachine Learning Lecture 7
Course Outline Machine Learning Lecture 7 Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Statistical Learning Theory 23.05.2016 Discriminative Approaches (5 weeks) Linear Discriminant
More informationSparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda
Sparse regression Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda 3/28/2016 Regression Least-squares regression Example: Global warming Logistic
More informationGeneralized logit models for nominal multinomial responses. Local odds ratios
Generalized logit models for nominal multinomial responses Categorical Data Analysis, Summer 2015 1/17 Local odds ratios Y 1 2 3 4 1 π 11 π 12 π 13 π 14 π 1+ X 2 π 21 π 22 π 23 π 24 π 2+ 3 π 31 π 32 π
More informationIntroduction to Machine Learning Midterm Exam Solutions
10-701 Introduction to Machine Learning Midterm Exam Solutions Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes,
More informationLogistic regression. 11 Nov Logistic regression (EPFL) Applied Statistics 11 Nov / 20
Logistic regression 11 Nov 2010 Logistic regression (EPFL) Applied Statistics 11 Nov 2010 1 / 20 Modeling overview Want to capture important features of the relationship between a (set of) variable(s)
More informationLogistic Regression. COMP 527 Danushka Bollegala
Logistic Regression COMP 527 Danushka Bollegala Binary Classification Given an instance x we must classify it to either positive (1) or negative (0) class We can use {1,-1} instead of {1,0} but we will
More informationMachine Learning Lecture 5
Machine Learning Lecture 5 Linear Discriminant Functions 26.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory
More informationLogistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.
Logistic Regression Vibhav Gogate The University of Texas at Dallas Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld. Generative vs. Discriminative Classifiers Want to Learn: h:x Y X features
More informationCase Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!
Case Study 1: Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April 4, 017 1 Announcements:
More informationLecture 10: Introduction to Logistic Regression
Lecture 10: Introduction to Logistic Regression Ani Manichaikul amanicha@jhsph.edu 2 May 2007 Logistic Regression Regression for a response variable that follows a binomial distribution Recall the binomial
More informationIntroduction to Machine Learning
Introduction to Machine Learning Logistic Regression Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574
More informationQualifier: CS 6375 Machine Learning Spring 2015
Qualifier: CS 6375 Machine Learning Spring 2015 The exam is closed book. You are allowed to use two double-sided cheat sheets and a calculator. If you run out of room for an answer, use an additional sheet
More informationSTA 450/4000 S: January
STA 450/4000 S: January 6 005 Notes Friday tutorial on R programming reminder office hours on - F; -4 R The book Modern Applied Statistics with S by Venables and Ripley is very useful. Make sure you have
More informationGaussian and Linear Discriminant Analysis; Multiclass Classification
Gaussian and Linear Discriminant Analysis; Multiclass Classification Professor Ameet Talwalkar Slide Credit: Professor Fei Sha Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015
More informationLinear Regression With Special Variables
Linear Regression With Special Variables Junhui Qian December 21, 2014 Outline Standardized Scores Quadratic Terms Interaction Terms Binary Explanatory Variables Binary Choice Models Standardized Scores:
More informationBe able to define the following terms and answer basic questions about them:
CS440/ECE448 Section Q Fall 2017 Final Review Be able to define the following terms and answer basic questions about them: Probability o Random variables, axioms of probability o Joint, marginal, conditional
More informationEXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING
EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING DATE AND TIME: August 30, 2018, 14.00 19.00 RESPONSIBLE TEACHER: Niklas Wahlström NUMBER OF PROBLEMS: 5 AIDING MATERIAL: Calculator, mathematical
More informationStatistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.
http://goo.gl/jv7vj9 Course website KYOTO UNIVERSITY Statistical Machine Learning Theory From Multi-class Classification to Structured Output Prediction Hisashi Kashima kashima@i.kyoto-u.ac.jp DEPARTMENT
More information10-701/ Machine Learning - Midterm Exam, Fall 2010
10-701/15-781 Machine Learning - Midterm Exam, Fall 2010 Aarti Singh Carnegie Mellon University 1. Personal info: Name: Andrew account: E-mail address: 2. There should be 15 numbered pages in this exam
More information> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel
Logistic Regression Pattern Recognition 2016 Sandro Schönborn University of Basel Two Worlds: Probabilistic & Algorithmic We have seen two conceptual approaches to classification: data class density estimation
More informationEXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING
EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING DATE AND TIME: June 9, 2018, 09.00 14.00 RESPONSIBLE TEACHER: Andreas Svensson NUMBER OF PROBLEMS: 5 AIDING MATERIAL: Calculator, mathematical
More informationLecture 8: Classification
1/26 Lecture 8: Classification Måns Eriksson Department of Mathematics, Uppsala University eriksson@math.uu.se Multivariate Methods 19/5 2010 Classification: introductory examples Goal: Classify an observation
More informationRegression.
Regression www.biostat.wisc.edu/~dpage/cs760/ Goals for the lecture you should understand the following concepts linear regression RMSE, MAE, and R-square logistic regression convex functions and sets
More information