Behavioral Data Mining. Lecture 7 Linear and Logistic Regression

Similar documents
Behavioral Data Mining. Lecture 3 Naïve Bayes Classifier and Generalized Linear Models

Logistic Regression. COMP 527 Danushka Bollegala

Linear Models for Regression CS534

Linear Models for Regression CS534

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Linear Methods for Regression. Lijun Zhang

Least Squares Regression

Machine Learning Linear Classification. Prof. Matteo Matteucci

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Introduction to Machine Learning. Regression. Computer Science, Tel-Aviv University,

Least Squares Regression

ECE521 week 3: 23/26 January 2017

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Machine Learning for OR & FE

Linear Discrimination Functions

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

Day 3 Lecture 3. Optimizing deep networks

CS242: Probabilistic Graphical Models Lecture 4A: MAP Estimation & Graph Structure Learning

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan

Overfitting, Bias / Variance Analysis

Machine Learning for NLP

CPSC 540: Machine Learning

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

ANLP Lecture 22 Lexical Semantics with Dense Vectors

Machine Learning Practice Page 2 of 2 10/28/13

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

This model of the conditional expectation is linear in the parameters. A more practical and relaxed attitude towards linear regression is to say that

Machine Learning for Biomedical Engineering. Enrico Grisan

Ch 4. Linear Models for Classification

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d)

Linear Models for Regression CS534

Sparse Linear Models (10/7/13)

Behavioral Data Mining. Lecture 2

Neural Network Training

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 10

Homework 1 Solutions Probability, Maximum Likelihood Estimation (MLE), Bayes Rule, knn

CSE546 Machine Learning, Autumn 2016: Homework 1

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

CS598 Machine Learning in Computational Biology (Lecture 5: Matrix - part 2) Professor Jian Peng Teaching Assistant: Rongda Zhu

CSC2515 Assignment #2

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Bindel, Fall 2016 Matrix Computations (CS 6210) Notes for

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

The classifier. Theorem. where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know

The classifier. Linear discriminant analysis (LDA) Example. Challenges for LDA

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes

Data Mining 2018 Logistic Regression Text Classification

Linear Models for Regression

CS-E3210 Machine Learning: Basic Principles

COMP 551 Applied Machine Learning Lecture 2: Linear regression

Machine Learning Lecture 7

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Lecture 5: Logistic Regression. Neural Networks

Nonparametric Bayesian Methods (Gaussian Processes)

Statistical Data Mining and Machine Learning Hilary Term 2016

Overview. Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation

Online Learning With Kernel

Introduction to Machine Learning Midterm Exam Solutions

Warm up: risk prediction with logistic regression

COMS 4771 Regression. Nakul Verma

II. Linear Models (pp.47-70)

Algorithms for NLP. Language Modeling III. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Linear Models in Machine Learning

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Classification. Chapter Introduction. 6.2 The Bayes classifier

Introduction to Logistic Regression

Linear Regression Models. Based on Chapter 3 of Hastie, Tibshirani and Friedman

Generalized Boosted Models: A guide to the gbm package

Direct Learning: Linear Regression. Donglin Zeng, Department of Biostatistics, University of North Carolina

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Binary Principal Component Analysis in the Netflix Collaborative Filtering Task

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Introduction to Machine Learning Midterm Exam

Lecture 14: Shrinkage

Logistic Regression. William Cohen

Data Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods.

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression

CMU-Q Lecture 24:

GWAS IV: Bayesian linear (variance component) models

Introduction to Machine Learning Midterm, Tues April 8

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

A Survey of L 1. Regression. Céline Cunen, 20/10/2014. Vidaurre, Bielza and Larranaga (2013)

Learning Task Grouping and Overlap in Multi-Task Learning

Linear & nonlinear classifiers

Andriy Mnih and Ruslan Salakhutdinov

Homework 4. Convex Optimization /36-725

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

DATA MINING AND MACHINE LEARNING. Lecture 4: Linear models for regression and classification Lecturer: Simone Scardapane

Gradient Boosting (Continued)

An Introduction to Statistical and Probabilistic Linear Models

ECE521 lecture 4: 19 January Optimization, MLE, regularization

The connection of dropout and Bayesian statistics

Midterm exam CS 189/289, Fall 2015

Support Vector Machines

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

Linear Model Selection and Regularization

Transcription:

Behavioral Data Mining Lecture 7 Linear and Logistic Regression

Outline Linear Regression Regularization Logistic Regression Stochastic Gradient Fast Stochastic Methods Performance tips

Linear Regression Predict a response variable Y from a vector of inputs X T ( X, X 2,..., X 1 p using a linear model: ˆ 0 Yˆ ˆ ) p 0 X j j 1 ˆ j is called the intercept or bias of the model. If we add a constant to every X, then the formula simplifies to: X 0 1 Yˆ p j 0 X ˆ j X j T ˆ

Least Squares The residual sum of squares (RSS) for N points is given by: RSS( ) or in matrix form (boldface for matrices/vectors, uppercase for matrices): RSS ( ) N i 1 ( y i x i ( y X ) T T ) 2 ( y X ) This is a quadratic function of, so its global minimum can be found by differentiating w.r.t. : X T ( y X ) 0 the solution satisfies X T X X T y

Solving Regression Problems This has a unique solution if ˆ ( X X T X) T 1 is non-singular, which is This is often true if N > p (more points than X dimensions), but there is no guarantee. On the other hand if p < N, the matrix is guaranteed to be singular, and we need to constrain the solution somehow. X X T y

Example Y X

Example in 2D

Probabilistic Formulation We take X p to be a random variable, Y another real-valued random variable dependent on X. These variables have a joint distribution Pr(X,Y). In general, we make a prediction of Y using some function (so far a linear function) f(x). To choose the best model (f) we define a loss function L(Y, f(x)) and try to minimize the expected loss across the joint distribution Pr(X,Y). The squared-error loss is EPE(f) = E(Y-f(X)) 2 and we want to minimize: EPE( f ) E( Y f ( X 2 ))

Minimizing Expected Loss f ( x) E( Y X x) This is called the regression function. If we assume that f() has the form f(x) = x T, then the optimal solution is similar to the previous sample solution: ˆ T 1 [E( X X )] E( X Replacing expected values with the sums we computed before gives the same result. T y)

Other Loss Functions The L 1 loss is defined as L(Y, f(x)) = E Y-f(X) The solution is the conditional median: f(x) = median(y X=x) which is often a more robust estimator. The L 1 loss has a discontinuous derivative and no simple closed form. On the other hand, it can be formulated as a linear program. It is also practical to solve it using stochastic gradient methods that are less sensitive to gradient discontinuities.

Outline Linear Regression Regularization Logistic Regression Stochastic Gradient Fast Stochastic Methods Performance tips

Regularizers Ridge and Lasso In practice the regression solution may be singular: ˆ ( X X) T 1 This is likely for example, with dependent features (trigrams and bigrams). The system still has a solution, in fact a linear space of solutions. We can make the solution unique by imposing additional conditions, e.g. Ridge regression: minimize L 2 norm of. The solution is ˆ ( X The matrix in parentheses is generically non-singular. X X I) T y T 1 X T y

Regularizers Regularization can be viewed as adding a probability distribution on X (independent of Y). Minimizing the L 2 norm of X is equivalent to using a mean-zero Normal distribution for Pr(X). The regression problem can be formulated as a Bayesian estimation problem with Pr(X) as a prior. The effect is to make the estimate conservative smaller values for are favored, and more evidence is needed to produce large coefficients for. Can choose the regularizer weight ( ) by observing the empirical distribution of after solving many regression problems, or simply by optimizing it during cross-validation.

Lasso Lasso regression: minimize L 1 norm of. Typically generates a sparse model since the L 1 norm drives many coefficients to zero. The function to minimize is and its derivative is T ( y X ) ( y X ) X T ( y X ) sign( ) An efficient solution is Least-Angle Regression (LARS) due to Hastie (see reading). Another option is stochastic gradient.

Coefficient tests Regularization helps avoid over-fitting, but many coefficients will still carry noise values that will be different on different samples from the same population. To remove coefficients that are not almost surely non-zero, we can add a significance test. The statistic is: z j ˆ ˆ where j is the j th diagonal element of, and 2 is an estimate of the error variance (mean squared difference between Y and X ). j j ( X T X) 1

Significant Coefficients The z value has a t-distribution under the null hypothesis ( j = 0) but in most cases (due to many degrees-of-freedom) it is well-approximated as a normal distribution. Using the normal CDF, we can test whether a given coefficient is significantly different from zero, at say a significance of 0.05 The weakness of this approach is that regression coefficients depend on each other dependence may change as other features are added or removed. E.g. highly dependent features share their influence.

Stepwise Selection We can select features in greedy fashion, at each step adding the next feature that most improves the error. By maintaining a QR decomposition of X T X and pivoting on the next best feature, stepwise selection is asymptotically as efficient as the standard solution method.

Regression and n-grams Unlike naïve Bayes, regression does not suffer from bias when features are dependent. On the other hand, regression has insufficient degrees of freedom to model dependencies. e.g., page and turn(er) vs. page turner A unigram model distributes this influence between the two words, increasing the error when each occurs alone. The strongest dependencies between words occur over short distances. Adding n-grams allows the strongest dependent combinations to receive an independent weight.

An example Dataset: Reviews from amazon.com, about 1M book reviews. Includes text and numerical score from 1 to 5. Each review (X) is represented as a bag-of-words (or n-grams) with counts. The responses Y are the numerical scores. Each X is normalized by dividing by its sum. 10-fold cross-validation is used for training/testing. Final cross-validation RMSE = 0.84

'productive' 'not disappointed' 'covering' 'be disappointed' 'two stars' 'five stars' 'not recommend' 'no nonsense' 'calling' '5 stars' 'online' 'points out' 'at best' 'no better' 'track' 'only problem' Strongest negatives 'travels' 'comprehend' 'ashamed' 'worthwhile' Strongest positives 'negative reviews' 'not only' 'the negative' 'a try' 'stern' 'not agree' 'parody' 'of print' 'evaluate' 'adulthood' 'web' 'even better' 'exposes' 'literary' 'covers' 'instead of' 'secret' 'not put' 'foster' 'be it' 'not buy' 'i wasnt' 'frederick' 'to avoid'

Outline Linear Regression Regularization Logistic Regression Stochastic Gradient Fast Stochastic Methods Performance tips

Logistic Regression Goal: use linear classifier functions and derive smooth [0,1] probabilities from them, Or: extend naïve Bayes to model dependence between features as in linear regression. The following probability derives from X 1 p = n 1 + exp ( j=1 X j β j ) And satisfies logit p = ln n p 1 p = X j j=1 β j

Logistic Regression Logistic regression generalizes naïve Bayes in the following sense: For independent features X, the maximum-likelihood logistic coefficients are the naïve Bayes classifier coefficients. And the probability computed using the logistic function is the same as the Bayes posterior probability.

Logistic Regression The log likelihood for a model on N data vectors is: l β = N i=1 y i β T x i log (1 + exp β T x i ) Where we assume every x has a zeroth coefficient = 1. The derivative dl/d is dl dβ = N i=1 x i y ix i 1 + exp β T = x x i (y i p i ) i This is a non-linear optimization, but the function is convex. It can be solved with Newton s method if the model dimension is small. Otherwise, we can use stochastic gradient. N i=1

Regularization Just as with linear regression, Logistic regression can easily overfit data unless the model is regularized. Once again we can do this either with L 1 or L 2 measures on the coefficients. The gradients are then: L 2 : dl dβ = i=1 x i y i p i + 2λβ L 1 : dl dβ = i=1 x i y i p i + λ sign β As with linear regression, the L 1 norm tends to drive weak coefficients to zero. This helps both to reduce model variance, and also to limit model size (and improve evaluation speed).

Outline Linear Regression Regularization Logistic Regression Stochastic Gradient Fast Stochastic Methods Performance tips

Stochastic Gradient With behavioral data, many learning algorithms rely on gradient descent where the gradient is expressible as a sum over contributions from each user. e.g. the gradient of Logistic Regression is: dl dβ = N i=1 x i (y i p i ) where the sum is taken over N users. In classical gradient descent, we would take a step directly in the gradient direction dl/d.

Gradient Descent y = x 2-1/(2 ) dy/dx = -x For a well-behaved (locally quadratic) minimum in 1 dimension, gradient descent is an efficient optimization procedure. But things are normally much more difficult in high dimensions. Quadratic minima can be highly eccentric.

Gradient Descent http://www.gatsby.ucl.ac.uk/teaching/courses/ml2-2008/graddescent.pdf

Newton s Method The Hessian matrix: d 2 l H ij = dβ i dβ j Together with the local gradient dl/d exactly characterizes a quadratic minimum. If these derivatives can be computed, the Newton step is β n+1 = β n dl 1 H dβ Which converges quadratically (quadratically decreasing error) to the minimum.

Newton s method http://www.gatsby.ucl.ac.uk/teaching/courses/ml2-2008/graddescent.pdf

Newton s Method If the model has p coefficients, the Hessian has p 2. It will often be impractical to compute all these coefficients, either because of time or memory limits. But a typical quadratic minimum has a few strong dimensions and many weak ones. Finding these and optimizing along them can be fast and is the basis of many more efficient optimization algorithms.

Stochastic Gradient Its safer to take many small steps in gradient than few large steps. Stochastic gradient uses the approximate gradient from one or a few samples to incrementally update the model. back to Logistic Regression, we compute: dl dβ x i (y i p i ) i block where the sum is taken over a small block of values, and add a multiple of this gradient to at each step. These small steps avoid overshooting and oscillation near eccentric minima.

Convergence Rate Stochastic gradient is often orders of magnitude faster than traditional gradient methods, but may still be slow for highdimensional models. e.g. Linear regression on 1M Amazon reviews. model size = 20,000-30,000, stochastic gradient block size = 1000 RMSE Number of complete passes over the dataset (~ 1M local updates)

Outline Linear Regression Regularization Logistic Regression Stochastic Gradient Fast Stochastic Methods Performance tips

Improved Stochastic Gradient Based on incremental estimation of natural gradient, several methods have appeared recently. They infer second-order information from gradient steps and use this to approximate Newton steps.

Outline Linear Regression Regularization Logistic Regression Stochastic Gradient Fast Stochastic Methods Performance tips

Sparse Matrices The term counts X can be stored as an NxM sparse matrix, where N is the number of terms, M the number of docs. Sparse matrices contain only the non-zero values. Formats are: Coordinate: store all tuples of (row, col, value), usually sorted in lexicographic (row, col) or (col, row) order. CSC, Compressed Sparse Column: rows, values, compress(cols) CSR, Compressed Sparse Row: cols, values, compress(rows) CSC especially is a widely-used format in scientific computing, for sparse BLAS and Matlab. CSC supports efficient slicing by columns this is useful for permutations, blocking and cross-validation by docids.

CSC format Elements are sorted in (col, row) order, i.e. column-major order. Value and row are the same as for coordinate format. Instead of storing every column value, CSC stores a matrix of M+1 column pointers (colptr) to the start and end+1 of the elements in each column. N row N colptr M+1

CSC format Because of memory grain and caching (next time), its much faster to traverse the columns of a CSC sparse matrix than rows for matrix operations. Additional techniques (autotuned blocking) may be used inside sparse matrix-vector multiply operators to improve performance you don t need to implement this, but you should use it when available. N row N colptr M+1

Other tips Use binary files + compression speeds up reads. Should be less than 10 secs for 0.5 GB. Use blocking for datasets that wont fit in main memory. Seek to start address of a block of data in the binary file no need to explicitly split the file. Data is available in binary form already tokenized, with dictionary, if you need it.

Summary Linear Regression Regularization Logistic Regression Stochastic Gradient Fast Stochastic Methods Performance tips