Regression.

Similar documents
Regression. Mark Craven and David Page Computer Sciences 760 Spring Goals for the lecture

CSE446: Linear Regression Regulariza5on Bias / Variance Tradeoff Winter 2015

Classification Logistic Regression

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

CS 6140: Machine Learning Spring What We Learned Last Week 2/26/16

Is the test error unbiased for these programs? 2017 Kevin Jamieson

UVA CS 4501: Machine Learning. Lecture 6: Linear Regression Model with Dr. Yanjun Qi. University of Virginia

Regression Shrinkage and Selection via the Lasso

Introduction to Machine Learning

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Lecture 2 Machine Learning Review

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Machine Learning and Data Mining. Linear regression. Prof. Alexander Ihler

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725

Linear Models in Machine Learning

The lasso. Patrick Breheny. February 15. The lasso Convex optimization Soft thresholding

Stochastic Gradient Descent

CS 6140: Machine Learning Spring What We Learned Last Week. Survey 2/26/16. VS. Model

Is the test error unbiased for these programs?

CS 6140: Machine Learning Spring 2016

Machine Learning CSE546 Carlos Guestrin University of Washington. October 7, Efficiency: If size(w) = 100B, each prediction is expensive:

Generative v. Discriminative classifiers Intuition

Fast Regularization Paths via Coordinate Descent

STA141C: Big Data & High Performance Statistical Computing

ECS289: Scalable Machine Learning

SCMA292 Mathematical Modeling : Machine Learning. Krikamol Muandet. Department of Mathematics Faculty of Science, Mahidol University.

Recita,on: Loss, Regulariza,on, and Dual*

Bias/variance tradeoff, Model assessment and selec+on

Optimization methods

Least Squares Regression

Logistic Regression. Machine Learning Fall 2018

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

COMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma

Least Squares Regression

Introduction to Machine Learning. Regression. Computer Science, Tel-Aviv University,

Machine Learning Practice Page 2 of 2 10/28/13

Optimization methods

Lasso: Algorithms and Extensions

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Logis&c Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Support Vector Machines

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

COMS 4771 Regression. Nakul Verma

ESS2222. Lecture 4 Linear model

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Bias-Variance Tradeoff

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

Linear classifiers: Overfitting and regularization

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Regulatory Inferece from Gene Expression. CMSC858P Spring 2012 Hector Corrada Bravo

CSE546: SVMs, Dual Formula5on, and Kernels Winter 2012

Priors in Dependency network learning

Click Prediction and Preference Ranking of RSS Feeds

LASSO Review, Fused LASSO, Parallel LASSO Solvers

Convex envelopes, cardinality constrained optimization and LASSO. An application in supervised learning: support vector machines (SVMs)

Lecture 8: February 9

Notes on Machine Learning for and

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Overfitting, Bias / Variance Analysis

Least Mean Squares Regression. Machine Learning Fall 2017

Statistical Data Mining and Machine Learning Hilary Term 2016

Generative v. Discriminative classifiers Intuition

Lecture 9: September 28

DATA MINING AND MACHINE LEARNING

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

A Survey of L 1. Regression. Céline Cunen, 20/10/2014. Vidaurre, Bielza and Larranaga (2013)

Linear classifiers: Logistic regression

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

cxx ab.ec Warm up OH 2 ax 16 0 axtb Fix any a, b, c > What is the x 2 R that minimizes ax 2 + bx + c

Warm up: risk prediction with logistic regression

Optimization and Gradient Descent

Lecture 2: Logistic Regression and Neural Networks

ECE521 lecture 4: 19 January Optimization, MLE, regularization

The classifier. Theorem. where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know

The classifier. Linear discriminant analysis (LDA) Example. Challenges for LDA

Agenda. Fast proximal gradient methods. 1 Accelerated first-order methods. 2 Auxiliary sequences. 3 Convergence analysis. 4 Numerical examples

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /

6. Proximal gradient method

COMP 562: Introduction to Machine Learning

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

Machine Learning 4771

Statistical Methods for Data Mining

Rirdge Regression. Szymon Bobek. Institute of Applied Computer science AGH University of Science and Technology

MATH 829: Introduction to Data Mining and Analysis Computing the lasso solution

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5

Graphical Models. Lecture 3: Local Condi6onal Probability Distribu6ons. Andrew McCallum

STAT 535 Lecture 5 November, 2018 Brief overview of Model Selection and Regularization c Marina Meilă

Regularization and Variable Selection via the Elastic Net

Lecture 25: November 27

Probabilistic Graphical Models

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

Probabilistic Graphical Models

Introduction to Logistic Regression

Applied Machine Learning Lecture 5: Linear classifiers, continued. Richard Johansson

1 Sparsity and l 1 relaxation

A Short Introduction to the Lasso Methodology

Oslo Class 6 Sparsity based regularization

Classifica(on and predic(on omics style. Dr Nicola Armstrong Mathema(cs and Sta(s(cs Murdoch University

Applied Machine Learning Annalisa Marsico

Transcription:

Regression www.biostat.wisc.edu/~dpage/cs760/

Goals for the lecture you should understand the following concepts linear regression RMSE, MAE, and R-square logistic regression convex functions and sets ridge regression (L2 penalty) lasso (L1 penalty): least absolute shrinkage and selection operator lasso by proximal method (ISTA) lasso by coordinate descent

Using Linear Algebra As we go to more variables, notation more complex Use matrix representation and operations, assume all features standardized (standard normal), and assume an additional constant 1 feature as we did with ANNs Given data matrix X with label vector Y Find vector of coefficients β to minimize: Xβ Y 2 2

Evaluation Metrics for Numeric Prediction Root mean squared error (RMSE) Mean absolute error (MAE) average error R-square (R-squared) Historically all were computed on training data, and possibly adjusted after, but really should crossvalidate

R-square(d) Formulation 1: R 2 =1 Pi (y i h( ~x i )) 2 P i (y i y) 2 P P p P p P Formulation 2: square of Pearson correlation coefficient r. Recall for x, y: r = P i (x i x)(y i y) ppi (x i x) 2 pp i i (y i y) 2

Some Observations R-square of 0 means you have no model, R-square of 1 implies perfect model (loosely, explains all variation) These two formulations agree when performed on the training set The do not agree when we do cross-validation, in general, because mean of training set is different from mean of each fold Should do CV and use first formulation, but can be negative!

Logistic Regression: Motivation Linear regression was used to fit a linear model to the feature space. Requirement arose to do classifica9on- make the number of interest the probability that a feature will take a par9cular value given other features P(Y = 1 X) So, extend linear regression for classifica9on

Theory of Logistic Regression y x

Likelihood function for logistic regression Naïve Bayes Classifica9on makes use of condi9onal informa9on and computes full joint distribu9on Logis9c regression instead computes condi9onal probability operates on Condi&onal Likelihood func9on or Condi&onal Log Likelihood. Condi9onal likelihood is the probability of the observed Y values condi9oned on their corresponding X values Unlike linear regression, no closed- form solu9on, so we use standard gradient ascent to find the weight vector w that maximizes the condi9onal likelihood

In the following we will make more precise how the set of constraints X Background on Optimization (Bubeck, 2015) Thanks Irene Giacomelli Definition 1.1 (Convex sets and convex functions). A set X µ R n is said to be convex if it contains all of its segments, that is (x, y, ) œ X X [0, 1], (1 )x + y œ X. A function f : X æ R is said to be convex if it always lies below its chords, that is (x, y, ) œ X X [0, 1], f((1 )x + y) Æ (1 )f(x)+ f(y). We are interested in algorithms that take as input a convex set X and a convex function f and output an approximate minimum of f over X. We write compactly the problem of finding the minimum of f over X as min. f(x) s.t. x œ X.

(Non-)Convex Function or Set

Epigraph (Bubeck, 2015) X æ R epi(f) ={(x, t) œ X R : t Ø f(x)}. It is obvious that a function is convex if and only if its epigraph is a convex set. Proof. The first claim is almost trivial: let œ ((1 ) + ), then

Show for all real a < b and 0 c 1, f(ca + (1- c)b) c f(a) + (1- c) f(b) for following: f(x)= x f(x)=x 2 Not so for f(x)=x 3 In general x could be a vector x For gradient descent, also want f(x) to be con9nuous differen9able For x we need proximal methods, subgradient methods, or coordinate descent

Logistic Regression Algorithm Error in es9mate

Regularized Regression Regression (linear or logis9c) is prone to overfi\ng, especially when: there are a large number of features or when fit with high order polynomial features. Regulariza9on helps combat overfi\ng by having a simpler model. It is used when we want to have: less varia9on in the different weights or smaller weights overall or only a few non- zero weights(and thus features considered). Regulariza9on is accomplished by adding a penalty term to the target func9on that is being op9mized. Two types L 2 and L 1 regulariza9on.

L 2 regularization

L 2 regularization in linear regression Called ridge regression S9ll has a closed- form solu9on, so even though con9nuous differen9able and convex, don t need gradient descent β = (X T X λi) -1 X T Y

L 1 regularization

Obtained by taking Langrangian. Even for linear regression, no closed-form solution. Ordinary gradient ascent also does not work because no derivative. Fastest methods now FISTA and (faster) coordinate descent.

f(x) = g(x) + h(x) Proximal Methods g is convex, differentiable h is convex and decomposable, but not differentiable Example: g is squared error, h is lasso penalty sum of absolute value terms, one per coefficient (so one per feature) 2 Find β to minimize Xβ Y 2 + λ β 1 { g { h

Proximal Operator: Soft-Thresholding S λ (x) = for all i 8 >< x i if x i > 0 if apple x i apple >: x i + if x i < We typically apply this to coefficient vector β.

Iterative Shrinkage-Thresholding Algorithm (ISTA) Initialize β; let η be learning rate Repeat until convergence Make a gradient step of: β ç S λ (β ηx T (Xβ y))

Coordinate Descent Fastest current method for lasso-penalized linear or logistic regression Simple idea: adjust one feature at a time, and special-case it near 0 where gradient not defined (where absolute value s effect changes) Can take features in a cycle in any order, or randomly pick next feature (analogous to Gibbs Sampling) To special-case it near 0 just apply soft-thresholding everywhere

Coordinate Descent Algorithm Initialize coefficients Cycle over features until convergence: For each example i and feature j, compute partial residual : r ij = y i k j x ikβ k. ares coefficient of these Compute least-squares coefficients of these residuals (as we did in OLS regression): β j = 1 N N i=1 x ijr ij Update β j by soft-thresholding, where for any term T, T + denotes min(0,a): * j β j ç S λ (β )

Comments on penalized regression L2-penalized regression also called ridge regression Can combine L1 and L2 penalties: elastic net L1-penalized regression is especially active area of research group lasso fused lasso others

Comments on logistic regression Logis9c Regression is a linear classifier In Bayesian classifier, features are independent given class - > assump9on on P(X Y) In Logis9c Regression: Func9onal form is P(Y X), there is no assump9on on P(X Y) Logis9c Regression op9mized by using condi9onal likelihood no closed- form solu9on concave - > global op9mum with gradient ascent

More comments on regularization Linear and logis9c regression prone to overfi\ng Regulariza9on helps combat overfi\ng by adding a penalty term to the target func9on being op9mized L1 regulariza9on ocen preferred since it produces sparse models. It can drive certain co- efficients(weights) to zero, performing feature selec9on in effect L2 regulariza9on drives towards smaller and simpler weight vectors but cannot perform feature selec9on like L1 regulariza9on Few uses of OLS these days e.g., Warfarin Dosing (NEJM 2009) just 30 carefully hand- selected features