Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Similar documents
Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Machine Learning. Regularization and Feature Selection. Fabio Vandin November 13, 2017

The Perceptron algorithm

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Warm up: risk prediction with logistic regression

Machine Learning. Model Selection and Validation. Fabio Vandin November 7, 2017

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

Lecture 4: Linear predictors and the Perceptron

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Machine Learning. Regularization and Feature Selection. Fabio Vandin November 14, 2017

Machine Learning 2017

Discriminative Models

Lecture 5: Logistic Regression. Neural Networks

Classification Logistic Regression

Linear Regression. Machine Learning CSE546 Kevin Jamieson University of Washington. Oct 5, Kevin Jamieson 1

Logistic regression and linear classifiers COMS 4771

Logistic Regression. Machine Learning Fall 2018

CS229 Supplemental Lecture notes

An Introduction to Statistical and Probabilistic Linear Models

Discriminative Models

ECS171: Machine Learning

Linear classifiers: Logistic regression

Notes on the framework of Ando and Zhang (2005) 1 Beyond learning good functions: learning good spaces

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

Machine Learning and Data Mining. Linear classification. Kalev Kask

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Introduction to Machine Learning (67577) Lecture 3

Lecture 2 Machine Learning Review

Regression with Numerical Optimization. Logistic

Overfitting, Bias / Variance Analysis

Warm up. Regrade requests submitted directly in Gradescope, do not instructors.

ECE521 Lecture7. Logistic Regression

1 Machine Learning Concepts (16 points)

Support vector machines Lecture 4

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

Statistical Machine Learning Hilary Term 2018

Machine Learning (CSE 446): Probabilistic Machine Learning

6.867 Machine learning: lecture 2. Tommi S. Jaakkola MIT CSAIL

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Linear Models in Machine Learning

Linear Regression (continued)

Machine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Neural Network Training

ECE521 week 3: 23/26 January 2017

Midterm exam CS 189/289, Fall 2015

Qualifying Exam in Machine Learning

Machine Learning for NLP

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks

The Perceptron Algorithm

Gaussian discriminant analysis Naive Bayes

Ch 4. Linear Models for Classification

Kernelized Perceptron Support Vector Machines

IFT Lecture 7 Elements of statistical learning theory

Linear Regression 1 / 25. Karl Stratos. June 18, 2018

SCMA292 Mathematical Modeling : Machine Learning. Krikamol Muandet. Department of Mathematics Faculty of Science, Mahidol University.

Logistic Regression Logistic

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Linear Classifiers. Michael Collins. January 18, 2012

Announcements. Proposals graded

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))

Linear classifiers: Overfitting and regularization

Feed-forward Network Functions

Lecture Support Vector Machine (SVM) Classifiers

Statistical Data Mining and Machine Learning Hilary Term 2016

Computational and Statistical Learning Theory

Logistic Regression Trained with Different Loss Functions. Discussion

Announcements - Homework

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

Linear Models for Classification

Linear discriminant functions

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Part of the slides are adapted from Ziko Kolter

Bias-Variance Tradeoff

CSC321 Lecture 4 The Perceptron Algorithm

Machine Learning: A Statistics and Optimization Perspective

Logistic Regression. Will Monroe CS 109. Lecture Notes #22 August 14, 2017

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Classification. Sandro Cumani. Politecnico di Torino

Reading Group on Deep Learning Session 1

Classification Logistic Regression

18.6 Regression and Classification with Linear Models

Machine Learning Practice Page 2 of 2 10/28/13

ECE 5424: Introduction to Machine Learning

Neural Networks: Backpropagation

SPSS, University of Texas at Arlington. Topics in Machine Learning-EE 5359 Neural Networks

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018

ESS2222. Lecture 4 Linear model

Machine Learning And Applications: Supervised Learning-SVM

Classification Based on Probability

Machine Learning Lecture 5

Machine Learning. Lecture 3: Logistic Regression. Feng Li.

Binary Classification / Perceptron

Transcription:

Machine Learning Linear Models Fabio Vandin October 10, 2017 1

Linear Predictors and Affine Functions Consider X = R d Affine functions: L d = {h w,b : w R d, b R} where ( d ) h w,b (x) = w, x + b = w i x i + b i=1 Note: each member of L d is a function x w, x + b b: bias 2

Linear Models Hypothesis class H: φ L d, where φ : R Y h H is h : R d Y φ depends on the learning problem Example binary classification, Y = { 1, 1} φ(z) = sign(z) regression, Y = R φ(z) = z 3

Equivalent Notation Given x X, w R d, b R, define: w = (b, w 1, w 2,..., w d ) R d+1 x = (1, x 1, x 2,..., x d ) R d+1 Then: h w,b (x) = w, x + b = w, x (1) we will consider bias term as part of w and assume x = (1, x 1, x 2,..., x d ) when needed, with h w (x) = w, x 4

Linear Classification X = R d, Y = { 1, 1}, 0-1 loss Hypothesis class = halfspaces HS d = sign L d = {x sign(h x,b (x)) : h w,b L d } (2) Example: X = R 2 5

Finding a Good Hypothesis Linear classification with hypothesis set H = halfspaces. How do we find a good hypothesis? Good = ERM rule Perceptron Algorithm (Rosenblatt, 1958) Note: if y i w, x i > 0 for all i = 1,..., m all points are classified correctly by model w 6

Perceptron Input: training set (x 1, y 1 ),..., (x m, y m ) initialize w (1) = (0,..., 0); for t = 1, 2,... do if i s.t. y i w (t), x i 0 then w (t+1) w (t) + y i x i ; else return w (t) ; Interpretation of update: Note that: y i w (t+1), x i = y i w (t) + y i x i, x i = y i w (t), x i + x i 2 updates guides w to be more correct on (x) i, y i ). Termination? Depends on the realizability assumption! 7

Perceptron with Linearly Separable Data Realizability assumption for halfspace = linearly separable data Proposition Assume that (x 1, y 1 ),..., (x m, y m ) is separable, let: B = min{ w : i, i = 1,..., m, y i w, x i }, and R = max i x i. Then the Perceptron algorithm stops after at most (RB) 2 iterations and when it stops it holds that i, i {1,..., m} : y i w (t), x i > 0. 8

Perceptron: Notes simple to implement for separable data convergence is guaranteed converge depends on B, which can be exponential in d... an ILP approach may be better to find ERM solution in such cases potentially multiple solutions, which one is picked depends on starting values non separable data? run for some time and keep best solution found up to that point (pocket algorithm) 9

10 Linear Regression X = R d, Y = R Hypothesis class: H reg = L d = {x sign(h x,b (x)) : h w,b L d } (3) Note: h H reg : R d R Commonly used loss function: squared-loss l(h, (x, y)) def = (h(x) y) 2 empirical risk function (training error): Mean Squared Error L S (h) = 1 m m (h(x i ) y i ) 2 i=1

Least Squares 11 How to find a good hypothesis? Least Squares algorithm Best hypothesis: arg min L 1 S(h w ) = arg min w w m m (h(x i ) y i ) 2 Equivalent formulation: w minimizing Residual Sum of Squares (RSS), i.e. m arg min (h(x i ) y i ) 2 w i=1 i=1

12 Let RSS: Matrix Form x 1 x 2 X =. x m X: design matrix y 1 y 2 y =. we have that RSS is m (h(x i ) y i ) 2 = (y Xw) T (y Xw) i=1 y m

13 Want to find w that minimizes RSS (=objective function): RSS(w = arg min (y w Xw)T (y Xw) How? Compute gradient RSS(w) w compare it to 0. of objective function w.r.t w and RSS(w) w = 2X T (y Xw) Then we need to find w such that 2X T (y Xw) = 0

14 2X T (y Xw) = 0 is equivalent to X T Xw = X T y If X T X is invertible solution to ERM problem is: w = (X T X) 1 X T y

Complexity Considerations 15 We need to compute (X T X) 1 X T y Algorithm: 1 compute X T X: product of (d + 1) m matrix and m (d + 1) matrix 2 compute (X T X) 1 inversion of (d + 1) (d + 1) matrix 3 compute (X T X) 1 X T : product of (d + 1) (d + 1) matrix and (d + 1) m matrix 4 compute (X T X) 1 X T : product of (d + 1) m matrix and m 1 matrix Most expensive operation? Inversion! done for (d + 1) (d + 1) matrix

X T X not invertible? How do we get w such that X T Xw = X T y if X T X is not invertible? Let A = X T X Let A + be the generalized inverse of A, i.e.: AA + A = A Proposition If A = X T X is not invertible, then ŵ = A + X T y is a solution to X T Xw = X T y. Proof: [UML] 9.2.1 16

Generalized Inverse of A 17 Note A = X T X is symmetric eigenvalue decomposition of A: A = VDV T with D: diagonal matrix (entries = eigenvalues of A) V: orthonormal matrix (V T V = I d d ) Define D + diagonal matrix such that: { 0 if Di,i = 0 D + i,i 1 D i,i otherwise

18 Let A + = VD + V T Then AA + A = VDV T VD + V T VDV T = VDD + DV T = VDV T = A A + is the generalized inverse of A.

Logistic Regression 19 Learn a function h from R d to [0, 1]. What can this be used for? Classification! Example: binary classification (Y = { 1, 1}) - h(x) = probability that label of x is 1. For simplicity of presentation, we consider binary classification with Y = { 1, 1}, but similar considerations apply for multiclass classification.

20 Logistic Regression: Model Hypothesis class H: φ sig L d, where φ : R [0, 1] is sigmoid function Sigmoid function = S-shaped function For logistic regression, the sigmoid φ sig used is the logistic regression: 1 φ sig (z) = 1 + e z

21 Therefore H sig = φ sig L d = {x φ sig ( w, x ) : w R d } Main difference with binary classification with halfspaces: when w, x 0 halfspace prediction is deterministically 1 or 1 φ sig ( w, x ) 1/2 uncertainty in predicted label

22 Loss Function Need to define how bad it is to predict h w (x) [0, 1] given that true label is y = ±1 Desiderata h w (x) large if y = 1 1 h w (x) large if y = 1 Note that 1 h w (x) = 1 = e w,x 1 1 + e w,x 1 + e w,x 1 = 1 + e w,x

23 Then reasonable loss function: increases monotonically with 1 1 + e y w,x reasonable loss function: increases monotonically with 1 + e y w,x Loss function for logistic regression: l(h w, (x, y)) = log (1 + e y w,x )

24 Therefore, given training set S = ((x 1, y 1 ),..., (x 1, y 1 )) the ERM problem for logistic regression is: 1 arg min w R d m m i=1 log (1 ) + e y i w,x i Notes: logistic loss function is a convex function ERM problem can be solved efficiently Definition may look a bit arbitrary: actually, ERM formulation is the same as the one arising from Maximum Likelihood Estimation

Maximum Likelihood Estimation (MLE) [UML, 24.1] 25 MLE is a statistical approach for finding the parameters that maximize the joint probability of a given dataset assuming a specific parametric probability function. Note: MLE essentially assumes a generative model for the data General approach: 1 given training set S = ((x 1, y 1 ),..., (x m, y m )), assume each (x i, y i ) is i.i.d. from some probability distribution of parameters θ 2 consider P[S θ] (likelihood of data given parameters) 3 log likelihood: L(S; θ) = log(p[s θ]) 4 maximum likelihood estimator: ˆθ = arg max θ L(S; θ)

Logistic Regression and MLE 26 Assuming x 1,..., x m are fixed, the probability that x i has label y i = 1 is 1 h w (x i ) = 1 + e w,x i while the probability that x i has label y i = 1 is (1 h w (x i ))) = Then the likelihood for training set S is: 1 1 + e w,x i m ( 1 ) 1 + e y i w,x i i=1

27 Therefore the log likelihood is: m i=1 log (1 ) + e y i w,x i And note that the maximum likelihood estimator for w is: arg max w m i=1 ( ) log 1 + e y i w,x i = arg min w m i=1 MLE solution is equivalent to ERM solution! log (1 ) + e y i w,x i

Bibliography 28 [UML] Chapter 9: no 9.1.1