Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Similar documents
Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Machine Learning. Regularization and Feature Selection. Fabio Vandin November 13, 2017

The Perceptron algorithm

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Lecture 4: Linear predictors and the Perceptron

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

Machine Learning. Model Selection and Validation. Fabio Vandin November 7, 2017

Machine Learning. Regularization and Feature Selection. Fabio Vandin November 14, 2017

Warm up: risk prediction with logistic regression

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Discriminative Models

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

Introduction to Machine Learning (67577) Lecture 3

Discriminative Models

Lecture 5: Logistic Regression. Neural Networks

Linear Classifiers. Michael Collins. January 18, 2012

Logistic regression and linear classifiers COMS 4771

Regression and Classification" with Linear Models" CMPSCI 383 Nov 15, 2011!

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))

Binary Classification / Perceptron

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Linear Regression (continued)

CS229 Supplemental Lecture notes

Machine Learning and Data Mining. Linear classification. Kalev Kask

6.867 Machine learning: lecture 2. Tommi S. Jaakkola MIT CSAIL

Machine Learning Linear Models

Machine Learning for NLP

Part of the slides are adapted from Ziko Kolter

Classification Logistic Regression

Linear discriminant functions

FA Homework 2 Recitation 2

SPSS, University of Texas at Arlington. Topics in Machine Learning-EE 5359 Neural Networks

Computational and Statistical Learning Theory

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Notes on the framework of Ando and Zhang (2005) 1 Beyond learning good functions: learning good spaces

An Introduction to Statistical and Probabilistic Linear Models

GRADIENT DESCENT AND LOCAL MINIMA

Machine Learning (CSE 446): Probabilistic Machine Learning

Single layer NN. Neuron Model

ECS171: Machine Learning

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Machine Learning Foundations

AdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague

ECS171: Machine Learning

Recap from previous lecture

MLCC 2017 Regularization Networks I: Linear Models

Ch 4. Linear Models for Classification

More about the Perceptron

Perceptron (Theory) + Linear Regression

SCMA292 Mathematical Modeling : Machine Learning. Krikamol Muandet. Department of Mathematics Faculty of Science, Mahidol University.

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Lecture Support Vector Machine (SVM) Classifiers

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Machine Learning 2017

Machine Learning for NLP

Overfitting, Bias / Variance Analysis

IFT Lecture 7 Elements of statistical learning theory

Linear models: the perceptron and closest centroid algorithms. D = {(x i,y i )} n i=1. x i 2 R d 9/3/13. Preliminaries. Chapter 1, 7.

CSC321 Lecture 4 The Perceptron Algorithm

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018

ECE 5424: Introduction to Machine Learning

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

Evaluation. Andrea Passerini Machine Learning. Evaluation

Preliminaries. Definition: The Euclidean dot product between two vectors is the expression. i=1

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Support vector machines Lecture 4

Statistical Machine Learning Hilary Term 2018

Neural Networks and Deep Learning

Linear Regression. Machine Learning CSE546 Kevin Jamieson University of Washington. Oct 5, Kevin Jamieson 1

Evaluation requires to define performance measures to be optimized

Midterm exam CS 189/289, Fall 2015

Day 3: Classification, logistic regression

Neural Network Training

Modelli Lineari (Generalizzati) e SVM

Empirical Risk Minimization Algorithms

1 Machine Learning Concepts (16 points)

Warm up. Regrade requests submitted directly in Gradescope, do not instructors.

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines

Machine Learning. Lecture 3: Logistic Regression. Feng Li.

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

CSC242: Intro to AI. Lecture 21

Kernel Methods and Support Vector Machines

COMP 551 Applied Machine Learning Lecture 2: Linear regression

Machine Learning: A Statistics and Optimization Perspective

Logistic Regression. Machine Learning Fall 2018

CSC321 Lecture 4: Learning a Classifier

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade

Linear classifiers: Logistic regression

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

Chapter ML:VI. VI. Neural Networks. Perceptron Learning Gradient Descent Multilayer Perceptron Radial Basis Functions

Logistic Regression Logistic

Machine Learning CSE546 Carlos Guestrin University of Washington. October 7, Efficiency: If size(w) = 100B, each prediction is expensive:

ESS2222. Lecture 4 Linear model

On the tradeoff between computational complexity and sample complexity in learning

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

Regression with Numerical Optimization. Logistic

Classification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1).

Lecture 2 Machine Learning Review

Transcription:

Machine Learning Linear Models Fabio Vandin October 10, 2017 1

Linear Predictors and Affine Functions Consider X = R d Affine functions: L d = {h w,b : w R d, b R} where ( d ) h w,b (x) = w, x + b = w i x i + b i=1 Note: each member of L d is a function x w, x + b b: bias 2

Linear Models Hypothesis class H: φ L d, where φ : R Y h H is h : R d Y φ depends on the learning problem Example binary classification, Y = { 1, 1} φ(z) = sign(z) regression, Y = R φ(z) = z 3

Equivalent Notation Given x X, w R d, b R, define: w = (b, w 1, w 2,..., w d ) R d+1 x = (1, x 1, x 2,..., x d ) R d+1 Then: h w,b (x) = w, x + b = w, x (1) we will consider bias term as part of w and assume x = (1, x 1, x 2,..., x d ) when needed, with h w (x) = w, x 4

Linear Classification X = R d, Y = { 1, 1}, 0-1 loss Hypothesis class = halfspaces HS d = sign L d = {x sign(h x,b (x)) : h w,b L d } (2) Example: X = R 2 5

Finding a Good Hypothesis Linear classification with hypothesis set H = halfspaces. How do we find a good hypothesis? Good = ERM rule Perceptron Algorithm (Rosenblatt, 1958) Note: if y i w, x i > 0 for all i = 1,..., m all points are classified correctly by model w 6

Perceptron Input: training set (x 1, y 1 ),..., (x m, y m ) initialize w (1) = (0,..., 0); for t = 1, 2,... do if i s.t. y i w (t), x i 0 then w (t+1) w (t) + y i x i ; else return w (t) ; Interpretation of update: Note that: y i w (t+1), x i = y i w (t) + y i x i, x i = y i w (t), x i + x i 2 update guides w to be more correct on (x i, y i ). Termination? Depends on the realizability assumption! 7

Perceptron with Linearly Separable Data Realizability assumption for halfspace = linearly separable data Proposition Assume that (x 1, y 1 ),..., (x m, y m ) is linearly separable, let: B = min{ w : y i w, x i 1 i, i = 1,..., m, }, and R = max i x i. Then the Perceptron algorithm stops after at most (RB) 2 iterations and when it stops it holds that i, i {1,..., m} : y i w (t), x i > 0. 8

Perceptron: Notes simple to implement for separable data convergence is guaranteed converge depends on B, which can be exponential in d... an ILP approach may be better to find ERM solution in such cases potentially multiple solutions, which one is picked depends on starting values non separable data? run for some time and keep best solution found up to that point (pocket algorithm) 9

10 X = R d, Y = R Linear Regression Hypothesis class: H reg = L d = {x w, x + b : w R d, b R} Note: h H reg : R d R Commonly used loss function: squared-loss l(h, (x, y)) def = (h(x) y) 2 empirical risk function (training error): Mean Squared Error L S (h) = 1 m (h(x i ) y i ) 2 m i=1

Linear Regression - Example 11 d = 1

Least Squares 12 How to find a ERM hypothesis? Least Squares algorithm Best hypothesis: arg min L 1 S(h w ) = arg min w w m m ( w, x i y i ) 2 Equivalent formulation: w minimizing Residual Sum of Squares (RSS), i.e. m arg min ( w, x i y i ) 2 w i=1 i=1

13 Let RSS: Matrix Form x 1 x 2 X =. x m X: design matrix y 1 y 2 y =. y m we have that RSS is m ( w, x i y i ) 2 = (y Xw) T (y Xw) i=1

14 Want to find w that minimizes RSS (=objective function): RSS(w) = arg min (y w Xw)T (y Xw) How? Compute gradient RSS(w) w compare it to 0. of objective function w.r.t w and RSS(w) w = 2X T (y Xw) Then we need to find w such that 2X T (y Xw) = 0

15 2X T (y Xw) = 0 is equivalent to X T Xw = X T y If X T X is invertible solution to ERM problem is: w = (X T X) 1 X T y

Complexity Considerations 16 We need to compute (X T X) 1 X T y Algorithm: 1 compute X T X: product of (d + 1) m matrix and m (d + 1) matrix 2 compute (X T X) 1 inversion of (d + 1) (d + 1) matrix 3 compute (X T X) 1 X T : product of (d + 1) (d + 1) matrix and (d + 1) m matrix 4 compute (X T X) 1 X T : product of (d + 1) m matrix and m 1 matrix Most expensive operation? Inversion! done for (d + 1) (d + 1) matrix

X T X not invertible? How do we get w such that X T Xw = X T y if X T X is not invertible? Let A = X T X Let A + be the generalized inverse of A, i.e.: AA + A = A Proposition If A = X T X is not invertible, then ŵ = A + X T y is a solution to X T Xw = X T y. Proof: [UML] 9.2.1 17

Generalized Inverse of A 18 Note A = X T X is symmetric eigenvalue decomposition of A: A = VDV T with D: diagonal matrix (entries = eigenvalues of A) V: orthonormal matrix (V T V = I d d ) Define D + diagonal matrix such that: { 0 if D + Di,i = 0 i,i = otherwise 1 D i,i

19 Let A + = VD + V T Then AA + A = VDV T VD + V T VDV T = VDD + DV T = VDV T = A A + is the generalized inverse of A.

Logistic Regression 20 Learn a function h from R d to [0, 1]. What can this be used for? Classification! Example: binary classification (Y = { 1, 1}) - h(x) = probability that label of x is 1. For simplicity of presentation, we consider binary classification with Y = { 1, 1}, but similar considerations apply for multiclass classification.