ECS171: Machine Learning

Similar documents
STA141C: Big Data & High Performance Statistical Computing

ECS171: Machine Learning

Learning From Data Lecture 9 Logistic Regression and Gradient Descent

STA141C: Big Data & High Performance Statistical Computing

CS260: Machine Learning Algorithms

ECS289: Scalable Machine Learning

ECS171: Machine Learning

Logistic Regression Trained with Different Loss Functions. Discussion

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS16

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Big Data Analytics. Lucas Rego Drumond

Overfitting, Bias / Variance Analysis

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

Linear Models in Machine Learning

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

Statistical Data Mining and Machine Learning Hilary Term 2016

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Linear Discrimination Functions

Lecture 4 Logistic Regression

Perceptron (Theory) + Linear Regression

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5

Selected Topics in Optimization. Some slides borrowed from

Linear classifiers: Overfitting and regularization

Classification: Logistic Regression from Data

Empirical Risk Minimization

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade

Warm up: risk prediction with logistic regression

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

LECTURE 22: SWARM INTELLIGENCE 3 / CLASSICAL OPTIMIZATION

Machine Learning Foundations

Support Vector Machines

Lecture 1: Supervised Learning

Optimization for Machine Learning

Discriminative Models

Homework #3 RELEASE DATE: 10/28/2013 DUE DATE: extended to 11/18/2013, BEFORE NOON QUESTIONS ABOUT HOMEWORK MATERIALS ARE WELCOMED ON THE FORUM.

Linear classifiers: Logistic regression

Machine Learning. Lecture 3: Logistic Regression. Feng Li.

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Logistic Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Machine Learning. Lecture 2: Linear regression. Feng Li.

Linear Regression (continued)

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

Classification Logistic Regression

Regression with Numerical Optimization. Logistic

Logistic Regression. COMP 527 Danushka Bollegala

ECE521 Lectures 9 Fully Connected Neural Networks

Machine Learning for NLP

CS-E3210 Machine Learning: Basic Principles

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Lecture 7. Logistic Regression. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 11, 2016

Neural Networks: Backpropagation

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

ESS2222. Lecture 4 Linear model

Introduction to Machine Learning. Regression. Computer Science, Tel-Aviv University,

Discriminative Models

FINAL: CS 6375 (Machine Learning) Fall 2014

Statistical Machine Learning Hilary Term 2018

Preliminaries. Definition: The Euclidean dot product between two vectors is the expression. i=1

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Lecture 2 - Learning Binary & Multi-class Classifiers from Labelled Training Data

Lecture 2 Machine Learning Review

Tufts COMP 135: Introduction to Machine Learning

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

Applied Machine Learning Lecture 5: Linear classifiers, continued. Richard Johansson

CS6220: DATA MINING TECHNIQUES

Gradient Descent. Sargur Srihari

Linear & nonlinear classifiers

Stochastic Gradient Descent

Classification. Chapter Introduction. 6.2 The Bayes classifier

CSC 411 Lecture 17: Support Vector Machine

Generative v. Discriminative classifiers Intuition

MLPR: Logistic Regression and Neural Networks

Logistic Regression. Will Monroe CS 109. Lecture Notes #22 August 14, 2017

Outline. MLPR: Logistic Regression and Neural Networks Machine Learning and Pattern Recognition. Which is the correct model? Recap.

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

Logistic Regression. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, / 48

Sub-Sampled Newton Methods

Machine Learning and Data Mining. Linear classification. Kalev Kask

Midterm exam CS 189/289, Fall 2015

Convex Optimization Lecture 16

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Machine Learning Basics III

Neural Network Training

Lecture 5: Logistic Regression. Neural Networks

Online Learning and Sequential Decision Making

1 Review of Winnow Algorithm

Reward-modulated inference

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Logistic Regression. Stochastic Gradient Descent

CMU-Q Lecture 24:

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Lecture 3 - Linear and Logistic Regression

OPTIMIZATION METHODS IN DEEP LEARNING

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Kernel Logistic Regression and the Import Vector Machine

Part of the slides are adapted from Ziko Kolter

Regression and Classification" with Linear Models" CMPSCI 383 Nov 15, 2011!

CS229 Supplemental Lecture notes

Transcription:

ECS171: Machine Learning Lecture 3: Linear Models I (LFD 3.2, 3.3) Cho-Jui Hsieh UC Davis Jan 17, 2018

Linear Regression (LFD 3.2)

Regression Classification: Customer record Yes/No Regression: predicting credit limit Customer record dollar amount

Regression Classification: Customer record Yes/No Regression: predicting credit limit Customer record dollar amount Linear Regression: h(x) = d i=0 w ix i = w T x

The data set Dataset (historical decisions by credit officers): (x 1, y 1 ), (x 2, y 2 ),, (x N, y N ) y n R: credit limit for customer x n

The data set Dataset (historical decisions by credit officers): (x 1, y 1 ), (x 2, y 2 ),, (x N, y N ) y n R: credit limit for customer x n Linear regression: find a function h(x) = w T x to approximate y(= f (x))

The data set Dataset (historical decisions by credit officers): (x 1, y 1 ), (x 2, y 2 ),, (x N, y N ) y n R: credit limit for customer x n Linear regression: find a function h(x) = w T x to approximate y(= f (x)) Use square error (h(x) f (x)) 2 in-sample error :E in (h) = 1 N N (h(x n ) y n ) 2 n=1

Illustration Linear regression: find linear function with small residual

Matrix form of E in E in (w) = 1 N (xn T w y n ) 2 N n=1 x1 T = 1 w y 2 1 x2 T w y 2 N. xn T w y N x1 T = 1 y 1 x2 T N. w y 2. xn T = 1 N }{{} X w y }{{} N (d+1) N 1 2 y N 2

Minimize E in min E in(w) = 1 X w y 2 w N E in : continuous, differentiable, convex Necessary condition of optimal w: E in w 0 (w ) 0 E in (w ) =. =. E in w d (w ) 0

Minimizing E in E in (w) = 1 N X w y 2 = 1 N (w T X T X w 2w T X T y + y T y) E in (w) = 2 N (X T X w X T y) E in (w ) = 0 X T X w = X T y w = (X T X ) 1 X T y }{{} X X = (X T X ) 1 X T : pseudo-inverse of X

More on Linear Regression Solutions Case I: X T X is invertible Unique solution Often when N > d + 1 Case II: X T X is non-invertible Many solutions Minimal norm solution w = X y (X is defined by another way) Often when d + 1 > N

Linear Regression Linear regression algorithm: 1. Given data matrix X and label y 2. Compute w = X y

Linear Regression Linear regression algorithm: 1. Given data matrix X and label y 2. Compute w = X y E in (w) is minimized

Linear Regression Linear regression algorithm: 1. Given data matrix X and label y 2. Compute w = X y E in (w) is minimized We will show E out will also be small using VC-dimension bound

Logistic Regression (LFD 3.3)

Soft binary classification Example: heart attack problem

Soft binary classification Example: heart attack problem Soft binary classification: f (x) = P(+1 x) [0, 1]

Soft binary classification Same data as hard binary classification, but different target function

Linear Models Linear scoring function: s = w T x

Logistic Hypothesis Logistic function: θ(s) = es 1 + e s = 1 1 + e s. convert the score to estimated probability Logistic regression hypothesis: h(x) = θ(w T x)

Genuine Probability Data (x, y) with binary y generated by a noisy process: { f (x) for y = +1 P(y x) = 1 f (x) for y = 1 The target f : R d [0, 1] (probability) Goal: learn g(x) = θ(w T x) f (x) How to measure the error?

Error measure: likelihood Likelihood: How likely to get y based on h?

Error measure: likelihood Likelihood: How likely to get y based on h? { h(x) for y = +1 Given h, P(y x) = 1 h(x) for y = 1

Error measure: likelihood Likelihood: How likely to get y based on h? { h(x) for y = +1 Given h, P(y x) = 1 h(x) for y = 1 If h f, likelihood should be large!

Error measure: likelihood Likelihood of D = (x 1, y 1 ),, (x N, y N ): Π N n=1p(y n x n )

Error measure: likelihood Likelihood of D = (x 1, y 1 ),, (x N, y N ): Π N n=1p(y n x n ) { h(x) for y = +1 Given h, P(y x) = 1 h(x) for y = 1

Error measure: likelihood Likelihood of D = (x 1, y 1 ),, (x N, y N ): Π N n=1p(y n x n ) { h(x) for y = +1 Given h, P(y x) = 1 h(x) for y = 1 Substitute h(x) = θ(w T x) and 1 h(x) = 1 θ(w T x) = θ( w T x)

Error measure: likelihood Likelihood of D = (x 1, y 1 ),, (x N, y N ): Π N n=1p(y n x n ) { h(x) for y = +1 Given h, P(y x) = 1 h(x) for y = 1 Substitute h(x) = θ(w T x) and 1 h(x) = 1 θ(w T x) = θ( w T x) P(y x) = θ(yw T x)

Error measure: likelihood Likelihood of D = (x 1, y 1 ),, (x N, y N ): Π N n=1p(y n x n ) { h(x) for y = +1 Given h, P(y x) = 1 h(x) for y = 1 Substitute h(x) = θ(w T x) and 1 h(x) = 1 θ(w T x) = θ( w T x) P(y x) = θ(yw T x) Likelihood: Π N n=1p(y n x n ) = Π N n=1θ(y n w T x n )

Maximizing the likelihood Find w to maximize the likelihood! max w ΠN n=1θ(y n w T x n ) max w log(πn n=1θ(y n w T x n )) min w log(πn n=1θ(y n w T x n )) min w min w min w N log(θ(y n w T x n )) n=1 N 1 log( θ(y n w T x n ) ) n=1 N log(1 + e ynw T x n ) n=1

Empirical Risk Minimization Most linear ML algorithms follow 1 N min loss(w T x n, y n ) w N n=1 Linear regression: loss(h(x n ), y n ) = (w T x n y n ) 2 Logistic regression: loss(h(x n ), y n ) = log(1 + e ynw T x n )

Gradient descent and SGD

Optimization Goal: find the minimizer of a function min f (w) w For now we assume f is twice differentiable Machine learning algorithm: find the hypothesis that minimizes E in

Convex vs Nonconvex Convex function: f (x) = 0 Global minimum A function is convex if 2 f (x) is positive definite Example: linear regression, logistic regression, Non-convex function: f (x) = 0 Global min, local min, or saddle point most algorithms only converge to gradient= 0 Example: neural network,

Gradient Descent Gradient descent: repeatedly do w t+1 w t α f (w t ) α > 0 is the step size Step size too large diverge; too small slow convergence

Why gradient descent? Reason I: Gradient is the steepest direction to decrease the objective function locally

Why gradient descent? Reason I: Gradient is the steepest direction to decrease the objective function locally Reason II: successive approximation view At each iteration, form an approximation function of f ( ): f (w + d) g(d) := f (w t ) + f (w t ) T d + 1 2α d 2 Update solution by w t+1 w t + d d = arg min d g(d) g(d ) = 0 f (w t ) + 1 α d = 0 d = α f (w t ) d will decrease f ( ) if α (step size) is sufficiently small

Illustration of gradient descent

Illustration of gradient descent Form a quadratic approximation f (w t + d) g(d) = f (w t ) + f (w t ) T d + 1 2α d 2

Illustration of gradient descent Minimize g(d): g(d ) = 0 f (w t ) + 1 α d = 0 d = α f (w t )

Illustration of gradient descent Update w: w t+1 = w t + d = w t α f (w t )

Illustration of gradient descent

Illustration of gradient descent

Convergence Let L be a constant such that 2 f (x) LI for all x Theorem: gradient descent converges if α < 1 L In practice, we do not know L need to tune step size when running gradient descent

Applying to Logistic regression gradient descent for logistic regression Initialize the weights w 0 For t = 1, 2, Compute the gradient E in = 1 N N y n x n 1 + e ynw T x n n=1 Update the weights: w w η E in Return the final weights w

Applying to Logistic regression gradient descent for logistic regression Initialize the weights w 0 For t = 1, 2, Compute the gradient E in = 1 N N y n x n 1 + e ynw T x n n=1 Update the weights: w w η E in Return the final weights w When to stop? Fixed number of iterations, or Stop when E in < ɛ

Conclusions Linear regression: Target is real number Square loss E in can be minimized with a closed form solution Logistic regression: An classification model Based on a probability assumption Gradient descent Next class: SGD Questions?