Machine Learning 4771

Similar documents
Machine Learning 4771

Machine Learning 4771

18.9 SUPPORT VECTOR MACHINES

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS16

6.036 midterm review. Wednesday, March 18, 15

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Logistic Regression Logistic

Machine Learning 4771

Review: Support vector machines. Machine learning techniques and image analysis

Classification Based on Probability

Ridge Regression: Regulating overfitting when using many features. Training, true, & test error vs. model complexity. CSE 446: Machine Learning

Logistic Regression. Machine Learning Fall 2018

Lecture 2 Machine Learning Review

Machine Learning and Data Mining. Linear classification. Kalev Kask

15-388/688 - Practical Data Science: Nonlinear modeling, cross-validation, regularization, and evaluation

Overfitting, Bias / Variance Analysis

CS 340 Lec. 16: Logistic Regression

Linear Models in Machine Learning

Today. Calculus. Linear Regression. Lagrange Multipliers

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

Logistic Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Midterm exam CS 189/289, Fall 2015

Machine Learning - Waseda University Logistic Regression

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

Machine Learning (CSE 446): Neural Networks

Machine Learning Lecture 7

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

The Perceptron algorithm

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

CPSC 340: Machine Learning and Data Mining. Gradient Descent Fall 2016

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d)

Machine Learning And Applications: Supervised Learning-SVM

Linear discriminant functions

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

Machine Learning. Kernels. Fall (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang. (Chap. 12 of CIML)

CPSC 340: Machine Learning and Data Mining

Lecture 4 Logistic Regression

Linear classifiers: Logistic regression

Machine Learning Practice Page 2 of 2 10/28/13

Rirdge Regression. Szymon Bobek. Institute of Applied Computer science AGH University of Science and Technology

Notes on Discriminant Functions and Optimal Classification

Logistic Regression. COMP 527 Danushka Bollegala

Recap from previous lecture

10-701/ Machine Learning - Midterm Exam, Fall 2010

Single layer NN. Neuron Model

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Warm up: risk prediction with logistic regression

CS229 Supplemental Lecture notes

Neural Networks: Introduction

Classification Logistic Regression

Discriminative Models

Linear Regression (continued)

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Introduction to Machine Learning

Machine Learning Basics III

Lecture 3: Statistical Decision Theory (Part II)

Binary Classification / Perceptron

Bias-Variance Tradeoff

Introduction to Machine Learning

Cheng Soon Ong & Christian Walder. Canberra February June 2018

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

Comments. x > w = w > x. Clarification: this course is about getting you to be able to think as a machine learning expert

Machine Learning, Fall 2009: Midterm

Spectral Regularization

FA Homework 2 Recitation 2

Introduction to Machine Learning

Neural networks and support vector machines

ECS171: Machine Learning

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

COMP 551 Applied Machine Learning Lecture 13: Dimension reduction and feature selection

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Regularization via Spectral Filtering

ECE521 week 3: 23/26 January 2017

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Computational statistics

Regression and Classification" with Linear Models" CMPSCI 383 Nov 15, 2011!

Support Vector Machine (continued)

Linear classifiers: Overfitting and regularization

Lecture 3 Feedforward Networks and Backpropagation

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

COMP 551 Applied Machine Learning Lecture 2: Linear Regression

STA 414/2104, Spring 2014, Practice Problem Set #1

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Support Vector Machines

Jeff Howbert Introduction to Machine Learning Winter

Statistical Data Mining and Machine Learning Hilary Term 2016

Linear models: the perceptron and closest centroid algorithms. D = {(x i,y i )} n i=1. x i 2 R d 9/3/13. Preliminaries. Chapter 1, 7.

Linear & nonlinear classifiers

Is the test error unbiased for these programs?

Artificial Intelligence

Discriminative Models

VBM683 Machine Learning

Lecture 4: Training a Classifier

We choose parameter values that will minimize the difference between the model outputs & the true function values.

Statistical Machine Learning Hilary Term 2018

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Logistic Regression Trained with Different Loss Functions. Discussion

Regression. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh.

Transcription:

Machine Learning 4771 Instructor: Tony Jebara

Topic 3 Additive Models and Linear Regression Sinusoids and Radial Basis Functions Classification Logistic Regression Gradient Descent

Polynomial Basis Functions To fit a P th order polynomial function to multivariate data: concatenate columns of all monomials up to power P E.g. dimensional data and nd order polynomial (quadratic) x 1 ( 1) x 1 ( ) x i 1 ( ) x i ( ) x N 1 ( ) x N ( ) 1 x 1 ( 1) x 1 ( ) x 1 1 ( )x 1 1 ( ) x 1 1 ( )x 1 ( ) x 1 ( )x 1 1 x i ( 1) x i ( ) x i ( 1)x i ( 1) x i ( 1)x i ( ) x i ( )x i ( ) ( ) 1 x N ( 1) x N ( ) x N ( 1)x N ( 1) x N ( 1)x N ( ) x N ( )x N ( )

Sinusoidal Basis Functions More generally, we don t just have to deal with polynomials, use any set of basis fn s: ( ) P = θ p φ ( p x) f x;θ + θ 0 p=1 These are generally called Additive Models Regression adds linear combinations of the basis fn s For example: Fourier (sinusoidal) basis φ k ( x ) i = sin kx i ( ) φ ( x ) = cos( kx ) k+1 i i Note, don t have to be a basis per se, usually subset θ 1 + θ + θ 3

Radial Basis Functions Can act as prototypes of the data itself f ( x;θ) = θ k exp 1 N x x k=1 σ k Parameter σ = standard deviation σ = covariance controls how wide bumps are what happens if too big/small? Also works in multi-dimensions Called RBF for short

Radial Basis Functions Each training point leads to a bump function f ( x;θ) = θ k exp 1 N x x k=1 σ k Reuse solution from linear regression: θ * = X T X Can view the data instead as X, a big matrix of size N x N X = exp 1 σ x 1 x 1 exp 1 σ x x 1 exp 1 σ x 3 x 1 exp 1 σ x 1 x exp 1 σ x x exp 1 σ x 3 x ( ) 1 X T y For RBFs, X is square and symmetric, so solution is just exp 1 σ x 1 x 3 exp 1 σ x x 3 exp 1 σ x 3 x 3 θ R = 0 X T Xθ = X T y Xθ = y θ * = X 1 y

Evaluating Our Learned Function We minimized empirical risk to get θ* How well does f(x;θ*) perform on future data? It should Generalize and have low True Risk: R true ( θ) = P ( x,y)l( y, f ( x;θ) )dx dy Can t compute true risk, instead use Testing Empirical Risk We randomly split data into training and testing portions {( x 1,y 1 ),,( x N,y )} ( x,y N N +1 N +1),, x N +M,y N +M { ( )} Find θ* with training data: Evaluate it with testing data: R train R test ( θ) = 1 N ( θ) = 1 M N i=1 N +M i=n +1 L( y i, f ( x i ;θ)) L( y i, f ( x i ;θ))

Crossvalidation Try fitting with different sigma radial basis function widths Select sigma which gives lowest R test (θ*) Loss R ( test θ * ) underfitting overfitting R ( train θ * ) 1/σ Best sigma Think of sigma as a measure of the simplicity of the model Thinner RBFs are more flexible and complex

Regularized Risk Minimization Empirical Risk Minimization gave overfitting & underfitting We want to add a penalty for using too many theta values This gives us the Regularized Risk R regularized ( θ) = R ( empirical θ) + Penalty ( θ) = 1 N N i=1 L( y i, f ( x i ;θ)) + λ N θ Solution for Regularized Risk with Least Squares Loss: 1 θ R regularized = 0 θ N θ * = ( X T X + λi ) 1 X T y y Xθ + λ N θ = 0

Regularized Risk Minimization Have D=16 features (or P=15 throughout) Try minimizing R regularized (θ) to get θ* with different λ Note that λ=0 give back Empirical Risk Minimization

Crossvalidation Try fitting with different lambda regularization levels Select lambda which gives lowest R test (θ*) Risk R ( test θ * ) underfitting overfitting R ( train θ * ) 1/λ Best lambda Lambda measures simplicity of the model Models with low lambda are more flexible

From Regression To Classification Classification is another important learning problem Regression Classification { ( )} x RD y R 1 X = ( x 1,y 1 ),( x,y ),, x N,y N { ( )} x RD y { 0,1} X = ( x 1,y 1 ),( x,y ),, x N,y N E.g. Given x = [tumor size, tumor density] Predict y in {benign,malignant} Should we solve this as a least squares regression problem? x x x x x x x x x x O O O O O O O O O

Classification vs. Regression a) Classification needs binary answers like {0,1} b) Least squares is an unfair measure of risk here e.g. Why penalize a correct but large positive y answer? e.g. Why penalize a correct but large negative y answer? Example: not good to use regression output for a decision f(x)>0.5 Class 1 f(x)<0.5 Class 0 if f(x)=-3.8 & correct class=0, squared error penalizes it f(x) 1.0 0.5 0.0 o o x x x x x x x We pay a hefty squared error loss here even if we got the correct classification result. The thick solid line model makes two mistakes while the dashed model is perfect

Classification vs. Regression We will consider the following four steps to improve from naïve regression to get better classification learning: 1) Fix functions f(x) to give binary output (logistic neuron) ) Fix our definition of the Risk we will minimize so that we get good classification accuracy (logistic loss) and later on 3) Make an even better fix on f(x) to binarize (perceptron) 4) Make an even better risk (perceptron loss)

Logistic Neuron (McCullough-Pitts) To output binary, use squashing function g(). f ( x;θ) = θ T x ( ) = g θ T x ( ) = 1 + exp( z ) f x;θ g z ( ) ( ) 1 Linear neuron Logistic Neuron This squashing is called sigmoid or logistic function

Logistic Regression Given a classification problem with binary outputs X = ( x 1,y 1 ),( x,y ),,( x N,y ) N Use this function and output 1 if f(x)>0.5 and 0 otherwise ( ) = 1 + exp θ T x f x;θ { } x R D y 0,1 ( ( )) 1 { }

Short hand for Linear Functions What happened to adding the intercept? f x;θ ( ) = θ T x + θ 0 = θ(1) θ() θ(d) T x(1) x() x(d) + θ 0 = θ 0 θ(1) θ() θ(d) T 1 x(1) x() x(d) = θ T x

Logistic Regression Given a classification problem with binary outputs X = ( x 1,y 1 ),( x,y ),,( x N,y ) N Fix#1: use f(x) below, output 1 if f(x)>0.5 and 0 otherwise ( ) = 1 + exp θ T x f x;θ { } x R D y 0,1 ( ( )) 1 { }

Logistic Regression Given a classification problem with binary outputs X = ( x 1,y 1 ),( x,y ),,( x N,y ) N Fix#1: use f(x) below, output 1 if f(x)>0.5 and 0 otherwise ( ) = 1 + exp θ T x f x;θ ( ( )) 1 Fix#: instead of squared loss, use Logistic Loss L log ( y, f ( x;θ) ) = y i 1 ( )log( 1 f ( x;θ) ) y i log f x;θ ( ( )) This method is called Logistic Regression. But Empirical Risk Minimization has no closed-form sol n: R emp { } x R D y 0,1 ( θ) = 1 N N ( y i 1)log( 1 f ( x i ;θ)) y i log f x i ;θ i=1 { } ( ( ))

Logistic Regression With logistic squashing function, minimizing R(θ) is harder R emp ( θ) = 1 N N ( y i 1)log( 1 f ( x i ;θ)) y i log f x i ;θ i=1 1 y θ R = 1 i N 1 f ( x i ;θ) y N i f ( x i ;θ) f ' ( x ;θ ) = 0??? i=1 i Can t minimize risk and find best theta analytically! Let s try finding best theta numerically. Use the following to compute gradient ( ) = 1 + exp θ T x f x;θ ( ( )) 1 = g ( θ T x) Here, g() is the logistic squashing function ( ) = 1 + exp( z ) g z ( ) 1 g ' z ( ) ( ) = g ( z) 1 g ( z) ( ( ))

Gradient Descent Useful when we can t get minimum solution in closed form Gradient points in direction of fastest increase Take step in the opposite direction! Gradient Descent Algorithm choose scalar step size η, & tolerance ε initialize θ 0 = small random vector θ 1 = θ 0 η θ R emp θ 0, t = 1 while θ t θ t 1 { θ t+1 = θ t η θ R emp, t = t + 1 θ t } For appropriate η, this will converge to local minimum

Logistic Regression Logistic regression gives better classification performance Its empirical risk is N R emp ( y i 1)log( 1 f ( x i ;θ)) y i log( f ( x i ;θ)) This R(θ) is convex so gradient descent always converges to the same solution Make predictions using ( ) = 1 + exp θ T x f x;θ ( θ) = 1 N i=1 ( ( )) 1 Output 1 if f > 0.5 Output 0 otherwise