CSCI567 Machine Learning (Fall 2014)

Similar documents
Linear Regression (continued)

Overfitting, Bias / Variance Analysis

Statistical Machine Learning Hilary Term 2018

Linear Models for Regression CS534

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan

Linear Models for Regression

Linear Regression. S. Sumitra

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Linear Models in Machine Learning

Linear Models for Regression

Machine Learning. Regression. Manfred Huber

Linear Models for Regression. Sargur Srihari

Linear Regression (9/11/13)

Lecture 2 - Learning Binary & Multi-class Classifiers from Labelled Training Data

Neural Networks and Deep Learning

Regression. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh.

Learning from Data: Regression

Linear Models for Regression CS534

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Machine Learning. Lecture 2: Linear regression. Feng Li.

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

Introduction to Machine Learning Prof. Sudeshna Sarkar Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Introduction to Machine Learning

Linear Models for Regression CS534

Bayesian Linear Regression. Sargur Srihari

Association studies and regression

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Introduction to Machine Learning. Regression. Computer Science, Tel-Aviv University,

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Linear Models for Regression

Reading Group on Deep Learning Session 1

Least Mean Squares Regression. Machine Learning Fall 2018

Least Squares Regression

Regression with Numerical Optimization. Logistic

Week 3: Linear Regression

Least Mean Squares Regression

ECE521 week 3: 23/26 January 2017

Advanced Machine Learning Practical 4b Solution: Regression (BLR, GPR & Gradient Boosting)

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

Lecture : Probabilistic Machine Learning

Bayesian Linear Regression [DRAFT - In Progress]

Stochastic Gradient Descent

Midterm exam CS 189/289, Fall 2015

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

CS540 Machine learning Lecture 5

Least Squares Regression

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Linear Classifiers: Expressiveness

Bayesian Machine Learning

Machine Learning Lecture 7

Linear Regression 1 / 25. Karl Stratos. June 18, 2018

An Introduction to Statistical and Probabilistic Linear Models

1 Machine Learning Concepts (16 points)

ECE521 Tutorial 2. Regression, GPs, Assignment 1. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel for the slides.

Today. Calculus. Linear Regression. Lagrange Multipliers

CSC321 Lecture 2: Linear Regression

CSC 411: Lecture 02: Linear Regression

Probabilistic Machine Learning. Industrial AI Lab.

Machine Learning. 7. Logistic and Linear Regression

CS6220: DATA MINING TECHNIQUES

ESS2222. Lecture 4 Linear model

CSC242: Intro to AI. Lecture 21

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

April 9, Depto. de Ing. de Sistemas e Industrial Universidad Nacional de Colombia, Bogotá. Linear Classification Models. Fabio A. González Ph.D.

LINEAR REGRESSION, RIDGE, LASSO, SVR

ECE521 lecture 4: 19 January Optimization, MLE, regularization

CMU-Q Lecture 24:

Error Functions & Linear Regression (1)

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Linear Models for Classification

Support Vector Machines

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

MLCC 2017 Regularization Networks I: Linear Models

Machine Learning (CSE 446): Probabilistic Machine Learning

Machine Learning and Data Mining. Linear regression. Kalev Kask

PATTERN RECOGNITION AND MACHINE LEARNING

Covariance Matrix Simplification For Efficient Uncertainty Management

CS260: Machine Learning Algorithms

MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Learning From Data Lecture 10 Nonlinear Transforms

CSCI567 Machine Learning (Fall 2018)

Data Mining Techniques. Lecture 2: Regression

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks

CSC321 Lecture 18: Learning Probabilistic Models

Deep Learning for Computer Vision

Lecture 2 Machine Learning Review

From perceptrons to word embeddings. Simon Šuster University of Groningen

Machine Learning Linear Models

CSE446: non-parametric methods Spring 2017

Reminders. Thought questions should be submitted on eclass. Please list the section related to the thought question

Machine Learning

Transcription:

CSCI567 Machine Learning (Fall 24) Drs. Sha & Liu {feisha,yanliu.cs}@usc.edu October 2, 24 Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 24) October 2, 24 / 24

Outline Review of last lecture Review of last lecture Linear regression 2 Nonlinear basis functions 3 Basic ideas of overcome overfitting Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 24) October 2, 24 2 / 24

Review of last lecture Estimating model parameters Linear regression Design matrix and target vector x T x T 2 X =. RN D, X = ( X) R N (D+), y = x T N Residual sum squares in matrix form { ( ) } T RSS( w) = w T X T X w 2 X T y w + const y y 2. y N Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 24) October 2, 24 3 / 24

Optimal solution Review of last lecture Linear regression Normal equation Take derivative with respect to w RSS( w) w X T Xw X T y = This leads to the least-mean-square (LMS) solution w LMS = ( X T X) X T y Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 24) October 2, 24 4 / 24

Review of last lecture Linear regression Practical concerns Bottleneck of computing the LMS solution w = ( XT X) Xy is to invert the matrix X T X R (D+) (D+) Scalable methods Batch gradient descent Stochastic gradient descent Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 24) October 2, 24 5 / 24

Review of last lecture Linear regression Stochastic gradient descent Widrow-Hoff rule: update parameters using one example at a time Initialize w to w () (anything reasonable is fine); set t = ; choose η > Loop until convergence random choose a training a sample x t 2 Compute its contribution to the gradient (ignoring the constant factor) g t = ( x T t w(t) y t ) x t 3 Update the parameters w (t+) = w (t) ηg t 4 t t + Scalable to large problems and often very effective Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 24) October 2, 24 6 / 24

Review of last lecture Linear regression Regularized least square/ridge regression For X T X that is not invertible w = ( X T X + λi) XT y This is equivalent to adding an extra term to RSS( w) RSS( w) { { }}{ ( ) } T w T X T X w 2 X T y w + 2 2 λ w 2 2 }{{} regularization Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 24) October 2, 24 7 / 24

Review of last lecture Linear regression Things you should know about linear regression Linear regression is the linear combination of features. f : x y, with f(x) = w + d w dx d = w + w T x If we minimize residual sum squares as our learning objective, we get a closed-form solution of parameters. You can either invert a matrix to obtain the parameters, or use gradient-based methods (batch or stochastic gradient descent) Probabilistic interpretation: maximum likelihood if assuming residual is Gaussian distributed Ridge regression: regularizing residual sum squares Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 24) October 2, 24 8 / 24

Outline Nonlinear basis functions Review of last lecture 2 Nonlinear basis functions 3 Basic ideas of overcome overfitting Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 24) October 2, 24 9 / 24

Nonlinear basis functions What if data is not linearly separable or fits to a line Example of nonlinear classification.5.5.5.5.5.5.5.5 Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 24) October 2, 24 / 24

Nonlinear basis functions What if data is not linearly separable or fits to a line Example of nonlinear classification.5.5.5.5.5.5.5.5 Example of nonlinear regression t x Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 24) October 2, 24 / 24

Nonlinear basis functions Nonlinear basis for classification Transform the input/feature φ(x) : x R 2 z = x x 2 Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 24) October 2, 24 / 24

Nonlinear basis functions Nonlinear basis for classification Transform the input/feature φ(x) : x R 2 z = x x 2 Transformed training data: linearly separable!.5.5.5.5.5.5.5.5.8.6.4.2.2.4.6.8.5.5.5.5 2 Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 24) October 2, 24 / 24

Another example Nonlinear basis functions How to transform the input/feature? 4 3 2 2 3 4 4 3 2 2 3 4 Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 24) October 2, 24 2 / 24

Nonlinear basis functions Another example 4 3 2 How to transform the input/feature? φ(x) : x R 2 z = x 2 x x 2 x 2 2 R 3 2 3 4 4 3 2 2 3 4 Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 24) October 2, 24 2 / 24

Another example Nonlinear basis functions 4 3 2 How to transform the input/feature? φ(x) : x R 2 z = x 2 x x 2 x 2 2 R 3 2 3 4 4 3 2 2 3 4 Transformed training data: linearly separable Intuition: 2 w = [ ] T then, w T z = x 2 2, i.e., the distance to the origin! 8 6 4 2 5 5 5 5 Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 24) October 2, 24 2 / 24

Nonlinear basis functions General nonlinear basis functions We can use a nonlinear mapping φ(x) : x R D z R M where M is the dimensionality of the new feature/input z (or φ(x)). Note that M could be either greater than D or less than or the same. Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 24) October 2, 24 3 / 24

Nonlinear basis functions General nonlinear basis functions We can use a nonlinear mapping φ(x) : x R D z R M where M is the dimensionality of the new feature/input z (or φ(x)). Note that M could be either greater than D or less than or the same. With the new features, we can apply our learning techniques linear methods: prediction is based on w T φ(x) other methods: nearest neighbors, decision trees, etc to minimize our errors on the transformed training data Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 24) October 2, 24 3 / 24

Nonlinear basis functions Regression with nonlinear basis Residual sum squares [w T φ(x n ) y n ] 2 n where w R M, the same dimensionality as the transformed features φ(x). Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 24) October 2, 24 4 / 24

Nonlinear basis functions Regression with nonlinear basis Residual sum squares [w T φ(x n ) y n ] 2 n where w R M, the same dimensionality as the transformed features φ(x). The LMS solution can be formulated with the new design matrix φ(x ) T φ(x 2 ) T Φ =. RN M, w lms = ( Φ T Φ ) Φ T y φ(x N ) T Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 24) October 2, 24 4 / 24

Nonlinear basis functions Example with regression Polynomial basis functions x φ(x) = x 2. x M f(x) = w + M w m x m m= Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 24) October 2, 24 5 / 24

Nonlinear basis functions Example with regression Polynomial basis functions x φ(x) = x 2 f(x) = w +. x M M w m x m m= Fitting samples from a sine function: underrfitting as f(x) is too simple M = M = t t x x Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 24) October 2, 24 5 / 24

Nonlinear basis functions Adding more high-order basis M=3 t M = 3 x Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 24) October 2, 24 6 / 24

Nonlinear basis functions Adding more high-order basis M=3 M=9: overfitting t M = 3 t M = 9 x x Being too adaptive leads better results on the training data, but not so great on data that has not been seen! Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 24) October 2, 24 6 / 24

Overfiting Nonlinear basis functions Parameters for higher-order polynomials are very large M = M = M = 3 M = 9 w.9.82.3.35 w -.27 7.99 232.37 w 2-25.43-532.83 w 3 7.37 48568.3 w 4-23639.3 w 5 6442.26 w 6-68.52 w 7 424.8 w 8-557682.99 w 9 252.43 Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 24) October 2, 24 7 / 24

Nonlinear basis functions Overfitting can be quite disastrous Fitting the housing price data with M = 3 Note that the price would goes to zero (or negative) if you buy bigger ones! This is called poor generalization/overfitting. Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 24) October 2, 24 8 / 24

Detecting overfitting Nonlinear basis functions Plot model complexity versus objective function As model becomes more complex, performance on training keeps improving while on test data improve first and deteriorate later. ERMS.5 Training Test 3 M 6 9 Horizontal axis: measure of model complexity In this example, we use the maximum order of the polynomial basis functions. Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 24) October 2, 24 9 / 24

Detecting overfitting Nonlinear basis functions Plot model complexity versus objective function As model becomes more complex, performance on training keeps improving while on test data improve first and deteriorate later. ERMS.5 Training Test 3 M 6 9 Horizontal axis: measure of model complexity In this example, we use the maximum order of the polynomial basis functions. Vertical axis: For regression, the vertical axis would be RSS or RMS (squared root of RSS) 2 For classification, the vertical axis would be classification error rate or cross-entropy error function Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 24) October 2, 24 9 / 24

Outline Basic ideas of overcome overfitting Review of last lecture 2 Nonlinear basis functions 3 Basic ideas of overcome overfitting Use more training data Regularization methods Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 24) October 2, 24 2 / 24

Basic ideas of overcome overfitting Use more training data Use more training data to prevent over fitting The more, the merrier t M = 9 x Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 24) October 2, 24 2 / 24

Basic ideas of overcome overfitting Use more training data Use more training data to prevent over fitting The more, the merrier M = 9 N = 5 t t x x Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 24) October 2, 24 2 / 24

Basic ideas of overcome overfitting Use more training data Use more training data to prevent over fitting The more, the merrier M = 9 N = 5 N = t t t x x What if we do not have a lot of data? x Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 24) October 2, 24 2 / 24

Basic ideas of overcome overfitting Regularization methods Regularization methods Intuition: for a linear model for regression what do we mean by being simple? w T x Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 24) October 2, 24 22 / 24

Basic ideas of overcome overfitting Regularization methods Regularization methods Intuition: for a linear model for regression what do we mean by being simple? w T x Assumptions 2 d 2σ 2 p(w d ) = e w 2πσ Namely, a prior, we believe w d is around zero, i.e., resulting in a simple model for prediction. Note that this line of thinking is to regard w as a random variable and we will use the observed data D to update our prior belief on w Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 24) October 2, 24 22 / 24

Basic ideas of overcome overfitting Regularization methods Example: fitting data with polynomials Our regression model y = M w m x m m= Thus, smaller w m will likely lead to a smaller oder of polynomial, thus potentially preventing overfitting. Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 24) October 2, 24 23 / 24

Basic ideas of overcome overfitting Regularization methods Setup for regularized linear regression Linear regression y = w T x + η where η is a Gaussian noise, distributed according to N(, σ 2 ). Prior distribution on the parameter p(w) = M e w 2πσ d= 2 d 2σ 2 Note that all the dimensions share the same variance σ 2. Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 24) October 2, 24 24 / 24