Today. Calculus. Linear Regression. Lagrange Multipliers

Similar documents
Overfitting, Bias / Variance Analysis

Regression. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh.

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

y Xw 2 2 y Xw λ w 2 2

Bayesian Linear Regression [DRAFT - In Progress]

Today. Probability and Statistics. Linear Algebra. Calculus. Naïve Bayes Classification. Matrix Multiplication Matrix Inversion

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Bayesian Gaussian / Linear Models. Read Sections and 3.3 in the text by Bishop

Machine Learning Lecture 7

Machine Learning And Applications: Supervised Learning-SVM

Linear Models for Regression

ECE521 week 3: 23/26 January 2017

GAUSSIAN PROCESS REGRESSION

Linear Models for Regression

Jeff Howbert Introduction to Machine Learning Winter

Bias-Variance Trade-off in ML. Sargur Srihari

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

Lecture : Probabilistic Machine Learning

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Logistic Regression. Machine Learning Fall 2018

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Outline lecture 2 2(30)

Lecture 2 Machine Learning Review

Support Vector Machine (SVM) and Kernel Methods

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Introduction to Bayesian Learning. Machine Learning Fall 2018

CMU-Q Lecture 24:

Linear & nonlinear classifiers

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Support Vector Machines and Kernel Methods

Linear Models for Classification

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes

y(x) = x w + ε(x), (1)

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Introduction to Machine Learning

Support Vector Machine (continued)

PATTERN RECOGNITION AND MACHINE LEARNING

Statistics 203: Introduction to Regression and Analysis of Variance Penalized models

Statistical Data Mining and Machine Learning Hilary Term 2016

Max Margin-Classifier

Week 3: Linear Regression

Machine Learning 4771

Logistic Regression. COMP 527 Danushka Bollegala

Computer Vision Group Prof. Daniel Cremers. 3. Regression

Support Vector Machines

4 Bias-Variance for Ridge Regression (24 points)

Linear Models for Regression. Sargur Srihari

CPSC 340: Machine Learning and Data Mining

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Support Vector Machine (SVM) and Kernel Methods

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Machine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Introduction to Gaussian Process

Support Vector Machine (SVM) and Kernel Methods

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Linear & nonlinear classifiers

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

COMP 551 Applied Machine Learning Lecture 21: Bayesian optimisation

Linear Models for Regression CS534

CSCI567 Machine Learning (Fall 2014)

Midterm. Introduction to Machine Learning. CS 189 Spring You have 1 hour 20 minutes for the exam.

Machine Learning Practice Page 2 of 2 10/28/13

CSC321 Lecture 18: Learning Probabilistic Models

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

Ways to make neural networks generalize better

CSC 411: Lecture 04: Logistic Regression

Announcements - Homework

Linear regression COMS 4771

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

CS798: Selected topics in Machine Learning

Introduction to SVM and RVM

Lecture 5: GPs and Streaming regression

Qualifying Exam in Machine Learning

STA414/2104 Statistical Methods for Machine Learning II

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

INTRODUCTION TO PATTERN RECOGNITION

Day 5: Generative models, structured classification

COMS 4771 Regression. Nakul Verma

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Notes on Discriminant Functions and Optimal Classification

Multivariate Bayesian Linear Regression MLAI Lecture 11

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

CSE546: SVMs, Dual Formula5on, and Kernels Winter 2012

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Introduction to Machine Learning Fall 2017 Note 5. 1 Overview. 2 Metric

Recap from previous lecture

Overview. Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation

Bayesian Linear Regression. Sargur Srihari

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

Transcription:

Today Calculus Lagrange Multipliers Linear Regression 1

Optimization with constraints What if I want to constrain the parameters of the model. The mean is less than 10 Find the best likelihood, subject to a constraint. Two functions: An objective function to maximize An inequality that must be satisfied 2

Lagrange Multipliers Find maxima of f (x,y) subject to a constraint. f(x, y) =x +2y x 2 + y 2 =1 3

General form Maximizing: Subject to: f(x, y) g(x, y) =c Introduce a new variable, and find a maxima. Λ(x, y, λ) =f(x, y)+λ(g(x, y) c) 4

Example Maximizing: Subject to: f(x, y) =x +2y x 2 + y 2 =1 Introduce a new variable, and find a maxima. Λ(x, y, λ) =x +2y + λ(x 2 + y 2 1) 5

Example Λ(x, y, λ) x Λ(x, y, λ) y =1+2λx =0 =2+2λy =0 Λ(x, y, λ) λ =(x 2 + y 2 1) = 0 Now have 3 equations with 3 unknowns. 6

Example Eliminate Lambda 1=2λx 2=2λy 1 x =2λ = 2 y y =2x Substitute and Solve x 2 + y 2 =1 x 2 +(2x) 2 =1 5x 2 =1 x = ± 1 5 y = ± 2 5 7

Basics of Linear Regression Regression algorithm Supervised technique. In one dimension: Identify In D-dimensions: Identify Given: training data: And targets: y : R R y : R D R {x0, x 1,...,x N } {t 0,t 1,...,t N } 8

Graphical Example of Regression t? x 9

Graphical Example of Regression t x 10

Graphical Example of Regression t x 11

Definition In linear regression, we assume that the model that generates the data involved only a linear combination of input variables. y(x, w) =w 0 + w 1 x 1 +...+ w D x D y(x, w) =w 0 + D j=1 w j x j Where w is a vector of weights which define the D parameters of the model 12

How can we evaluate the performance of a regression solution? Error Functions (or Loss functions) Squared Error Linear Error Evaluation E(t E(t i,y(x i, w)) = 1 i,y(x i, w)) = t 2 (t i y(x i, w) w)) 2 13

Regression Error 14

Empirical Risk Empirical risk is the measure of the loss from data. R emp = 1 N = 1 N N i=1 N E(t i,y(x i, w)) 1 2 (t i y(x i, w)) 2 i=1 By minimizing risk on the training data, we optimize the fit with respect to the loss function w R =0 15

Model Likelihood and Empirical Risk Two related but distinct ways to look at a model. 1. Model Likelihood. 1. What is the likelihood that a model generated the observed data? 2. Empirical Risk 1. How much error does the model have on the training data? 16

Model Likelihood p(t x, w, β) =N(t; y(x, w), β 1 ) where β = 1 σ 2 p(t x, w, β) = N(t i ; y(x i, w), β 1 ) Assuming Independently Identically Distributed (iid) data. 17

Understanding Model Likelihood p(t x, w, β) = p(t x, w, β) = n p(t x, w, β) =ln 1 β exp 2π N(t i ; y(x i, w), β 1 ) β 2 (y(x i, w) t i ) 2 1 β β exp 2π 2 (y(x i, w) t i ) 2 Substitution for the eqn of a gaussian Apply a log function = β 2 (y(xi, w) t i ) 2 + N 2 ln β N 2 ln 2π Let the log dissolve products into sums 18

Understanding Model Likelihood ln p(t x, w, β) = β 2 w ln p(t x, w, β) = w w β 2 R emp = 1 N β 2 (y(xi, w) t i ) 2 + N 2 ln β N 2 (y(xi, w) t i ) 2 (y(xi, w) t i ) 2 =0 N i=1 1 2 (t i y(x i, w)) 2 ln 2π Optimize the weights. (Maximum Likelihood Estimation) Log Likelihood Empirical Risk w/ Squared Loss Function 19

Maximizing Log Likelihood (1-D) Find the optimal settings of w. w R = 0 R( w) = 1 w = w 0 2N w 1 T R w 0 R w 1 = 0 0 (t i w 1 x i w 0 ) 2 20

1 N Maximizing Log Likelihood w R( w) = 1 2N (t i w 1 x i w 0 ) 2 R w 0 = 1 N (t i w 1 x i w 0 )( 1) = 0 1 N w 0 = 1 N w 0 = 1 N w 0 = 1 N (t i w 1 x i w 0 )( 1) (t i w 1 x i ) (t i w 1 x i ) t i w 1 1 N x i Partial derivative Set to zero Separate the sum to isolate w 0 21

1 N 1 N Maximizing Log Likelihood w R( w) = 1 2N R w 1 = 1 N (t i w 1 x i w 0 )( x i )=0 (t i x i w 1 x 2 i w 0 x i )=0 1 N w 1 w 1 x 2 i = 1 N x 2 i = (t i w 1 x i w 0 ) 2 (t i w 1 x i w 0 )( x i ) t i x i 1 N t i x i w 0 x i w 0 x i Partial derivative Set to zero Separate the sum to isolate w 0 22

Maximizing Log Likelihood w 1 w 1 x 2 i = w 0 = 1 N x 2 i = t i x i 1 N t i w 1 1 N t i x i w 0 t i w 1 1 N x i x i x i x i From previous partial From prev. slide Substitute 1 x 2 i 1 N w 1 = x i x i = t i x i 1 N t ix i 1 N x2 i 1 N t i x i Isolate w 1 t i x i x i x i 23

Maximizing Log Likelihood Clean and easy. w 0 w 1 = Or not 1 N t i w1 1 N t ix i 1 N x2 i 1 N t i x i x i x i x i Apply some linear algebra. 24

Likelihood using linear algebra Representing the linear regression function in terms of vectors. y = w 0 + w 1 x 1 + w 2 x 2 +...+ w x x = 1 x 1 x 2... x T w = w 0 w 1 w 2... w T y = x T w 25

Likelihood using linear algebra Stack x T into a matrix of data points, X. R emp ( w) = 1 2N = 1 2N = 1 2N = 1 2N (t i w 1 x i w 0 ) 2 t i 2 w 1 x 0 i w 1 t 0 1 x 0 t 1. 1 x 1 w0.. w 1 t 1 x t X w 2 2 Representation as vectors Stack the data into a matrix and use the Norm operation to handle the sum 26

Likelihood in multiple dimensions This representation of risk has no inherent dimensionality. R emp ( w) = 1 2N t X w 2 w 1 2N w R emp ( w) =0 t X 2 w =0 27

Maximum Likelihood Estimation redux 1 2N w w R emp ( w) =0 1 w t X 2N w 2 =0 1 2N w (t X w) T (t X w) =0 (t T t t T X w w T X Tt + w T X T X w) =0 1 X 2N T t X T t +2X T X w =0 1 2N 2 X T t +2 X T X w =0 X T X w = X T t Decompose the norm FOIL linear algebra style Differentiate Combine terms Isolate w w =( X T X) 1 X Tt 28

Extension to polynomial regression 29

Extension to polynomial regression y = c 0 + c 1 x 1 + c 2 x 2 y = c 0 + c 1 x + c 2 x 2 Polynomial regression is the same as linear regression in D dimensions 30

Generate new features Standard Polynomial with coefficients, w y(x, w) = D d=1 w d x d + w 0 Risk = 1 2 t 0 t 1. t n 1 1 x 0... x p 0 1 x 1... x p 1.... 1 x n 1... x p n 1 w 0 w 1. w p 2 31

Generate new features Feature Trick: To fit a D dimensional polynomial, Create a D-element vector from x i x i = x 0 i x 1 i... x P i T Then standard linear regression in D dimensions 32

How is this still linear regression? The regression is linear in the parameters, despite projecting x i from one dimension to D dimensions. Now we fit a plane (or hyperplane) to a representation of x i in a higher dimensional feature space. This generalizes to any set of functions φ i : R R x i = φ 0 (x i ) φ 1 (x i )... φ P (x i ) T 33

Basis functions as feature extraction These functions are called basis functions. They define the bases of the feature space Allows linear decomposition of any type of function to data points Common Choices: Polynomial Gaussian Sigmoids Wave functions (sine, etc.) φ i : R R 34

Training data vs. Testing Data Evaluating the performance of a classifier on training data is meaningless. With enough parameters, a model can simply memorize (encode) every training point To evaluate performance, data is divided into training and testing (or evaluation) data. Training data is used to learn model parameters Testing data is used to evaluate performance 35

Overfitting 36

Overfitting 37

Overfitting performance 38

Definition of overfitting When the model describes the noise, rather than the signal. How can you tell the difference between overfitting, and a bad model? 39

Possible detection of overfitting Stability An appropriately fit model is stable under different samples of the training data An overfit model generates inconsistent performance Performance A good model has low test error A bad model has high test error 40

What is the optimal model size? The best model size generalizes to unseen data the best. Approximate this by testing error. One way to optimize parameters is to minimize testing error. This operation uses testing data as tuning or development data Sacrifices training data in favor of parameter optimization Can we do this without explicit evaluation data? 41

Context for linear regression Simple approach Efficient learning Extensible Regularization provides robust models 42

Break Coffee. Stretch. 43

Linear Regression Identify the best parameters, w, for a regression function y = w 0 + N i=1 w i x i w =( X T X) 1 X Tt 44

Overfitting Recall: overfitting happens when a model is capturing idiosyncrasies of the data rather than generalities. Often caused by too many parameters relative to the amount of training data. E.g. an order-n polynomial can intersect any N+1 data points 45

Dealing with Overfitting Use more data Use a tuning set Regularization Be a Bayesian 46

Regularization In a linear regression model overfitting is characterized by large weights. M =0 M =1 M =3 M =9 w 0 0.19 0.82 0.31 0.35 w 1-1.27 7.99 232.37 w 2-25.43-5321.83 w 3 17.37 48568.31 w 4-231639.30 w 5 640042.26 w 6-1061800.52 w 7 1042400.18 w 8-557682.99 w 9 125201.43 47

Penalize large weights Introduce a penalty term in the loss function. E( w) = 1 2 E( w) = 1 2 n=0 n=0 {t n y(x n, w)} 2 Regularized Regression (L2-Regularization or Ridge Regression) (t n y(x n, w)) 2 + λ 2 w2 48

Regularization Derivation w 1 2 w (E( w)) = 0 (y(x i, w) t i ) 2 + λ 2 w2 =0 w 1 2 t X w 2 + λ 2 w2 =0 w 1 2 ( t X w) T (t X w)+ λ 2 wt w =0 49

1 w 2 ( t X w) T (t X w)+ λ 2 wt w =0 X T t + X T λ X w + w 2 wt w =0 X T t + X T X w + λw =0 X T t + X T X w + λi w =0 X T t +( X T X + λi) w =0 ( X T X + λi) w = X Tt w =( X T X + λi) 1 X Tt 50

Regularization in Practice 51

Regularization Results 52

More regularization The penalty term defines the styles of regularization L2-Regularization L1-Regularization L0-Regularization L0-norm is the optimal subset of features E( w) = 1 2 E( w) = 1 2 E( w) = 1 2 (t n y(x n, w)) 2 + λ 2 w2 n=0 (t n y(x n, w)) 2 + λ w 1 n=0 (t n y(x n, w)) 2 + λ n=0 n=0 δ(w n = 0) 53

Curse of dimensionality Increasing dimensionality of features increases the data requirements exponentially. For example, if a single feature can be accurately approximated with 100 data points, to optimize the joint over two features requires 100*100 data points. Models should be small relative to the amount of available data Dimensionality reduction techniques feature selection can help. L0-regularization is explicit feature selection L1- and L2-regularizations approximate feature selection. 54

Bayesians v. Frequentists What is a probability? Frequentists A probability is the likelihood that an event will happen It is approximated by the ratio of the number of observed events to the number of total events Assessment is vital to selecting a model Point estimates are absolutely fine Bayesians A probability is a degree of believability of a proposition. Bayesians require that probabilities be prior beliefs conditioned on data. The Bayesian approach is optimal, given a good model, a good prior and a good loss function. Don t worry so much about assessment. If you are ever making a point estimate, you ve made a mistake. The only valid probabilities are posteriors based on evidence given some prior 55

Bayesian Linear Regression The previous MLE derivation of linear regression uses point estimates for the weight vector, w. Bayesians say, hold it right there. Use a prior distribution over w to estimate parameters p( w α) =N( w 0, α 1 I)= α 2π (M+1)/2 exp α 2 wt w Alpha is a hyperparameter over w, where alpha is the precision or inverse variance of the distribution. Now optimize: p( w x,t, α, β) p(t x, w, β)p( w α) 56

Optimize the Bayesian posterior p( w x,t, α, β) p(t x, w, β)p( w α) As usual it s easier to optimize after a log transform. p(t x, w, β) = ln p(t x, w, β)+lnp( w α) n=0 β 2π exp n p(t x, w, β) = N 2 ln β N 2 ln 2π β 2 β2 (t n y(x n, w)) 2 (t n y(x n, w)) 2 n=0 57

Optimize the Bayesian posterior p( w x,t, α, β) p(t x, w, β)p( w α) As usual it s easier to optimize after a log transform. ln p(t x, w, β)+lnp( w α) p( w α) =N( w 0, α 1 I)= α 2π n p( w α) = M +1 2 ln α M +1 2 (M+1)/2 exp α 2 wt w ln 2π α 2 wt w 58

Optimize the Bayesian posterior ln p(t x, w, β)+lnp( w α) n p(t x, w, β) = N 2 ln β N 2 ln 2π β 2 ln p( w α) = M +1 2 ln α M +1 2 (t n y(x n, w)) 2 n=0 Ignoring terms that do not depend on w n p(t x, w, β)+lnp( w α) β 2 ln 2π α 2 wt w (t n y(x n, w)) 2 + α 2 wt w n=0 IDENTICAL formulation as L2-regularization 59

Context Overfitting is bad. Bayesians vs. Frequentists Is one better? Machine Learning uses techniques from both camps. 60

Next Time Logistic Regression Read Chapter 4.1, 4.3 61