Overview. Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation

Similar documents
Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Bayesian Learning (II)

CSC321 Lecture 18: Learning Probabilistic Models

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Lecture : Probabilistic Machine Learning

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Linear Models for Regression

y Xw 2 2 y Xw λ w 2 2

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

CS-E3210 Machine Learning: Basic Principles

Estimation Theory. as Θ = (Θ 1,Θ 2,...,Θ m ) T. An estimator

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

COMP90051 Statistical Machine Learning

Cheng Soon Ong & Christian Walder. Canberra February June 2018

CMU-Q Lecture 24:

Bayesian Machine Learning

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

Introduction to Machine Learning

Probabilistic Machine Learning. Industrial AI Lab.

ECE521 week 3: 23/26 January 2017

Learning from Data: Regression

CSC2515 Assignment #2

GWAS IV: Bayesian linear (variance component) models

Graphical Models for Collaborative Filtering

COMP 551 Applied Machine Learning Lecture 21: Bayesian optimisation

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables?

Choosing among models

Bayesian Linear Regression. Sargur Srihari

Lecture 1b: Linear Models for Regression

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Regression. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh.

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Bayesian Linear Regression [DRAFT - In Progress]

Bayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine

Introduction to Machine Learning Fall 2017 Note 5. 1 Overview. 2 Metric

Logistic Regression. Machine Learning Fall 2018

Least Squares Regression

Density Estimation. Seungjin Choi

STA 4273H: Sta-s-cal Machine Learning

Least Squares Regression

Lecture 13: Data Modelling and Distributions. Intelligent Data Analysis and Probabilistic Inference Lecture 13 Slide No 1

Linear Models for Classification

Machine Learning CSE546 Sham Kakade University of Washington. Oct 4, What about continuous variables?

Introduction to Bayesian inference

Bayesian Learning Extension

Machine Learning

Relevance Vector Machines

Machine Learning (CSE 446): Probabilistic Machine Learning

Machine Learning Linear Classification. Prof. Matteo Matteucci

Week 5: Logistic Regression & Neural Networks

Overview. Linear Regression with One Variable Gradient Descent Linear Regression with Multiple Variables Gradient Descent with Multiple Variables

Linear Models for Regression CS534

Statistical learning. Chapter 20, Sections 1 3 1

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

Lecture 2 Machine Learning Review

Bias-Variance Tradeoff

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Introduction to Bayesian Learning. Machine Learning Fall 2018

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING. Non-linear regression techniques Part - II

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Machine learning - HT Basis Expansion, Regularization, Validation

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Machine Learning. Lecture 2: Linear regression. Feng Li.

Reminders. Thought questions should be submitted on eclass. Please list the section related to the thought question

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

COMP90051 Statistical Machine Learning

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

Linear Models for Regression CS534

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Generative Clustering, Topic Modeling, & Bayesian Inference

Machine Learning Basics: Maximum Likelihood Estimation

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Lecture 9: PGM Learning

COMS 4771 Regression. Nakul Verma

SCUOLA DI SPECIALIZZAZIONE IN FISICA MEDICA. Sistemi di Elaborazione dell Informazione. Regressione. Ruggero Donida Labati

Statistical learning. Chapter 20, Sections 1 4 1

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Machine Learning 4771

Informatics 2B: Learning and Data Lecture 10 Discriminant functions 2. Minimal misclassifications. Decision Boundaries

Bayesian Gaussian / Linear Models. Read Sections and 3.3 in the text by Bishop

Bayesian Regression Linear and Logistic Regression

Overfitting, Bias / Variance Analysis

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training

Support Vector Machines and Bayes Regression

Today. Calculus. Linear Regression. Lagrange Multipliers

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Linear Models for Regression CS534

Computer Vision Group Prof. Daniel Cremers. 3. Regression

DS-GA 1002 Lecture notes 11 Fall Bayesian statistics

Logistic Regression. COMP 527 Danushka Bollegala

Introduction to Machine Learning. Regression. Computer Science, Tel-Aviv University,

Bayesian Learning. Artificial Intelligence Programming. 15-0: Learning vs. Deduction

Machine Learning, Fall 2012 Homework 2

TDA231. Logistic regression

Introduction to Gaussian Processes

Transcription:

Overview Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation

Probabilistic Interpretation: Linear Regression Assume output y is generated by: Y = θ T x + M = h θ (x) + M where h θ (x) = θ T x and M is Gaussian noise with mean 0 and variance 1. As usual, we use capitals for random variables. So training data d is: {(x (1), h θ (x (1) )+M (1) ), (x (2), h θ (x (2) )+M (2) ),, (x (m), h θ (x (m) )+M (m) )} where M (1), M (2),..., M (m) are independent random variables each of which is Gaussian with mean 0 and variance 1.

Example Training data generated using Matlab code: x=[0:0.1:10]; y=x+randn(1,length(x)); plot(x,y, + ) In this example θ has just one element (m = 1) with value θ 1 = 1. In practice we don t know θ 1 in advance, and given the training data we want to estimate its value. 12 10 Y=x+M, M~N(0,1), θ=1 8 6 Y 4 2 0-2 0 2 4 6 8 10 x

Probabilistic Interpretation: Linear Regression A Gaussian RV Z with mean µ and variance σ 2 has pdf f Z (z) = 1 σ 2π (Z µ) 2 e 2σ 2. So we are assuming: f M (m) = 1 2π e m2 2, f Y (y) = 1 2π e (y h θ (x))2 2. The likelihood f D Θ (d θ) of the training data d is therefore: f D Θ (d θ) = m i=1 1 e (y(i) h θ (x (i) )) 2 2 2π = 1 m e m (y (i) h θ (x (i) )) 2 i=1 2 2π Taking logs: log f D Θ (d θ) = log 1 2π m m i=1 (y (i) h θ (x (i) )) 2 2

Parameter Estimation Recall Bayes Rule for PDFs f Θ D ( θ d) = f D Θ(d θ)f Θ ( θ) f D (d) posterior likelihood prior The maximum likelihood estimate of θ maximises the likelihood. Equivalently, it maximises the log-likelihood (why?). We can also 1 drop the fixed scaling factor 2π (why?). In linear regression case we select θ to maximise: m (y (i) h θ (x (i) )) 2 i=1 i.e. minimise m (y (i) h θ (x (i) )) 2 i=1

Probabilistic Interpretation: Who Cares? Since probability is about reasoning under uncertainty it would be v odd indeed if our machine learning algorithms did not make good sense from a probability/statistics point of view. Casting an ML approach within a statistical framework clarifies the assumptions that have been made (perhaps implicitly). E.g. in linear regression: Noise is additive Y = θ T x + M Noise on each observation is independent and identically distributed Noise is Gaussian it is this which leads directly to the use of a square loss (y h θ (x)) 2. Changing the noise model would lead to a different loss function.

Closed-form solution It turns out we can find the least square solution in closed form (not just by using gradient descent). Let s work through example with linear model and only one input (m = 1). Select θ to maximise log L(θ) = 1 2 n j=1 (y j θx j ) 2 Compute derivative with respect to θ: dl dθ = n (y j θx j )x j = j=1 n y j x j θ j=1 Set derivative equal to 0 and solve for θ: n y j x j θ j=1 θ = n j=1 y jx j n j=1 x 2 j n xj 2 = 0 j=1 n j=1 x 2 j Can extend to multiple inputs, but we won t do it in this course.

Fitting nonlinear curves: choosing features Example: Hypothesis: h θ (x) = θ 0 + θ 1 x 2 Define feature vector z with z = x 2. This new z can be computed given input x, so its known. Using this new feature vector our hypothesis can be rewritten as h θ (z) = θ 0 + θ 1 z, so can directly apply all the ideas we ve just discussed. 35 30 Y=x 2 +M, M~N(0,100) 35 30 Y=x 2 +M, M~N(0,100) 25 25 20 20 Y Y 15 15 10 10 5 5 0-5 0 5 x 0 0 5 10 15 20 25 x 2

Regularisation & Model Selection Advertising example again. Thin out data by taking every 10th point. Try a few different hypothesis: h θ (x) = θ 0 + θ 1 x h θ (x) = θ 0 + θ 1 x + θ 2 x 2 + + θ 6 x 6 h θ (x) = θ 0 + θ 1 x + θ 2 x 2 h θ (x) = θ 0 + θ 1 x + θ 2 x 2 + + θ 10 x 10 50 50 data1 40 data1 40 linear quadratic linear 6th degree 30 30 10th degree Sales 20 10 Sales 20 10 0 0-10 -10-20 0 50 100 150 200 250 300 TV Advertising -20 0 50 100 150 200 250 300 TV Advertising As we add more parameters, we start to fit the noise in the training data, called overfitting. But of use too few parameters then will get a poor fit, underfitting. How to strike the right balance between these? This is an example of the bias-variance trade-off.

Regularisation & Model Selection More data can help, e.g. when don t thin out data: 30 25 20 data1 linear quadratic 6th degree 10th degree 50 40 30 Sales 15 Sales 20 10 10 5 0-10 data1 linear quadratic 6th degree 10th degree 0 0 50 100 150 200 250 300 TV Advertising -20-100 0 100 200 300 400 500 TV Advertising But even with more data, still our hypothesis doesn t generalise well i.e. doesn t predict well for data outside the training set.

Bayesian Estimation Estimate the posterior f Θ D (θ d), rather than likelihood f D θ (d θ) A distribution rather than just a single value. Example: Y = m i=1 Θ ix i + M, M N(0, 1), Θ i N(0, λ). Bayes theorem: f Θ D (θ d) = f D θ(d θ)f θ (θ) f D (d) We already know that likelihood f D Θ (d θ) exp( n j=1 (y (j) m i=1 θ ix (j) i ) 2 /2). From our model we have prior f Θi (θ) exp( θ 2 /2λ). f D (d) is a normalising constant (so area under PDF f Θ D (θ d) is 1). Combining these using Bayes Rule gives: f Θ D (θ d) exp( n m (y (j) θ i x (j) i ) 2 /2) exp( θ 2 /2λ) j=1 i=1

MAP Estimation Maximum a posteriori (MAP) estimation 1 Select θ that maximises posterior f Θ D (θ d) (back to a single value rather than a distribution). f θ D (θ d) MAP estimate f θ D (θ d) θ Runs into trouble if distribution has > 1 peak. θ 1 Challenge: what does this Latin mean?

MAP Estimation For Linear Regression: Ridge Regression Taking logs (the same value of θ maximises f Θ D (θ d) and log f Θ D (θ d), why?), select θ to maximise 1 m m (h θ (x (i) ) y (i) ) 2 1 λ i=1 n j=1 θ 2 j i.e. to minimise 1 m m (h θ (x (i) ) y (i) ) 2 + 1 λ i=1 n j=1 θ 2 j This is called ridge regression. When λ 0, then we are saying that we are certain θ = 0. When λ is large we are saying that we know little about the value of θ prior to making the observations. The penalised estimate is then close to the non-penalised estimate.

MAP Estimation 1.05 Estimate of θ 1 0.95 0.9 0.85 MAP estimate Max Likelihood estimate 0.8 0.75 0 0.02 0.04 0.06 0.08 0.1 λ

MAP vs Maximum Likelihood Estimation Difference between MAP and ML really kicks in when we only have a small number of observations, yet still need to make a prediction. Our prior beliefs are then especially important. 3 2.5 2 1.5 1 Y 0.5 0-0.5-1 -1.5 Observations ML estimate MAP estimate λ=1/3 Y=X -2 0 0.2 0.4 0.6 0.8 1 X

MAP vs Maximum Likelihood Estimation But as number N of observations grows, impact of prior on posterior tends to decline. 1.2 1.1 MAP estimate λ=1/100 ML estimate 1 Estimate of θ 0.9 0.8 0.7 0.6 0.5 0.4 20 40 60 80 100 Number of observations