STA414/2104 Statistical Methods for Machine Learning II

Similar documents
STA 4273H: Sta-s-cal Machine Learning

ECE521 week 3: 23/26 January 2017

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Nonparametric Bayesian Methods (Gaussian Processes)

ECE521 lecture 4: 19 January Optimization, MLE, regularization

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Lecture : Probabilistic Machine Learning

Bayesian Regression Linear and Logistic Regression

SCUOLA DI SPECIALIZZAZIONE IN FISICA MEDICA. Sistemi di Elaborazione dell Informazione. Regressione. Ruggero Donida Labati

STA 4273H: Statistical Machine Learning

Linear Models for Regression

Linear Models for Regression CS534

y(x) = x w + ε(x), (1)

Regression. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh.

Density Estimation. Seungjin Choi

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Bayesian Machine Learning

Pattern Recognition and Machine Learning

Bayesian Linear Regression. Sargur Srihari

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Least Squares Regression

STA 4273H: Statistical Machine Learning

Linear Models for Regression CS534

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Gaussian Processes in Machine Learning

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Linear Models for Regression

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

GWAS IV: Bayesian linear (variance component) models

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Least Squares Regression

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Introduction to Gaussian Process

GAUSSIAN PROCESS REGRESSION

Lecture: Gaussian Process Regression. STAT 6474 Instructor: Hongxiao Zhu

Linear Models for Regression CS534

STA 414/2104: Machine Learning

Support Vector Machine (continued)

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

Introduction to Machine Learning Midterm, Tues April 8

Bayesian methods in economics and finance

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables?

Announcements. Proposals graded

Overfitting, Bias / Variance Analysis

STA 414/2104, Spring 2014, Practice Problem Set #1

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

Recent Advances in Bayesian Inference Techniques

Linear Models for Regression. Sargur Srihari

Linear Methods for Regression. Lijun Zhang

Introduction to Machine Learning

Sparse Linear Models (10/7/13)

Bayesian Machine Learning

Jeff Howbert Introduction to Machine Learning Winter

CSE446: non-parametric methods Spring 2017

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Lecture 1b: Linear Models for Regression

INTRODUCTION TO BAYESIAN INFERENCE PART 2 CHRIS BISHOP

COMS 4721: Machine Learning for Data Science Lecture 6, 2/2/2017

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Statistics 203: Introduction to Regression and Analysis of Variance Penalized models

COMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma

Machine Learning Techniques for Computer Vision

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

Machine Learning CSE546 Sham Kakade University of Washington. Oct 4, What about continuous variables?

Bias-Variance Tradeoff

Outline lecture 2 2(30)

Bayesian Inference and MCMC

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

Logistic Regression Logistic

Introduction to Bayesian Learning. Machine Learning Fall 2018

Lecture 14: Shrinkage

Bayesian Gaussian / Linear Models. Read Sections and 3.3 in the text by Bishop

Introduction to Probabilistic Machine Learning

Parametric Inference Maximum Likelihood Inference Exponential Families Expectation Maximization (EM) Bayesian Inference Statistical Decison Theory

Outline Lecture 2 2(32)

CS 7140: Advanced Machine Learning

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Probabilistic Graphical Models Lecture 20: Gaussian Processes

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

Bayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Part 1: Expectation Propagation

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

20: Gaussian Processes

CSC321 Lecture 18: Learning Probabilistic Models

Kernel methods, kernel SVM and ridge regression

How to build an automatic statistician

Curve Fitting Re-visited, Bishop1.2.5

Gentle Introduction to Infinite Gaussian Mixture Modeling

Least Absolute Shrinkage is Equivalent to Quadratic Penalization

CPSC 540: Machine Learning

Bayesian Linear Regression [DRAFT - In Progress]

Transcription:

STA414/2104 Statistical Methods for Machine Learning II Murat A. Erdogdu & David Duvenaud Department of Computer Science Department of Statistical Sciences Lecture 3 Slide credits: Russ Salakhutdinov

Announcements Homework 1 v0 is released on Jan 19! Due on Jan 29 10 pm. You can turn in HW in class, OHs, etc.

Last Time Maximum likelihood estimation Exponential families Important distributions Geometry of multivariate distributions Gaussian mixtures

Parametric Distributions We want model the probability distribution variable x given a finite set of observations: of a random Need to determine given We also assume that the data points are i.i.d above. We will focus on the maximum likelihood estimation Remember the curve fitting example.

Linear Basis Function Models Remember, the simplest linear model for regression: where is a d-dimensional input vector (covariates). Key property: linear function of the parameters. However, it is also a linear function of input variables.

Linear Basis Function Models Remember, the simplest linear model for regression: where is a d-dimensional input vector (covariates). Key property: linear function of the parameters. However, it is also a linear function of input variables. Instead consider: where are known as basis functions. Typically, so that w 0 acts as a bias (or intercept). In the simplest case, we use linear bases functions: Using nonlinear basis allows the functions to be nonlinear functions of the input space.

Linear Basis Function Models Polynomial basis functions: Basis functions are global: small changes in x affect all basis functions. Gaussian basis functions: Basis functions are local: small changes in x only affect nearby basis functions. and s control location and scale (width).

Linear Basis Function Models Sigmoidal basis functions Basis functions are local: small changes in x only affect nearby basis functions. and s control location and scale. Decision boundaries will be linear in the feature space but would correspond to nonlinear boundaries in the original input space x. Classes that are linearly separable in the feature space linearly separable in the original input space. need not be

Maximum Likelihood Estimation As before, assume observations arise from a deterministic function with an additive Gaussian noise: which we can write as: Given observed inputs, and corresponding target values, under independence assumption, we can write down the likelihood function: where

Maximum Likelihood Estimation Taking the logarithm, we obtain: Differentiating and setting to zero yields: sum-of-squares error function

Maximum Likelihood Estimation Differentiating and setting to zero yields: Solving for w, we get: The Moore-Penrose pseudo-inverse,. where is known as the design matrix:

Geometry of Least Squares Consider an N-dimensional space, so that is a vector in that space. Each basis function evaluated at the N data points, can be represented as a vector in the same space. If M is less than N, then the M basis function will span a linear subspace S of dimensionality M. Define: The sum-of-squares error is equal to the squared Euclidean distance between y and t (up to a factor of 1/2). The solution corresponds to the orthogonal projection of t onto the subspace S.

Regularized Least Squares Let us consider the following error function: is called the regularization parameter (or penalty level). Data term + Regularization term Using sum-of-squares error function with a quadratic penalization term, we obtain: which is minimized by setting: Ridge regression The solution adds a positive constant to the diagonal of This makes the problem nonsingular, even if is not of full rank (e.g. when the number of training examples is less than the number of basis functions).

Equivalent formulations We can write the problem Data term + Regularization term equivalently as depends on, but different than. For example, we can write ridge regression, as

Effect of Regularization The blue curve is the contours of the error function. The red curve is the feasible set for the parameters. The regularizer shrinks model parameters to zero.

Other Regularizers Using a more general regularizer, we get: Lasso Quadratic

The Lasso Penalize the absolute value of the weights: For sufficiently large, some of the coefficients will be driven to exactly zero, leading to a sparse model. The above formulation is equivalent to: sum-of-squares error The two approaches are related using Lagrange multiplies. The Lasso solution is a convex problem and can be solved efficiently.

Lasso vs. Quadratic Penalty Lasso tends to generate sparser solutions compared to a quadratic regularizer (often referred to as L 1 and L 2 regularizers).

Sparsity induced by regularizers Using a more general regularizer, we get: Lasso Quadratic

Posterior Distribution The posterior distribution for the model parameters can be found by combining the prior with the likelihood for the parameters given the data. This is accomplished using Bayes Rule: Probability of observed data given w Prior probability of weight vector w Posterior probability of weight vector W given training data D Marginal likelihood (normalizing constant): This integral can be high-dimensional and is often difficult to compute.

The Rules of Probability Sum Rule: Product Rule:

Predictive Distribution We can also state Bayes rule in words: We can make predictions for a new data point x *, given the training dataset by integrating over the posterior distribution: which is sometimes called predictive distribution. Note that computing predictive distribution requires knowledge of the posterior distribution: where which is usually intractable.

MLE vs MAP In Maximum Likelihood Estimation, we maximize the likelihood Note that sometimes D and w is swapped to denote likelihood! But notation for likelihood is typically L(w D), not p(d w). In Maximum A posteriori Probability (MAP) estimation, we maximizethe posterior

Modeling Challenges The first challenge is in specifying suitable model and suitable prior distributions. This can be challenging particularly when dealing with highdimensional problems we see in machine learning. - A suitable model should admit all the possibilities that are thought to be at all likely. - A suitable prior should avoid giving zero or very small probabilities to possible events, but should also avoid spreading out the probability over all possibilities. We may need to properly model dependencies between parameters in order to avoid having a prior that is too spread out. One strategy is to introduce latent variables into the model and hyperparameters into the prior. Both of these represent the ways of modeling dependencies in a tractable way.

Computational Challenges The other big challenge is computing the posterior distribution. There are several main approaches: Analytical integration: If we use conjugate priors, the posterior distribution can be computed analytically. Only works for simple models and is usually too much to hope for. Gaussian approximation: Approximate the posterior distribution with a Gaussian. Works well when there is a lot of data compared to the model complexity (as posterior is close to Gaussian). Monte Carlo integration: Once we have a sample from the posterior distribution, we can do many things. The dominant current approach is Markov Chain Monte Carlo (MCMC) -- simulate a Markov chain that converges to the posterior distribution. It can be applied to a wide variety of problems. Variational approximation: A cleverer way to approximate the posterior. It often works much faster compared to MCMC. But often not as general as MCMC.

Bayesian Linear Regression Given observed inputs and corresponding target values we can write down the likelihood function: Where represent our basis functions. The corresponding conjugate prior is given by a Gaussian distribution: As both the likelihood and the prior terms are Gaussians, the posterior distribution will also be Gaussian. If the posterior distributions p(θ x) are in the same family as the prior probability distribution p(θ), the prior and posterior are then called conjugate distributions, and the prior is called a conjugate prior for the likelihood.

Bayesian Linear Regression Combining the prior together with the likelihood term: The posterior (with a bit of manipulation) takes the following Gaussian form: where The posterior mean can be expresses in terms of the least-squares estimator and the prior mean: As we increase our prior precision (decrease prior variance), we place greater weight on the prior mean relative the data.

Bayesian Linear Regression Consider a zero mean isotropic Gaussian prior, which is govern by a single precision parameter: for which the posterior is Gaussian with: means diagonal covariance If we consider an infinitely broad prior, =0, the mean m N of the posterior distribution reduces to maximum likelihood value w ML. The log of the posterior distribution is given by the sum of the log-likelihood and the log of the prior: Maximizing this posterior with respect to w is equivalent to minimizing the sumof-squares error function with a quadratic regulation term (ridge regression ).

Bayesian Linear Regression Consider a linear model of the form: The training data is generated from the function with by first choosing x n uniformly from [-1;1], evaluating and adding a small Gaussian noise. Goal: recover the values of from such data. 0 data points are observed: Prior Data Space

Bayesian Linear Regression 0 data points are observed: Prior Data Space 1 data point is observed: Likelihood Posterior Data Space

Bayesian Linear Regression 0 data points are observed. 1 data point is observed. 2 data points are observed. 20 data points are observed.

Predictive Distribution We can make predictions for a new input vector x by integrating over the posterior distribution: where Noise in the target values Uncertainty associated with parameter values. In the limit, as N -> infinity, the second term goes to zero. The variance of the predictive distribution arises only from the additive noise.

Predictive Distribution: Bayes vs. ML Predictive distribution based on maximum likelihood estimates Bayesian predictive distribution

Predictive Distribution Sinusoidal dataset, 9 Gaussian basis functions. Predictive distribution Samples from the posterior

Predictive Distribution Sinusoidal dataset, 9 Gaussian basis functions. Predictive distribution Samples from the posterior

Gamma-Gaussian Conjugate Prior So far we have assumed that the noise parameter is known. If both w and are treated as unknown, then we can introduce a conjugate prior distribution that will be given by the Gaussian-Gamma distribution: where the Gamma distribution is given by: The posterior distribution takes the same functional form as the prior:

Equivalent Kernel The predictive mean can be written as: Equivalent kernel or smoother matrix. Think of this as inverse distance between x and x n The mean of the predictive distribution at a time x can be written as a linear combination of the training set target values. A kernel above case, it is positive definite since is always positive semidefinite. In the means positive def.

Equivalent Kernel The weight of t n depends on distance between x and x n ; nearby x n carry more weight. Gaussian basis functions The kernel as a covariance function: We can avoid the use of basis functions and define the kernel function directly, leading to Gaussian Processes.

Other Kernels Examples of kernels k(x,x ) for x=0, plotted as a function corresponding to x. Polynomial Sigmoidal Note that these are localized functions of x.

Limitations M basis function along each dimension of a D-dimensional input space requires M D basis functions: the curse of dimensionality. The data vectors typically lie close to a nonlinear low-dimensional manifold, whose intrinsic dimensionality is smaller than that of the input space.