STA 414/2104: Machine Learning

Similar documents
Maximum Likelihood (ML), Expecta6on Maximiza6on (EM)

PROBABILITY DISTRIBUTIONS. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Probability Theory for Machine Learning. Chris Cremer September 2015

Bayesian Models in Machine Learning

STA 4273H: Sta-s-cal Machine Learning

STA414/2104 Statistical Methods for Machine Learning II

STA 4273H: Statistical Machine Learning

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Review of probabilities

ECE521 week 3: 23/26 January 2017

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning

SVM by Sequential Minimal Optimization (SMO)

STA 414/2104: Machine Learning

Review (Probability & Linear Algebra)

Bayesian Methods: Naïve Bayes

Some slides from Carlos Guestrin, Luke Zettlemoyer & K Gajos 2

Linear Models for Regression CS534

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

INTRODUCTION TO BAYESIAN INFERENCE PART 2 CHRIS BISHOP

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

Linear Models for Regression CS534

Review (probability, linear algebra) CE-717 : Machine Learning Sharif University of Technology

Stat 5101 Lecture Notes

STA 4273H: Statistical Machine Learning

6.867 Machine Learning

Introduction to Probability and Statistics (Continued)

Neural Network Training

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

Linear Models for Regression CS534

Lecture 2: Repetition of probability theory and statistics

Lecture 11: Probability Distributions and Parameter Estimation

Introduction to Machine Learning CMU-10701

6.867 Machine Learning

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

CSC321 Lecture 18: Learning Probabilistic Models

Curve Fitting Re-visited, Bishop1.2.5

PMR Learning as Inference

A Brief Review of Probability, Bayesian Statistics, and Information Theory

Accouncements. You should turn in a PDF and a python file(s) Figure for problem 9 should be in the PDF

Manifold Learning for Signal and Visual Processing Lecture 9: Probabilistic PCA (PPCA), Factor Analysis, Mixtures of PPCA

Probability Background

Probabilistic Graphical Models (I)

Slides modified from: PATTERN RECOGNITION AND MACHINE LEARNING CHRISTOPHER M. BISHOP

Machine Learning CSE546 Sham Kakade University of Washington. Oct 4, What about continuous variables?

INTRODUCTION TO PATTERN RECOGNITION

Estimating the accuracy of a hypothesis Setting. Assume a binary classification setting

Probabilistic Systems Analysis Spring 2018 Lecture 6. Random Variables: Probability Mass Function and Expectation

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables?

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Machine Learning using Bayesian Approaches

A brief review of basics of probabilities

+ + ( + ) = Linear recurrent networks. Simpler, much more amenable to analytic treatment E.g. by choosing

Data Analysis and Manifold Learning Lecture 6: Probabilistic PCA and Factor Analysis

Modèles stochastiques II

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

MLE/MAP + Naïve Bayes

CS 6140: Machine Learning Spring What We Learned Last Week. Survey 2/26/16. VS. Model

CS 6140: Machine Learning Spring 2016

Introduction to Probability and Statistics (Continued)

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

COMP 551 Applied Machine Learning Lecture 19: Bayesian Inference

CALCULUS AB WEEKLY REVIEW SEMESTER 2

Chapter 5 continued. Chapter 5 sections

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

STA 414/2104: Machine Learning

Language as a Stochastic Process

Data Mining Techniques. Lecture 3: Probability

Probability and Estimation. Alan Moses

Probability and Information Theory. Sargur N. Srihari

DS-GA 1002 Lecture notes 11 Fall Bayesian statistics

Statistics for scientists and engineers

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

Pattern Recognition and Machine Learning

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

Introduction to Bayesian Learning. Machine Learning Fall 2018

Lecture 6: Gaussian Mixture Models (GMM)

2.3. The Gaussian Distribution

The Bayes classifier

Feed-forward Network Functions

Statistical Distribution Assumptions of General Linear Models

SIFT, GLOH, SURF descriptors. Dipartimento di Sistemi e Informatica

Kernel Methods. Machine Learning A W VO

Machine Learning CMPT 726 Simon Fraser University. Binomial Parameter Estimation

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Model Averaging (Bayesian Learning)

CPSC 340: Machine Learning and Data Mining

x. Figure 1: Examples of univariate Gaussian pdfs N (x; µ, σ 2 ).

Linear Regression. Example: Lung FuncAon in CF pts. Example: Lung FuncAon in CF pts. Lecture 9: Linear Regression BMI 713 / GEN 212

Naïve Bayes classification

Point Estimation. Vibhav Gogate The University of Texas at Dallas

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

Review: Probability. BM1: Advanced Natural Language Processing. University of Potsdam. Tatjana Scheffler

Introduction to Probabilistic Machine Learning

Announcements. Proposals graded

Algorithms for Uncertainty Quantification

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Transcription:

STA 414/2104: Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistics! rsalakhu@cs.toronto.edu! h0p://www.cs.toronto.edu/~rsalakhu/ Lecture 2

Linear Least Squares From last class: Minimize the sum of the squares of the errors between the predicaons for each data point x n and the corresponding real- valued targets t n. Loss funcaon: sum- of- squared error funcaon: Source: Wikipedia

Linear Least Squares If is nonsingular, then the unique soluaon is given by: opamal weights vector of target values the design matrix has one input vector per row Source: Wikipedia At an arbitrary input, the predicaon is The enare model is characterized by d+1 parameters w *.

Example: Polynomial Curve FiQng Consider observing a training set consisang of N 1- dimensional observaaons: together with corresponding real- valued targets: Goal: Fit the data using a polynomial funcaon of the form: Note: the polynomial funcaon is a nonlinear funcaon of x, but it is a linear funcaon of the coefficients w! Linear Models.

Example: Polynomial Curve FiQng As for the least squares example: we can minimize the sum of the squares of the errors between the predicaons for each data point x n and the corresponding target values t n. Loss funcaon: sum- of- squared error funcaon: Similar to the linear least squares: Minimizing sum- of- squared error funcaon has a unique soluaon w *.

ProbabilisAc PerspecAve So far we saw that polynomial curve fiqng can be expressed in terms of error minimizaaon. We now view it from probabilisac perspecave. Suppose that our model arose from a staasacal model: where ² is a random error having Gaussian distribuaon with zero mean, and is independent of x. Thus we have: where is a precision parameter, corresponding to the inverse variance. I will use probability distribution and probability density interchangeably. It should be obvious from the context.!

Maximum Likelihood If the data are assumed to be independently and idenacally distributed (i.i.d assump*on), the likelihood funcaon takes form: It is oxen convenient to maximize the log of the likelihood funcaon: Maximizing log- likelihood with respect to w (under the assumpaon of a Gaussian noise) is equivalent to minimizing the sum- of- squared error funcaon. Determine w.r.t. : by maximizing log- likelihood. Then maximizing

PredicAve DistribuAon Once we determined the parameters w and, we can make predicaon for new values of x: Later we will consider Bayesian linear regression.

Bernoulli DistribuAon Consider a single binary random variable can describe the outcome of flipping a coin: Coin flipping: heads = 1, tails = 0. For example, x The probability of x=1 will be denoted by the parameter µ, so that: The probability distribuaon, known as Bernoulli distribuaon, can be wri0en as:

Suppose we observed a dataset Parameter EsAmaAon We can construct the likelihood funcaon, which is a funcaon of µ. Equivalently, we can maximize the log of the likelihood funcaon: Note that the likelihood funcaon depends on the N observaaons x n only through the sum Sufficient StaAsAc

Suppose we observed a dataset Parameter EsAmaAon SeQng the derivaave of the log- likelihood funcaon w.r.t µ to zero, we obtain: where m is the number of heads.

Binomial DistribuAon We can also work out the distribuaon of the number m of observaaons of x=1 (e.g. the number of heads). The probability of observing m heads given N coin flips and a parameter µ is given by: The mean and variance can be easily derived as:

Example Histogram plot of the Binomial distribuaon as a funcaon of m for N=10 and µ = 0.25.

Beta DistribuAon We can define a distribuaon over (e.g. it can be used a prior over the parameter µ of the Bernoulli distribuaon). where the gamma funcaon is defined as: and ensures that the Beta distribuaon is normalized.

Beta DistribuAon

MulAnomial Variables Consider a random variable that can take on one of K possible mutually exclusive states (e.g. roll of a dice). We will use so- called 1- of- K encoding scheme. If a random variable can take on K=6 states, and a paracular observaaon of the variable corresponds to the state x 3 =1, then x will be resented as: 1- of- K coding scheme: If we denote the probability of x k =1 by the parameter µ k, then the distribuaon over x is defined as:

MulAnomial Variables MulAnomial distribuaon can be viewed as a generalizaaon of Bernoulli distribuaon to more than two outcomes. It is easy to see that the distribuaon is normalized: and

Maximum Likelihood EsAmaAon Suppose we observed a dataset We can construct the likelihood funcaon, which is a funcaon of µ. Note that the likelihood funcaon depends on the N data points only though the following K quanaaes: which represents the number of observaaons of x k =1. These are called the sufficient staasacs for this distribuaon.

Maximum Likelihood EsAmaAon To find a maximum likelihood soluaon for µ, we need to maximize the log- likelihood taking into account the constraint that Forming the Lagrangian: which is the fracaon of observaaons for which x k =1.

MulAnomial DistribuAon We can construct the joint distribuaon of the quanaaes {m 1,m 2,,m k } given the parameters µ and the total number N of observaaons: The normalizaaon coefficient is the number of ways of paraaoning N objects into K groups of size m 1,m 2,,m K. Note that

Dirichlet DistribuAon Consider a distribuaon over µ k, subject to constraints: The Dirichlet distribuaon is defined as: where 1,, k are the parameters of the distribuaon, and (x) is the gamma funcaon. The Dirichlet distribuaon is confined to a simplex as a consequence of the constraints.

Dirichlet DistribuAon Plots of the Dirichlet distribuaon over three variables.

Gaussian Univariate DistribuAon In the case of a single variable x, the Gaussian distribuaon takes form: which is governed by two parameters: - µ (mean) - ¾ 2 (variance) The Gaussian distribuaon saasfies:

MulAvariate Gaussian DistribuAon For a D- dimensional vector x, the Gaussian distribuaon takes form: which is governed by two parameters: - µ is a D- dimensional mean vector. - is a D by D covariance matrix. and denotes the determinant of. Note that the covariance matrix is a symmetric posiave definite matrix.

Central Limit Theorem The distribuaon of the sum of N i.i.d. random variables becomes increasingly Gaussian as N grows. Consider N variables, each of which has a uniform distribuaon over the interval [0,1]. Let us look at the distribuaon over the mean: As N increases, the distribuaon tends towards a Gaussian distribuaon.

Geometry of the Gaussian DistribuAon For a D- dimensional vector x, the Gaussian distribuaon takes form: Let us analyze the funcaonal dependence of the Gaussian on x through the quadraac form: Here is known as Mahalanobis distance. The Gaussian distribuaon will be constant on surfaces in x- space for which is constant.

Geometry of the Gaussian DistribuAon For a D- dimensional vector x, the Gaussian distribuaon takes form: Consider the eigenvalue equaaon for the covariance matrix: The covariance can be expressed in terms of its eigenvectors: The inverse of the covariance:

Geometry of the Gaussian DistribuAon For a D- dimensional vector x, the Gaussian distribuaon takes form: Remember: Hence: We can interpret {y i } as a new coordinate system defined by the orthonormal vectors u i that are shixed and rotated.

Geometry of the Gaussian DistribuAon Red curve: surface of constant probability density The axis are defined by the eigenvectors u i of the covariance matrix with corresponding eigenvalues.

Moments of the Gaussian DistribuAon The expectaaon of x under the Gaussian distribuaon: The term in z in the factor (z+µ) will vanish by symmetry.

Moments of the Gaussian DistribuAon The second order moments of the Gaussian distribuaon: The covariance is given by: Because the parameter matrix governs the covariance of x under the Gaussian distribuaon, it is called the covariance matrix.

Moments of the Gaussian DistribuAon Contours of constant probability density: Covariance matrix is of general form. Diagonal, axis- aligned covariance matrix. Spherical (proporaonal to idenaty) covariance matrix.

ParAAoned Gaussian DistribuAon Consider a D- dimensional Gaussian distribuaon: Let us paraaon x into two disjoint subsets x a and x b : In many situaaons, it will be more convenient to work with the precision matrix (inverse of the covariance matrix): Note that aa is not given by the inverse of aa.

CondiAonal DistribuAon It turns out that the condiaonal distribuaon is also a Gaussian distribuaon: Covariance does not depend on x b. Linear funcaon of x b.

Marginal DistribuAon It turns out that the marginal distribuaon is also a Gaussian distribuaon: For a marginal distribuaon, the mean and covariance are most simply expressed in terms of paraaoned covariance matrix.

CondiAonal and Marginal DistribuAons

Maximum Likelihood EsAmaAon Suppose we observed i.i.d data We can construct the log- likelihood funcaon, which is a funcaon of µ and : Note that the likelihood funcaon depends on the N data points only though the following sums: Sufficient StaBsBcs

Maximum Likelihood EsAmaAon To find a maximum likelihood esamate of the mean, we set the derivaave of the log- likelihood funcaon to zero: and solve to obtain: Similarly, we can find the ML esamate of :

Maximum Likelihood EsAmaAon EvaluaAng the expectaaon of the ML esamates under the true distribuaon, we obtain: Unbiased esamate Note that the maximum likelihood esamate of is biased. Biased esamate We can correct the bias by defining a different esamator:

SequenAal EsAmaAon SequenAal esamaaon allows data points to be processed one at a Ame and then discarded. Important for on- line applicaaons. Let us consider the contribuaon of the N th data point x n : correcaon given x N correcaon weight old esamate

Student s t- DistribuAon Consider Student s t- DistribuAon where Infinite mixture of Gaussians SomeAmes called the precision parameter. Degrees of freedom

Student s t- DistribuAon SeQng º = 1 recovers Cauchy distribuaon The limit º! 1 corresponds to a Gaussian distribuaon.

Student s t- DistribuAon Robustness to outliners: Gaussian vs. t- DistribuAon.

Student s t- DistribuAon The mulavariate extension of the t- DistribuAon: where ProperAes:

Mixture of Gaussians When modeling real- world data, Gaussian assumpaon may not be appropriate. Consider the following example: Old Faithful Dataset Single Gaussian Mixture of two Gaussians

Mixture of Gaussians We can combine simple models into a complex model by defining a superposiaon of K Gaussian densiaes of the form: Component Mixing coefficient K=3 Note that each Gaussian component has its own mean µ k and covariance k. The parameters ¼ k are called mixing coefficients. Mote generally, mixture models can comprise linear combinaaons of other distribuaons.

Mixture of Gaussians IllustraAon of a mixture of 3 Gaussians in a 2- dimensional space: (a) Contours of constant density of each of the mixture components, along with the mixing coefficients (b) Contours of marginal probability density (c) A surface plot of the distribuaon p(x).

Maximum Likelihood EsAmaAon Given a dataset D, we can determine model parameters µ k. k, ¼ k by maximizing the log- likelihood funcaon: Log of a sum: no closed form soluaon SoluBon: use standard, iteraave, numeric opamizaaon methods or the ExpectaAon MaximizaAon algorithm.

The ExponenAal Family The exponenaal family of distribuaons over x is defined to be a set of destrucaons for the form: where - is the vector of natural parameters - u(x) is the vector of sufficient staasacs The funcaon g( ) can be interpreted the coefficient that ensures that the distribuaon p(x ) is normalized:

Bernoulli DistribuAon The Bernoulli distribuaon is a member of the exponenaal family: Comparing with the general form of the exponenaal family: we see that and so LogisAc sigmoid

Bernoulli DistribuAon The Bernoulli distribuaon is a member of the exponenaal family: The Bernoulli distribuaon can therefore be wri0en as: where

MulAnomial DistribuAon The MulAnomial distribuaon is a member of the exponenaal family: where and NOTE: The parameters k are not independent since the corresponding µ k must saasfy In some cases it will be convenient to remove the constraint by expressing the distribuaon over the M- 1 parameters.

MulAnomial DistribuAon The MulAnomial distribuaon is a member of the exponenaal family: Let This leads to: and Here the parameters k are independent. Note that: and SoXmax funcaon

MulAnomial DistribuAon The MulAnomial distribuaon is a member of the exponenaal family: The MulAnomial distribuaon can therefore be wri0en as: where

Gaussian DistribuAon The Gaussian distribuaon can be wri0en as: where

ML for the ExponenAal Family Remember the ExponenAal Family: From the definiaon of the normalizer g( ): We can take a derivaave w.r.t : Thus

ML for the ExponenAal Family Remember the ExponenAal Family: We can take a derivaave w.r.t : Thus Note that the covariance of u(x) can be expressed in terms of the second derivaave of g( ), and similarly for the higher moments.

ML for the ExponenAal Family Suppose we observed i.i.d data We can construct the log- likelihood funcaon, which is a funcaon of the natural parameter. Therefore we have Sufficient StaAsAc