Machine Learning Basics: Maximum Likelihood Estimation

Similar documents
Machine Learning Basics: Stochastic Gradient Descent. Sargur N. Srihari

Machine Learning Basics: Estimators, Bias and Variance

PATTERN RECOGNITION AND MACHINE LEARNING

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Gradient-Based Learning. Sargur N. Srihari

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Variational Inference and Learning. Sargur N. Srihari

Bayesian Machine Learning

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Lecture 2 Machine Learning Review

We choose parameter values that will minimize the difference between the model outputs & the true function values.

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

CSC321 Lecture 18: Learning Probabilistic Models

Convolution and Pooling as an Infinitely Strong Prior

CMU-Q Lecture 24:

Machine Learning. Lecture 2: Linear regression. Feng Li.

Rirdge Regression. Szymon Bobek. Institute of Applied Computer science AGH University of Science and Technology

Statistical Machine Learning Hilary Term 2018

Statistical learning. Chapter 20, Sections 1 4 1

Linear Models for Regression CS534

ECE521 week 3: 23/26 January 2017

Linear Models for Regression. Sargur Srihari

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Overfitting, Bias / Variance Analysis

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION

ECE521 Lectures 9 Fully Connected Neural Networks

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Lecture 3 - Linear and Logistic Regression

Machine Learning Linear Classification. Prof. Matteo Matteucci

Statistical learning. Chapter 20, Sections 1 3 1

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

CPSC 540: Machine Learning

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Density Estimation: ML, MAP, Bayesian estimation

Machine Learning and Data Mining. Linear regression. Kalev Kask

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Neural Network Training

Distributed Estimation, Information Loss and Exponential Families. Qiang Liu Department of Computer Science Dartmouth College

Tufts COMP 135: Introduction to Machine Learning

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

Least Squares Regression

Machine Learning 4771

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS16

Linear Models for Regression CS534

Bayesian Linear Regression. Sargur Srihari

Lecture 5: GPs and Streaming regression

Lecture : Probabilistic Machine Learning

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Density Estimation. Seungjin Choi

CS60010: Deep Learning

Deep Feedforward Networks. Sargur N. Srihari

Parameter Norm Penalties. Sargur N. Srihari

Bayesian Deep Learning

COMP 551 Applied Machine Learning Lecture 21: Bayesian optimisation

Overview. Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation

Machine Learning for NLP

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Introduction to Machine Learning

Bayesian Machine Learning

Reward-modulated inference

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Machine Learning

In this chapter, we provide an introduction to covariate shift adaptation toward machine learning in a non-stationary environment.

Linear Models for Regression CS534

Machine Learning

Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Machine learning - HT Maximum Likelihood

Statistical learning. Chapter 20, Sections 1 3 1

Fundamentals of Machine Learning

Least Squares Regression

11. Learning graphical models

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

STA414/2104 Statistical Methods for Machine Learning II

Iterative Reweighted Least Squares

STA 4273H: Statistical Machine Learning

Probability and Statistical Decision Theory

Fundamentals of Machine Learning (Part I)

GWAS IV: Bayesian linear (variance component) models

Deep Feedforward Networks

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Statistical Data Mining and Machine Learning Hilary Term 2016

Linear Models for Regression

Linear Dynamical Systems

Sample questions for Fundamentals of Machine Learning 2018

Gaussian with mean ( µ ) and standard deviation ( σ)

Introduction to Bayesian Learning. Machine Learning Fall 2018

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Logistic Regression Trained with Different Loss Functions. Discussion

Machine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang

Probability and Information Theory. Sargur N. Srihari

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Bias-Variance Tradeoff

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Lecture 1: Supervised Learning

Feed-forward Network Functions

Machine Learning Basics III

Transcription:

Machine Learning Basics: Maximum Likelihood Estimation Sargur N. srihari@cedar.buffalo.edu This is part of lecture slides on Deep Learning: http://www.cedar.buffalo.edu/~srihari/cse676 1

Topics 1. Learning Algorithms 2. Capacity, Overfitting and Underfitting 3. Hyperparameters and Validation Sets 4. Estimators, Bias and Variance 5. Maximum Likelihood Estimation 6. Bayesian Statistics 7. Supervised Learning Algorithms 8. Unsupervised Learning Algorithms 9. Stochastic Gradient Descent 10. Building a Machine Learning Algorithm 2 11. Challenges Motivating Deep Learning

Topics in Maximum Likelihood 0. The maximum likelihood principle Maximizing likelihood is minimizing KL divergence 1. Conditional log-likelihood & MSE Minimizing negative log-likelihood is equivalent to minimizing MSE in linear regression with Gaussian noise 2. Properties of maximum likelihood No consistent estimator has lower MSE than maximum likelihood estimator 3

How to obtain good estimators? We have seen some definitions of common estimators and their properties Ex: sample mean, bias-variance Where do they come from? Rather than guessing some function and determining its bias and variance, better to have some principle from which we can derive specific functions that are good estimators The most common such principle is the maximum likelihood principle 4

Maximum Likelihood Principle Consider set of m examples X={x (1), x (2),..x (m) } Drawn independently from the true but unknown data generating distribution p data (x) Let p model (x;θ) be a parametric family of distributions over same space indexed by θ i.e., p model (x;θ) maps any configuration of x to a real no. estimating the true probability p data (x) The maximum likelihood estimator for θ is: This product over many probabilities is inconvenient ex: underflow 5

Alternative form of max likelihood Deep Learning An equivalent optimization problem is to take logarithm of the likelihood Since dividing by m does not change the problem This maximization can be written as The expectation is wrt the empirical distribution defined by the training data ˆp data One way to interpret maximum likelihood estimation is to view it as minimizing the dissimilarity between the empirical distribution ˆp data defined by the training 6 set and the model distribution p model (x) as seen next

Maximizing likelihood is minimizing Deep Learning KL divergence Max likelihood =minimizing KL diverg. between: ˆp data empirical distribution and model distribution p model (x) The K-L divergence is First Term a function of data generation, not model Thus we only need to minimize E x~ˆpdata log p model (x) i.e., cross entropy between distribution of training set and probability distribution defined by model Definition of cross entropy between distributions p and q is H(p,q)=E p [-log q]=h(p)+d KL (p q) For discrete distributions H(p,q)=-Σ x p(x)log q(x) 7

Summary of Max Likelihood It is an attempt to make model distribution match the empirical distribution ˆp data Ideally we would like to match the unknown The optimal θ is the same whether we maximize likelihood or minimize KL divergence In software both are phrased as minimization Maximum likelihood becomes minimization of negative log-likelihood (NLL) Equivalently minimization of cross entropy KL divergence has a minimal value 0 Negative log-likelihood is negative with no bound ˆp p model

Conditional Log-likelihood and MSE Maximum likelihood estimator can be readily generalized to parameters of an input-output relationship Goal is prediction: conditional probability P(y x;θ) Which forms the basis of most supervised learning If X represents all our inputs and Y represents all our targets then the conditional maximum likelihood estimator is If examples are i.i.d. then 9

Linear Regression as Maximum Likelihood Basic linear regression: Takes an input x and produce an output ŷ Mapping from x to ŷ is chosen to minimize MSE Revisit linear regression as maximum likelihood Think of model as producing a conditional distribution p(y x) To derive same algorithm, define Function ŷ(x,w) predicts mean of Gaussian Since samples are i.i.d. p(y x) = N(y;ŷ(x,w), σ 2 ) Thus maximizing the log-likelihood is same as minimizing MSE 10

Properties of Maximum Likelihood Main appeal of maximum likelihood estimator: It is the best estimator asymptotically In terms of its rate of converges, as mà Under some conditions, it has consistency property As mà it converges to the true parameter value Conditions for consistency p data must lie within model family p model (.,θ) p data must correspond to exactly one value of θ 11

Statistical Efficiency Several estimators, based on inductive principles other than MSE, can be consistent Define Statistical efficiency: estimator has lower generalization error for fixed no of samples Or equivalently, needs fewer examples to obtain a fixed generalization error Needs measuring closeness to true parameter MSE between estimated and true parameter values Parameter MSE decreases as m increases Using Cramer-Rao bound No consistent estimator has lower MSE than Maximum Likelihood Estimator