Machine learning - HT Maximum Likelihood

Similar documents
Introduction to Machine Learning

PATTERN RECOGNITION AND MACHINE LEARNING

Introduction to Machine Learning

Introduction to Machine Learning

The generative approach to classification. A classification problem. Generative models CSE 250B

Linear Regression and Discrimination

Machine Learning - MT Classification: Generative Models

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Maximum likelihood estimation

Introduction to Machine Learning

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Statistical Machine Learning Lectures 4: Variational Bayes

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Generative modeling of data. Instructor: Taylor Berg-Kirkpatrick Slides: Sanjoy Dasgupta

Lecture 3. Probability - Part 2. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. October 19, 2016

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Least Squares Regression

DD Advanced Machine Learning

Machine learning - HT Basis Expansion, Regularization, Validation

Association studies and regression

Chapter 5 continued. Chapter 5 sections

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Quick Tour of Basic Probability Theory and Linear Algebra

CS 591, Lecture 2 Data Analytics: Theory and Applications Boston University

Least Squares Regression

Exam 2. Jeremy Morris. March 23, 2006

Bayesian Dropout. Tue Herlau, Morten Morup and Mikkel N. Schmidt. Feb 20, Discussed by: Yizhe Zhang

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

Expectation Propagation Algorithm

Machine Learning - MT Linear Regression

Lecture : Probabilistic Machine Learning

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Bayesian Learning (II)

COMS 4721: Machine Learning for Data Science Lecture 1, 1/17/2017

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Expectation Propagation for Approximate Bayesian Inference

GWAS V: Gaussian processes

Naïve Bayes classification

PROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Machine Learning. Lecture 02.2: Basics of Information Theory. Nevin L. Zhang

Gaussian with mean ( µ ) and standard deviation ( σ)

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Brandon C. Kelly (Harvard Smithsonian Center for Astrophysics)

Latent Variable Models and EM algorithm

Overfitting, Bias / Variance Analysis

CS Lecture 19. Exponential Families & Expectation Propagation

Bayesian Decision and Bayesian Learning

EEL 5544 Noise in Linear Systems Lecture 30. X (s) = E [ e sx] f X (x)e sx dx. Moments can be found from the Laplace transform as

Latent Variable Models and EM Algorithm

Lecture 25: Review. Statistics 104. April 23, Colin Rundel

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

Lecture 5: GPs and Streaming regression

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Continuous Random Variables

The Expectation Maximization or EM algorithm

x. Figure 1: Examples of univariate Gaussian pdfs N (x; µ, σ 2 ).

Some Concepts of Probability (Review) Volker Tresp Summer 2018

Machine Learning for Data Science (CS4786) Lecture 12

CSC 411: Lecture 09: Naive Bayes

Lecture 2: Repetition of probability theory and statistics

Comments. x > w = w > x. Clarification: this course is about getting you to be able to think as a machine learning expert

Introduction to Machine Learning Lecture 14. Mehryar Mohri Courant Institute and Google Research

Nonparameteric Regression:

CMPE 58K Bayesian Statistics and Machine Learning Lecture 5

Reminders. Thought questions should be submitted on eclass. Please list the section related to the thought question

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Density Estimation. Seungjin Choi

(Multivariate) Gaussian (Normal) Probability Densities

Parameter Estimation. Industrial AI Lab.

MA/ST 810 Mathematical-Statistical Modeling and Analysis of Complex Systems

Discrete Mathematics and Probability Theory Fall 2015 Lecture 21

Logistic Regression. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

f(x θ)dx with respect to θ. Assuming certain smoothness conditions concern differentiating under the integral the integral sign, we first obtain

Expectation Maximization

Estimation Theory. as Θ = (Θ 1,Θ 2,...,Θ m ) T. An estimator

Random vectors X 1 X 2. Recall that a random vector X = is made up of, say, k. X k. random variables.

DEEP LEARNING CHAPTER 3 PROBABILITY & INFORMATION THEORY

Bayesian Linear Regression [DRAFT - In Progress]

Mathematical statistics

Machine Learning (CSE 446): Probabilistic Machine Learning

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Probability Theory for Machine Learning. Chris Cremer September 2015

Expectation Maximization

Gaussian processes and bayesian optimization Stanisław Jastrzębski. kudkudak.github.io kudkudak

Introduction to Machine Learning

Linear Models for Classification

The Expectation-Maximization Algorithm

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Introduction to Machine Learning

Linear Classification

Foundations of Statistical Inference

Lecture 5. Gaussian Models - Part 1. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. November 29, 2016

Linear Regression (9/11/13)

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

Introduction to Statistical Learning Theory

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Transcription:

Machine learning - HT 2016 3. Maximum Likelihood Varun Kanade University of Oxford January 27, 2016

Outline Probabilistic Framework Formulate linear regression in the language of probability Introduce the maximum likelihood estimate Relation to least squares estimate Basics of Probability Univariate and multivariate normal distribution Laplace distribution Likelihood, Entropy and its relation to learning 1

Univariate Gaussian (Normal) Distribution The univariate normal distribution is defined by the following density function p(x) = 1 2πσ e (x µ)2 2σ 2 X N (µ, σ 2 ) Here µ is the mean and σ 2 is the variance. 2

Sampling from a Gaussian distribution Sampling from X N (µ, σ 2 ) By setting Y = X µ, sample from Y N (0, 1) σ Cumulative distribution function Φ(x) = 1 2π x e t2 2 dt 3

Covariance and Correlation For random variable X and Y the covariance measures how the random variable change jointly. cov(x, Y ) = E[(X E[X])(Y E[Y ])] Covariance depends on the scale of the random variable. The (Pearson) correlation coefficient normalizes the covariance to give a value between 1 and +1. corr(x, Y ) = cov(x, Y ) σ Xσ Y, where σ 2 X = E[(X E[X]) 2 ] and σ 2 Y = E[(Y E[Y ]) 2 ]. 4

Multivariate Gaussian Distribution Suppose x is a n-dimensional random vector. The covariance matrix consists of all pariwise covariances. [ ] cov(x) = E (x E[x])(x E[x]) T = var(x 1) cov(x 1, X 2) cov(x 1, X n) cov(x 2, X 1) var(x 2) cov(x 2, X n)........ cov(x n, X 1) cov(x n, X 2) var(x n, X n). If µ = E[x] and Σ = cov[x], the multivariate normal is defined by the density ( 1 N (µ, Σ) = exp 1 ) (2π) n/2 Σ 1/2 2 (x µ)t Σ 1 (x µ) 5

Bivariate Gaussian Distribution Suppose X 1 N (µ 1, σ 2 1) and X 2 N (µ 2, σ 2 2) What is the joint probability distribution p(x 1, x 2)? 6

Suppose you are given three independent samples: x 1 = 1, x 2 = 2.7, x 3 = 3. You know that the data were generated from N (0, 1) or N (4, 1). Let θ represent the parameters of the distribution. Then the probability of observing data with parameter θ is called the likelihood: p(x 1, x 2, x 3 θ) = p(x 1 θ)p(x 2 θ)p(x 3 θ) We have to chose between θ = 0 and θ = 4. Which one? Maximum Likelihood Estimation (MLE): Pick θ that maximizes the likelihood. 7

Linear Regression Recall our linear regression model y = x T w + noise Model y (conditioned on x) as a random variable. Given x and the model parameter w: E[y x, w] = x T w We can be more specific in choosing our model for y. Let us assume that given x, w, y is Gaussian with mean x T w and variance σ 2. y N (x T w, σ 2 ) = x T w + N (0, σ 2 ) 8

Likelihood of Linear Regression Suppose we observe data (x i, y i) m i=1. What is the likelihood of observing the data for model parameters w, σ? 9

Likelihood of Linear Regression Suppose we observe data (x i, y i) m i=1. What is the likelihood of observing the data for model parameters w, σ? p (y 1,..., y m x 1,..., x m, w, σ) = m p (y i x i, w, σ) i=1 Recall that y i x T i w + N (0, σ 2 ). So p (y 1,..., y m x 1,..., x m, w, σ) = = m 1 (y i x 2πσ 2 e i=1 T i w)2 2σ 2 ( 1 ) m/2 e 1 mi=1 2σ 2 (y i x T i w)2 2πσ 2 Want to find parameters w and σ that maximize the likelihood 10

Likelihood of Linear Regression It is simpler to look at the log-likelihood. Taking logs LL(y 1,..., y m x 1,..., x m, w, σ) = m 2 log(2πσ2 ) 1 2σ 2 LL(y X, w, σ) = m 2 log(2πσ2 ) 1 2σ 2 (Xw y)t (Xw y) How to find w that maximizes the likelihood? m (x T i w y i) 2 i=1 11

Maximum Likelihood and Least Squares Let us in fact look at negative log-likelihood (which is more like loss) NLL(y X, w, σ) = 1 2σ 2 (Xw y)t (Xw y) + m 2 log(2πσ2 ) And recall the squared loss objective L(w) = (Xw y) T (Xw y) We can also find the MLE for σ. As exercise show that the MLE of σ is σ 2 ML = 1 m (Xw ML y) T (Xw ML y) 12

Making Prediction Given training data D = (x i, y i) m i=1 we can use MLE estimators to make predictions on new points and also give confidence intervals. y new x new, D N (x T neww ML, σ 2 ML) 13

Outliers and Laplace Distribution With outliers least squares (and hence MLE with Gaussian model) can be quite bad. Instead, we can model the noise (or uncertainty) in y as a Laplace distribution p(y x, w, b) = 1 ( 2b exp y ) xt w b 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 6 4 2 0 2 4 6 14

Lookahead: Binary Classification Bernoulli random variable X takes value in {0, 1}. We parametrize using θ [0, 1]. More succinctly, we can write p(1 θ) = θ p(0 θ) = 1 θ p(x θ) = θ x (1 θ) 1 x For classification, we will design models with parameter w that given input x produce a value in f(x; w) [0, 1]. Then, we can model the (binary) class labels as: y Bernoulli(f(x; w)) 15

Entropy In information theory, entropy H is a measure of uncertainty associated with a random variable. H(X) = x p(x) log(p(x)) In the case of bernoulli variables (with parameter θ) we get: H(X) = θ log(θ) (1 θ) log(1 θ) 610 015 014 013 012 010 010 012 013 014 015 610 16

Maximum Likelihood and KL-Divergence Suppose we get data x 1,... x m from some unknown distribution q. Attempt to find parameters θ for a family of distributions that best explains the data ˆθ = argmax θ = argmax θ = argmax θ = argmin θ argmin θ m p(x i θ) i=1 m log(p(x i θ)) i=1 1 m log(p(x i θ)) 1 m log(q(x i)) m m i=1 i=1 1 m ( ) q(xi) log m p(x i θ) i=1 ( ) q(x) log q(x)dx = KL(q p) p(x) 17

Kullback-Leibler Divergence KL-Divergence is like a distance between distributions KL(q p) = i log q(xi) p(x i) q(xi)dx KL(q q) = 0 KL(q p) 0 for all distributions p 18