COMS 4771 Lecture Course overview 2. Maximum likelihood estimation (review of some statistics)

Similar documents
COMS 4721: Machine Learning for Data Science Lecture 1, 1/17/2017

CS 6375 Machine Learning

Machine Learning (CS 567) Lecture 2

Mathematical Formulation of Our Example

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Machine Learning CSE546

Machine Learning CSE546

Probabilistic Graphical Models for Image Analysis - Lecture 1

Machine Learning (COMP-652 and ECSE-608)

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Math for Machine Learning Open Doors to Data Science and Artificial Intelligence. Richard Han

CSC321 Lecture 18: Learning Probabilistic Models

Machine Learning (COMP-652 and ECSE-608)

Stat 406: Algorithms for classification and prediction. Lecture 1: Introduction. Kevin Murphy. Mon 7 January,

Detection and Estimation Theory

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1

Machine Learning. Instructor: Pranjal Awasthi

Naive Bayes and Gaussian Bayes Classifier

CPSC 540: Machine Learning

The Expectation-Maximization Algorithm

ECE521 week 3: 23/26 January 2017

Nonparametric Bayesian Methods (Gaussian Processes)

Parametric Techniques

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Statistical Pattern Recognition

Introduction to Machine Learning

CPSC 540: Machine Learning

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Parametric Techniques Lecture 3

Lecture 1 October 9, 2013

Factor Analysis (10/2/13)

Naive Bayes and Gaussian Bayes Classifier

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Introduction to Machine Learning

Introduction. Le Song. Machine Learning I CSE 6740, Fall 2013

PCA and admixture models

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

GI01/M055: Supervised Learning

Foundations of Machine Learning

Machine Learning Basics: Estimators, Bias and Variance

BANA 7046 Data Mining I Lecture 6. Other Data Mining Algorithms 1

Computer Vision Group Prof. Daniel Cremers. 3. Regression

Dimensionality Reduction

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

Machine Learning for Data Science (CS4786) Lecture 12

Introduction to Machine Learning

Bayesian decision theory Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory

STATS 306B: Unsupervised Learning Spring Lecture 2 April 2

K-Means and Gaussian Mixture Models

STA 4273H: Statistical Machine Learning

Advanced Machine Learning & Perception

Unsupervised Learning

Naive Bayes and Gaussian Bayes Classifier

Association studies and regression

Bits of Machine Learning Part 1: Supervised Learning

Machine learning - HT Maximum Likelihood

Statistical and Learning Techniques in Computer Vision Lecture 2: Maximum Likelihood and Bayesian Estimation Jens Rittscher and Chuck Stewart

Naïve Bayes classification

Homework 6: Image Completion using Mixture of Bernoullis

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

CS 188: Artificial Intelligence Spring 2009

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Lecture 11: Unsupervised Machine Learning

Density Estimation. Seungjin Choi

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION

Bayesian Methods: Naïve Bayes

Machine Learning! in just a few minutes. Jan Peters Gerhard Neumann

IE598 Big Data Optimization Introduction

Topics in Natural Language Processing

EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1

Expectation Maximization (EM)

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

Gaussian Models

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

Review of probabilities

Linear Regression and Discrimination

Pattern Recognition. Parameter Estimation of Probability Density Functions

CPSC 340: Machine Learning and Data Mining

Statistical Machine Learning Hilary Term 2018

Loss Functions, Decision Theory, and Linear Models

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

L11: Pattern recognition principles

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Latent Variable Models and EM Algorithm

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

GAUSSIAN PROCESS REGRESSION

Machine Learning Lecture 1

Collaborative topic models: motivations cont

Qualifying Exam in Machine Learning

CS 6140: Machine Learning Spring What We Learned Last Week 2/26/16

10-701/ Machine Learning - Midterm Exam, Fall 2010

Expectation Maximization (EM)

LECTURE NOTE #3 PROF. ALAN YUILLE

Machine Learning 4771

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Maximum likelihood estimation

Deep Learning Basics Lecture 7: Factor Analysis. Princeton University COS 495 Instructor: Yingyu Liang

Transcription:

COMS 4771 Lecture 1 1. Course overview 2. Maximum likelihood estimation (review of some statistics) 1 / 24

Administrivia

This course Topics http://www.satyenkale.com/coms4771/ 1. Supervised learning Core issues of statistical machine learning Algorithmic, statistical, and analytical tools 2. Some topics in unsupervised learning Coursework Common statistical models Frameworks for developing new models and algorithms 1. Four homework assignments (theory & programming): 24% 2. Kaggle-style prediction competition: 26% 3. Two in-class exams (3/3, 4/28): 25% each 4. No late assignments accepted, no make-up exams 3 / 24

Prerequisites Mathematical prerequisites Basic algorithms and data structures Linear algebra (e.g., vector spaces, orthogonality, spectral decomposition) Multivariate calculus (e.g., limits, Taylor expansion, gradients) Probability/statistics (e.g., random variables, expectation, LLN, MLE) Computational prerequisites You should have regular access to and be able to program in MATLAB. MATLAB is available for download for SEAS students: http://portal.seas.columbia.edu/matlab/ 4 / 24

Resources Course staff http://www.satyenkale.com/coms4771/ Instructor: Satyen Kale Teaching assistants: Gaurang Sadekar, Patanjali Vakhulabharanam, Robert Ying, Xiyue Yu Office hours, online forum (Piazza): see course website Office hour attendance highly recommended. Materials Lecture slides: posted on course website Textbooks: readings from The Elements of Statistical Learning [ESL], A Course in Machine Learning [CML], Understanding Machine Learning [UML] (available free online; see course website) 5 / 24

Machine Learning overview

Algorithmic problems Minimum spanning tree Input: Graph G. Output: A minimum spanning tree in G. Input/output relationship well-specified for all inputs. 7 / 24

Algorithmic problems Minimum spanning tree Input: Graph G. Output: A minimum spanning tree in G. Input/output relationship well-specified for all inputs. Bird species recognition Input: Image of a bird. Output: Species name of the bird. Input/output relationship is difficult to specify. 7 / 24

Algorithmic problems Minimum spanning tree Input: Graph G. Output: A minimum spanning tree in G. Input/output relationship well-specified for all inputs. Bird species recognition Input: Image of a bird. Output: Species name of the bird. Input/output relationship is difficult to specify. Machine learning: use examples of input/output pairs to learn the mapping indigo bunting 7 / 24

Machine learning in context Perspective of intelligent systems Goal: robust system with intelligent behavior Often: hard-coded solution too complex, not robust, sub-optimal How do we learn from past experiences to perform well in the future? 8 / 24

Machine learning in context Perspective of intelligent systems Goal: robust system with intelligent behavior Often: hard-coded solution too complex, not robust, sub-optimal How do we learn from past experiences to perform well in the future? Perspective of algorithmic statistics Goal: statistical analysis of large, complex data sets Past: 100 data points of two variables. Data collection and statistical analysis done by hand/eye. Now: several million data and variables, collected by high-throughput automatic processes. How can we automate statistical analysis for modern applications? 8 / 24

Supervised learning Abstract problem Data: labeled examples (x 1, y 1), (x 2, y 2),..., (x n, y n) X Y from some population. X = input (feature) space; Y = output (label, response) space. Underlying assumption: there s a relatively simple function f : X Y such that f(x) y for most (x, y) in the population. Learning task: using the labeled examples, construct ˆf such that ˆf f. At test time: use ˆf to predict y for new (and previously unseen) x s. new (unlabeled) example past labeled examples learning algorithm learned predictor predicted label 9 / 24

Supervised learning: examples Spam filtering X = e-mail messages Y = {spam, not spam} 10 / 24

Supervised learning: examples Spam filtering X = e-mail messages Y = {spam, not spam} Optical character recognition X = 32 32 pixel images Y = {A, B,..., Z, a, b,..., z, 0, 1,..., 9} 10 / 24

Supervised learning: examples Spam filtering X = e-mail messages Y = {spam, not spam} Optical character recognition X = 32 32 pixel images Y = {A, B,..., Z, a, b,..., z, 0, 1,..., 9} Online dating X = user profiles user profiles Y = [0, 1] 10 / 24

Supervised learning: examples Spam filtering X = e-mail messages Y = {spam, not spam} Optical character recognition X = 32 32 pixel images Y = {A, B,..., Z, a, b,..., z, 0, 1,..., 9} Online dating X = user profiles user profiles Y = [0, 1] Machine translation X = sequences of English words Y = sequences of French words 10 / 24

Unsupervised learning Abstract problem Data: (unlabeled) examples x 1, x 2,..., x n X from some population. Underlying assumption: there s some interesting structure in the population to be discovered. Learning task: using the unlabeled examples, find the interesting structure. Uses: visualization/interpretation, pre-process data for downstream learning,... 11 / 24

Unsupervised learning Abstract problem Data: (unlabeled) examples x 1, x 2,..., x n X from some population. Underlying assumption: there s some interesting structure in the population to be discovered. Learning task: using the unlabeled examples, find the interesting structure. Uses: visualization/interpretation, pre-process data for downstream learning,... Examples Discover sub-communities of individuals in a social network. Explain variability of market price movement using a few latent factors. Learn a useful representation of data that improves supervised learning. 11 / 24

What else is there with machine learning? Advanced issues Structured output spaces Distributed learning Incomplete data Causal inference Privacy... Other models of learning Semi-supervised learning Active learning Online learning Reinforcement learning... Major application areas Natural language processing Speech recognition Computer vision Computational biology Information retrieval... Modes of study Mathematical analysis Cross-domain evaluations End-to-end application study... 12 / 24

Maximum likelihood estimation

Example: Binary prediction On a sequence of 10 days, observe weather: 1, 0, 1, 1, 0, 0, 1, 0, 1, 0 1 = rain, 0 = no rain 14 / 24

Example: Binary prediction On a sequence of 10 days, observe weather: 1, 0, 1, 1, 0, 0, 1, 0, 1, 0 1 = rain, 0 = no rain Goal: understand the distribution of the weather 14 / 24

Example: Binary prediction On a sequence of 10 days, observe weather: 1, 0, 1, 1, 0, 0, 1, 0, 1, 0 1 = rain, 0 = no rain Goal: understand the distribution of the weather Modeling assumption: rain on each day with a fixed probability p [0, 1] Bernoulli distribtion with parameter p: X Bern(p) means X is a {0, 1}-valued random variable with Pr[X = 1] = p. 14 / 24

Example: Binary prediction On a sequence of 10 days, observe weather: 1, 0, 1, 1, 0, 0, 1, 0, 1, 0 1 = rain, 0 = no rain Goal: understand the distribution of the weather Modeling assumption: rain on each day with a fixed probability p [0, 1] Bernoulli distribtion with parameter p: X Bern(p) means X is a {0, 1}-valued random variable with Pr[X = 1] = p. How to estimate the probability of rain? Which Bernoulli distribution models data best? (a) Bern(0.1) (b) Bern(0.5) (c) Bern(0.8) (d) Bern(1) 14 / 24

Maximum Likelihood Estimation for Bernoulli Data: 1, 0, 1, 1, 0, 0, 1, 0, 1, 0 15 / 24

Maximum Likelihood Estimation for Bernoulli Data: 1, 0, 1, 1, 0, 0, 1, 0, 1, 0 If data from Bern(p), likelihood of data = p 5 (1 p) 5 15 / 24

Maximum Likelihood Estimation for Bernoulli Data: 1, 0, 1, 1, 0, 0, 1, 0, 1, 0 If data from Bern(p), likelihood of data = p 5 (1 p) 5 Maximum Likelihood Estimate for p is then arg max p 5 (1 p) 5. p [0,1] 15 / 24

Maximum Likelihood Estimation for Bernoulli Data: 1, 0, 1, 1, 0, 0, 1, 0, 1, 0 If data from Bern(p), likelihood of data = p 5 (1 p) 5 Maximum Likelihood Estimate for p is then arg max p 5 (1 p) 5. p [0,1] To compute maximum, take derivative, set to 0. Maximum at p = 0.5 15 / 24

Maximum Likelihood Estimation for Bernoulli Data: 1, 0, 1, 1, 0, 0, 1, 0, 1, 0 If data from Bern(p), likelihood of data = p 5 (1 p) 5 Maximum Likelihood Estimate for p is then arg max p 5 (1 p) 5. p [0,1] To compute maximum, take derivative, set to 0. Maximum at p = 0.5 Generalize: data x 1, x 2,..., x n {0, 1} MLE of the Bernoulli parameter p is p ML := 1 n x i. 15 / 24

General setup Data x 1, x 2,..., x n X 16 / 24

General setup Data x 1, x 2,..., x n X A model P = {P θ : θ Θ} is a set of probability distributions over X indexed by a parameter space Θ. 16 / 24

General setup Data x 1, x 2,..., x n X A model P = {P θ : θ Θ} is a set of probability distributions over X indexed by a parameter space Θ. Example: Bernoulli model. Parameter set Θ = [0, 1] For any θ Θ, Pθ = Bern(θ). 16 / 24

General setup Data x 1, x 2,..., x n X A model P = {P θ : θ Θ} is a set of probability distributions over X indexed by a parameter space Θ. Example: Bernoulli model. Parameter set Θ = [0, 1] For any θ Θ, Pθ = Bern(θ). Also, often deal with models where each P θ has a density function p( ; θ): X R +. 16 / 24

General setup Data x 1, x 2,..., x n X A model P = {P θ : θ Θ} is a set of probability distributions over X indexed by a parameter space Θ. Example: Bernoulli model. Parameter set Θ = [0, 1] For any θ Θ, Pθ = Bern(θ). Also, often deal with models where each P θ has a density function p( ; θ): X R +. Example: Gaussian density in one dimension (X = R) 0.4 Θ = R R ++, parameter θ = (µ, σ) ( ) 1 p(x; µ, σ) := exp (x µ)2 2πσ 2 2σ 2 0.3 0.2 0.1 0 4 2 0 2 4 16 / 24

Maximum likelihood estimation Setting Given: data x 1, x 2,..., x n X ; parametric model P = {P θ : θ Θ}. The i.i.d. assumption: assume x 1, x 2,..., x n are independent and identically distributed according to the same probability distribution. Likelihood of θ given data: (under the i.i.d. assumption) n p(x i; θ), the probability mass (or density) of the data, as given by P θ. 17 / 24

Maximum likelihood estimation Setting Given: data x 1, x 2,..., x n X ; parametric model P = {P θ : θ Θ}. The i.i.d. assumption: assume x 1, x 2,..., x n are independent and identically distributed according to the same probability distribution. Likelihood of θ given data: (under the i.i.d. assumption) n p(x i; θ), the probability mass (or density) of the data, as given by P θ. Maximum likelihood estimator The maximum likelihood estimator (MLE) for the model P is θ ML := arg max θ Θ n p(x i; θ) i.e., the parameter θ Θ whose likelihood is highest given the data. 17 / 24

Logarithm trick Recall: logarithms turn products into sums ( n ) log f i = log(f i) Logarithms and maxima The logarithm is monotonically increasing on R ++. Consequence: Application of log does not change the location of a maximum or minimum: max y arg max y log(g(y)) max g(y) y log(g(y)) = arg max g(y) y The value changes (in general). The location does not change. 18 / 24

MLE: maximality criterion Likelihood and logarithm trick θ ML = arg max θ Θ n p(x i; θ) = arg max log θ Θ = arg max θ Θ ( n ) p(x i; θ) log p(x i; θ) Maximality criterion Assuming θ is unconstrained, the log-likelihood maximizer must satisfy 0 = θ log p(x i; θ) For some models, can analytically find unique solution θ Θ. 19 / 24

Gaussian distribution Gaussian density in one dimension (X = R) p(x; µ, σ) := µ = mean, σ 2 = variance x µ σ ( ) 1 exp (x µ)2 2πσ 2 2σ 2 = deviation from mean in units of σ. 0.4 0.3 0.2 0.1 0 4 2 0 2 4 20 / 24

Gaussian distribution Gaussian density in one dimension (X = R) p(x; µ, σ) := µ = mean, σ 2 = variance x µ σ ( ) 1 exp (x µ)2 2πσ 2 2σ 2 = deviation from mean in units of σ. Gaussian density in d dimensions (X = R d ) In the density function, the quadratic function 0.4 0.3 0.2 0.1 0 4 2 0 2 4 (x µ)2 = 1 2σ 2 2 (x µ)(σ2 ) 1 (x µ) is replaced by a multivariate quadratic form: 1 p(x; µ, Σ) := ( (2π)d det(σ) exp 1 ) 2 (x µ) Σ 1 (x µ) 20 / 24

Gaussian MLE Model: multivariate Gaussians with fixed covariance The model P is the set of all Gaussian densities on R d with fixed covariance matrix Σ: P = {g(. ; µ, Σ) : µ R d} where g is the Gaussian density function. The parameter space is Θ = R d. MLE equation Solve the following equation (from the maximality criterion) for µ: µ log g(x i; µ, Σ) = 0. 21 / 24

Gaussian MLE 0 = = = 1 µ [log ( (2π)d Σ exp 1 ) ] 2 (xi µ) Σ 1 (x i µ) ] 1 µ [log (2π)d Σ 1 2 (xi µ) Σ 1 (x i µ) µ [ 1 ] 2 (xi µ) Σ 1 (x i µ) = Σ 1 (x i µ) 22 / 24

Gaussian MLE 0 = = = 1 µ [log ( (2π)d Σ exp 1 ) ] 2 (xi µ) Σ 1 (x i µ) ] 1 µ [log (2π)d Σ 1 2 (xi µ) Σ 1 (x i µ) µ [ 1 ] 2 (xi µ) Σ 1 (x i µ) = Multiplication by ( Σ) on both sides gives 0 = (x i µ) = µ = 1 n Σ 1 (x i µ) x i 22 / 24

Gaussian MLE 0 = = = 1 µ [log ( (2π)d Σ exp 1 ) ] 2 (xi µ) Σ 1 (x i µ) ] 1 µ [log (2π)d Σ 1 2 (xi µ) Σ 1 (x i µ) µ [ 1 ] 2 (xi µ) Σ 1 (x i µ) = Multiplication by ( Σ) on both sides gives 0 = (x i µ) = µ = 1 n Σ 1 (x i µ) Conclusion The maximum likelihood estimator of the Gaussian mean parameter is µ ML := 1 n x i. x i 22 / 24

Example: Gaussian with unknown covariance Model: multivariate Gaussians The model P is now { } P = g(. ; µ, Σ) : µ R d, Σ S d d ++ where S d ++ is the set of symmetric positive definite d d matrices. The parameter space is Θ = R d S d d ++. 23 / 24

Example: Gaussian with unknown covariance Model: multivariate Gaussians The model P is now { } P = g(. ; µ, Σ) : µ R d, Σ S d d ++ where S d ++ is the set of symmetric positive definite d d matrices. The parameter space is Θ = R d S d d ++. ML approach Since we have just seen that the ML estimator of µ does not depend on Σ, we can compute µ ML first. We then estimate Σ using the criterion Σ log g(x i; µ ML, Σ) = 0. 23 / 24

Example: Gaussian with unknown covariance Model: multivariate Gaussians The model P is now { } P = g(. ; µ, Σ) : µ R d, Σ S d d ++ where S d ++ is the set of symmetric positive definite d d matrices. The parameter space is Θ = R d S d d ++. ML approach Since we have just seen that the ML estimator of µ does not depend on Σ, we can compute µ ML first. We then estimate Σ using the criterion Σ log g(x i; µ ML, Σ) = 0. Solution The ML estimator of Σ is Σ ML := 1 n (x i µ ML )(x i µ ML ). 23 / 24

Recap Maximum Likelihood Estimation: a principled way to find the best distribution in a model: choose distribution that most likely generated the data MLE for Bernoulli, Gaussian can be computed in closed form. 24 / 24