MLE and Bayesian Estimation. Jie Tang Department of Computer Science & Technology Tsinghua University 2012

Similar documents
Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Lecture Notes on Linear Regression

CS 2750 Machine Learning. Lecture 5. Density estimation. CS 2750 Machine Learning. Announcements

Machine learning: Density estimation

INF 5860 Machine learning for image classification. Lecture 3 : Image classification and regression part II Anne Solberg January 31, 2018

xp(x µ) = 0 p(x = 0 µ) + 1 p(x = 1 µ) = µ

The conjugate prior to a Bernoulli is. A) Bernoulli B) Gaussian C) Beta D) none of the above

Ensemble Methods: Boosting

The Gaussian classifier. Nuno Vasconcelos ECE Department, UCSD

Logistic Classifier CISC 5800 Professor Daniel Leeds

Linear Regression Introduction to Machine Learning. Matt Gormley Lecture 5 September 14, Readings: Bishop, 3.1

CIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M

Course 395: Machine Learning - Lectures

Logistic Regression Maximum Likelihood Estimation

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results.

Maximum Likelihood Estimation of Binary Dependent Variables Models: Probit and Logit. 1. General Formulation of Binary Dependent Variables Models

10-701/ Machine Learning, Fall 2005 Homework 3

Generalized Linear Methods

Space of ML Problems. CSE 473: Artificial Intelligence. Parameter Estimation and Bayesian Networks. Learning Topics

Discriminative classifier: Logistic Regression. CS534-Machine Learning

Probabilistic Classification: Bayes Classifiers. Lecture 6:

Discriminative classifier: Logistic Regression. CS534-Machine Learning

8/25/17. Data Modeling. Data Modeling. Data Modeling. Patrice Koehl Department of Biological Sciences National University of Singapore

MATH 829: Introduction to Data Mining and Analysis The EM algorithm (part 2)

Expectation Maximization Mixture Models HMMs

Retrieval Models: Language models

Outline. Multivariate Parametric Methods. Multivariate Data. Basic Multivariate Statistics. Steven J Zeil

EEE 241: Linear Systems

Chapter 1. Probability

Maximum Likelihood Estimation

MIMA Group. Chapter 2 Bayesian Decision Theory. School of Computer Science and Technology, Shandong University. Xin-Shun SDU

Classification as a Regression Problem

Support Vector Machines

Lecture 12: Classification

P R. Lecture 4. Theory and Applications of Pattern Recognition. Dept. of Electrical and Computer Engineering /

SDMML HT MSc Problem Sheet 4

Relevance Vector Machines Explained

Maximum Likelihood Estimation of Binary Dependent Variables Models: Probit and Logit. 1. General Formulation of Binary Dependent Variables Models

Boostrapaggregating (Bagging)

Linear Approximation with Regularization and Moving Least Squares

Online Classification: Perceptron and Winnow

Kernel Methods and SVMs Extension

The Geometry of Logit and Probit

1 Convex Optimization

Semi-Supervised Learning

C4B Machine Learning Answers II. = σ(z) (1 σ(z)) 1 1 e z. e z = σ(1 σ) (1 + e z )

Natural Images, Gaussian Mixtures and Dead Leaves Supplementary Material

Parametric fractional imputation for missing data analysis. Jae Kwang Kim Survey Working Group Seminar March 29, 2010

Motion Perception Under Uncertainty. Hongjing Lu Department of Psychology University of Hong Kong

An Experiment/Some Intuition (Fall 2006): Lecture 18 The EM Algorithm heads coin 1 tails coin 2 Overview Maximum Likelihood Estimation

Maxent Models & Deep Learning

15-381: Artificial Intelligence. Regression and cross validation

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING

Linear Regression Analysis: Terminology and Notation

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification

Maximum Likelihood Estimation (MLE)

Hydrological statistics. Hydrological statistics and extremes

Week 5: Neural Networks

Homework Assignment 3 Due in class, Thursday October 15

Limited Dependent Variables

Durban Watson for Testing the Lack-of-Fit of Polynomial Regression Models without Replications

Multilayer Perceptron (MLP)

The big picture. Outline

ENG 8801/ Special Topics in Computer Engineering: Pattern Recognition. Memorial University of Newfoundland Pattern Recognition

Dr. Shalabh Department of Mathematics and Statistics Indian Institute of Technology Kanpur

Classification learning II

Generative classification models

First Year Examination Department of Statistics, University of Florida

The exam is closed book, closed notes except your one-page cheat sheet.

Natural Language Processing and Information Retrieval

Which Separator? Spring 1

Singular Value Decomposition: Theory and Applications

Gaussian process classification: a message-passing viewpoint

INTRODUCTION TO MACHINE LEARNING 3RD EDITION

Finite Mixture Models and Expectation Maximization. Most slides are from: Dr. Mario Figueiredo, Dr. Anil Jain and Dr. Rong Jin

CSC 411 / CSC D11 / CSC C11

3.1 Expectation of Functions of Several Random Variables. )' be a k-dimensional discrete or continuous random vector, with joint PMF p (, E X E X1 E X

Department of Computer Science Artificial Intelligence Research Laboratory. Iowa State University MACHINE LEARNING

Lecture 10 Support Vector Machines. Oct

Chapter 9: Statistical Inference and the Relationship between Two Variables

Multilayer Perceptrons and Backpropagation. Perceptrons. Recap: Perceptrons. Informatics 1 CG: Lecture 6. Mirella Lapata

EM and Structure Learning

A New Method for Estimating Overdispersion. David Fletcher and Peter Green Department of Mathematics and Statistics

Clustering & Unsupervised Learning

Lecture 21: Numerical methods for pricing American type derivatives

Bayesian Learning. Smart Home Health Analytics Spring Nirmalya Roy Department of Information Systems University of Maryland Baltimore County

Lecture 3: Probability Distributions

Econ107 Applied Econometrics Topic 3: Classical Model (Studenmund, Chapter 4)

Hidden Markov Models & The Multivariate Gaussian (10/26/04)

A Robust Method for Calculating the Correlation Coefficient

Support Vector Machines

18-660: Numerical Methods for Engineering Design and Optimization

Predictive Analytics : QM901.1x Prof U Dinesh Kumar, IIMB. All Rights Reserved, Indian Institute of Management Bangalore

Erratum: A Generalized Path Integral Control Approach to Reinforcement Learning

Chapter 11: Simple Linear Regression and Correlation

Lecture Nov

JAB Chain. Long-tail claims development. ASTIN - September 2005 B.Verdier A. Klinger

j) = 1 (note sigma notation) ii. Continuous random variable (e.g. Normal distribution) 1. density function: f ( x) 0 and f ( x) dx = 1

U.C. Berkeley CS294: Beyond Worst-Case Analysis Luca Trevisan September 5, 2017

princeton univ. F 13 cos 521: Advanced Algorithm Design Lecture 3: Large deviations bounds and applications Lecturer: Sanjeev Arora

Transcription:

MLE and Bayesan Estmaton Je Tang Department of Computer Scence & Technology Tsnghua Unversty 01 1

Lnear Regresson? As the frst step, we need to decde how we re gong to represent the functon f. One example: polynomal curve fttng M M = + + + = j Notatons: f ( x, w) w0 w1 x wm x wjx w 0 +w 1 x j= 0 w T x Now, gven a tranng set, how do we pck, or learn, the parameters w?

Least Squares Loss Functon y f(x, w) N 1 L( w) = ( f ( x, w) y = 1 ) 3

4 Learnng by Gradent Descent How to choose w n order to mnmze L(w)? General dea s to start wth some ntal guess for w, and that repeatedly changes w to make L(w) smaller, utl we converge to a value of w that mmze L(w). Gradent descent s a natural search algorthm that update w n the drecton of steepest decrease of L: where s the learnng rate Now let us calculate the partal dervatve. Frst consder one tranng nstance (x, y), so the sum n L can be gnored: Intuton: The update s proportonal to the error term (f(x, w) - y). Thus for the tranng examples wth predcton score close to the actual value y, there s lttle need to change the parameters; n contrast, a larger change to the parameters wll be made. j j j w L w w Δ = (w) j M j j j j j j x y x f y x w w y x f y x f w L w ) ), ( ( ) ), ( ( ) ), ( ( 1 ) ( 0 = = = = w w w w j j j x y x f w w ) ), ( Δ( = w The rule s called LMS (least mean squares) update rule, aka Wdrow-Hoff learnng rule.

Learnng by Gradent Descent (cont.) Then consder a tranng data set rather than only one example by two ways: Batch gradent descent Scan through the entre tranng set before takng a sngle step of update Repeat untl convergence { w j = w j + ( f (x, w) y ) x j } Stochastc gradent descent Start makng progress rght away, and contnues to make progress wth each example t looks at. Much faster than batch gradent descent. The parameters wll keep oscllatng around the mnmum of L, but n practce most of the values near the mnmum wll be reasonably good approxmatons to the true mnmum. Repeat untl convergence { Foreach x { w j = w j + ( f (x, w) y ) x j } } Partcularly when the tranng set s large, stochastc gradent descent s often preferred. 5

0 th Order Polynomal 6 * A number of the followng pages are from Bshop s sldes

7 1 st Order Polynomal

8 3 rd Order Polynomal

9 9 th Order Polynomal

Over-fttng Root-Mean-Square (RMS) Error: E RMS = L( w) N 10

11 Polynomal Coeffcents

Data Set Sze 9 th Order Polynomal 1

Data Set Sze 9 th Order Polynomal 13

Regularzaton Penalze large coeffcent values N 1 λ L( w) = ( f ( x, w) y ) + w = 1 14

15 Regularzaton

16 Regularzaton

Regularzaton: vs. 17

18 Polynomal Coeffcents

Let us back to the smple case Lnear functon And the loss functon f ( x, w) = w + w x + + w = p 0 j=0 = 1 w j x 1 j Should all example be treated equally? How about f we assgn dfferent weghts to dfferent examples? 1 = w N 1 L( w) = ( f ( x, w) y ) T x p x p 19

Locally Weghted Lnear Regresson Ftng w to mnmze N 1 L( w) = α ( f ( x, w) = 1 y ) where a s a non-negatve valued weght for x α ( x x exp τ q = here x q s the example we want to predct ts y ) Wth ths, we have the frst example of a non-parametrc algorthm The number of parameters s not fxed and may grows wth the ncrease of the sze of the tranng set. 0

1 Probablstc Interpretaton Let us assume that each example has an error term to represent unmodeled effects. Also assume that ε satsfes IID, e.g., Then we have Gven data X={x }, and ther correspondng Y T y ε + = x w = exp 1 ) ( σ ε πσ ε p = ) ( exp 1 ) ; ( σ πσ T y x y p x w w = = = = = N T N y x y p X Y p L 1 1 ) ( exp 1 ) ; ( ) ; ( ) ( σ πσ x w w w w

Maxmum (log-)lkelhood Log lkelhood * Under certan probablstc assumpton, mnmzng least squares regresson corresponds to maxmzng lkelhood estmaton of w = = = = = = = N T N T N y N y x y p X Y p L 1 1 1 ) ( 1 1 1 log ) ( exp 1 log ) ; ( log ) ; ( log ) ( x w x w w w w σ πσ σ πσ Whch s the same as mnmzng least squares

3 Probablty Dstrbutons

The Rules of Probablty Sum Rule Product Rule 4

5 Probablty Denstes

6 The Gaussan Dstrbuton

7 Gaussan Mean and Varance

8 The Multvarate Gaussan

Examples 1 0 = 0 1 0.6 0 = 0 0.6 0 = 0 9

30 Examples

31 Examples

Gaussan Parameter Estmaton Lkelhood functon 3

33 Maxmum (Log) Lkelhood

Propertes of and 34

Bnary Varables (1) Con flppng: heads=1, tals=0 Bernoull Dstrbuton 35

N con flps: Bnary Varables () Bnomal Dstrbuton 36

37 Bnomal Dstrbuton

ML for Bernoull Gven: Parameter Estmaton (1) 38

Example: Parameter Estmaton () Predcton: all future tosses wll land heads up Overfttng to D 39

Beta Dstrbuton Dstrbuton over. 40

Bayesan Bernoull The Beta dstrbuton provdes the conjugate pror for the Bernoull dstrbuton. 41

4 Beta Dstrbuton

43 Pror Lkelhood = Posteror

Propertes of the Posteror As the sze of the data set, N, ncrease 44

Predcton under the Posteror What s the probablty that the next con toss wll land heads up? 45

Multnomal Varables 1-of-K codng scheme: 46

ML Parameter estmaton Gven: Ensure, use a Lagrange multpler 47

48 The Multnomal Dstrbuton

The Drchlet Dstrbuton Conjugate pror for the multnomal dstrbuton. 49

50 Bayesan Multnomal (1)

51 Bayesan Multnomal ()

Thanks! Je Tang, DCST http://keg.cs.tsnghua.edu.cn/jetang/ http://arnetmner.org Emal: jetang@tsnghua.edu.cn 5