xp(x µ) = 0 p(x = 0 µ) + 1 p(x = 1 µ) = µ

Similar documents
CS 2750 Machine Learning. Lecture 5. Density estimation. CS 2750 Machine Learning. Announcements

Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI

ENG 8801/ Special Topics in Computer Engineering: Pattern Recognition. Memorial University of Newfoundland Pattern Recognition

MLE and Bayesian Estimation. Jie Tang Department of Computer Science & Technology Tsinghua University 2012

Machine learning: Density estimation

Semi-Supervised Learning

Probabilistic Classification: Bayes Classifiers. Lecture 6:

Expected Value and Variance

Lecture Notes on Linear Regression

Homework Assignment 3 Due in class, Thursday October 15

Maximum Likelihood Estimation

Finite Mixture Models and Expectation Maximization. Most slides are from: Dr. Mario Figueiredo, Dr. Anil Jain and Dr. Rong Jin

An Experiment/Some Intuition (Fall 2006): Lecture 18 The EM Algorithm heads coin 1 tails coin 2 Overview Maximum Likelihood Estimation

Parametric fractional imputation for missing data analysis. Jae Kwang Kim Survey Working Group Seminar March 29, 2010

MATH 829: Introduction to Data Mining and Analysis The EM algorithm (part 2)

Maximum Likelihood Estimation of Binary Dependent Variables Models: Probit and Logit. 1. General Formulation of Binary Dependent Variables Models

Limited Dependent Variables

The conjugate prior to a Bernoulli is. A) Bernoulli B) Gaussian C) Beta D) none of the above

Course 395: Machine Learning - Lectures

10-701/ Machine Learning, Fall 2005 Homework 3

The Expectation-Maximization Algorithm

CIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M

Chapter Newton s Method

Goodness of fit and Wilks theorem

Using T.O.M to Estimate Parameter of distributions that have not Single Exponential Family

Hidden Markov Models

Maximum Likelihood Estimation of Binary Dependent Variables Models: Probit and Logit. 1. General Formulation of Binary Dependent Variables Models

Classification as a Regression Problem

EM and Structure Learning

3.1 ML and Empirical Distribution

Learning from Data 1 Naive Bayes

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification

Outline. Bayesian Networks: Maximum Likelihood Estimation and Tree Structure Learning. Our Model and Data. Outline

Differentiating Gaussian Processes

MATH 5707 HOMEWORK 4 SOLUTIONS 2. 2 i 2p i E(X i ) + E(Xi 2 ) ä i=1. i=1

Conjugacy and the Exponential Family

First Year Examination Department of Statistics, University of Florida

Retrieval Models: Language models

Outline. Multivariate Parametric Methods. Multivariate Data. Basic Multivariate Statistics. Steven J Zeil

Engineering Risk Benefit Analysis

Chapter 1. Probability

The Geometry of Logit and Probit

The Gaussian classifier. Nuno Vasconcelos ECE Department, UCSD

ANSWERS. Problem 1. and the moment generating function (mgf) by. defined for any real t. Use this to show that E( U) var( U)

8/25/17. Data Modeling. Data Modeling. Data Modeling. Patrice Koehl Department of Biological Sciences National University of Singapore

P R. Lecture 4. Theory and Applications of Pattern Recognition. Dept. of Electrical and Computer Engineering /

The exam is closed book, closed notes except your one-page cheat sheet.

CS286r Assign One. Answer Key

ECONOMICS 351*-A Mid-Term Exam -- Fall Term 2000 Page 1 of 13 pages. QUEEN'S UNIVERSITY AT KINGSTON Department of Economics

Hidden Markov Models & The Multivariate Gaussian (10/26/04)

Lecture 3: Probability Distributions

Hydrological statistics. Hydrological statistics and extremes

1/10/18. Definitions. Probabilistic models. Why probabilistic models. Example: a fair 6-sided dice. Probability

8 : Learning in Fully Observed Markov Networks. 1 Why We Need to Learn Undirected Graphical Models. 2 Structural Learning for Completely Observed MRF

Lecture 7: Boltzmann distribution & Thermodynamics of mixing

Maximum Likelihood Estimation (MLE)

Lecture 10 Support Vector Machines II

The big picture. Outline

Generalized Linear Methods

9.2 Maximum A Posteriori and Maximum Likelihood

Predictive Analytics : QM901.1x Prof U Dinesh Kumar, IIMB. All Rights Reserved, Indian Institute of Management Bangalore

Why Bayesian? 3. Bayes and Normal Models. State of nature: class. Decision rule. Rev. Thomas Bayes ( ) Bayes Theorem (yes, the famous one)

Linear Approximation with Regularization and Moving Least Squares

The Multiple Classical Linear Regression Model (CLRM): Specification and Assumptions. 1. Introduction

Hidden Markov Models

PhysicsAndMathsTutor.com

U.C. Berkeley CS294: Beyond Worst-Case Analysis Luca Trevisan September 5, 2017

Stat260: Bayesian Modeling and Inference Lecture Date: February 22, Reference Priors

STAT 405 BIOSTATISTICS (Fall 2016) Handout 15 Introduction to Logistic Regression

18.1 Introduction and Recap

Math1110 (Spring 2009) Prelim 3 - Solutions

4.3 Poisson Regression

Newton s Method for One - Dimensional Optimization - Theory

Rockefeller College University at Albany

Probability Theory (revisited)

ECE 534: Elements of Information Theory. Solutions to Midterm Exam (Spring 2006)

C4B Machine Learning Answers II. = σ(z) (1 σ(z)) 1 1 e z. e z = σ(1 σ) (1 + e z )

n α j x j = 0 j=1 has a nontrivial solution. Here A is the n k matrix whose jth column is the vector for all t j=0

Which Separator? Spring 1

Markov Chain Monte Carlo (MCMC), Gibbs Sampling, Metropolis Algorithms, and Simulated Annealing Bioinformatics Course Supplement

p(z) = 1 a e z/a 1(z 0) yi a i x (1/a) exp y i a i x a i=1 n i=1 (y i a i x) inf 1 (y Ax) inf Ax y (1 ν) y if A (1 ν) = 0 otherwise

3.1 Expectation of Functions of Several Random Variables. )' be a k-dimensional discrete or continuous random vector, with joint PMF p (, E X E X1 E X

MACHINE APPLIED MACHINE LEARNING LEARNING. Gaussian Mixture Regression

Solutions Homework 4 March 5, 2018

Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

PHYS 705: Classical Mechanics. Calculus of Variations II

j) = 1 (note sigma notation) ii. Continuous random variable (e.g. Normal distribution) 1. density function: f ( x) 0 and f ( x) dx = 1

Composite Hypotheses testing

INF 5860 Machine learning for image classification. Lecture 3 : Image classification and regression part II Anne Solberg January 31, 2018

Stat 543 Exam 2 Spring 2016

Module 2. Random Processes. Version 2 ECE IIT, Kharagpur

Chapter 7 Channel Capacity and Coding

Winter 2008 CS567 Stochastic Linear/Integer Programming Guest Lecturer: Xu, Huan

since [1-( 0+ 1x1i+ 2x2 i)] [ 0+ 1x1i+ assumed to be a reasonable approximation

MAE140 - Linear Circuits - Fall 13 Midterm, October 31

9.913 Pattern Recognition for Vision. Class IV Part I Bayesian Decision Theory Yuri Ivanov

Bayesian Learning. Smart Home Health Analytics Spring Nirmalya Roy Department of Information Systems University of Maryland Baltimore County

Notes on Frequency Estimation in Data Streams

ISQS 6348 Final Open notes, no books. Points out of 100 in parentheses. Y 1 ε 2

The EM Algorithm (Dempster, Laird, Rubin 1977) The missing data or incomplete data setting: ODL(φ;Y ) = [Y;φ] = [Y X,φ][X φ] = X

Transcription:

CSE 455/555 Sprng 2013 Homework 7: Parametrc Technques Jason J. Corso Computer Scence and Engneerng SUY at Buffalo jcorso@buffalo.edu Solutons by Yngbo Zhou Ths assgnment does not need to be submtted and wll not be graded, but students are advsed to work through the problems to ensure they understand the materal. You are both allowed and encouraged to work n groups on ths and other homework assgnments n ths class. These are challengng topcs, and workng together wll both make them easer to decpher and help you ensure that you truly understand them. 1. Maxmum Lkelhood of Bnary/Multnomal Varable Suppose we have a sngle bnary varable x 0, 1} wth x 1 denotes the heads and x 0 denotes the tals of the outcome from flppng a con. We do not make any assumpton on the farness of the con, nstead, we assume the probablty of x 1 wll be denoted by a parameter µ so that p(x 1 µ) µ, where 0 µ 1. (a) Wrte down the probablty dstrbuton of x. p(x 0 µ) 1 µ, so p(x µ) µ x (1 µ) 1 x, ths s known as the Bernoull dstrbuton. (b) Show that ths s a proper probablty dstrbuton,.e. the probablty sum up to 1. What s the expectaton and varance of ths dstrbuton? x 0,1} p(x µ) p(x 0 µ) + p(x 1 µ) 1 µ + µ 1 E(x) x 0,1} xp(x µ) 0 p(x 0 µ) + 1 p(x 1 µ) µ var(x) E(x 2 ) E(x) 2 } x 0,1} x2 p(x µ) µ 2 0 2 p(x 0 µ) + 1 2 p(x 1 µ) µ 2 µ µ 2 (c) ow suppose we have a set of observed values X x 1, x 2,..., x } of x. Wrte the lkelhood functon and estmate the maxmum lkelhood parameter µ ML. p(x µ) p(x µ) µx (1 µ) 1 x The log-lkelhood can be wrtten as: l(x µ) ln p(x µ) x ln µ + (1 x ) ln(1 µ)} Take the dervatve w.r.t. µ and set to zero we get: x µ (1 x ) 1 µ } 0 (1 µ) x µ (1 x ) x µ x µ µ 1 x

f we observe m heads, then µ ML m. µ ML x (d) ow suppose we are rollng a K-sded dce, n other words we have data D x 1, x 2,..., x } whch can take on K values. Assume the generaton of each value of k K s determned by a parameter θ k 0, so that the p(x k θ k ) θ k, and k θ k 1. Wrte down the lkelhood functon. Frst we ntroduce a convenent representaton called 1-of-K scheme, whch the varables s represented by a K-dmensonal bnary vector that one of the element can take on value one. Ths s exactly the case for ths queston, snce at each tme we can only generate one sample taken on one partcular value. Therefore, smlar to the Bernoull form we have the dstrbuton shown as: p(x θ) where K x 1, θ 0 and K θ 1. ow we can wrte the lkelhood functon as: p(d θ) n1 θ x n θ x n θ x n where n n x n s the number of data that take on value, and log-lkelhood: l(d θ) n1 K x n ln θ (e) Wrte the maxmum lkelhood soluton of θ ML k. We have to maxmze the lkelhood w.r.t. constrant, so we have to ntroduce Lagrange multpler, and the new objectve functon s: F (θ, λ) n1 take the dervatve w.r.t. θ j and set to zero we get: F θ j θ n K K x n ln θ + λ( θ 1) n1 n j θ j λ θ j n j λ x nj θ j + λ 0 substtute ths back to our constrant θ 1, we have therefore we get θ n. K K θ λ n λ 1 2

2. ave Bayes In nave Bayes, we assume that the presence of a partcular feature of a class s unrelated to the presence of any other feature. For example, a frut may be consdered to be a watermelon f t s green, round, and more than 10 pounds. we wll consder all these three features as ndependent to each other n nave Bayes. Let our features x, [1 d] be bnary valued and have d dmensons,.e. x 0, 1} and our nput feature vector x [x 1 x 2 x d ] T. For each tranng sample, our target value y 0, 1} s also a bnary-valued varable. Then our model s parameterzed by φ y0 p(x 1 y 0), φ y1 p(x 1 y 1), and φ y p(y 1), and p(y) (φ y ) y (1 φ y ) (1 y) p(x y 0) p(x y 1) p(x y 0) (φ y0 ) x (1 φ y0 ) (1 x ) p(x y 1) (φ y1 ) x (1 φ y1 ) (1 x ) (a) Wrte down the jont log-lkelhood functon l(θ) log n1 p(x(n), y (n) ; θ) n terms of the model parameters gven above. x (n) means the nth data pont, and θ represents all the parameters,.e. φ y, φ y0, φ y1, 1,..., d}. l(θ) log log p(x (n), y (n) ; θ) n1 p(x (n) y y (n) ; θ)p(y (n) ; θ) n1 ( ) log p(x (n) y (n) ; θ) p(y (n) ; θ) n1 ( log p(y (n) ; θ) + n1 d log p(x (n) y (n) ; θ) y (n) log φ y + (1 y (n) log(1 φ y ) + n1 (b) Estmate the paramters usng maxmum lkelhood,.e. φ y1. Take dervatve wth respect to these 3 parameters, we get: ) d ( x (n) ) } log φ y (n) + (1 x (n) ) log(1 φ y (n)) fnd solutons for paramter φ y, φ y0 and φ y n1 y(n) 3

n1 φ y0 (1 y(n) )x (n) n1 (1 y(n) ) φ y1 n1 y(n) x (n) n1 y(n) (c) When a new sample pont x comes, we make the predcton based on the most lkely class estmate generated by our model. Show that the hypothess returned by nave Bayes s lnear,.e. f p(y 0 x) and p(y 1 x) are the class probabltes returned by our model, show that there exst some α so that p(y 1 x) p(y 0 x)f and only f α T x 0 where α [α 0 α 1 α d ] T and x [1 x 1 x 2 x d ] T. p(y 1 x) p(y 0 x) p(x y 0)p(y 0) p(x y 1)p(y 1) (1 φ y ) (φ y0 ) x (1 φ y0 ) (1 x) φ y (φ y1 ) x (1 φ y1 ) (1 x ) log (1 φ y ) + log φ y + d α T x 0 d x log φ y0 + d x log φ y1 + x log φ y0 φ y1 1 φ y1 d (1 x ) log (1 φ y0 ) d (1 x ) log (1 φ y1 ) 1 φ y0 + log ( 1 φ y φ y ( 1 φ y0 1 φ y1 ) d ) 0 Where α 0 log ( 1 φ y ( 1 φ y0 ) d ) φ y 1 φ y1 3. Gaussan Dstrbuton α log φ y0 φ y1 1 φ y1 1 φ y0 Please famlarze yourself wth the maxmum lkelhood estmaton for Gaussan dstrbuton n the class notes and answer the followng questons. (a) What s the jont probablty dstrbuton p(x; µ, σ 2 ) of the samples? p(x; µ, σ 2 ) (x µ, σ 2 ) (b) What s the maxmum lkelhood (ML) estmaton of the paramters, f both µ and σ 2 are unknown? Please refer to the lecture notes. 4

(c) Show that the ML estmaton of the varance s based,.e. show that E(σML) 2 1 σ2 Hnt: you can use the fact that the expectaton of random varable from a Gaussan dstrbutuon s µ,.e. E(x) µ, and var(x) σ 2 E(x 2 ) E(x) 2 E(σML) 2 E( 1 (x µ ML ) 2 ) E( 1 } (x 2 + µ 2 ML 2x µ ML ) ) E( 1 (x 2 + ( 1 x j ) 2 2x ( 1 x j )) ) j1 j1 E( 1 (x 2 + 1 2 x 2 j + 1 2 x j x k 2 x2 2 x x j ) ) j1 j k j 1 E(x 2 ) + 1 2 E(x 2 j) + 1 2 E(x j x k ) 2 E(x 2 ) 2 1 j1 j k (σ 2 + µ 2 ) + σ 2 + µ 2 + ( 1)µ 2 2(σ 2 + µ 2 ) 2( 1)µ 2} E(x x j ) j 1 σ 2 + µ 2 + σ 2 + µ 2 + µ 2 µ 2 2σ 2 2µ 2 2µ 2 + 2µ 2} 1 (σ2 σ 2 ) 1 σ2 (d) Please wrte down the objectve functon of maxmum a posteror (MAP) estmaton of the parameters, f we assume that only µ s the unknown parameter and follow a Gaussan dstrbuton wth mean µ 0 and varance σ0 2. We know that p(µ) (µ µ 0, σ0 2 ), so the posteror probablty of parameter µ would be: p(µ X) p(x µ)p(µ) (x µ, σ 2 )} (µ µ 0, σ0) 2 (e) Please wrte down the Bayesan formulaton of ths problem, f all other assumptons stay the same as n queston d. p(x X) p(x θ)p(θ X)dθ where x s some new data that you want to do predcton and θ s the set of parameters from the model, n ths case θ µ}, snce we assume µ s the only unknown parameter. 5