Maximum A Posteriori (MAP) CS 109 Lecture 22 May 16th, 2016

Similar documents
x 1 Outline IAML: Logistic Regression Decision Boundaries Example Data

The Law of Total Probability, Bayes Rule, and Random Variables (Oh My!)

CS 109 Lecture 23 May 18th, 2016

COMP 551 Applied Machine Learning Lecture 5: Generative models for linear classification

Bootstrap Method > # Purpose: understand how bootstrap method works > obs=c(11.96, 5.03, 67.40, 16.07, 31.50, 7.73, 11.10, 22.38) > n=length(obs) >

CAUSAL INFERENCE. Technical Track Session I. Phillippe Leite. The World Bank

Simple Linear Regression (single variable)

Module 3: Gaussian Process Parameter Estimation, Prediction Uncertainty, and Diagnostics

CHAPTER 24: INFERENCE IN REGRESSION. Chapter 24: Make inferences about the population from which the sample data came.

Admin. MDP Search Trees. Optimal Quantities. Reinforcement Learning

Distributions, spatial statistics and a Bayesian perspective

Internal vs. external validity. External validity. This section is based on Stock and Watson s Chapter 9.

COMP 551 Applied Machine Learning Lecture 11: Support Vector Machines

Medium Scale Integrated (MSI) devices [Sections 2.9 and 2.10]

This section is primarily focused on tools to aid us in finding roots/zeros/ -intercepts of polynomials. Essentially, our focus turns to solving.

Lecture 13: Markov Chain Monte Carlo. Gibbs sampling

Chapter 8: The Binomial and Geometric Distributions

Lecture 7: Damped and Driven Oscillations

, which yields. where z1. and z2

Lecture 2: Supervised vs. unsupervised learning, bias-variance tradeoff

Lecture 2: Supervised vs. unsupervised learning, bias-variance tradeoff

We can see from the graph above that the intersection is, i.e., [ ).

CHM112 Lab Graphing with Excel Grading Rubric

Resampling Methods. Cross-validation, Bootstrapping. Marek Petrik 2/21/2017

Introduction to Regression

Experiment #3. Graphing with Excel

Professional Development. Implementing the NGSS: High School Physics

Five Whys How To Do It Better

[COLLEGE ALGEBRA EXAM I REVIEW TOPICS] ( u s e t h i s t o m a k e s u r e y o u a r e r e a d y )

INSTRUMENTAL VARIABLES

Physics 2010 Motion with Constant Acceleration Experiment 1

SUPPLEMENTARY MATERIAL GaGa: a simple and flexible hierarchical model for microarray data analysis

Computational modeling techniques

Public Key Cryptography. Tim van der Horst & Kent Seamons

Section 5.8 Notes Page Exponential Growth and Decay Models; Newton s Law

Part 3 Introduction to statistical classification techniques

Phys. 344 Ch 7 Lecture 8 Fri., April. 10 th,

What is Statistical Learning?

15-381/781 Bayesian Nets & Probabilistic Inference

LHS Mathematics Department Honors Pre-Calculus Final Exam 2002 Answers

Activity Guide Loops and Random Numbers

AP Statistics Notes Unit Two: The Normal Distributions

Fall 2013 Physics 172 Recitation 3 Momentum and Springs

AP Physics Kinematic Wrap Up

Pattern Recognition 2014 Support Vector Machines

If (IV) is (increased, decreased, changed), then (DV) will (increase, decrease, change) because (reason based on prior research).

Trigonometric Ratios Unit 5 Tentative TEST date

CS 477/677 Analysis of Algorithms Fall 2007 Dr. George Bebis Course Project Due Date: 11/29/2007

Modelling of Clock Behaviour. Don Percival. Applied Physics Laboratory University of Washington Seattle, Washington, USA

Accelerated Chemistry POGIL: Half-life

You need to be able to define the following terms and answer basic questions about them:

k-nearest Neighbor How to choose k Average of k points more reliable when: Large k: noise in attributes +o o noise in class labels

COMP 551 Applied Machine Learning Lecture 4: Linear classification

NAME: Prof. Ruiz. 1. [5 points] What is the difference between simple random sampling and stratified random sampling?

Turing Machines. Human-aware Robotics. 2017/10/17 & 19 Chapter 3.2 & 3.3 in Sipser Ø Announcement:

Comparing Several Means: ANOVA. Group Means and Grand Mean

Flipping Physics Lecture Notes: Simple Harmonic Motion Introduction via a Horizontal Mass-Spring System

Midwest Big Data Summer School: Machine Learning I: Introduction. Kris De Brabanter

Math 10 - Exam 1 Topics

SUMMER REV: Half-Life DUE DATE: JULY 2 nd

BLAST / HIDDEN MARKOV MODELS

Reinforcement Learning" CMPSCI 383 Nov 29, 2011!

Kinetic Model Completeness

Physics 2B Chapter 23 Notes - Faraday s Law & Inductors Spring 2018

In the OLG model, agents live for two periods. they work and divide their labour income between consumption and

AP Literature and Composition. Summer Reading Packet. Instructions and Guidelines

1b) =.215 1c).080/.215 =.372

Lesson Plan. Recode: They will do a graphic organizer to sequence the steps of scientific method.

MATCHING TECHNIQUES. Technical Track Session VI. Emanuela Galasso. The World Bank

Differentiation Applications 1: Related Rates

4th Indian Institute of Astrophysics - PennState Astrostatistics School July, 2013 Vainu Bappu Observatory, Kavalur. Correlation and Regression

Getting Involved O. Responsibilities of a Member. People Are Depending On You. Participation Is Important. Think It Through

Hypothesis Tests for One Population Mean

Logistic Regression. and Maximum Likelihood. Marek Petrik. Feb

NUMBERS, MATHEMATICS AND EQUATIONS

A Matrix Representation of Panel Data

ECE 5318/6352 Antenna Engineering. Spring 2006 Dr. Stuart Long. Chapter 6. Part 7 Schelkunoff s Polynomial

ENSC Discrete Time Systems. Project Outline. Semester

Chapter 3 Kinematics in Two Dimensions; Vectors

Physics 212. Lecture 12. Today's Concept: Magnetic Force on moving charges. Physics 212 Lecture 12, Slide 1

EASTERN ARIZONA COLLEGE Introduction to Statistics

Resampling Methods. Chapter 5. Chapter 5 1 / 52

Lecture 6: Phase Space and Damped Oscillations

PSU GISPOPSCI June 2011 Ordinary Least Squares & Spatial Linear Regression in GeoDa

COMP 551 Applied Machine Learning Lecture 9: Support Vector Machines (cont d)

The standards are taught in the following sequence.

T Algorithmic methods for data mining. Slide set 6: dimensionality reduction

Who is the Holy Spirit?

Lecture 5: Equilibrium and Oscillations

AP Statistics Practice Test Unit Three Exploring Relationships Between Variables. Name Period Date

Keysight Technologies Understanding the Kramers-Kronig Relation Using A Pictorial Proof

ELT COMMUNICATION THEORY

Three charges, all with a charge of 10 C are situated as shown (each grid line is separated by 1 meter).

Hiding in plain sight

37 Maxwell s Equations

Ecology 302 Lecture III. Exponential Growth (Gotelli, Chapter 1; Ricklefs, Chapter 11, pp )

SIZE BIAS IN LINE TRANSECT SAMPLING: A FIELD TEST. Mark C. Otto Statistics Research Division, Bureau of the Census Washington, D.C , U.S.A.

I. Analytical Potential and Field of a Uniform Rod. V E d. The definition of electric potential difference is

COMP90051 Statistical Machine Learning

MATCHING TECHNIQUES Technical Track Session VI Céline Ferré The World Bank

Transcription:

Maximum A Psteriri (MAP) CS 109 Lecture 22 May 16th, 2016

Previusly in CS109

Game f Estimatrs Maximum Likelihd Nn spiler: this didn t happen

Side Plt argmax argmax f lg Mther f ptimizatins?

Reviving an Old Stry Line The Multinmial Distributin Mult(p1,, pk) n! p(x1,..., xk ) = px1 1... pxkk x1!... xk!

Machine Learning S Far

Maximum Likelihd f Data Cnsider n I.I.D. randm variables X 1, X 2,..., X n X i is a sample frm density functin f(x i θ) n L( θ ) = f ( θ ) i= 1 X i n i= 1 LL( θ ) = lg L( θ ) = lg f ( X θ ) = lg f ( X θ ) i n i= 1 i MLE = argmaxll( )

MLE t Linear Regressin Hw d yu fit this line? Assume: Y Y = X + Z Z N(0, 2 ) Calculate MLE f ˆ = argmin mx (Y i X i ) 2 i=1 X This is an algrithm called linear regressin. Learn mre abut it later

Watch it Online

Episde 22 The Sng f The Last Estimatr

The Sng f the Last Estimatr

Smething rtten in the wrld f MLE

Freshadwing..

Need a Vlunteer S gd t see yu again!

I have tw envelpes, will allw yu t have ne One cntains $X, the ther cntains $2X Select an envelpe Open it! Nw, wuld yu like t switch fr ther envelpe? T help yu decide, cmpute E[$ in ther envelpe] Let Y = $ in envelpe yu selected 1 Y 1 E $in ther envelpe] = 2 2 + 2 2 Befre pening envelpe, think either equally gd S, what happened by pening envelpe? Tw Envelpes [ Y = And des it really make sense t switch? 5 4 Y

Thinking Deeper Abut Tw Envelpes The tw envelpes prblem set-up Tw envelpes: ne cntains $X, ther cntains $2X Yu select an envelpe and pen it Let Y = $ in envelpe yu selected Let Z = $ in ther envelpe 1 Y 1 5 E[ Z Y] = 2 2 + 2 2Y = 4 Y E[Z Y] abve assumes all values X (where 0 < X < ) are equally likely Nte: there are infinitely many values f X S, nt true prbability distributin ver X (desn t integrate t 1)

All Values are Equally Likely? p(x) Infinite pwers f tw 0 10 20 40 60 80 100 X

Subjectivity f Prbability Belief abut cntents f envelpes Since implied distributin ver X is nt a true prbability distributin, what is ur distributin ver X? Frequentist: play game infinitely many times and see hw ften different values cme up. Prblem: I nly allw yu t play the game nce Bayesian prbability Have prir belief f distributin fr X (r anything fr that matter) Prir belief is a subjective prbability By extensin, all prbabilities are subjective Allws us t answer questin when we have n/limited data E.g., prbability a cin yu ve never flipped lands n heads As we get mre data, prir belief is swamped by data

Subjectivity f Prbability p(x) 0 10 20 40 60 80 100 X

The Envelpe, Please Bayesian: have prir distributin ver X, P(X) Let Y = $ in envelpe yu selected Let Z = $ in ther envelpe Open yur envelpe t determine Y If Y > E[Z Y], keep yur envelpe, therwise switch N incnsistency! Opening envelpe prvides data t cmpute P(X Y) and thereby cmpute E[Z Y] Of curse, there s the issue f hw yu determined yur prir distributin ver X Bayesian: Desn t matter hw yu determined prir, but yu must have ne (whatever it is) Imagine if envelpe yu pened cntained $20.01

The Dreaded Half Cent

Envelpe Summary: Prbabilities are beliefs Incrprating prir beliefs is useful

Especially fr ne sht learning

One Sht Learning Single training example: Test set:

Prirs fr Parameter Estimatin?

Flash Back: Bayes Therem Bayes Therem (θ = mdel parameters, D = data): Psterir Likelihd Prir Likelihd: yu ve seen this befre (in cntext f MLE) Prbability f data given prbability mdel (parameter θ) Prir: befre seeing any data, what is belief abut mdel I.e., what is distributin ver parameters θ Psterir: after seeing data, what is belief abut mdel P(θ D) = P(D θ) P(θ) P(D) After data D bserved, have psterir distributin p(θ D) ver parameters θ cnditined n data. Use this t predict new data.

Cmputing P(θ D) Bayes Therem (θ = mdel parameters, D = data): P(θ D) = P(D θ) P(θ) P(D) We have prir P(θ) and can cmpute P(D θ) But hw d we calculate P(D)? Cmplicated answer: P ( D) = P( D θ) P( θ) dθ Easy answer: It des nt depend n θ, s ignre it Just a cnstant that frces P(θ D) t integrate t 1

Mst imprtant slide f tday

Recall Maximum Likelihd Estimatr (MLE) f θ Maximum A Psteriri (MAP) estimatr f θ: where g(θ) is prir distributin f θ. As befre, can ften be mre cnvenient t use lg: MAP estimate is the mde f the psterir distributin = = n i MLE X i f 1 ) ( max arg θ θ θ ),...,, ( ) ( ),...,, ( arg max ),...,, ( arg max 2 1 2 1 2 1 n n n MAP X X X h g X X X f X X X f θ θ θ θ θ θ = = + = = n i MAP X i f g 1 )) ( lg( )) ( lg( max arg θ θ θ θ ) ( ) ( arg max ),...,, ( ) ( ) ( arg max 1 2 1 1 = = = = n i i n n i i X f g X X X h g X f θ θ θ θ θ θ Maximum A Psteriri

Maximum A Psteriri Estimated parameter Lg prir MAP =argmax lg(g( )) + nx i=1 lg(f(x i ))! Chse the value f theta that maximizes: Sum f lg likelihd

Gtta get that intuitin

l Prir: θ ~ Beta(a, b);; D = {n heads, m tails} l Estimate p l By definitin, f(θ D) is Beta(a + n, b + m) ) ( ) ( ) ( ) ( D f p f p D f D p f D D D = = = = θ θ θ θ θ θ 1 1 3 ) (1 + + = b m a n p p C 1 1 2 1 2 1 1 1 ) (1 ) (1 ) (1 ) (1 = = + + b a m n b a m n p p p p C C C C p p p p n m n n m n P(θ D) Fr Beta and Bernulli MAP = argmax f( D) = argmax (n + a 1) lg +(m + b 1) lg(1 )

Hyper Parameters a b Hyperparameter a, b are fixed p Prir p Beta(a, b) Data distributin X i Bern(p) X 1 X 2 X n MAP will estimate the mst likely value f p fr this mdel

Where d Ya Get Them P(θ)? l l l θ is the prbability a cin turns up heads Mdel θ with 2 different prirs: l l P 1 (θ) is Beta(3,8) (blue) P 2 (θ) is Beta(7,4) (red) They lk pretty different! l Nw flip 100 cins;; get 58 heads and 42 tails l What d psterirs lk like?

It s Like Having Twins argmax returns the mde l As lng as we cllect enugh data, psterirs will cnverge t the true value!

Cnjugate Distributins Withut Tears Just fr review Have cin with unknwn prbability θ f heads Our prir (subjective) belief is that θ ~ Beta(a, b) Nw flip cin k = n + m times, getting n heads, m tails Psterir density: (θ n heads, m tails) ~Beta(a+n, b+\m) Beta is cnjugate fr Bernulli, Binmial, Gemetric, and Negative Binmial a and b are called hyperparameters Saw (a + b 2) imaginary trials, f thse (a 1) are successes Fr a cin yu never flipped befre, use Beta(x, x) t dente yu think cin likely t be fair Hw strngly yu feel cin is fair is a functin f x

M Beta

Gnna Need Prirs Parameter Bernulli p Binmial p Pissn Expnential Multinmial p i Nrmal µ Distributin fr Parameter Beta Beta Gamma Gamma Dirichlet Nrmal Nrmal 2 Inverse Gamma Dn t need t knw Inverse Gamma. But it will knw yu

Multinmial is Multiple Times the Fun Dirichlet(a 1, a 2,..., a m ) distributin Cnjugate fr Multinmial Dirichlet generalizes Beta in same way Multinmial generalizes Bernulli f(x 1 = x 1,X 2 = x 2,...,X m = x m )=K Intuitive understanding f hyperparameters: Saw a i m imaginary trials, with (a i 1) f utcme i Updating t get the psterir distributin m i= 1 After bserving n 1 + n 2 +... + n m, new trials with n i f utcme i...... psterir distributin is Dirichlet(a 1 + n 1, a 2 + n 2,..., a m + n m ) my i=1 x a i 1 i

Best Shrt Film in the Dirichlet Categry And nw a cl animatin f Dirichlet(a, a, a) This is actually lg density (but yu get the idea ) Thanks Wikipedia!

Example: Estimating Die Parameters

Yur Happy Laplace Recall example f 6-sides die rlls: X ~ Multinmial(p 1, p 2, p 3, p 4, p 5, p 6 ) Rll n = 12 times Result: 3 nes, 2 tws, 0 threes, 3 furs, 1 fives, 3 sixes MLE: p 1 =3/12, p 2 =2/12, p 3 =0/12, p 4 =3/12, p 5 =1/12, p 6 =3/12 Dirichlet prir allws us t pretend we saw each X i utcme k times befre. MAP estimate: p = + k i n + mk Laplace s law f successin : idea abve with k = 1 X i + 1 Laplace estimate: pi = n + m Laplace: p 1 =4/18, p 2 =3/18, p 3 =1/18, p 4 =4/18, p 5 =2/18, p 6 =4/18 N lnger have 0 prbability f rlling a three!

Gd Times with Gamma Gamma(k, θ) distributin Cnjugate fr Pissn Als cnjugate fr Expnential, but we wn t delve int that Intuitive understanding f hyperparameters: Saw k ttal imaginary events during θ prir time perids Updating t get the psterir distributin After bserving n events during next t time perids...... psterir distributin is Gamma(k + n, θ + t) Example: Gamma(10, 5) Saw 10 events in 5 time perids. Like bserving at rate = 2 Nw see 11 events in next 2 time perids à Gamma(21, 7) Equivalent t updated rate = 3

Is Peer Grading Accurate Enugh? Peer Grading n Cursera HCI. 31,067 peer grades fr 3,607 students. Tuned Mdels f Peer Assessment. C Piech, J Huang, A Ng, D Kller

Is Peer Grading Accurate Enugh? 1. Defined randm variables fr: True grade (s i ) fr assignment i Observed (z ij ) scre fr assign i Bias (b j ) fr each grader j Variance (r j ) fr each grader j 2. Designed a prbabilistic mdel that defined the distributins fr all randm variables z j i N (µ = s i + b j, = p r j ) s i N(µ 0, 0) b i N(0, 0 ) = hyperparameter r i InvGamma( 0, 0 ) Tuned Mdels f Peer Assessment. C Piech, J Huang, A Ng, D Kller

Is Peer Grading Accurate Enugh? 1. Defined randm variables fr: True grade (s i ) fr assignment i Observed (z ij ) scre fr assign i Bias (b j ) fr each grader j Variance (r j ) fr each grader j 2. Designed a prbabilistic mdel that defined the distributins fr all randm variables 3. Fund variable assignments using MAP estimatin given the bserved data Tuned Mdels f Peer Assessment. C Piech, J Huang, A Ng, D Kller

The last estimatr has risen

Next time: Machine Learning algrithms

It s Nrmal t Be Nrmal Nrmal(µ 0, σ 02 ) distributin Cnjugate fr Nrmal (with unknwn µ, knwn σ 2 ) Intuitive understanding f hyperparameters: A priri, believe true µ distributed ~ N(µ 0, σ 02 ) Updating t get the psterir distributin After bserving n data pints...... psterir distributin fr µ is: + + + = 1 2 2 0 2 2 0 2 1 2 0 0 1, 1 σ σ σ σ σ σ µ n n x N n i i