Elementary manipulations of probabilities

Similar documents
Probability and MLE.

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 5

Exponential Families and Bayesian Inference

The Bayesian Learning Framework. Back to Maximum Likelihood. Naïve Bayes. Simple Example: Coin Tosses. Given a generative model

Uncertainty. Variables. assigns to each sentence numerical degree of belief between 0 and 1. uncertainty

Random Variables, Sampling and Estimation

15-780: Graduate Artificial Intelligence. Density estimation

Distribution of Random Samples & Limit theorems

Machine Learning. Machine Learning /15-781

Advanced Stochastic Processes.

EECS564 Estimation, Filtering, and Detection Hwk 2 Solns. Winter p θ (z) = (2θz + 1 θ), 0 z 1

Pattern Classification

Topic 8: Expected Values

5. Likelihood Ratio Tests

Statistical Inference (Chapter 10) Statistical inference = learn about a population based on the information provided by a sample.

Pattern Classification

INF Introduction to classifiction Anne Solberg Based on Chapter 2 ( ) in Duda and Hart: Pattern Classification

CSE 527, Additional notes on MLE & EM

Hypothesis Testing. Evaluation of Performance of Learned h. Issues. Trade-off Between Bias and Variance

CS284A: Representations and Algorithms in Molecular Biology

STAT Homework 1 - Solutions

Expectation-Maximization Algorithm.

Lecture 12: November 13, 2018

f(x i ; ) L(x; p) = i=1 To estimate the value of that maximizes L or equivalently ln L we will set =0, for i =1, 2,...,m p x i (1 p) 1 x i i=1

ECE 8527: Introduction to Machine Learning and Pattern Recognition Midterm # 1. Vaishali Amin Fall, 2015

Slide Set 13 Linear Model with Endogenous Regressors and the GMM estimator

Chapter 6 Principles of Data Reduction

This exam contains 19 pages (including this cover page) and 10 questions. A Formulae sheet is provided with the exam.

Lecture 11 and 12: Basic estimation theory

Economics 241B Relation to Method of Moments and Maximum Likelihood OLSE as a Maximum Likelihood Estimator

Discrete Mathematics for CS Spring 2007 Luca Trevisan Lecture 22

Problem Set 4 Due Oct, 12

Parameter, Statistic and Random Samples

AMS570 Lecture Notes #2

Resampling Methods. X (1/2), i.e., Pr (X i m) = 1/2. We order the data: X (1) X (2) X (n). Define the sample median: ( n.

Problems from 9th edition of Probability and Statistical Inference by Hogg, Tanis and Zimmerman:

Last Lecture. Wald Test

Section 14. Simple linear regression.

Definition 4.2. (a) A sequence {x n } in a Banach space X is a basis for X if. unique scalars a n (x) such that x = n. a n (x) x n. (4.

Introductory statistics

Maximum Likelihood Estimation

Basics of Inference. Lecture 21: Bayesian Inference. Review - Example - Defective Parts, cont. Review - Example - Defective Parts

Lecture 1 Probability and Statistics

A quick activity - Central Limit Theorem and Proportions. Lecture 21: Testing Proportions. Results from the GSS. Statistics and the General Population

Lecture 4. Hw 1 and 2 will be reoped after class for every body. New deadline 4/20 Hw 3 and 4 online (Nima is lead)

Machine Learning Brett Bernstein

Lecture 1 Probability and Statistics

MATH 320: Probability and Statistics 9. Estimation and Testing of Parameters. Readings: Pruim, Chapter 4

Clustering. CM226: Machine Learning for Bioinformatics. Fall Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar.

Chapter 22. Comparing Two Proportions. Copyright 2010, 2007, 2004 Pearson Education, Inc.

Joint Probability Distributions and Random Samples. Jointly Distributed Random Variables. Chapter { }

DS 100: Principles and Techniques of Data Science Date: April 13, Discussion #10

Machine Learning Theory (CS 6783)

Bayesian Methods: Introduction to Multi-parameter Models

FACULTY OF MATHEMATICAL STUDIES MATHEMATICS FOR PART I ENGINEERING. Lectures

CEE 522 Autumn Uncertainty Concepts for Geotechnical Engineering

Quick Review of Probability

ECE 901 Lecture 13: Maximum Likelihood Estimation

Infinite Sequences and Series

Chapter 6 Sampling Distributions

Machine Learning Brett Bernstein

STATISTICAL PROPERTIES OF LEAST SQUARES ESTIMATORS. Comments:

Convergence of random variables. (telegram style notes) P.J.C. Spreij

4. Partial Sums and the Central Limit Theorem

It is always the case that unions, intersections, complements, and set differences are preserved by the inverse image of a function.

Continuous Functions

Lecture 9: September 19

Discrete Mathematics for CS Spring 2005 Clancy/Wagner Notes 21. Some Important Distributions

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d

Algorithms for Clustering

Discrete Mathematics and Probability Theory Fall 2009 Satish Rao,David Tse Lecture 16. Multiple Random Variables and Applications to Inference

Lecture 19: Convergence

Factor Analysis. Lecture 10: Factor Analysis and Principal Component Analysis. Sam Roweis

Matrix Representation of Data in Experiment

Lecture 12: September 27

Machine Learning 4771

Topic 9: Sampling Distributions of Estimators

Agnostic Learning and Concentration Inequalities

1.010 Uncertainty in Engineering Fall 2008

Discrete Mathematics for CS Spring 2008 David Wagner Note 22

Homework 5 Solutions

Quick Review of Probability

A sequence of numbers is a function whose domain is the positive integers. We can see that the sequence

Mixtures of Gaussians and the EM Algorithm

Last time: Moments of the Poisson distribution from its generating function. Example: Using telescope to measure intensity of an object

Chapter 3. Strong convergence. 3.1 Definition of almost sure convergence

Let us give one more example of MLE. Example 3. The uniform distribution U[0, θ] on the interval [0, θ] has p.d.f.

Regression and generalization

Outline. L7: Probability Basics. Probability. Probability Theory. Bayes Law for Diagnosis. Which Hypothesis To Prefer? p(a,b) = p(b A) " p(a)

Sieve Estimators: Consistency and Rates of Convergence

Statistical Pattern Recognition

Direction: This test is worth 250 points. You are required to complete this test within 50 minutes.

Lecture 2. The Lovász Local Lemma

Math 152. Rumbos Fall Solutions to Review Problems for Exam #2. Number of Heads Frequency

Confidence Intervals for the Population Proportion p

Empirical Process Theory and Oracle Inequalities

Machine Learning.

January 25, 2017 INTRODUCTION TO MATHEMATICAL STATISTICS

Topic 9: Sampling Distributions of Estimators

Random Models. Tusheng Zhang. February 14, 2013

Transcription:

Elemetary maipulatios of probabilities Set probability of multi-valued r.v. {=Odd} = +3+5 = /6+/6+/6 = ½ X X,, X i j X i j Multi-variat distributio: Joit probability: X true true X X,, X X i j i j X X Margial robability: X js j

Joit robability A joit probability distributio for a set of RVs gives the probability of every atomic evet sample poit lu,drikeer = a matri of values: 0.005 0.0 0.95 0.78 lu,drikeer, eadache =? Every uestio about a domai ca be aswered by the joit distributio, as we will see later.

Coditioal robability X = ractio of worlds i which X is true that also have true = "havig a headache" = "comig dow with lu" =/0 =/40 =/ = fractio of flu-iflicted worlds i which you have a headache Defiitio: = / X X Corollary: The Chai Rule X X X X

MLE Objective fuctio: h t l ; D log D log log log We eed to maimize this w.r.t. Take derivatives wrt h h l h h 0 MLE h or MLE i i Sufficiet statistics reuecy as sample mea The couts,, where, are sufficiet statistics of data D h k i i

The ayes Rule What we have just did leads to the followig geeral epressio: X p X X This is ayes Rule

More Geeral orms of ayes Rule lu eadhead Drakeer p X p X p X X Z p Z X Z p Z X Z p Z X Z X Z p Z X Z X S i i i i y p y X p X X y

robabilistic Iferece = "havig a headache" = "comig dow with lu" =/0 =/40 =/ Oe day you wake up with a headache. ou come with the followig reasoig: "sice 50% of flues are associated with headaches, so I must have a 50-50 chace of comig dow with flu Is this reasoig correct?

robabilistic Iferece = "havig a headache" = "comig dow with lu" =/0 =/40 =/ The roblem: =?

rior Distributio Support that our propositios about the possible has a "causal flow" e.g., rior or ucoditioal probabilities of propositios e.g., lu =true = 0.05 ad Drikeer =true = 0. correspod to belief prior to arrival of ay ew evidece A probability distributio gives values for all possible assigmets: Drikeer =[0.0,0.09, 0., 0.8] ormalized, i.e., sums to

osterior coditioal probability Coditioal or posterior see later probabilities e.g., lueadache = 0.78 give that flu is all I kow OT if flu the 7.8% chace of eadache Represetatio of coditioal distributios: lueadache = -elemet vector of -elemet vectors If we kow more, e.g., Drikeer is also give, the we have lueadache,drikeer = 0.070 This effect is kow as eplai away! lueadache,lu = ote: the less or more certai belief remais valid after more evidece arrives, but is ot always useful ew evidece may be irrelevat, allowig simplificatio, e.g., lueadache,stealerwi = lueadache This kid of iferece, sactioed by domai kowledge, is crucial

Iferece by eumeratio Start with a Joit Distributio uildig a Joit Distributio of M=3 variables Make a truth table listig all combiatios of values of your variables if there are M oolea variables the the table will have M rows. rob 0 0 0 0.4 0 0 0. 0 0 0.7 0 0. 0 0 0.05 0 0.05 0 0.05 0.05 or each combiatio of values, say how probable it is. ormalized, i.e., sums to

Iferece with the Joit Oe you have the JD you ca ask for the probability of ay atomic evet cosistet with you uery E row i i E 0.4 0. 0.7 0. 0.05 0.05 0.05 0.05

Iferece with the Joit Compute Margials lu eadache 0.4 0. 0.7 0. 0.05 0.05 0.05 0.05

Iferece with the Joit Compute Margials eadache 0.4 0. 0.7 0. 0.05 0.05 0.05 0.05

Iferece with the Joit Compute Coditioals E E E ie ie E E E row row i i 0.4 0. 0.7 0. 0.05 0.05 0.05 0.05

Iferece with the Joit Compute Coditioals lu eadhead lu eadhead eadhead 0.4 0. 0.7 0. 0.05 0.05 0.05 0.05 Geeral idea: compute distributio o uery variable by fiig evidece variables ad summig over hidde variables

Summary: Iferece by eumeratio Let X be all the variables. Typically, we wat the posterior joit distributio of the uery variables give specific values e for the evidece variables E Let the hidde variables be = X--E The the reuired summatio of joit etries is doe by summig out the hidde variables: E=e=α,E=e=α h,e=e, =h The terms i the summatio are joit etries because, E, ad together ehaust the set of radom variables Obvious problems: Worst-case time compleity Od where d is the largest arity Space compleity Od to store the joit distributio ow to fid the umbers for Od etries???

Coditioal idepedece Write out full joit distributio usig chai rule: eadache;lu;virus;drikeer = eadache lu;virus;drikeer lu;virus;drikeer = eadache lu;virus;drikeer lu Virus;Drikeer Virus Drikeer Drikeer Assume idepedece ad coditioal idepedece = eadachelu;drikeer luvirus Virus Drikeer I.e.,? idepedet parameters I most cases, the use of coditioal idepedece reduces the size of the represetatio of the joit distributio from epoetial i to liear i. Coditioal idepedece is our most basic ad robust form of kowledge about ucertai eviromets.

Rules of Idepedece --- by eamples Virus Drikeer = Virus iff Virus is idepedet of Drikeer lu Virus;Drikeer = luvirus iff lu is idepedet of Drikeer, give Virus eadache lu;virus;drikeer = eadachelu;drikeer iff eadache is idepedet of Virus, give lu ad Drikeer

Margial ad Coditioal Idepedece Recall that for evets E i.e. X= ad say, =y, the coditioal probability of E give, writte as E, is E ad / = the probability of both E ad are true, give is true E ad are statistically idepedet if E = E i.e., prob. E is true does't deped o whether is true; or euivaletly E ad =E. E ad are coditioally idepedet give if or euivaletly E, = E E, = E

Why kowledge of Idepedece is useful Lower compleity time, space, search 0.4 0. 0.7 0. 0.05 0.05 0.05 0.05 Motivates efficiet iferece for all kids of ueries Stay tued!! Structured kowledge about the domai easy to learig both from epert ad from data easy to grow

Where do probability distributios come from? Idea Oe: uma, Domai Eperts Idea Two: Simpler probability facts ad some algebra e.g.,,, 0.4 0. 0.7 0. 0.05 0.05 0.05 0.05 Idea Three: Lear them from data! A good chuk of this course is essetially about various ways of learig various forms of them!

Desity Estimatio A Desity Estimator lears a mappig from a set of attributes to a robability Ofte kow as parameter estimatio if the distributio form is specified iomial, Gaussia Three importat issues: ature of the data iid, correlated, Objective fuctio MLE, MA, Algorithm simple algebra, gradiet methods, EM, Evaluatio scheme likelihood o test data, predictability, cosistecy,

arameter Learig from iid data Goal: estimate distributio parameters from a dataset of idepedet, idetically distributed iid, fully observed, traiig cases D = {,..., } Maimum likelihood estimatio MLE. Oe of the most commo estimators. With iid ad full-observability assumptio, write L as the likelihood of the data: L,, ;, i ; 3. pick the settig of parameters most likely to have geerated the data we saw: * i ; ;,, ; arg ma L arg ma log L

Eample : eroulli model Data: We observed iid coi tossig: D={, 0,,, 0} Represetatio: iary r.v: { 0, } Model: p p for 0 for ow to write the likelihood of a sigle observatio i? i i i The likelihood of datasetd={,, }: i i i,,..., i i i i i #head i #tails

MLE for discrete joit distributios More geerally, it is easy to show that evet i #records total i which evet umber of records i is true This is a importat but sometimes ot so effective learig algorithm! 0.4 0. 0.7 0. 0.05 0.05 0.05 0.05

Eample : uivariate ormal Data: We observed iid real samples: D={-0., 0,, -5.,, 3} Model: Log likelihood: / ep / D D l ; log log MLE: take derivative ad set to zero: l l / 4 MLE MLE ML

Overfittig Recall that for eroulli Distributio, we have head ML head head tail What if we tossed too few times so that we saw zero head? We have head 0, ad we will predict that the probability of ML seeig a head et is zero!!! The rescue: Where ' is kow as the pseudo- imagiary cout head ML ut ca we make this more formal? head head ' tail '

The ayesia Theory The ayesia Theory: e.g., for date D ad model M MD = DMM/D the posterior euals to the likelihood times the prior, up to a costat. This allows us to capture ucertaity about the model i a pricipled way

ierarchical ayesia Models are the parameters for the likelihood p a are the parameters for the prior pa. We ca have hyper-hyper-parameters, etc. We stop whe the choice of hyper-parameters makes o differece to the margial likelihood; typically make hyperparameters costats. Where do we get the prior? Itelliget guesses Empirical ayes Type-II maimum likelihood computig poit estimates of a : a arg ma p a MLE a

ayesia estimatio for eroulli eta distributio: osterior distributio of : otice the isomorphism of the posterior to the prior, such a prior is called a cojugate prior a a t h t h p p p,...,,...,,..., a a a a a a,, ;

ayesia estimatio for eroulli, co'd osterior distributio of : Maimum a posteriori MA estimatio: osterior mea estimatio: rior stregth: A=a+ A ca be iteroperated as the size of a imagiary data set from which we obtai the pseudo-couts a a t h t h p p p,...,,...,,..., a a a d C d D p h ayes t h ata parameters ca be uderstood as pseudo-couts,..., ma log arg MA

Effect of rior Stregth Suppose we have a uiform prior a==/, ad we observe h, 8 Weak prior A =. osterior predictio: Strog prior A = 0. osterior predictio: owever, if we have eough data, it washes away the prior. e.g., h 00, 800. The the estimates uder t 00 000 weak ad strog prior are ad, respectively, 000 0000 both of which are close to 0. t p h h, t 8, a a' 0. 5 0 0 p h h, t 8, a a' 0 0. 40 0 0

ayesia estimatio for ormal distributio ormal rior: Joit probability: osterior: 0 / ep / 0 ~ ad, / / / / / / ~ where Sample mea 0 / ep ep, / / ~ / ~ ep ~ /