Some Probability and Statistics

Similar documents
Some Probability and Statistics

COS 424: Interacting with Data. Lecturer: Dave Blei Lecture #2 Scribe: Ellen Kim February 7, 2008

Random Variables and Events

Mathematical Foundations

Statistical Models. David M. Blei Columbia University. October 14, 2014

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

CSC321 Lecture 18: Learning Probabilistic Models

Naïve Bayes classification

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Probability Theory for Machine Learning. Chris Cremer September 2015

Review: Probability. BM1: Advanced Natural Language Processing. University of Potsdam. Tatjana Scheffler

Probability. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh. August 2014

{ p if x = 1 1 p if x = 0

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Naïve Bayes. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Human-Oriented Robotics. Probability Refresher. Kai Arras Social Robotics Lab, University of Freiburg Winter term 2014/2015

Representation. Stefano Ermon, Aditya Grover. Stanford University. Lecture 2

MLE/MAP + Naïve Bayes

Today. Probability and Statistics. Linear Algebra. Calculus. Naïve Bayes Classification. Matrix Multiplication Matrix Inversion

Introduction to Bayesian Learning

Grundlagen der Künstlichen Intelligenz

Machine Learning

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Bayesian Methods: Naïve Bayes

Some slides from Carlos Guestrin, Luke Zettlemoyer & K Gajos 2

Recitation 2: Probability

Decision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag

Dimension Reduction. David M. Blei. April 23, 2012

[POLS 8500] Review of Linear Algebra, Probability and Information Theory

Probability and Estimation. Alan Moses

PROBABILITY THEORY REVIEW

COS513 LECTURE 8 STATISTICAL CONCEPTS

CS6220: DATA MINING TECHNIQUES

Lecture 10: Probability distributions TUESDAY, FEBRUARY 19, 2019

MS&E 226: Small Data. Lecture 11: Maximum likelihood (v2) Ramesh Johari

Bayesian Linear Regression [DRAFT - In Progress]

A.I. in health informatics lecture 2 clinical reasoning & probabilistic inference, I. kevin small & byron wallace

SAMPLE CHAPTER. Avi Pfeffer. FOREWORD BY Stuart Russell MANNING

01 Probability Theory and Statistics Review

A brief introduction to Conditional Random Fields

MACHINE LEARNING INTRODUCTION: STRING CLASSIFICATION

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 3 September 14, Readings: Mitchell Ch Murphy Ch.

CS 361: Probability & Statistics

Formal Modeling in Cognitive Science

Probabilistic Machine Learning

CS 361: Probability & Statistics

Formal Modeling in Cognitive Science Lecture 19: Application of Bayes Theorem; Discrete Random Variables; Distributions. Background.

COS402- Artificial Intelligence Fall Lecture 10: Bayesian Networks & Exact Inference

Generative Learning. INFO-4604, Applied Machine Learning University of Colorado Boulder. November 29, 2018 Prof. Michael Paul

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Introduction to probability and statistics

Aarti Singh. Lecture 2, January 13, Reading: Bishop: Chap 1,2. Slides courtesy: Eric Xing, Andrew Moore, Tom Mitchell

Course: ESO-209 Home Work: 1 Instructor: Debasis Kundu

Machine Learning. Bayes Basics. Marc Toussaint U Stuttgart. Bayes, probabilities, Bayes theorem & examples

Lecture 1: Probability Fundamentals

Quiz 1. Name: Instructions: Closed book, notes, and no electronic devices.

Undirected Graphical Models

Mathematics 426 Robert Gross Homework 9 Answers

Introduction to Probability and Statistics (Continued)

Be able to define the following terms and answer basic questions about them:

Probability Experiments, Trials, Outcomes, Sample Spaces Example 1 Example 2

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

Quick Tour of Basic Probability Theory and Linear Algebra

Sample Spaces, Random Variables

CPSC 340: Machine Learning and Data Mining

Introduction to Bayesian Learning. Machine Learning Fall 2018

Probability, Entropy, and Inference / More About Inference

Parameter estimation Conditional risk

Joint Probability Distributions and Random Samples (Devore Chapter Five)

Review of Probability Mark Craven and David Page Computer Sciences 760.

Notes on Probability

Machine Learning

CPSC 540: Machine Learning

13: Variational inference II

Learning Bayesian network : Given structure and completely observed data

Lecture 2: Simple Classifiers

CS6220: DATA MINING TECHNIQUES

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

Probability Theory. Introduction to Probability Theory. Principles of Counting Examples. Principles of Counting. Probability spaces.

Review of Probabilities and Basic Statistics

Today. Why do we need Bayesian networks (Bayes nets)? Factoring the joint probability. Conditional independence

Lecture 4: Probabilistic Learning

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Senior Math Circles November 19, 2008 Probability II

Machine Learning

Review of probability

Discrete Mathematics and Probability Theory Spring 2016 Rao and Walrand Note 14

Introduction to Bayesian inference

COMP90051 Statistical Machine Learning

V7 Foundations of Probability Theory

Fourier and Stats / Astro Stats and Measurement : Stats Notes

Outline. Probability. Math 143. Department of Mathematics and Statistics Calvin College. Spring 2010

Probabilistic Systems Analysis Spring 2018 Lecture 6. Random Variables: Probability Mass Function and Expectation

Parametric Techniques

PMR Learning as Inference

Naïve Bayes Classifiers and Logistic Regression. Doug Downey Northwestern EECS 349 Winter 2014

Advanced Probabilistic Modeling in R Day 1

Lecture : Probabilistic Machine Learning

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

CLASS NOTES Models, Algorithms and Data: Introduction to computing 2018

Transcription:

Some Probability and Statistics David M. Blei COS424 Princeton University February 13, 2012

Card problem There are three cards Red/Red Red/Black Black/Black I go through the following process. Close my eyes and pick a card Pick a side at random Show you that side Suppose I show you red. What s the probability the other side is red too?

Random variable Probability is about random variables. A random variable is any probabilistic outcome. For example, The flip of a coin The height of someone chosen randomly from a population We ll see that it s sometimes useful to think of quantities that are not strictly probabilistic as random variables. The temperature on 11/12/2013 The temperature on 03/04/1905 The number of times streetlight appears in a document

Random variable Random variables take on values in a sample space. They can be discrete or continuous: Coin flip: {H,T } Height: positive real values (0, ) Temperature: real values (, ) Number of words in a document: Positive integers {1,2,...} We call the values atoms. Denote the random variable with a capital letter; denote a realization of the random variable with a lower case letter. E.g., X is a coin flip, x is the value (H or T ) of that coin flip.

Discrete distribution A discrete distribution assigns a probability to every atom in the sample space For example, if X is an (unfair) coin, then P(X = H) = 0.7 P(X = T) = 0.3 The probabilities over the entire space must sum to one P(X = x) = 1 x Probabilities of disjunctions are sums over part of the space. E.g., the probability that a die is bigger than 3: P(D > 3) = P(D = 4) + P(D = 5) + P(D = 6)

A useful picture ~x x An atom is a point in the box An event is a subset of atoms (e.g., d > 3) The probability of an event is sum of probabilities of its atoms.

Joint distribution Typically, we consider collections of random variables. The joint distribution is a distribution over the configuration of all the random variables in the ensemble. For example, imagine flipping 4 coins. The joint distribution is over the space of all possible outcomes of the four coins. P(HHHH) = 0.0625 P(HHHT) = 0.0625 P(HHTH) = 0.0625... You can think of it as a single random variable with 16 values.

Visualizing a joint distribution ~x x

Conditional distribution A conditional distribution is the distribution of a random variable given some evidence. P(X = x Y = y) is the probability that X = x when Y = y. For example, P(I listen to Steely Dan) = 0.5 P(I listen to Steely Dan Toni is home) = 0.1 P(I listen to Steely Dan Toni is not home) = 0.7 P(X = x Y = y) is a different distribution for each value of y P(X = x Y = y) = 1 x P(X = x Y = y) 1 (necessarily) y

Definition of conditional probability ~x, ~y ~x, y x, y x, ~y Conditional probability is defined as: P(X = x Y = y) = which holds when P(Y) > 0. P(X = x,y = y), P(Y = y) In the Venn diagram, this is the relative probability of X = x in the space where Y = y.

Returning to the card problem Now we can solve the card problem. Let X 1 be the random side of the random card I chose Let X 2 be the other side of that card Compute P(X 2 = red X 1 = red) P(X 2 = red X 1 = red) = P(X 1 = R,X 2 = R) P(X 1 = R) (1) Numerator is 1/3: Only one card has two red sides. Denominator is 1/2: There are three possible sides of the six that are red.

The chain rule The definition of conditional probability lets us derive the chain rule, which let s us define the joint distribution as a product of conditionals: P(X,Y) = P(X,Y) P(Y) P(Y) = P(X Y)P(Y) For example, let Y be a disease and X be a symptom. We may know P(X Y) and P(Y) from data. Use the chain rule to obtain the probability of having the disease and the symptom. In general, for any set of N variables P(X 1,...,X N ) = N P(X n X 1,...,X n 1 ) n=1

Marginalization Given a collection of random variables, we are often only interested in a subset of them. For example, compute P(X) from a joint distribution P(X,Y,Z) Can do this with marginalization P(X) = P(X,y,z) y z Derived from the chain rule: P(X,y,z) = P(X)P(y,z X) y z y z = P(X) P(y,z X) y z = P(X) Note: now we can compute the probability that Toni is home.

Bayes rule From the chain rule and marginalization, we obtain Bayes rule. P(Y X) = y P(X Y)P(Y) P(X Y = y)p(y = y) Again, let Y be a disease and X be a symptom. From P(X Y) and P(Y), we can compute the (useful) quantity P(Y X). Bayes rule is important in Bayesian statistics, where Y is a parameter that controls the distribution of X.

Independence Random variables are independent if knowing about X tells us nothing about Y. P(Y X) = P(Y) This means that their joint distribution factorizes, X Y P(X,Y) = P(X)P(Y). Why? The chain rule P(X,Y) = P(X)P(Y X) = P(X)P(Y)

Independence examples Examples of independent random variables: Flipping a coin once / flipping the same coin a second time You use an electric toothbrush / blue is your favorite color Examples of not independent random variables: Registered as a Republican / voted for Bush in the last election The color of the sky / The time of day

Are these independent? Two twenty-sided dice Rolling three dice and computing (D 1 + D 2,D 2 + D 3 ) # enrolled students and the temperature outside today # attending students and the temperature outside today

Two coins Suppose we have two coins, one biased and one fair, P(C 1 = H) = 0.5 P(C 2 = H) = 0.7. We choose one of the coins at random Z {1,2}, flip C Z twice, and record the outcome (X,Y). Question: Are X and Y independent? What if we knew which coin was flipped Z?

Conditional independence X and Y are conditionally independent given Z. P(Y X,Z = z) = P(Y Z = z) for all possible values of z. Again, this implies a factorization X Y Z P(X,Y Z = z) = P(X Z = z)p(y Z = z), for all possible values of z.

Continuous random variables We ve only used discrete random variables so far (e.g., dice) Random variables can be continuous. We need a density p(x), which integrates to one. E.g., if x then p(x)dx = 1 Probabilities are integrals over smaller intervals. E.g., Notice when we use P, p, X, and x. 6.5 P(X ( 2.4,6.5)) = p(x)dx 2.4

The Gaussian distribution The Gaussian (or Normal) is a continuous distribution. p(x µ,σ) = 1 (x µ)2 exp 2πσ 2σ 2 The density of a point x is proportional to the negative exponentiated half distance to µ scaled by σ 2. µ is called the mean; σ 2 is called the variance.

Gaussian density N(1.2, 1) p(x) 0.0 0.1 0.2 0.3 0.4 4 2 0 2 4 x The mean µ controls the location of the bump. The variance σ 2 controls the spread of the bump.

Notation For discrete RV s, p denotes the probability mass function, which is the same as the distribution on atoms. (I.e., we can use P and p interchangeably for atoms.) For continuous RV s, p is the density and they are not interchangeable. This is an unpleasant detail. Ask when you are confused.

Expectation Consider a function of a random variable, f(x). (Notice: f(x) is also a random variable.) The expectation is a weighted average of f, where the weighting is determined by p(x), E[f(X)] = p(x)f(x) In the continuous case, the expectation is an integral E[f(X)] = x p(x)f(x)dx

Conditional expectation The conditional expectation is defined similarly E[f(X) Y = y] = p(x y)f(x) Question: What is E[f(X) Y = y]? What is E[f(X) Y]? E[f(X) Y = y] is a scalar. E[f(X) Y] is a (function of a) random variable. x

Iterated expectation Let s take the expectation of E[f(X) Y]. E[E[f(X)] Y]] = p(y)e[f(x) Y = y] = = = = = y p(y) p(x y)f(x) y p(x,y)f(x) y x x p(x)p(y x)f(x) y x p(x)f(x) p(y x) x p(x)f(x) x = E[f(X)] y

Flips to the first heads We flip a coin with probability π of heads until we see a heads. What is the expected waiting time for a heads? E[N] = 1π + 2(1 π)π + 3(1 π) 2 π +... = n(1 π) (n 1) π n=1

Let s use iterated expectation E[N] = E[E[N X 1 ]] = π E[N X 1 = H] + (1 π)e[n X 1 = T] = π 1 + (1 π)(e[n] + 1)] = π + 1 π + (1 π)e[n] = 1/π

Probability models Probability distributions are used as models of data that we observe. Pretend that data is drawn from an unknown distribution. Infer the properties of that distribution from the data For example the bias of a coin the average height of a student the chance that someone will vote for H. Clinton the chance that someone from Vermont will vote for H. Clinton the proportion of gold in a mountain the number of bacteria in our body the evolutionary rate at which genes mutate We will see many models in this class.

Independent and identically distributed random variables Independent and identically distributed (IID) variables are: 1 Independent 2 Identically distributed If we repeatedly flip the same coin N times and record the outcome, then X 1,...,X N are IID. The IID assumption can be useful in data analysis.

What is a parameter? Parameters are values that index a distribution. A coin flip is a Bernoulli. Its parameter is the probability of heads. p(x π) = π 1[x=H] (1 π) 1[x=T], where 1[ ] is called an indicator function. It is 1 when its argument is true and 0 otherwise. Changing π leads to different Bernoulli distributions. A Gaussian has two parameters, the mean and variance. p(x µ,σ) = 1 (x µ)2 exp 2πσ 2σ 2 A multinomial distribution has a vector parameter, on the simplex.

The likelihood function Again, suppose we flip a coin N times and record the outcomes. Further suppose that we think that the probability of heads is π. (This is distinct from whatever the probability of heads really is.) Given π, the probability of an observed sequence is p(x 1,...,x N π) = N π 1[xn=H] (1 π) 1[x n=t] n=1

The log likelihood As a function of π, the probability of a set of observations is called the likelihood function. N p(x 1,...,x N π) = π 1[xn=H] (1 π) 1[x n=t] n=1 Taking logs, this is the log likelihood function. N (π) = 1[x n = H]logπ + 1[x n = T]log(1 π) n=1

Bernoulli log likelihood f (x) 40 30 20 10 0.0 0.2 0.4 0.6 0.8 1.0 x We observe HHTHTHHTHHTHHTH. The value of π that maximizes the log likelihood is 2/3.

The maximum likelihood estimate The maximum likelihood estimate is the value of the parameter that maximizes the log likelihood (equivalently, the likelihood). In the Bernoulli example, it is the proportion of heads. ˆπ = 1 N N 1[x n = H] n=1 In a sense, this is the value that best explains our observations.

Why is the MLE good? The MLE is consistent. Flip a coin N times with true bias π. Estimate the parameter from x 1,...x N with the MLE ˆπ. Then, lim ˆπ = π N This is a good thing. It lets us sleep at night.

5000 coin flips 1 1 0 1 1 1 1 0 0 1 0 0 1 1 1 0 0 0 0 0 1 0 0 1 0 0 1 1 1 0 1 0 1 0 0 0 0 1 0 1 0 1 1 1 1 0 0 0 1 1 0 1 1 1 1 1 1 0 1 1 0 1 0 1 1 1 0 0 0 1 1 1 1 1 0 1 1 1 1 1 1 0 1 1 0 1 1 1 1 0 0 1 1 1 0 0 1 0 1 0 0 0 0 1 0 0 1 1 1 0 1 1 1 0 1 1 1 0 1 1 0 0 0 1 1 1 1 1 1 0 1 1 1 0 0 1 1 1 1 1 0 0 0 0 1 0 1 1 1 1 1 1 1 0 0 1 1 1 0 0 0 1 1 1 0 1 0 0 0 1 1 0 1 0 1 1 1 0 1 1 0 0 1 1 1 1 1 1 0 1 0 0 1 0 0 1 0 0 1 1 0 0 1 0 0 0 1 1 1 0 1 1 0 0 0 1 1 0 1 1 1 0 1 1 0 0 1 0 1 0 1 1 1 1 1 0 0 0 0 1 1 0 0 1 0 1 0 0 0 1 1 0 1 1 0 0 1 1 0 1 1 0 1 1 1 0 1 1 0 1 1 0 1 1 0 0 1 1 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 0 0 0 1 0 0 1 1 1 0 0 1 1 0 0 0 0 0 0 0 0 1 0 1 1 1 0 0 1 1 0 1 1 1 0 1 1 1 1 0 1 1 0 0 0 0 1 1 1 0 1 0 1 0 1 0 1 1 0 1 0 0 1 1 1 0 0 1 1 1 0 1 0 1 0 1 1 0 1 1 1 1 1 0 0 0 1 1 1 1 1 1 0 1 1 1 0 0 0 1 0 1 0 0 1 1 1 1 1 0 1 0 1 0 0 1 1 0 1 1 1 1 0 0 0 0 1 1 1 1 0 1 0 0 1 0 0 1 0 0 1 0 1 1 1 1 1 1 0 0 1 1 0 1 1 1 0 0 1 0 0 1 0 1 1 1 1 0 0 1 1 1 1 1 0 1 1 0 0 1 0 1 1 0 1 1 1 1 0 0 0 0 1 0 0 1 1 1 1 0 0 0 0 1 0 0 1 1 1 1 1 0 1 1 0 1 1 1 1 1 0 1 1 0 1 0 1 1 1 0 0 1 0 1 1 1 1 1 1 1 1 0 1 0 1 1 0 0 1 0 1 1 1 0 0 0 1 1 1 0 1 1 1 0 0 0 1 1 1...

Consistency of the MLE example MLE of bias 0.5 0.6 0.7 0.8 0.9 1.0 0 1000 2000 3000 4000 5000 Index

Gaussian log likelihood Suppose we observe x 1,...,x N continuous. We choose to model them with a Gaussian p(x 1,...,x N µ,σ 2 ) = N 1 (xn µ) 2 exp 2πσ 2σ 2 n=1 The log likelihood is (µ,σ) = 1 N 2 N (x n µ) log(2πσ2 ) 2 2σ 2 n=1

Gaussian MLE The MLE of the mean is the sample mean ˆµ = 1 N N x n n=1 The MLE of the variance is the sample variance ˆσ 2 = 1 N N (x n ˆµ) 2 n=1 E.g., approval ratings of the presidents from 1945 to 1975.

Gaussian analysis of approval ratings p(x) 0.000 0.005 0.010 0.015 0.020 0.025 0 20 40 60 80 100 120 x Q: What s wrong with this analysis?

Model pitfalls What s wrong with this analysis? Assigns positive probability to numbers < 0 and > 100 Ignores the sequential nature of the data Assumes that approval ratings are IID! All models are wrong. Some models are useful. (Box)

Graphical models Represents a joint distribution of random variables (Also called Bayesian network ) Nodes are RVs; Edges denote possible dependence Shaded nodes are observed; unshaded nodes are hidden GMs with shaded nodes represent posterior distributions. Connects computing about models to graph structure (COS513) Connects independence assumptions to graph structure (COS513) Here we ll use it as a schematic for factored joint distributions.

Some of the models we ll learn about Naive Bayes classification Linear regression and logistic regression Generalized linear models Hidden variables, mixture models, and the EM algorithm Factor analysis / Principal component analysis Sequential models Bayesian models