PROBABILITY THEORY REVIEW

Similar documents
01 Probability Theory and Statistics Review

Probability. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh. August 2014

[POLS 8500] Review of Linear Algebra, Probability and Information Theory

Reminders. Thought questions should be submitted on eclass. Please list the section related to the thought question

Probability and Statistical Decision Theory

Review of Probability Theory

MA/ST 810 Mathematical-Statistical Modeling and Analysis of Complex Systems

Discrete Probability Refresher

Recitation 2: Probability

Machine Learning for Large-Scale Data Analysis and Decision Making A. Week #1

Random Variables. Definition: A random variable (r.v.) X on the probability space (Ω, F, P) is a mapping

2.1 Lecture 5: Probability spaces, Interpretation of probabilities, Random variables

Basics on Probability. Jingrui He 09/11/2007

Joint Probability Distributions and Random Samples (Devore Chapter Five)

Review of Probability. CS1538: Introduction to Simulations

Lecture 2: Repetition of probability theory and statistics

Introduction to Machine Learning

Brief Review of Probability

STAT Chapter 5 Continuous Distributions

Appendix A : Introduction to Probability and stochastic processes

Machine Learning Lecture Notes

Probability. Paul Schrimpf. January 23, Definitions 2. 2 Properties 3

Lecture Notes 1 Probability and Random Variables. Conditional Probability and Independence. Functions of a Random Variable

Probability Theory for Machine Learning. Chris Cremer September 2015

Chapter 2. Probability

Lecture Note 1: Probability Theory and Statistics

Lecture 1: Probability Fundamentals

Some Probability and Statistics

Notes on Mathematics Groups

Random variables (discrete)

Parameter estimation Conditional risk

Lecture 10: Probability distributions TUESDAY, FEBRUARY 19, 2019

Lecture 2: Review of Basic Probability Theory

a b = a T b = a i b i (1) i=1 (Geometric definition) The dot product of two Euclidean vectors a and b is defined by a b = a b cos(θ a,b ) (2)

ECSE B Solutions to Assignment 8 Fall 2008

Lecture Notes 1 Probability and Random Variables. Conditional Probability and Independence. Functions of a Random Variable

Basic Probability. Introduction

MA 575 Linear Models: Cedric E. Ginestet, Boston University Revision: Probability and Linear Algebra Week 1, Lecture 2

Probability Review. Chao Lan

Summary of basic probability theory Math 218, Mathematical Statistics D Joyce, Spring 2016

3 Multiple Discrete Random Variables

CS 361: Probability & Statistics

Sample Spaces, Random Variables

Introduction to Machine Learning

Random Variables Example:

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

Random Variables Chris Piech CS109, Stanford University. Piech, CS106A, Stanford University

Terminology. Experiment = Prior = Posterior =

Introduction to Bayesian Statistics

Review of probability

1 Probability and Random Variables

Some Probability and Statistics

Algorithms for Uncertainty Quantification

Introduction to Stochastic Processes

STAT 414: Introduction to Probability Theory

Lecture 1: Basics of Probability

Perhaps the simplest way of modeling two (discrete) random variables is by means of a joint PMF, defined as follows.

Review of Basic Probability Theory

3. Review of Probability and Statistics

Intro to Probability. Andrei Barbu

SUMMARY OF PROBABILITY CONCEPTS SO FAR (SUPPLEMENT FOR MA416)

Random Variables. Cumulative Distribution Function (CDF) Amappingthattransformstheeventstotherealline.

Why study probability? Set theory. ECE 6010 Lecture 1 Introduction; Review of Random Variables

Review of Probability Mark Craven and David Page Computer Sciences 760.

Single Maths B: Introduction to Probability

MAT 271E Probability and Statistics

Class 8 Review Problems solutions, 18.05, Spring 2014

Fourier and Stats / Astro Stats and Measurement : Stats Notes

Review (probability, linear algebra) CE-717 : Machine Learning Sharif University of Technology

Raquel Prado. Name: Department of Applied Mathematics and Statistics AMS-131. Spring 2010

2. Suppose (X, Y ) is a pair of random variables uniformly distributed over the triangle with vertices (0, 0), (2, 0), (2, 1).

Week 12-13: Discrete Probability

Dept. of Linguistics, Indiana University Fall 2015

Statistical Methods in Particle Physics

CME 106: Review Probability theory

Review: mostly probability and some statistics

Multivariate Distributions (Hogg Chapter Two)

Lecture 1: August 28

15-780: Grad AI Lecture 17: Probability. Geoff Gordon (this lecture) Tuomas Sandholm TAs Erik Zawadzki, Abe Othman

DS-GA 1002 Lecture notes 2 Fall Random variables

Probability theory for Networks (Part 1) CS 249B: Science of Networks Week 02: Monday, 02/04/08 Daniel Bilar Wellesley College Spring 2008

Multivariate distributions

Math Review Sheet, Fall 2008

Recall that if X 1,...,X n are random variables with finite expectations, then. The X i can be continuous or discrete or of any other type.

Statistical Learning Theory

Basic Probability Reference Sheet

ECE 302, Final 3:20-5:20pm Mon. May 1, WTHR 160 or WTHR 172.

Random variables. DS GA 1002 Probability and Statistics for Data Science.

Probability. Hosung Sohn

Vectors and Matrices Statistics with Vectors and Matrices

Class 26: review for final exam 18.05, Spring 2014

Properties of Probability

7 Random samples and sampling distributions

Discrete Distributions

CMPSCI 240: Reasoning Under Uncertainty

Lecture 10: Bayes' Theorem, Expected Value and Variance Lecturer: Lale Özkahya

Quick Tour of Basic Probability Theory and Linear Algebra

Multivariate random variables

Quiz 1. Name: Instructions: Closed book, notes, and no electronic devices.

Discrete Random Variable

Transcription:

PROBABILITY THEORY REVIEW CMPUT 466/551 Martha White Fall, 2017

REMINDERS Assignment 1 is due on September 28 Thought questions 1 are due on September 21 Chapters 1-4, about 40 pages If you are printing, don t print all the notes yet Office hours Martha: 3-5 p.m. on Tuesday (ATH 3-05) Labs: W 5-8 p.m. and F 2-5 p.m. Start thinking about datasets for your mini-project I do not expect you to know formulas, like pdfs I will use a combination of slides and writing on the board (where you should take notes)

BACKGROUND FOR COURSE Need to know calculus, mostly derivatives Will almost never integrate I will teach you multivariate calculus Need to know linear algebra I assume you know about vector, matrices and dot products I will teach you more advanced topics, like singular value decompositions Need to have learned a bit about probability Concerns for final: will be testing only fundamentals, not intended to be difficult to finely separate students

PROBABILITY THEORY IS THE SCIENCE OF PREDICTIONS* The goal of science is to discover theories that can be used to predict how natural processes evolve or explain natural phenomenon, based on observed phenomenon. The goal of probability theory is to provide the foundation to build theories (= models) that can be used to reason about the outcomes of events, future or past, based on observations. prediction of the unknown which may depend on what is observed and whose nature is probabilistic *Quote from Csaba Szepesvari, https://eclass.srv.ualberta.ca/pluginfile.php/1136251/ mod_resource/content/1/lecturenotes_probabilities.pdf

(MEASURABLE) SPACE OF OUTCOMES AND EVENTS F has been changed to E in the notes an event space Note: terminology sigma field sounds technical, but it just means this event space

Intuitively, WHY IS THIS THE DEFINITION? 1. A collection of outcomes is an event (e.g., either a 1 or 6 was rolled) 2. If we can measure two events separately, then their union should also be a measurable event 3. If we can measure an event, then we should be able to measure that that event did not occur (the complement)

AXIOMS OF PROBABILITY 1. (unit measure) P ( ) = 1 2. ( -additivity) Any countable sequence of disjoint events A 1,A 2,...2 F satisfies P ([ 1 i=1 A i)= P 1 i=1 P (A i)

A FEW COMMENTS ON TERMINOLOGY A few new terms, including countable, closure only a small amount of terminology used, can google these terms and learn on your own notation sheet in notes Countable: integers, {0.1,2.0,3.6}, Uncountable: real numbers, intervals, Interchangeably use (though its somewhat loose) discrete and countable continuous and uncountable

SAMPLE SPACES e.g., F = {;, {1, 2}, {3, 4, 5, 6}, {1, 2, 3, 4, 5, 6}} e.g., F = {;, [0, 0.5], (0.5, 1.0], [0, 1]}

FINDING PROBABILITY DISTRIBUTIONS

PROBABILITY MASS FUNCTIONS

ARBITRARY PMFS e.g. PMF for a fair die (table of values) = {1, 2, 3, 4, 5, 6} p(!) =1/6 8! 2

EXERCISE: HOW ARE PMFS USEFUL AS A MODEL? Recall we wanted to model commute times We could use a probability table for minutes: count number of times t = 1, 2, 3, occurs and then normalize probabilities by # samples Pick t with the largest p(t)

USEFUL PMFS

USEFUL PMFS e.g., amount of mail received in a day number of calls received by call center in an hour

EXERCISE: CAN WE USE A POISSON FOR COMMUTE TIMES? Used a probability table (histogram) for minutes: count number of times t = 1, 2, 3, occurs and then normalize probabilities by # samples Can we use a Poisson?

PROBABILITY DENSITY FUNCTIONS

PMFS VS. PDFS Example: - Stopping time of a car, in interval [3,15]. What is the probability of seeing a stopping time of exactly 3.141596? (How much mass in [3,15]?) - More reasonable to ask the probability of stopping between 3 to 3.5 seconds.

PMFS VS. PDFS

USEFUL PDFS

USEFUL PDFS

USEFUL PDFS

EXERCISE: MODELING COMMUTE TIMES p(t) = t e t! p(t) = tk k 1 e t/ (k) Gamma Poisson Which might you choose? p(t) = Gaussian 1 (t µ)2 p e 2 2 2 2

RANDOM VARIABLES Age: 35 Likes sports: Yes Height: 1.85m Smokes: No Weight: 75kg Marital st.: Single IQ: 104 Occupation: Musician Age: 26 Likes sports: Yes Height: 1.75m Smokes: No Weight: 79kg Marital st.: Divorced IQ: 103 Occupation: Athlete Musician is a random variable (a function) A is the new event space Can ask P(M = 0) and P(M = 1)

WE INSTINCTIVELY CREATE THIS TRANSFORMATION Assume is a set of people. Compute the probability that a randomly selected person! 2 has a cold. Define event A = {! 2 : Disease(!) = cold}. Disease is our new random variable, P (Disease = cold) Disease is a function that maps outcome space to new outcome space {cold, not cold} Disease is a function, which is neither a variable nor random BUT, this term is still a good one since we treat Disease as a variable And assume it can take on different values (randomly according to some distribution)

RANDOM VARIABLE: FORMAL DEFINITION Example X :! [0, 1) is set of (measured) people in population with associated measurements such as height and weight X(!) = height A = interval = [5 0 1 00, 5 0 2 00 ] P (X 2 A) =P (5 0 1 00 apple X apple 5 0 2 00 )=P({! : X(!) 2 A})

5 MINUTE BREAK AND EXERCISE Let X be a random variable that corresponds to the ratio of hard-to-easy problems on an assignment. Assume it takes values in {0.1, 0.25, 0.7}. Is this discrete or continuous? Does it have a PMF or PDF? Further, where could the variability come from? i.e., why is this a random variable? Let X be the stopping time of a car, taking values in [3,5] union [7,9]. Is this discrete or continuous? Think of an example of a discrete random variable (RV) and a continuous RV We provided several named PMFs. Why do we use these explicit functional forms? Why not just tables of values, which is more flexible?

WHAT IF WE HAVE MORE THAN TWO VARIABLES So far, we have considered scalar random variables Axioms of probability defined abstractly, apply to vector random variables = R 2, e.g.,! =[ 0.5, 10] =[0, 1] [2, 5], e.g.,! =[0.2, 3.5] But, when defining probabilities, we will want to consider how the variables interact

TWO DISCRETE RANDOM VARIABLES Random variables X and Y Outcome spaces X and Y fy p(x, y) =P (X = x, Y = y) ÿ ÿ ÿ ÿ ÿ p(x, ÿ ÿy) =1. xœx yœy œx œy X = {young, old} and Y = {no arthritis, arthritis}, obabilities Y 0 1 X 0 1/2 1/100 1 1/10 39/100 f probability spaces, because = q q

ÿ ÿ SOME QUESTIONS WE MIGHT ÿ ÿask NOW THAT WE HAVE TWO RANDOM VARIABLES œx œy X = {young, old} and Y = {no arthritis, arthritis}, obabilities Y 0 1 X 0 1/2 1/100 1 1/10 39/100 Are these two f variables probability spaces, because = q related? q Or do they change completely independently of each other? Given this joint distribution, can we determine just the distribution over arthritis? i.e., P(Y = 1)? (Marginal distribution) If we knew something about one of the variables, say that the person Is young, do we now the distribution over Y? (Conditional distribution)

ÿ ÿ EXAMPLE: MARGINAL AND ÿ ÿconditional DISTRIBUTION œx œy X = {young, old} and Y = {no arthritis, arthritis}, obabilities Y 0 1 X 0 1/2 1/100 1 1/10 39/100 f probability spaces, because = q q P(Y = 1) = P(Y = 1, X = 0) + P(Y = 1, X = 1) = 40/100 What is P(Y = 0)? P(X = 1) = 49/100 P(Y = 1 X = 0) =? Is it 1/100, where the table tells us P(Y = 1, X=0)?

CONDITIONAL DISTRIBUTIONS p(y x) = p(x, y) p(x) n x, as P (Y œ A X = x) = Yq _] _[ yœa p(y x) A p(y x)dy Y : discrete Y : continuous = ( ) ( ) = ( ) ( ) is called the product rule. The e

ÿ ÿ EXERCISE: CONDITIONAL ÿ ÿ DISTRIBUTION œx œy X = {young, old} and Y = {no arthritis, arthritis}, obabilities Y 0 1 P(Y = 1 X = 0) =? X 0 1/2 1/100 1 1/10 39/100 f probability spaces, because = q q p(y x) = p(x, y) p(x) What is P(Y = 0 X = 0)? Should P(Y = 1 X = 0) + P(Y = 0 X = 0) = 1?

JOINT DISTRIBUTIONS FOR MANY VARIABLES proportion that is old. In general, we can consider d-dimensional random variable X =(X 1,X 2,...,X d ) with vector-valued outcomes x =(x 1,x 2,...,x d ), such that each x i is chosen from some X i. Then, for the discrete case, any function p : X 1 X 2... X d æ [0, 1] is called a multidimensional probability mass function if ÿ ÿ ÿ p (x 1,x 2,...,x d )=1. x 2 œx 2 x 1 œx 1 x d œx d or, for the continuous case, p : X 1 X 2... X d æ [0, Œ] is a multidimensional probability density function if X 1 X 2 X d p (x 1,x 2,...,x d ) dx 1 dx 2...dx d =1. A marginal distribution is defined for a subset of =( ) by summing or

X MARGINAL DISTRIBUTIONS X X A marginal distribution is defined for a subset of X =(X 1,X 2,...,X d ) by summing or integrating over the remaining variables. For the discrete case, the marginal distribution p (x i ) is defined as p (x i )= ÿ ÿ ÿ ÿ p (x 1,...,x i 1,x i,x i+1,...,x d ), x 1 œx 1 x i+1 œx i+1 x d œx d x i 1 œx i 1 where the variable x i is fixed to some value and we sum over all possible values of the other variables. Similarly, for the continuous case, the marginal distribution p (x i ) is defined as p (x i )= X 1 X i 1 X i+1 X d p (x 1,...,x i 1,x i,x i+1,...,x d ) dx 1...dx i 1 dx i+1...dx d. Notice that we use to define the density over, but then we overload this terminology and Natural question: Why do you use p for p(xi) and for p(x1,., xd)? They have different domains, they can t be the same function!

DROPPING SUBSCRIPTS Instead of: We will write: p(y x) = p(x, y) p(x)

ANOTHER EXAMPLE FOR CONDITIONAL DISTRIBUTIONS Let X be a Bernoulli random variable (i.e., 0 or 1 with probability alpha) Let Y be a random variable in {10, 11,, 1000} p(y X = 0) and p(y X = 1) are different distributions Two types of books: fiction (X=0) and non-fiction (X=1) Let Y corresponds to number of pages Distribution over number of pages different for fiction and non-fiction books (e.g., average different)

EXAMPLE CONTINUED Two types of books: fiction (X=0) and non-fiction (X=1) Y corresponds to number of pages p(y X = 0) = p(x = 0, y)/p(x = 0) p(x = 0, y) = probability that a book is fiction and has y pages (imagine randomly sampling a book) p(x = 0) = probability that a book is fiction If most books are non-fiction, p(x = 0, y) is small even if y is a likely number of pages for a fiction book p(x = 0) accounts for the fact that joint probability small if p(x = 0) is small

ANOTHER EXAMPLE Two types of books: fiction (X=0) and non-fiction (X=1) Let Y be a random variable over the reals, which corresponds to amount of money made p(y X = 0) and p(y X = 1) are different distributions e.g., even if both p(y X = 0) and p(y X = 1) are Gaussian, they likely have different means and variances

WHAT DO WE KNOW ABOUT P(Y)? We know p(y x) We know marginal p(x) Correspondingly we know p(x, y) = p(y x) p(x) from conditional probability definition that p(y x) = p(x, y) / p(x) What is the marginal p(y)? p(y) = X x p(x, y) = X x p(y x)p(x) = p(y X = 0)p(X = 0) + p(y X = 1)p(X = 1)

SEPT 12: PROBABILITY REVIEW CONTINUED Machine learning topic overview * from Yaser Abu-Mostafa, https://work.caltech.edu/library/

REMINDERS Assignment 1 (September 28), small typo fixes (bold X, and range of lambda to [0, infty) rather than (0, infty)) express in terms of givens a, b, c does not mean you have to use all a, b and c, but that the final expression should include (some subset) of these given values Office hours and labs start this week Martha: 3-5 p.m. on Tuesday (ATH 3-05) Labs: W 5-8 p.m. and F 2-5 p.m.

CHAIN RULE Two variable example p(x, y) =p(x y)p(y) =p(y x)p(x)

HOW DO WE GET BAYES RULE? Recall chain rule: p(x, y) =p(x y)p(y) =p(y x)p(x) Bayes rule: p(y x) = p(x y)p(y) p(x)

EXERCISE: CONDITIONAL PROBABILITIES Using conditional probabilities, we can incorporate other external information (features) Let y be the commute time, x the day of the year Array of conditional probability values > p(y x) y = 1, 2, and x = 1, 2,, 365 What are some issues with this choice for x? What other x could we use feasibly?

EXERCISE: ADDING IN AUXILIARY INFORMATION Gamma distribution for commute times extrapolates between recorded time in minutes Can incorporate external information (features) by modeling theta = function(features) dx = w i x i i=1 p(t) = tk k 1 e t/ (k)

INDEPENDENCE OF RANDOM VARIABLES p(x, y) =p(x)p(y) p(x, y z) =p(x z)p(y z)

CONDITIONAL INDEPENDENCE EXAMPLES EXAMPLE 7 IN THE NOTES Imagine you have a biased coin (does not flip 50% heads and 50% tails, but skewed towards one) Let Z = bias of a coin (say outcomes are 0.3, 0.5, 0.8 with associated probabilities 0.7, 0.2, 0.1) what other outcome space could we consider? what kinds of distributions? Let X and Y be consecutive flips of the coin Are X and Y independent? Are X and Y conditionally independent, given Z?

EXPECTED VALUE (MEAN) E [X] = 8P >< x2x xp(x) >: R X xp(x)dx X :discrete X : continuous

EXPECTATIONS WITH FUNCTIONS f : X! R X! 8P >< x2x f(x)p(x) E [f(x)] = >: X f(x)p(x)dx X : discrete X : continuous

E [X] = EXPECTED VALUE FOR MULTIVARIATE 8P >< x2x xp(x) >: R X xp(x)dx Each instance x is a vector, p is a function on these vectors X :discrete X : continuous p(x 1,x 2 ) x 1 x 2

Y : X COVARIANCE 8P >< >: >: ance function as Cov[X, Y ]=E [(X E [X]) (Y E [Y ])] = E [XY ] E [X] E [Y ], being the variance of the random variable Corr[X, Y ]= p Cov[X, Y ], V [X] V [Y ] iance function normalized by the product

COVARIANCE FOR MORE THAN TWO DIMENSIONS X =[X 1,...,X d ] ij = Cov[X i,x j ] = E [(X i E [X i ]) (X j E [X j ])] en as = Cov[X, X] 2 R d d = E[(X E[X])(X E(X) > ] = E[XX > ] E[X]E[X] >. nts of a covariance matrix are indi

COVARIANCE FOR MORE THAN TWO DIMENSIONS X =[X 1,...,X d ] = Cov[X, X] x, y 2 R d Dot product x > y = dx x i y i i=1 = E[(X E[X])(X E(X) > ] = E[XX > ] E[X]E[X] >. nts of a covariance matrix are in Outer product xy > = 2 6 4 2 R d d x 1 y 1 x 1 y 2... x 1 y d x 2 y 1 x 2 y 2... x 2 y d... x d y 1 x d y 2... x d y d 3 7 5

SOME USEFUL PROPERTIES 1. E [cx] =ce [X] 2. E [X + Y ]=E [X]+E [Y ] 3. V [c] =0. the variance of a constant is zero 4. V[X] 0 (i.e., is positive semi-definite), where for d =1,V[X] 0 that V[X] is shorthand for Cov[X, X]. 5. V[cX] =c 2 V[X]. 6. Cov[X, Y ]=E[(X E[X])(Y E(Y ) > ]=E[XY > ] E[X]E[Y ] > 7. Cov[X + Y ]=V[X]+V[Y ]+2Cov[X, Y ] In addition, if and are independent random variables, it holds that:

MULTIDIMENSIONAL PMF Now record both commute time and number red lights = {4,...,14} {1, 2, 3, 4, 5} PMF is normalized 2-d table (histogram) of occurrences Occurences 8 7 6 5 4 3 2 1 0 4 5 6 7 8 9 10 Commute time 11 12 13 14 1 2 3 4 5 Red lights

MULTIDIMENSIONAL GAUSSIAN d d d d x d d d = 2

MIXTURES OF DISTRIBUTIONS

MIXTURES OF GAUSSIANS

EXAMPLE: SAMPLE AVERAGE IS UNBIASED ESTIMATOR Obtain instances x 1,...,x n What can we say about the sample average? This sample is random, so we consider i.i.d. random variables X 1,...,X n Reflects that we could have seen a di erent set of instances x i " # 1 nx E X i = 1 nx E[X i ] n n i=1 i=1 = 1 nx µ n = µ i=1 For any one sample x 1,...,x n, unlikely that 1 n nx x i = µ i=1