Refresher on Discrete Probability

Similar documents
STAT2201. Analysis of Engineering & Scientific Data. Unit 3

CS37300 Class Notes. Jennifer Neville, Sebastian Moreno, Bruno Ribeiro

STAT 430/510 Probability Lecture 7: Random Variable and Expectation

Lecture 1: Probability Fundamentals

Introduction to Machine Learning

Analysis of Engineering and Scientific Data. Semester

Discrete Probability Refresher

SUMMARY OF PROBABILITY CONCEPTS SO FAR (SUPPLEMENT FOR MA416)

CS 246 Review of Proof Techniques and Probability 01/14/19

Introduction to Machine Learning

Lecture 13 (Part 2): Deviation from mean: Markov s inequality, variance and its properties, Chebyshev s inequality

2. AXIOMATIC PROBABILITY

Lecture 4: Probability and Discrete Random Variables

Probability Basics. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Recitation 2: Probability

Part IA Probability. Definitions. Based on lectures by R. Weber Notes taken by Dexter Chua. Lent 2015

Lecture 1: Review on Probability and Statistics

Lecture 2: Repetition of probability theory and statistics

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

Week 12-13: Discrete Probability

CME 106: Review Probability theory

Lecture 10: Bayes' Theorem, Expected Value and Variance Lecturer: Lale Özkahya

Probability Theory. Introduction to Probability Theory. Principles of Counting Examples. Principles of Counting. Probability spaces.

Week 2. Review of Probability, Random Variables and Univariate Distributions

Lecture 3. Discrete Random Variables

MA/ST 810 Mathematical-Statistical Modeling and Analysis of Complex Systems

Random variables (discrete)

Probabilistic and Bayesian Analytics

Probability. Paul Schrimpf. January 23, Definitions 2. 2 Properties 3

Quick Tour of Basic Probability Theory and Linear Algebra

Probability and Statistics

Dept. of Linguistics, Indiana University Fall 2015

SDS 321: Introduction to Probability and Statistics

Random Models. Tusheng Zhang. February 14, 2013

Lectures on Elementary Probability. William G. Faris

Part (A): Review of Probability [Statistics I revision]

Some Concepts of Probability (Review) Volker Tresp Summer 2018

Conditional Probability

Lecture 11. Probability Theory: an Overveiw

Bayesian statistics, simulation and software

Probability Theory for Machine Learning. Chris Cremer September 2015

Homework 4 Solution, due July 23

STAT 414: Introduction to Probability Theory

Expectation of Random Variables

1 Proof techniques. CS 224W Linear Algebra, Probability, and Proof Techniques

A random variable is a quantity whose value is determined by the outcome of an experiment.

Lecture 16. Lectures 1-15 Review

1 Random Variable: Topics

CS 4649/7649 Robot Intelligence: Planning

Review of Probability Theory

Brief Review of Probability

MATH MW Elementary Probability Course Notes Part I: Models and Counting

Math Introduction to Probability. Davar Khoshnevisan University of Utah

Notes 12 Autumn 2005

LECTURE 1. 1 Introduction. 1.1 Sample spaces and events

Class 26: review for final exam 18.05, Spring 2014

Probability, Random Processes and Inference

Single Maths B: Introduction to Probability

Sample Spaces, Random Variables

STAT 418: Probability and Stochastic Processes

01 Probability Theory and Statistics Review

Aarti Singh. Lecture 2, January 13, Reading: Bishop: Chap 1,2. Slides courtesy: Eric Xing, Andrew Moore, Tom Mitchell

Lecture 2: Review of Basic Probability Theory

Algorithms for Uncertainty Quantification

Topic 3: The Expectation of a Random Variable

Lecture 1. ABC of Probability

Probability Theory and Applications

MATH 3510: PROBABILITY AND STATS June 15, 2011 MIDTERM EXAM

Tom Salisbury

Uncertain Knowledge and Bayes Rule. George Konidaris

Probability theory basics

4th IIA-Penn State Astrostatistics School July, 2013 Vainu Bappu Observatory, Kavalur

STOR Lecture 16. Properties of Expectation - I

M378K In-Class Assignment #1

PCMI Introduction to Random Matrix Theory Handout # REVIEW OF PROBABILITY THEORY. Chapter 1 - Events and Their Probabilities

Chapter 4 : Discrete Random Variables

Today we ll discuss ways to learn how to think about events that are influenced by chance.

Discrete Random Variable

Midterm #1. Lecture 10: Joint Distributions and the Law of Large Numbers. Joint Distributions - Example, cont. Joint Distributions - Example

Probability and Probability Distributions. Dr. Mohammed Alahmed

Sources of Uncertainty

7. Be able to prove Rules in Section 7.3, using only the Kolmogorov axioms.

Discrete Mathematics and Probability Theory Spring 2016 Rao and Walrand Note 14

STAT:5100 (22S:193) Statistical Inference I

Econ 325: Introduction to Empirical Economics

SDS 321: Introduction to Probability and Statistics

1 INFO Sep 05

IEOR 3106: Introduction to Operations Research: Stochastic Models. Professor Whitt. SOLUTIONS to Homework Assignment 1

Probability Review Lecturer: Ji Liu Thank Jerry Zhu for sharing his slides

1 Presessional Probability

Probability Review. Chao Lan

Artificial Intelligence Uncertainty

ELEG 3143 Probability & Stochastic Process Ch. 2 Discrete Random Variables

A PECULIAR COIN-TOSSING MODEL

Things to remember when learning probability distributions:

4. Suppose that we roll two die and let X be equal to the maximum of the two rolls. Find P (X {1, 3, 5}) and draw the PMF for X.

Discrete Probability

Review of Basic Probability Theory

Probability, Random Processes and Inference

Data Mining Techniques. Lecture 3: Probability

Transcription:

Refresher on Discrete Probability STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University of Chicago October 2015

Background Things you should have seen before Events, Event Spaces Probability as limit of frequency Compound Events Joint and Conditional Probability Random Variables Expectation, variance and covariance Independence and Conditional Independence Estimation

Background Things you should have seen before Events, Event Spaces Probability as limit of frequency Compound Events Joint and Conditional Probability Random Variables Expectation, variance and covariance Independence and Conditional Independence Estimation This refresher WILL revise these topics.

Three types of Probability Frequency of repeated trials: if an experiment is repeated infinitely many times, 0 p(a) 1 is the fraction of times that the outcome will be A.

Three types of Probability Frequency of repeated trials: if an experiment is repeated infinitely many times, 0 p(a) 1 is the fraction of times that the outcome will be A. Typical example: number of times that a coin comes up heads.

Three types of Probability Frequency of repeated trials: if an experiment is repeated infinitely many times, 0 p(a) 1 is the fraction of times that the outcome will be A. Typical example: number of times that a coin comes up heads. Frequentist probability.

Three types of Probability Frequency of repeated trials: if an experiment is repeated infinitely many times, 0 p(a) 1 is the fraction of times that the outcome will be A. Typical example: number of times that a coin comes up heads. Frequentist probability. Degree of belief: A quantity obeying the same laws as the above, describing how likely we think a (possibly deterministic) event is.

Three types of Probability Frequency of repeated trials: if an experiment is repeated infinitely many times, 0 p(a) 1 is the fraction of times that the outcome will be A. Typical example: number of times that a coin comes up heads. Frequentist probability. Degree of belief: A quantity obeying the same laws as the above, describing how likely we think a (possibly deterministic) event is. Typical example: the probability that the Earth will warmer by more than 5 F by 2100.

Three types of Probability Frequency of repeated trials: if an experiment is repeated infinitely many times, 0 p(a) 1 is the fraction of times that the outcome will be A. Typical example: number of times that a coin comes up heads. Frequentist probability. Degree of belief: A quantity obeying the same laws as the above, describing how likely we think a (possibly deterministic) event is. Typical example: the probability that the Earth will warmer by more than 5 F by 2100. Bayesian probability.

Three types of Probability Frequency of repeated trials: if an experiment is repeated infinitely many times, 0 p(a) 1 is the fraction of times that the outcome will be A. Typical example: number of times that a coin comes up heads. Frequentist probability. Degree of belief: A quantity obeying the same laws as the above, describing how likely we think a (possibly deterministic) event is. Typical example: the probability that the Earth will warmer by more than 5 F by 2100. Bayesian probability. Subjective probability: I m 110% sure that I ll go out to dinner with you tonight.

Three types of Probability Frequency of repeated trials: if an experiment is repeated infinitely many times, 0 p(a) 1 is the fraction of times that the outcome will be A. Typical example: number of times that a coin comes up heads. Frequentist probability. Degree of belief: A quantity obeying the same laws as the above, describing how likely we think a (possibly deterministic) event is. Typical example: the probability that the Earth will warmer by more than 5 F by 2100. Bayesian probability. Subjective probability: I m 110% sure that I ll go out to dinner with you tonight. Mixing these three notions is a source of lots of trouble.

Three types of Probability Frequency of repeated trials: if an experiment is repeated infinitely many times, 0 p(a) 1 is the fraction of times that the outcome will be A. Typical example: number of times that a coin comes up heads. Frequentist probability. Degree of belief: A quantity obeying the same laws as the above, describing how likely we think a (possibly deterministic) event is. Typical example: the probability that the Earth will warmer by more than 5 F by 2100. Bayesian probability. Subjective probability: I m 110% sure that I ll go out to dinner with you tonight. Mixing these three notions is a source of lots of trouble. We will start with the frequentist interpretation and then discuss the Bayesian one.

Why do we need Probability in Machine Learning

Why do we need Probability in Machine Learning To analyze, understand and predict the performance of learning algorithms (Vapnik Chervonenkis Theory, PAC model, etc.)

Why do we need Probability in Machine Learning To analyze, understand and predict the performance of learning algorithms (Vapnik Chervonenkis Theory, PAC model, etc.) To build flexible and intuitive probabilistic models.

Basic Notions

Sample space Random Experiment: An experiment whose outcome cannot be determined in advance, but is nonetheless subject to analysis 1 Tossing a coin 2 Selecting a group of 100 people and observing the number of left handers

Sample space Random Experiment: An experiment whose outcome cannot be determined in advance, but is nonetheless subject to analysis 1 Tossing a coin 2 Selecting a group of 100 people and observing the number of left handers There are three main ingredients in the model of a random experiment

Sample space Random Experiment: An experiment whose outcome cannot be determined in advance, but is nonetheless subject to analysis 1 Tossing a coin 2 Selecting a group of 100 people and observing the number of left handers There are three main ingredients in the model of a random experiment We can t predict the outcome of a random experiment with certainty, but can specify a set of possible outcomes

Sample space Random Experiment: An experiment whose outcome cannot be determined in advance, but is nonetheless subject to analysis 1 Tossing a coin 2 Selecting a group of 100 people and observing the number of left handers There are three main ingredients in the model of a random experiment We can t predict the outcome of a random experiment with certainty, but can specify a set of possible outcomes Sample Space: The sample space Ω of a random experiment is the set of all possible outcomes of the experiment 1 {H, T} 2 {1, 2,..., 100 }

Events We are often not interested in a single outcome, but in whether or not one of a group of outcomes occurs. Such subsets of the sample space are called events Events are sets, can apply the usual set operations to them: 1 A B: Event that A or B or both occur 2 A B: Event that A and B both occur 3 A c : Event that A does not occur 4 A B: event A will imply event B 5 A B = : Disjoint events.

Axioms of Probability The third ingredient in the model for a random experiment is the specification of the probability of events

Axioms of Probability The third ingredient in the model for a random experiment is the specification of the probability of events The probability of some event A, denoted by P(A), is defined such that P(A) satisfies the following axioms 1 P(A) 0 2 P(Ω) = 1 3 For any sequence A 1, A 2,... of disjoint events we have: P ( ) i A i = P(A i ) i

Axioms of Probability The third ingredient in the model for a random experiment is the specification of the probability of events The probability of some event A, denoted by P(A), is defined such that P(A) satisfies the following axioms 1 P(A) 0 2 P(Ω) = 1 3 For any sequence A 1, A 2,... of disjoint events we have: P ( ) i A i = P(A i ) Kolmogorov showed that these three axioms lead to the rules of probability theory de Finetti, Cox and Carnap have also provided compelling reasons for these axioms i

Some Consequences Probability of the Empty set: P( ) = 0

Some Consequences Probability of the Empty set: P( ) = 0 Monotonicity: if A B then P(A) P(B)

Some Consequences Probability of the Empty set: P( ) = 0 Monotonicity: if A B then P(A) P(B) Numeric Bound: 0 P(A) 1 A S

Some Consequences Probability of the Empty set: P( ) = 0 Monotonicity: if A B then P(A) P(B) Numeric Bound: 0 P(A) 1 A S Addition Law: P(A B) = P(A) + P(B) P(A B) P(A c ) = P(S \ A) = 1 P(A)

Some Consequences Probability of the Empty set: P( ) = 0 Monotonicity: if A B then P(A) P(B) Numeric Bound: 0 P(A) 1 A S Addition Law: P(A B) = P(A) + P(B) P(A B) P(A c ) = P(S \ A) = 1 P(A) Axioms of probability are the only system with this property: If you gamble using them you can t be be unfairly exploited by an opponent using some other system (di Finetti, 1931)

Discrete Sample Spaces For now, we focus on the case when the sample space is countable Ω = {ω 1, ω 2,..., ω n }

Discrete Sample Spaces For now, we focus on the case when the sample space is countable Ω = {ω 1, ω 2,..., ω n } The probability P on a discrete sample space can be specified by first specifying the probability p i of each elementary event ω i and then defining:

Discrete Sample Spaces For now, we focus on the case when the sample space is countable Ω = {ω 1, ω 2,..., ω n } The probability P on a discrete sample space can be specified by first specifying the probability p i of each elementary event ω i and then defining: P(A) = p i A Ω i:ω i A

Discrete Sample Spaces For now, we focus on the case when the sample space is countable Ω = {ω 1, ω 2,..., ω n } The probability P on a discrete sample space can be specified by first specifying the probability p i of each elementary event ω i and then defining: P(A) = p i A Ω i:ω i A

Discrete Sample Spaces P(A) = p i A Ω i:ω i A

Discrete Sample Spaces P(A) = i:ω i A p i A Ω In many applications, each elementary event is equally likely. Probability of an elementary event: 1 divided by total number of elements in Ω Equally likely principle:if Ω has a finite number of outcomes, and all ar equally likely, then the possibility of each event A is defined as P(A) = A Ω

Discrete Sample Spaces P(A) = i:ω i A p i A Ω In many applications, each elementary event is equally likely. Probability of an elementary event: 1 divided by total number of elements in Ω Equally likely principle:if Ω has a finite number of outcomes, and all ar equally likely, then the possibility of each event A is defined as P(A) = A Ω Finding P(A) reduces to counting What is the probability of getting a full house in poker?

Discrete Sample Spaces P(A) = i:ω i A p i A Ω In many applications, each elementary event is equally likely. Probability of an elementary event: 1 divided by total number of elements in Ω Equally likely principle:if Ω has a finite number of outcomes, and all ar equally likely, then the possibility of each event A is defined as P(A) = A Ω Finding P(A) reduces to counting What is the probability of getting a full house in poker? 13 ( ( 4 3) 12 4 ) 2 ( 52 ) 0.14 5

Counting Counting is not easy! Fortunately, many counting problems can be cast into the framework of drawing balls from an urn ordered not ordered with replacement without replacement

Choosing k of n distinguishable objects with replacement without replacement ordered n k n(n 1)... (n k + 1) ) ( not ordered n k) ( n+k 1 n 1

Choosing k of n distinguishable objects with replacement without replacement ordered n k n(n 1)... (n k + 1) ( not ordered n+k 1 ) ( n n 1 k) usually goes in the denominator

Indistinguishable Objects If we choose k balls from an urn with n 1 red balls and n 2 green balls, what is the probability of getting a particular sequence of x red balls and k x green ones? What is the probability of any such sequence? How many ways can this happen? (this goes in the numerator)

Indistinguishable Objects If we choose k balls from an urn with n 1 red balls and n 2 green balls, what is the probability of getting a particular sequence of x red balls and k x green ones? What is the probability of any such sequence? How many ways can this happen? (this goes in the numerator) with replacement without replacement ordered not ordered n x 1 nk x 2 n 1... (n 1 x + 1) n 2... (n 2 k + x ( k x) n x 1 n k x 2 k! ( n 1 )( n2 ) x k x

Joint and conditional probability Joint: P(A, B) = P(A B) Conditional: P(A B) = P(A B) P(B) AI is all about conditional probabilities.

Conditional Probability P(A B) = fraction of worlds in which B is true that also have A true

Conditional Probability P(A B) = fraction of worlds in which B is true that also have A true H = Have a headache, F = Have flu.

Conditional Probability P(A B) = fraction of worlds in which B is true that also have A true H = Have a headache, F = Have flu. P(H) = 1 10, P(F ) = 1 40, P(H F ) = 1 2

Conditional Probability P(A B) = fraction of worlds in which B is true that also have A true H = Have a headache, F = Have flu. P(H) = 1 10, P(F ) = 1 40, P(H F ) = 1 2 Headaches are rare and flu is rarer, but if you are coming down wih flu, there is a 50-50 chance you ll have a headache.

Conditional Probability P(A B) = fraction of worlds in which B is true that also have A true H = Have a headache, F = Have flu. P(H) = 1 10, P(F ) = 1 40, P(H F ) = 1 2 Headaches are rare and flu is rarer, but if you are coming down wih flu, there is a 50-50 chance you ll have a headache.

Conditional Probability P(H F ) : Fraction of flu-inflicted worlds in which you have a headache

Conditional Probability P(H F ) : Fraction of flu-inflicted worlds in which you have a headache P(H F ) = Number of worlds with flu and headache Number of worlds with flu

Conditional Probability P(H F ) : Fraction of flu-inflicted worlds in which you have a headache P(H F ) = Number of worlds with flu and headache Number of worlds with flu

Conditional Probability P(H F ) : Fraction of flu-inflicted worlds in which you have a headache P(H F ) = Number of worlds with flu and headache Number of worlds with flu P(H F ) = Area of H and F region Area of F region = P(H F ) P(F )

Conditional Probability P(H F ) : Fraction of flu-inflicted worlds in which you have a headache P(H F ) = Number of worlds with flu and headache Number of worlds with flu P(H F ) = Area of H and F region Area of F region = P(H F ) P(F ) Conditional Probability: P(A B) = P(A B) P(B)

Conditional Probability P(H F ) : Fraction of flu-inflicted worlds in which you have a headache P(H F ) = Number of worlds with flu and headache Number of worlds with flu P(H F ) = Area of H and F region Area of F region = P(H F ) P(F ) Conditional Probability: P(A B) = P(A B) P(B) Corollary: The Chain Rule P(A B) = P(A B)P(B)

Probabilistic Inference H = Have a headache, F = Have flu. P(H) = 1 10, P(F ) 1 40, P(H F ) = 1 2

Probabilistic Inference H = Have a headache, F = Have flu. P(H) = 1 10, P(F ) 1 40, P(H F ) = 1 2 Suppose you wake up one day with a headache and think: 50 % of flus are associated with headaches so I must have a 50-50 chance of coming down with flu Is this reasoning good?

Bayes Rule: Relates P(A B) to P(A B)

Sensitivity and Specificity TRUE FALSE predict + true + false + predict false true Sensitivity = P(+ disease) FNR = P( T ) = 1 sensitivity Specificity = P( healthy) FPR = P(+ F ) = 1 specificity

Mammography Sensitivity of screening mammogram P(+ cancer) 90% Specificity of screening mammogram P( no cancer) 91% Probability that a woman age 40 has breast cancer 1% If a previously unscreened 40 year old woman s mammogram is positive, what is the probability that she has breast cancer?

Mammography Sensitivity of screening mammogram P(+ cancer) 90% Specificity of screening mammogram P( no cancer) 91% Probability that a woman age 40 has breast cancer 1% If a previously unscreened 40 year old woman s mammogram is positive, what is the probability that she has breast cancer? P(cancer +) =

Mammography Sensitivity of screening mammogram P(+ cancer) 90% Specificity of screening mammogram P( no cancer) 91% Probability that a woman age 40 has breast cancer 1% If a previously unscreened 40 year old woman s mammogram is positive, what is the probability that she has breast cancer? P(cancer +) = P(cancer, +) P(+) =

Mammography Sensitivity of screening mammogram P(+ cancer) 90% Specificity of screening mammogram P( no cancer) 91% Probability that a woman age 40 has breast cancer 1% If a previously unscreened 40 year old woman s mammogram is positive, what is the probability that she has breast cancer? P(cancer +) = P(cancer, +) P(+) = P(+ cancer) P(cancer) P(+) =

Mammography Sensitivity of screening mammogram P(+ cancer) 90% Specificity of screening mammogram P( no cancer) 91% Probability that a woman age 40 has breast cancer 1% If a previously unscreened 40 year old woman s mammogram is positive, what is the probability that she has breast cancer? P(cancer +) = P(cancer, +) P(+) 0.01.9 0.01.9 + 0.99 0.09 = P(+ cancer) P(cancer) P(+) =

Mammography Sensitivity of screening mammogram P(+ cancer) 90% Specificity of screening mammogram P( no cancer) 91% Probability that a woman age 40 has breast cancer 1% If a previously unscreened 40 year old woman s mammogram is positive, what is the probability that she has breast cancer? P(cancer +) = P(cancer, +) P(+) = P(+ cancer) P(cancer) P(+) 0.01.9 0.01.9 + 0.99 0.09 0.009 0.009 + 0.09 0.009 0.1 9% Message: P(A B) P(B A). =

Bayes rule P(B A) = P(A B) P(B) P(A) (Bayes, Thomas (1763) An Essay towards solving a problem in the doctrine of chances. Philosophical Transactions of the Royal Society of London) Rev. Thomas Bayes (1701 1761)

Prosecutor s fallacy: Sally Clark Sally Clark (1964 2007) Two kids died with no explanation. Sir Roy Meadow testified that chance of this happening due to SIDS is (1/8500) 2 (73 10 6 ) 1. Sally Clark found guilty and imprisoned. Later verdict overturned and Meadow struck off medical register. Fallacy: P(SIDS 2 deaths) P(SIDS, 2 deaths) P(guilty +) = 1 P(not guilty +) 1 P(+ not guilty)

Independence Two events A and B are independent, denoted A Bif P(A, B) = P(A) P(B).

Independence Two events A and B are independent, denoted A Bif P(A, B) = P(A) P(B). P(A B) = P(A, B) P(B) = P(A) P(B) P(B) = P(A)

Independence Two events A and B are independent, denoted A Bif P(A, B) = P(A) P(B). P(A B) = P(A, B) P(B) = P(A) P(B) P(B) = P(A) P(A c B) = P(B) P(A, B) P(B) = P(B)(1 P(A)) P(B) = P(A c )

Independence A collection of events A are mutually independent if for any {i 1, i 2,..., i n } A P ( n ) n A i = P(A i ) i=1 If A is independent of B and C, that does not necessarily mean that it is independent of (B, C) (example). i=1

Conditional independence A is conditionally independent of B given C, denoted A B C if P(A, B C) = P(A C) P(B C). A B C does not imply and is not implied by A B.

Common cause

Common cause p(x A, x B, x C ) = p(x C ) p(x A x C ) p(x B x C )

Common cause p(x A, x B, x C ) = p(x C ) p(x A x C ) p(x B x C ) X A X B but X A X B X C Example: Lung cancer Yellow teeth Smoking

Explaining away

Explaining away p(x A, x B, x C ) = p(x A ) p(x B ) p(x C x A, x B )

Explaining away p(x A, x B, x C ) = p(x A ) p(x B ) p(x C x A, x B ) Example: X A X B but X A X B X C Burglary Earthquake Alarm

Explaining away p(x A, x B, x C ) = p(x A ) p(x B ) p(x C x A, x B ) X A X B but X A X B X C Example: Burglary Earthquake Alarm Even if two variables are independent, they can become dependent when we observe an effect that they can both influence

Bayesian Networks Simple case: POS Tagging. Want to predict an output vector y = {y 0, y 1,..., y T } of random variables given an observed feature vector x (Hidden Markov Model)

Random Variables

Random Variables A Random Variable is a function X : Ω R

Random Variables A Random Variable is a function X : Ω R Example: Sum of two fair dice

Random Variables A Random Variable is a function X : Ω R Example: Sum of two fair dice The set of all possible values a random variable X can take is called its range Discrete random variables can only take isolated values (probability of a random variable taking a particular value reduces to counting)

Discrete Distributions Assume X is a discrete random variable. We would like to specify probabilities of events {X = x}

Discrete Distributions Assume X is a discrete random variable. We would like to specify probabilities of events {X = x} If we can specify the probabilities involving X, we can say that we have specified the probability distribution of X

Discrete Distributions Assume X is a discrete random variable. We would like to specify probabilities of events {X = x} If we can specify the probabilities involving X, we can say that we have specified the probability distribution of X For a countable set of values x 1, x 2,... x n, we have P(X = x i ) > 0, i = 1, 2,..., n and i P(X = x i) = 1

Discrete Distributions Assume X is a discrete random variable. We would like to specify probabilities of events {X = x} If we can specify the probabilities involving X, we can say that we have specified the probability distribution of X For a countable set of values x 1, x 2,... x n, we have P(X = x i ) > 0, i = 1, 2,..., n and i P(X = x i) = 1 We can then define the probability mass function f of X by f(x) = P(X = x)

Discrete Distributions Assume X is a discrete random variable. We would like to specify probabilities of events {X = x} If we can specify the probabilities involving X, we can say that we have specified the probability distribution of X For a countable set of values x 1, x 2,... x n, we have P(X = x i ) > 0, i = 1, 2,..., n and i P(X = x i) = 1 We can then define the probability mass function f of X by f(x) = P(X = x) Sometimes write as f X

Discrete Distributions Example: Toss a die and let X be its face value. X is discrete with range {1, 2, 3, 4, 5, 6}. The pmf is

Discrete Distributions Example: Toss a die and let X be its face value. X is discrete with range {1, 2, 3, 4, 5, 6}. The pmf is Another example: Toss two dice and let X be the largest face value. The pmf is

Expectation Assume X is a discrete random variable with pmf f.

Expectation Assume X is a discrete random variable with pmf f. The expectation of X, E[X] is defined by: E[X] = x xp(x = x) = x xf(x)

Expectation Assume X is a discrete random variable with pmf f. The expectation of X, E[X] is defined by: E[X] = x xp(x = x) = x xf(x) Sometimes written as µ X. Is sort of a weighted average of the values that X can take (another interpretation is as a center of mass).

Expectation Assume X is a discrete random variable with pmf f. The expectation of X, E[X] is defined by: E[X] = x xp(x = x) = x xf(x) Sometimes written as µ X. Is sort of a weighted average of the values that X can take (another interpretation is as a center of mass). Example: Expected outcome of toss of a fair die - 7 2

Expectation If X is a random variable, then a function of X, such as X 2 is also a random variable. The following statement is easy to prove:

Expectation If X is a random variable, then a function of X, such as X 2 is also a random variable. The following statement is easy to prove: Theorem If X is discrete with pmf f, then for any real-valued function g, Eg(X) = x g(x)f(x) Example: E[X 2 ] when X is outcome of the toss of a fair die, is 91 6

Linearity of Expectation A consequence of the obvious theorem from earlier is that Expectation is linear i.e. has the following two properties for a, b R and functions g, h

Linearity of Expectation A consequence of the obvious theorem from earlier is that Expectation is linear i.e. has the following two properties for a, b R and functions g, h E(aX + b) = aex + b

Linearity of Expectation A consequence of the obvious theorem from earlier is that Expectation is linear i.e. has the following two properties for a, b R and functions g, h E(aX + b) = aex + b (Proof: Suppose X has pmf f. Then the above follows from E(aX + b) = x (ax + b)f(x) = a x f(x) + b x f(x) = aex + b) E(g(X) + h(x)) = Eg(X) + Eh(X)

Linearity of Expectation A consequence of the obvious theorem from earlier is that Expectation is linear i.e. has the following two properties for a, b R and functions g, h E(aX + b) = aex + b (Proof: Suppose X has pmf f. Then the above follows from E(aX + b) = x (ax + b)f(x) = a x f(x) + b x f(x) = aex + b) E(g(X) + h(x)) = Eg(X) + Eh(X) (Proof: E(g(X) + h(x) = x (g(x) + h(x))f(x) = x g(x)f(x) + x h(x)f(x) = Eg(X) + Eh(X))

Variance Variance of a random variable X, denoted by V ar(x) is defined as: V ar(x) = E(X EX) 2

Variance Variance of a random variable X, denoted by V ar(x) is defined as: V ar(x) = E(X EX) 2 Is a measure of dispersion

Variance Variance of a random variable X, denoted by V ar(x) is defined as: V ar(x) = E(X EX) 2 Is a measure of dispersion The following two properties follow easily from the definitions of expectation and variance:

Variance Variance of a random variable X, denoted by V ar(x) is defined as: V ar(x) = E(X EX) 2 Is a measure of dispersion The following two properties follow easily from the definitions of expectation and variance: 1 V ar(x) = EX 2 (EX) 2

Variance Variance of a random variable X, denoted by V ar(x) is defined as: V ar(x) = E(X EX) 2 Is a measure of dispersion The following two properties follow easily from the definitions of expectation and variance: 1 V ar(x) = EX 2 (EX) 2 (Proof: Write EX = µ. Expanding V ar(x) = E(x µ) 2 = E(X 2 2µX + µ 2 ). Using linearity of expectation yields E(X 2 ) µ 2 )

Variance Variance of a random variable X, denoted by V ar(x) is defined as: V ar(x) = E(X EX) 2 Is a measure of dispersion The following two properties follow easily from the definitions of expectation and variance: 1 V ar(x) = EX 2 (EX) 2 (Proof: Write EX = µ. Expanding V ar(x) = E(x µ) 2 = E(X 2 2µX + µ 2 ). Using linearity of expectation yields E(X 2 ) µ 2 ) 2 V ar(ax + b) = a 2 V ar(x)

Variance Variance of a random variable X, denoted by V ar(x) is defined as: V ar(x) = E(X EX) 2 Is a measure of dispersion The following two properties follow easily from the definitions of expectation and variance: 1 V ar(x) = EX 2 (EX) 2 (Proof: Write EX = µ. Expanding V ar(x) = E(x µ) 2 = E(X 2 2µX + µ 2 ). Using linearity of expectation yields E(X 2 ) µ 2 ) 2 V ar(ax + b) = a 2 V ar(x) (Proof: V ar(ax + b) = E(aX + b (aµ + b)) 2 = E(a 2 (X µ) 2 ) = a 2 V ar(x))

Joint Distributions Let X 1,..., X n be discrete random variables. The function f defined by f(x 1,..., x n ) = P(X 1 = x 1,..., X n = x n ) is called the joint probability mass function of X 1,..., X n

Joint Distributions Let X 1,..., X n be discrete random variables. The function f defined by f(x 1,..., x n ) = P(X 1 = x 1,..., X n = x n ) is called the joint probability mass function of X 1,..., X n X 1,..., X n are independent if and only if P(X 1 = x 1,..., X n = x ) = P(X 1 = x 1 )... P(X n = x n ) for all x 1, x 2,..., x n

Joint Distributions Let X 1,..., X n be discrete random variables. The function f defined by f(x 1,..., x n ) = P(X 1 = x 1,..., X n = x n ) is called the joint probability mass function of X 1,..., X n X 1,..., X n are independent if and only if P(X 1 = x 1,..., X n = x ) = P(X 1 = x 1 )... P(X n = x n ) for all x 1, x 2,..., x n IfX 1,..., X n are independent, then EX 1, X 2,..., X n = EX 1 EX 2,..., EX n (Also: If X and Y are independent, then V ar(x + Y ) = V ar(x) + V ar(y ))

Joint Distributions Let X 1,..., X n be discrete random variables. The function f defined by f(x 1,..., x n ) = P(X 1 = x 1,..., X n = x n ) is called the joint probability mass function of X 1,..., X n X 1,..., X n are independent if and only if P(X 1 = x 1,..., X n = x ) = P(X 1 = x 1 )... P(X n = x n ) for all x 1, x 2,..., x n IfX 1,..., X n are independent, then EX 1, X 2,..., X n = EX 1 EX 2,..., EX n (Also: If X and Y are independent, then V ar(x + Y ) = V ar(x) + V ar(y )) Covariance: The covariance of two random variables X and Y is defined as the number Cov(X, Y ) = E(X EX)(Y EY )

Joint Distributions Let X 1,..., X n be discrete random variables. The function f defined by f(x 1,..., x n ) = P(X 1 = x 1,..., X n = x n ) is called the joint probability mass function of X 1,..., X n X 1,..., X n are independent if and only if P(X 1 = x 1,..., X n = x ) = P(X 1 = x 1 )... P(X n = x n ) for all x 1, x 2,..., x n IfX 1,..., X n are independent, then EX 1, X 2,..., X n = EX 1 EX 2,..., EX n (Also: If X and Y are independent, then V ar(x + Y ) = V ar(x) + V ar(y )) Covariance: The covariance of two random variables X and Y is defined as the number Cov(X, Y ) = E(X EX)(Y EY ) It is a measure for the amount of linear dependency between the variables

Joint Distributions Let X 1,..., X n be discrete random variables. The function f defined by f(x 1,..., x n ) = P(X 1 = x 1,..., X n = x n ) is called the joint probability mass function of X 1,..., X n X 1,..., X n are independent if and only if P(X 1 = x 1,..., X n = x ) = P(X 1 = x 1 )... P(X n = x n ) for all x 1, x 2,..., x n IfX 1,..., X n are independent, then EX 1, X 2,..., X n = EX 1 EX 2,..., EX n (Also: If X and Y are independent, then V ar(x + Y ) = V ar(x) + V ar(y )) Covariance: The covariance of two random variables X and Y is defined as the number Cov(X, Y ) = E(X EX)(Y EY ) It is a measure for the amount of linear dependency between the variables If X and Y are independent, the covariance is zero

Some Important Discrete Distributions

Bernoulli Distribution: Coin Tossing We say X has a Bernoulli Distribution with success probability p if X can only take values 0 and 1 with probabilities P(X = 1) = p = 1 P(X = 0)

Bernoulli Distribution: Coin Tossing We say X has a Bernoulli Distribution with success probability p if X can only take values 0 and 1 with probabilities P(X = 1) = p = 1 P(X = 0) Expectation:

Bernoulli Distribution: Coin Tossing We say X has a Bernoulli Distribution with success probability p if X can only take values 0 and 1 with probabilities P(X = 1) = p = 1 P(X = 0) Expectation: EX = 0P(X = 0) + 1P(X = 1)p

Bernoulli Distribution: Coin Tossing We say X has a Bernoulli Distribution with success probability p if X can only take values 0 and 1 with probabilities P(X = 1) = p = 1 P(X = 0) Expectation: EX = 0P(X = 0) + 1P(X = 1)p Variance:

Bernoulli Distribution: Coin Tossing We say X has a Bernoulli Distribution with success probability p if X can only take values 0 and 1 with probabilities P(X = 1) = p = 1 P(X = 0) Expectation: EX = 0P(X = 0) + 1P(X = 1)p Variance: V ar(x) = EX 2 (EX) 2 = EX (EX) 2 = p(1 p)

Binomial Distribution Consider a sequence of n coin tosses. Suppose X counts the total number of heads. If the probability of heads is p, then we say X has a binomial distribution with parameters n and p and write X Bin(n, p)

Binomial Distribution Consider a sequence of n coin tosses. Suppose X counts the total number of heads. If the probability of heads is p, then we say X has a binomial distribution with parameters n and p and write X Bin(n, p) The pmf is f(x) = P(X = x) = ( ) n p x (1 p) n x, with x = 0, 1,..., n x

Binomial Distribution Consider a sequence of n coin tosses. Suppose X counts the total number of heads. If the probability of heads is p, then we say X has a binomial distribution with parameters n and p and write X Bin(n, p) The pmf is f(x) = P(X = x) = ( ) n p x (1 p) n x, with x = 0, 1,..., n x Expectation:

Binomial Distribution Consider a sequence of n coin tosses. Suppose X counts the total number of heads. If the probability of heads is p, then we say X has a binomial distribution with parameters n and p and write X Bin(n, p) The pmf is f(x) = P(X = x) = ( ) n p x (1 p) n x, with x = 0, 1,..., n x Expectation: EX = np. Could evaluate the sum, but that is messy. Use linearity of expectation instead (X can be viewed as a sum X = X 1 + X 2,..., X n of n independent Bernoulli random variables). Variance:

Binomial Distribution Consider a sequence of n coin tosses. Suppose X counts the total number of heads. If the probability of heads is p, then we say X has a binomial distribution with parameters n and p and write X Bin(n, p) The pmf is f(x) = P(X = x) = ( ) n p x (1 p) n x, with x = 0, 1,..., n x Expectation: EX = np. Could evaluate the sum, but that is messy. Use linearity of expectation instead (X can be viewed as a sum X = X 1 + X 2,..., X n of n independent Bernoulli random variables). Variance: V ar(x) = np(1 p) (showed in a similar way to the expectation)

Binomial Distribution

Geometric Distribution Again look at coin tosses, but count a different thing: Number of tosses before the first head

Geometric Distribution Again look at coin tosses, but count a different thing: Number of tosses before the first head P(X = x) = (1 p) x 1 p, for x = 1, 2, 3... X is said to have a geometric distribution with parameter p, X G(p)

Geometric Distribution Again look at coin tosses, but count a different thing: Number of tosses before the first head P(X = x) = (1 p) x 1 p, for x = 1, 2, 3... X is said to have a geometric distribution with parameter p, X G(p) Expectation:

Geometric Distribution Again look at coin tosses, but count a different thing: Number of tosses before the first head P(X = x) = (1 p) x 1 p, for x = 1, 2, 3... X is said to have a geometric distribution with parameter p, X G(p) Expectation: EX = 1 p

Geometric Distribution Again look at coin tosses, but count a different thing: Number of tosses before the first head P(X = x) = (1 p) x 1 p, for x = 1, 2, 3... X is said to have a geometric distribution with parameter p, X G(p) Expectation: EX = 1 p Variance:

Geometric Distribution Again look at coin tosses, but count a different thing: Number of tosses before the first head P(X = x) = (1 p) x 1 p, for x = 1, 2, 3... X is said to have a geometric distribution with parameter p, X G(p) Expectation: EX = 1 p Variance: V ar(x) = 1 p p 2

Geometric Distribution Again look at coin tosses, but count a different thing: Number of tosses before the first head P(X = x) = (1 p) x 1 p, for x = 1, 2, 3... X is said to have a geometric distribution with parameter p, X G(p) Expectation: EX = 1 p Variance: V ar(x) = 1 p p 2

Poisson Distribution A random variable X for which: for fixed λ > 0 P(X = x) = λx x! exp λ, x = 0, 1, 2,...

Poisson Distribution A random variable X for which: for fixed λ > 0 We write X P oi(λ) P(X = x) = λx x! exp λ, x = 0, 1, 2,... Can be seen as a limiting distribution of Bin(n, λ n )

Law of Large Numbers To discuss the law or large numbers, we will first prove Chebyshev Inequality

Law of Large Numbers To discuss the law or large numbers, we will first prove Chebyshev Inequality Theorem (Chebyshev Inequality) Let X be a discrete random variable with EX = µ, and let ɛ > 0 be any positive real number. Then P( X µ ɛ) V ar(x) ɛ 2

Law of Large Numbers To discuss the law or large numbers, we will first prove Chebyshev Inequality Theorem (Chebyshev Inequality) Let X be a discrete random variable with EX = µ, and let ɛ > 0 be any positive real number. Then P( X µ ɛ) V ar(x) ɛ 2 Basically states that the probability of deviation from the mean of more than k standard deviations is 1 k 2

Law of Large Numbers Proof. Let f(x) denote the pmf for X. Then the probability that X differs from µ by ateast ɛ is given by P( X µ ɛ) = X µ ɛ f(x)

Law of Large Numbers Proof. Let f(x) denote the pmf for X. Then the probability that X differs from µ by ateast ɛ is given by P( X µ ɛ) = X µ ɛ f(x) We know that V ar(x) = x (x µ)2 f(x), and this is at least as large as x µ ɛ (x µ)2 f(x) since all the summands are positive and we have restricted the range of summation.

Law of Large Numbers Proof. Let f(x) denote the pmf for X. Then the probability that X differs from µ by ateast ɛ is given by P( X µ ɛ) = X µ ɛ f(x) We know that V ar(x) = x (x µ)2 f(x), and this is at least as large as x µ ɛ (x µ)2 f(x) since all the summands are positive and we have restricted the range of summation. But this last sum is at least ɛ 2 f(x) = ɛ 2 f(x) = ɛ 2 P( x µ ɛ) x µ ɛ x µ ɛ So, P( X µ ɛ) V ar(x) ɛ 2

Law of Large Numbers(Weak Form) Theorem (Law of Large Numbers) Let X 1, X 2,..., X n be an independent trials process, with finite expected value µ = EX j and finite variance σ 2 = V ar(x j ). Let S n = X 1 + X 2 + + X n, then for any ɛ > 0

Law of Large Numbers(Weak Form) Theorem (Law of Large Numbers) Let X 1, X 2,..., X n be an independent trials process, with finite expected value µ = EX j and finite variance σ 2 = V ar(x j ). Let S n = X 1 + X 2 + + X n, then for any ɛ > 0 as n and equivalently as n ( P S ) n n µ ɛ 0 ( P S ) n n µ < ɛ 1 Sample average converges in probability towards expected value.

Proof. Since X 1, X 2,..., X n are independent and have the same distribution, we have V ar(s n ) = nσ 2 and V ar( Sn n ) = σ2 den.

Proof. Since X 1, X 2,..., X n are independent and have the same distribution, we have V ar(s n ) = nσ 2 and V ar( Sn n ) = σ2 den. We also know that E Sn n = µ. By Chebyshev s inequality, for any ɛ > 0 ( P S ) n n µ ɛ σ2 nɛ 2 Thus for fixed ɛ, n implies the statement.

Roadmap Today: Discrete Probability Next time: Continuous Probability