Discrete Mathematics and Probability Theory Fall 2016 Walrand Probability: An Overview

Similar documents
n outcome is (+1,+1, 1,..., 1). Let the r.v. X denote our position (relative to our starting point 0) after n moves. Thus X = X 1 + X 2 + +X n,

Problems from 9th edition of Probability and Statistical Inference by Hogg, Tanis and Zimmerman:

STATISTICAL PROPERTIES OF LEAST SQUARES ESTIMATORS. Comments:

Random Variables, Sampling and Estimation

Expectation and Variance of a random variable

The standard deviation of the mean

Discrete Mathematics and Probability Theory Spring 2016 Rao and Walrand Note 19

Final Review for MATH 3510

A quick activity - Central Limit Theorem and Proportions. Lecture 21: Testing Proportions. Results from the GSS. Statistics and the General Population

Discrete Mathematics for CS Spring 2008 David Wagner Note 22

Randomized Algorithms I, Spring 2018, Department of Computer Science, University of Helsinki Homework 1: Solutions (Discussed January 25, 2018)

ACCESS TO SCIENCE, ENGINEERING AND AGRICULTURE: MATHEMATICS 1 MATH00030 SEMESTER / Statistics

1 Inferential Methods for Correlation and Regression Analysis

Lecture 6 Chi Square Distribution (χ 2 ) and Least Squares Fitting

Discrete Mathematics for CS Spring 2007 Luca Trevisan Lecture 22

Homework 5 Solutions

Lecture 6 Chi Square Distribution (χ 2 ) and Least Squares Fitting

Frequentist Inference

This exam contains 19 pages (including this cover page) and 10 questions. A Formulae sheet is provided with the exam.

f X (12) = Pr(X = 12) = Pr({(6, 6)}) = 1/36

7.1 Convergence of sequences of random variables

Chapter 8: Estimating with Confidence

Problem Set 2 Solutions

Problem Set 4 Due Oct, 12

Lecture 01: the Central Limit Theorem. 1 Central Limit Theorem for i.i.d. random variables

PH 425 Quantum Measurement and Spin Winter SPINS Lab 1

Lecture 2: April 3, 2013

Let us give one more example of MLE. Example 3. The uniform distribution U[0, θ] on the interval [0, θ] has p.d.f.

Output Analysis and Run-Length Control

Simulation. Two Rule For Inverting A Distribution Function

Hypothesis Testing. Evaluation of Performance of Learned h. Issues. Trade-off Between Bias and Variance

Discrete Mathematics and Probability Theory Fall 2016 Seshia and Walrand Final Solutions

MATH 472 / SPRING 2013 ASSIGNMENT 2: DUE FEBRUARY 4 FINALIZED

FACULTY OF MATHEMATICAL STUDIES MATHEMATICS FOR PART I ENGINEERING. Lectures

PRACTICE PROBLEMS FOR THE FINAL

An Introduction to Randomized Algorithms

Lecture 12: November 13, 2018

Lecture Chapter 6: Convergence of Random Sequences

Statistics 511 Additional Materials

Discrete probability distributions

Binomial Distribution

Lecture 19: Convergence

ECE 8527: Introduction to Machine Learning and Pattern Recognition Midterm # 1. Vaishali Amin Fall, 2015

Understanding Samples

Machine Learning Theory (CS 6783)

7.1 Convergence of sequences of random variables

Distribution of Random Samples & Limit theorems

Confidence Intervals for the Population Proportion p

Machine Learning Brett Bernstein

Statistical Inference (Chapter 10) Statistical inference = learn about a population based on the information provided by a sample.

Infinite Sequences and Series

Econ 325 Notes on Point Estimator and Confidence Interval 1 By Hiro Kasahara

Discrete Mathematics and Probability Theory Spring 2013 Anant Sahai Lecture 18

Econ 325: Introduction to Empirical Economics

Massachusetts Institute of Technology

Goodness-of-Fit Tests and Categorical Data Analysis (Devore Chapter Fourteen)

STAT Homework 1 - Solutions

kp(x = k) = λe λ λ k 1 (k 1)! = λe λ r k e λλk k! = e λ g(r) = e λ e rλ = e λ(r 1) g (1) = E[X] = λ g(r) = kr k 1 e λλk k! = E[X]

Topic 9: Sampling Distributions of Estimators

6.3 Testing Series With Positive Terms

Quick Review of Probability

Topic 9: Sampling Distributions of Estimators

( ) = p and P( i = b) = q.

Discrete Mathematics and Probability Theory Fall 2009 Satish Rao,David Tse Lecture 16. Multiple Random Variables and Applications to Inference

Basics of Probability Theory (for Theory of Computation courses)

Discrete Mathematics for CS Spring 2005 Clancy/Wagner Notes 21. Some Important Distributions

Linear Regression Demystified

Data Analysis and Statistical Methods Statistics 651

Estimation for Complete Data

Topic 8: Expected Values

f X (12) = Pr(X = 12) = Pr({(6, 6)}) = 1/36

UNIT 2 DIFFERENT APPROACHES TO PROBABILITY THEORY

Discrete Mathematics and Probability Theory Summer 2014 James Cook Note 15

MATH 320: Probability and Statistics 9. Estimation and Testing of Parameters. Readings: Pruim, Chapter 4

Topic 9: Sampling Distributions of Estimators

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d

Recall the study where we estimated the difference between mean systolic blood pressure levels of users of oral contraceptives and non-users, x - y.

1 Review of Probability & Statistics

Quick Review of Probability

NO! This is not evidence in favor of ESP. We are rejecting the (null) hypothesis that the results are

January 25, 2017 INTRODUCTION TO MATHEMATICAL STATISTICS

Lecture 2: Monte Carlo Simulation

The Random Walk For Dummies

Math 525: Lecture 5. January 18, 2018

Sequences and Series of Functions

Lecture 12: September 27

LECTURE 8: ASYMPTOTICS I

11 Correlation and Regression

A sequence of numbers is a function whose domain is the positive integers. We can see that the sequence

Chapter 5. Inequalities. 5.1 The Markov and Chebyshev inequalities

Math 10A final exam, December 16, 2016

Big Picture. 5. Data, Estimates, and Models: quantifying the accuracy of estimates.

Lecture 11 and 12: Basic estimation theory

Resampling Methods. X (1/2), i.e., Pr (X i m) = 1/2. We order the data: X (1) X (2) X (n). Define the sample median: ( n.

Axis Aligned Ellipsoid

AMS570 Lecture Notes #2

CSE 527, Additional notes on MLE & EM

Introduction to Probability. Ariel Yadin

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss

Variance of Discrete Random Variables Class 5, Jeremy Orloff and Jonathan Bloom

Transcription:

CS 70 Discrete Mathematics ad Probability Theory Fall 2016 Walrad Probability: A Overview Probability is a fasciatig theory. It provides a precise, clea, ad useful model of ucertaity. The successes of Probability Theory i Computer Sciece are remarkable: data sciece, machie learig, artificial itelligece, voice ad image recogitio, ad commuicatio theory are based o that theory. The objective of these otes is to itroduce the key ideas of Probability Theory o simple examples. Hopefully, this overview will help you see the forest as you explore its differet trees i the course. 1 Pick a Marble Setup Imagie a bag with 100 marbles that are idetical, except for their color. Amog those, 10 are blue, 20 are red, 30 are gree, ad 40 are white. You shake the bag ad pick a marble without lookig. Probability You will certaily agree that the odds that you picked a gree marble are 30 out of 100. Similarly, the odds that you picked a blue marble are 10 out of 100. We say that the probability that the marble is gree is 30/100 = 0.3. We write Pr[gree] = 0.3. Iterpretatio What does this mea precisely? Well, this is ot really that obvious. Two iterpretatios are useful. The first iterpretatio is a subjective willigess to bet o the outcome. Imagie the followig game of chace. You bet some amout ad you get $100.00 if the marble is gree. How much are you willig to bet? I would be willig to bet $30.00. The secod iterpretatio is frequetist. It says that if you were to repeat this experimet (shakig the bag with 100 marbles ad pick a marble without lookig), you would pick a gree marble about 30% of the time. Note that this is a iterpretatio at this poit, ot a theorem. Additivity Cosider the evet the marble is blue or gree. The odds of that evet are 40/100. We write Pr[blue or gree] = 0.4. Note that Pr[blue or gree] = Pr[blue] + Pr[gree]. This is ot surprisig sice the umber of marbles that are blue or gree is the sum of the umber of blue marbles plus the umber of gree marbles. We say that probability is additive. Coditioal Probability Assume that the marble you picked is blue or gree. What are the odds that it is blue or red? Well, sice you picked oe of the 40 marbles that are either blue or gree, that marble is blue or red oly if it is oe of the 10 blue marbles. Sice 10 out of the 40 blue or gree marbles are blue, we see that the odds that you picked a blue or red marble, give that you picked a blue or gree marble, is 10/40. We say that the coditioal probability of blue or red give blue or gree is 10/40. We write Pr[blue or red blue or gree] = 10/40. CS 70, Fall 2016, Probability: A Overview 1

Note that Pr[blue or red blue or gree] = Pr[(blue or red) ad (blue or gree)] Pr[blue or gree] = Pr[blue] Pr[blue or gree]. Bayes Rule Assume that we pait a black dot o half of the blue ad half of the red marbles, ad also o 20% of the gree ad 20% of the white marbles. You pick a marble at radom ad are told that the marble has a black dot. What are the odds that the marble is red? To aswer this questio, we ote that there are 5+10+6+8 = 29 marbles with a black dot, out of which 10 are red. Thus, the aswer is 10/29. This calculatio is a example of Bayes Rule. The idea is that oe specified Pr[black dot blue] = 0.5 ad similarly for the other colors. Oe also kows Pr[blue] = 0.1, ad similarly for the other colors. The calculatio determies Pr[red black dot], which i a sese is the reverse of the specificatio. A similar calculatio determies the likelihood of a disease (e.g., flu) give a symptom (e.g., fever). Here, the symptom is the back dot ad the disease is the color of the marble. Radom Variable Say that you get $8.00 if you pick a blue marble, $5.00 if it is red, $2.00 if it is gree, ad $2.00 if it is white. The amout you get is the a fuctio of the color of the marble you picked. This fuctio is fixed. Let us call the fuctio X( ). Thus, X(blue) = 10 ad X(white) = 2, ad so o. We call X a radom variable. Thus, we say that a radom variable is a real-valued fuctio of the outcome of a radom experimet. Here, the radom experimet is choosig a marble. The outcome is the color of the marble. We have specified all the possible outcomes: blue, red, gree, white. Also, we kow the probability of each outcome. For istace, Pr[blue] = 0.1. The set of outcomes ad their probability specifies the radom experimet. The fuctio X assigs a real umber to each outcome. Note that the values assiged to differet outcomes do ot have to be differet. Here, X(gree) = X(white) = 2. Distributio Assume that we are iterested oly i how much you get, ot i the details of the experimet that produces that gai. I that case, we ca describe X by sayig that X = 8 with probability 0.1 (which is the probability you pick a blue marble), X = 5 with probability 0.2, ad X = 2 with probability 0.7 (the probability that you pick a gree or white marble). Thus, the possible values of X are 8,5,2 ad their probability is 0.1,0.2,0.7, respectively. These values ad their probability are called the distributio of the radom variable X. Expectatio Imagie that you repeat the experimet (shake, pick, collect X) a very large umber N of times. The frequetist iterpretatio suggests that the fractio of the times that you collect 8 is 0.1, that you collect 5 is 0.2 ad that you collect 2 is 0.7. Thus, you collect 8 about 0.1N times, 5 about 0.2N times, ad 2 about 0.7N times. Hece, the total amout you collect over the N experimets is about 8 0.1N + 5 0.2N + 2 0.7N = (8 0.1 + 5 0.2 + 2 0.7)N. Accordigly, the average amout you collect per experimet is 8 0.1 + 5 0.2 + 2 0.7. We call this value the expectatio of X ad we write it as E[X]. We also call E[X] the mea value or the expected value of X. Thus, E[X] = 8 0.1 + 5 0.2 + 2 0.7 = 3.2. That is, E[X] is the sum of the values of X multiplied by their probability. Fuctio CS 70, Fall 2016, Probability: A Overview 2

Would you rather play the game (pick a marble ad get X) or get $3.20 without playig the game? The aswer depeds o a key factor that the ecoomists call the utility that you have for moey. To make the situatio a bit more dramatic, say that you ca either get $1.00 or play a game ad get $100.00 with probability 0.01 or $0.00 otherwise. What do you prefer? May people ted to choose to play the game. I fact, may people play the Califoria Lottery where the odds of wiig $100M are much less that 10 8. Let h(x) be the utility that you have for $x. Say that (this is a silly example, but it will illustrate a poit) h(8) = 10 ad h(5) = h(3.2) = h(2) = 0. For istace, for $8.00, you ca by a ticket to go see the latest Pokemo movie you crave ad that you caot do aythig of comparable value with less tha $8.00. The we fid that, after playig the marble game, h(x) = 10 with probability 0.1 ad h(x) = 0 with probability 0.9. Hece, E[h(X)] = 10 0.1 + 0 0.9 = 1. O the other had, if you do t play the game ad get 3.2, the h(3.2) = 0. Thus, you would rather play the game. Similarly, people play the lottery beause wiig would chage their life, presumably for the better, whereas loosig $1.00 does ot affect their life. Note that we calculated E[h(X)] by fidig the distributio (recall that this meas the set of possible values ad their probability) of h(x). We could have calculated E[h(X)] directly from the distributio of X: E[h(X)] = h(8)pr[x = 8] + h(5)pr[x = 5] + h(2)pr[x = 2] = 10 0.1 + 0 0.2 + 0 0.7. This is simple observatio, but it is coveiet. I a similar way, we could have computed E[h(X)] by lookig at the outcomes of the marble pickig game: E[h(X)] = h(x(blue))pr[blue] + h(x(red))pr[red] + h(x(gree))pr[gree] + h(x(white))pr[white] = h(8)0.1 + h(5)0.2 + h(2)0.3 + h(2)0.5. Ideed, these three differet ways of calculatig E[h(X)] correspod to differet ways of summig the possible ways of gettig the values of h(x): summig over the values of h(x), or the values of X, or the outcomes. Variace We saw that oe ca describe a radom variable X by its distributio. A summary of that distributio is the mea value E[X]. However, our discussio of the utility shows that this descriptio is a bit crude ad may ot suffice to decide whether to play a game of chace. For istace, the expected gai of playig the lottery is egative. You would ot paly a game where you are certai to loose. The mea value does ot say aythig about the ucertaity of X, i.e., its variability. Here, by variability we mea that if we play the game may times, we observe a variety of values of X. The variace is a oe-umber summary of variability. The variace of X is defied by var[x] = E[(X E[X]) 2 ]. The ituitio is that if X is almost always close to E[X], the the variace is small; otherwise, it is large. I our marble example, E[X] = 3.2. Sice X = 8,5, or 2 with probability 0.1,0.2,0.7, respectively, we see that var[x] = E[(X E[X]) 2 ] = (8 3.2) 2 Pr[X = 8] + (5 3.2) 2 Pr[X = 5] + (2 3.2) 2 Pr[X = 2] = (8 3.2) 2 0.1 + (5 3.2) 2 0.2 + (2 3.2) 2 0.7 = 23.04 0.1 + 3.24 0.2 + 1.44 0.7 = 3.96. The square root of the variace is called the stadard deviatio ad we deote it by σ X. Here, σ X = 3.96 2. CS 70, Fall 2016, Probability: A Overview 3

Figure 1: Liear Regressio of Y over X (brow) Figure 2: Quadratic Regressio of Y over X (purple) Liear Regressio Cosider oce agai our bag of marbles. Defie aother radom variable Y by Y (blue) = 1,Y (red) = 1,Y (gree) = 3 ad Y (white) = 4. Thus, each outcome (i.e., color) is assiged two umbers: X ad Y. I aother cotext, each perso is associated with a height ad a weight. Say that you wat to guess the weight of a perso from his/her height. How do you do it? Here, we wat to guess Y from the value of X. Here, a picture helps. Figure 1 shows the values of X ad Y associated with the four possible outcomes. For istace, the blue outcome is associated with X(blue) = 8 ad Y (blue) = 1. The figure also shows the probability of the differet outcomes. We wat a simple formula to provide a guess of Y based o X. I fact, we wat a formula of the form Ŷ = a + bx. Here, Ŷ is our guess for Y based o the value of X. Also, a ad b are some costats. This formula correspods to the lie show i the figure. We choose a ad b so that the guess Ŷ teds to be close to Y. This meas that the lie should be close to the actual poits (X,Y ) i the figure. Thus, Ŷ Y should be small. We make this precise by requirig that E[(Ŷ Y ) 2 ] be as small as possible. That is, we choose a ad b to miimize E[(Ŷ Y ) 2 ] = E[(a + bx Y ) 2 ]. We explai i the lectures that the best choice of a ad b is such that where cov(x,y ) = E[XY ] E[X]E[Y ]. Quadratic Regressio Ŷ = E[Y ] + cov(x,y ) (X E[X]) var[x] I the previous sectio, we estimated Y by usig a liear fuctio a + bx of X, as show i Figure 1. Figure 2 suggests that a quadratic estimate c + dx + ex 2 is better tha a liear estimate, i.e., that it is closer to the pairs (X,Y ). I the lectures, we explai how to fid the best values of c,d,e. CS 70, Fall 2016, Probability: A Overview 4

Figure 3: Coditioal Expectatio of Y give X (gree) Coditioal Expectatio What if we could choose ay fuctio of X istead of beig limited to liear or quadratic fuctios? Figure 3 shows the best possible fuctio g(x) of X to estimate Y. We explai i the lectures how to calculate that fuctio called the coditioal expectatio of Y give X. 2 Flip Cois So far, we looked at oe or two radom variables. I this sectio, we explore may radom variables. Setup You have a coi. Whe you flip it, there are two possible outcomes: heads (H) ad tails (T ). Let p = Pr[H], so that Pr[T ] = 1 p. For istace, the coi could be biased with p = 0.6, so that heads is more likely tha tails. Idepedece Say that you flip the coi twice. There are four possible outcomes for this experimet: HH,HT,T H, ad T T. Here, HT meas that the first flip produces H ad the secod T, ad similarly for the other outcomes. If we recall the defiitio of coditioal probability, we have Pr[first fip yields H secod flip yields H] Pr[(first flip yields H) ad (secod flip yields H)] = Pr[secod flip yields H] = Pr[HH]. p I the last step, we used the fact that the probability that the secod flip yields H is p. Now, it is reasoable to assume that the likelihood that the first flip yields H does ot deped o the fact that the secod flip yields H ad that this likelihood is the p. Hece, we are led to the coclusio that p = Pr[HH]/p, so that Pr[HH] = p 2. This assumptio is called the idepedece of the coi flips. A similar reasoig yield to the coclusio that Pr[HT ] = p(1 p),pr[t H] = (1 p)p,pr[t T ] = (1 p) 2. Let X = 1 whe the first flip is H ad X = 0 whe it is T. Also, let Y = 1 whe the secod flip is H ad Y = 0 whe it is T. The we see that Pr[X = 1] = Pr[Y = 1] = p ad Pr[X = 1,Y = 1] = Pr[X = 1]Pr[Y = 1]. Also, Pr[X = 1,Y = 0] = Pr[X = 1]Pr[Y = 0]. More geerally, Pr[X = a,y = b] = Pr[X = a]pr[y = b] for all a,b. Two radom variables with that property are said to be idepedet. CS 70, Fall 2016, Probability: A Overview 5

Variace of Sum Let X ad Y be idepedet radom variables. We show i the lectures that var[x +Y ] = var[x] + var[y ]. More geerally, if X 1,...,X are radom variables such that ay two of them are idepedet, the var[x 1 + + X ] = var[x 1 ] + + var[x ]. Moreover, we will see that var[ax] = a 2 var[x] for ay radom variable X ad ay costat a. Cosequetly, we see that var[ X 1 + + X ] = var[x 1] + + var[x ] 2. I particular, if var[x m ] = σ 2 for m = 1,...,, we have var[ X 1 + + X ] = var[x 1] + + var[x ] 2 = σ 2 2 = σ 2. Chebyshev s Iequality Flip a coi times ad let X m = 1 if flip m yields H ad X m = 0 otherwise. The var[x m ] = E[(X m E[X m ]) 2 ] = E[(X m p) 2 ] = (1 p) 2 Pr[X m = 1] + (0 p) 2 Pr[X m = 0] = (1 p) 2 p + p 2 (1 p) = p(1 p). Accordigly, i view of the previous sectio, var[ X 1 + + X ] = p(1 p). Thus, whe is large, the variace of A := (X 1 + + X )/ is very small. This suggests that the radom variable A teds to be very close to its mea value, which happes to be p. Thus, we expect the fractio of heads A i coi flips to be close to p. To make this idea precise, Chebyshev developed a iequality which says that We prove this iequality i the lectures. Pr[ X E[X] 2 > ε] var[x] ε 2. Thus, the likelihood that a radom variable X differs from its mea E[X] by at least ε is small if var[x] is small. If we apply this iequality to A, we fid that Pr[ A p ε] p(1 p) ε 2. Note that p(1 p) 1/4 for ay value of p. Cosequetly, we see that Pr[ A p ε] 1 4ε 2. Cofidece Iterval Say that you do ot kow the value of p = Pr[H]. To estimate it, you flip the coi times ad ote the fractio A of heads. The last iequality holds. Let us choose ε so that the right-had side of the iequality CS 70, Fall 2016, Probability: A Overview 6

is 0.05 = 1/20. That is, we choose ε so that 4ε 2 = 20, i.e., ε 2 = 5/ or ε = 5/ 2.25/. Hece, the previous iequality with that value of ε implies that Pr[ A p 2.25 ] 0.05, so that Pr[ A p 2.25 ] 1 0.05 = 95%. Now, sice A p δ if ad oly if p [A δ,a + δ], we coclude that Pr[p [A 2.25,A + 2.25 ]] 95%. For istace, say that = 10 4 ad A = 0.31. We the coclude that so that Pr[p [0.31 2.25 100 2.25,0.31 + ]] 95%, 100 Pr[p [0.2875, 0.3325]] 95%. We say that [0.2875,0.3325] is a 95%-cofidece iterval for p. As you ca see, the width of the cofidece iterval decreases like 1/. This example is the basis for the estimates i public opiio surveys. Time util first H We flip the coi util we get the first H. How may times do we eed to flip the coi, o average? Let β be that average umber of flips. That umber of flips is 1 if the first flip is H, which occurs with probability p. If the first coi is T, which occurs with probability 1 p, the the process starts afresh ad we eed to flip the coi β more times, o average. Thus, β = p 1 + (1 p) (1 + β). Solvig, we fid β = 1/p. Time util two cosecutive Hs We flip the coi util we get two cosecutive Hs. How may times do we eed to flip the coi, o average? Let β be that average umber of flips. Let also β(h) be the average umber of additioal flips util two cosecutive Hs, give that the last flip is H. The we claim that β = p(1 + β(h)) + (1 p)(1 + β) β(h) = p 1 + (1 p)(1 + β). The first idetity ca be see by otig that if the first flip is H, the after that first flip oe eeds β(h) additioal flips, o average, sice the last flip was H. However, if the secod flip is T, the after the first flip oe eeds β additioal flips, o average. The secod idetity ca be justified similarly. Solvig, oe fids β = 1/p + 1/p 2. CS 70, Fall 2016, Probability: A Overview 7