The Random Variable for Probabilities Chris Piech CS109, Stanford University

Size: px

Start display at page:

Download "The Random Variable for Probabilities Chris Piech CS109, Stanford University"

Diana Hood
5 years ago
Views:

1 The Random Variable for Probabilities Chris Piech CS109, Stanford University

2 Assignment Grades Frequency Frequency Frequency Frequency We have 2055 assignment distributions from grade scope

3 Flip a Coin With Unknown Probability Demo

4 Four Prototypical Trajectories Today we are going to learn something unintuitive, beautiful and useful

5 Four Prototypical Trajectories Review

6 Conditional Events Recall that for events E and F: P( EF) P ( E F) = where P( F) > P( F) E F 0

7 Discrete Conditional Distributions

8 Continuous Conditional Distributions Let X and Y be continuous random variables Conditional PDF of X given Y (where f Y (y) > 0): P (X = x Y = y) = P (X = x, Y = y) P (Y = y) f X Y (x, y)@x = f X,Y (x, y)@x@y f X Y (y)@y f X Y (x, y)@x = f X,Y (x, y)@x f X Y (y) f X Y (x, y) = f X,Y (x, y) f X Y (y)

9 Piech, CS106A, Stanford University Conditioning with a continuous random variable feels weird at first. But then it gets good. Its like biking with a helmet

10 Continuous Conditional Distributions Let X be continuous random variable Let E be an event: P (X = x, E) P (E X = x) = P (X = x) P (X = x E)P (E) = P (X = x) = f X(x E)P (E)@x f X (x)@x = f X(x E)P (E) f X (x)

11 Piech, CS106A, Stanford University Anomaly Detection Let X be a measure of time to answer a question Let E be the event that the user is a human: P (X = x, E) P (E X = x) = P (X = x) P (X = x E)P (E) = P (X = x) = f X(x E)P (E)@x f X (x)@x = f X(x E)P (E) f X (x)

12 Piech, CS106A, Stanford University Anomaly Detection Let X be a measure of time to answer a question Let E be the event that the user is a human What if you don t know unconditional?: Normal pdf P (E X = x) = f X(x E)P (E) f X (x) P (E X = x) P (E C X = x) =

13 Biometric Keystroke Piech, CS106A, Stanford University

14 Let X be a continuous random variable Let N be a discrete random variable Conditional PDF of X given N: Conditional PMF of N given X: If X and N are independent, then: ) ( ) ( ) ( ) ( n p x f x n p n x f N X X N N X = ) ( ) ( ) ( ) ( x f n p n x f x n p X N N X X N = ) ( ) ( x f n x f X N X = ) ( ) ( n p x n p N X N = Mixing Discrete and Continuous

15 Four Prototypical Trajectories End Review

17 Piech, CS106A, Stanford University We are going to think of probabilities as random variables!!!

18 Flip a Coin With Unknown Probability Flip a coin (n + m) times, comes up with n heads We don t know probability X that coin comes up heads Frequentist Bayesian X = lim n+m!1 n n + m n n + m f X N (x n) = P (N = n X = x)f X (x) P (N = n) X is a single value X is a random variable

19 Flip a Coin With Unknown Probability Flip a coin (n + m) times, comes up with n heads We don t know probability X that coin comes up heads Our belief before flipping coins is that: X ~ Uni(0, 1) Let N = number of heads Given X = x, coin flips independent: (N X) ~Bin(n +m, x) f X N (x n) = P (N = n X = x)f X(x) P (N = n) Bayesian posterior probability distribution Bayesian prior probability distribution

20 Flip a Coin With Unknown Probability Flip a coin (n + m) times, comes up with n heads We don t know probability X that coin comes up heads Our belief before flipping coins is that: X ~ Uni(0, 1) Let N = number of heads Given X = x, coin flips independent: (N X) ~Bin(n +m, x) f X N (x n) = P (N = n X = x)f X(x) P (N = n) Binomial = = n+m n x n (1 x) m P (N = n) n+m n P (N = n) xn (1 x) m Z = 1 c xn (1 x) m where c = 1 Z 1 0 x n (1 x) m dx

21 Beta Random Variable f X is a Beta Random Variable: X ~ Beta(a, b) B( ( x) = 0 Probability Density Function (PDF): (where a, b > 0) 1 x, ) a x a b 1 (1 ) b 1 0 < x < 1 otherwise where B 1 a 1 b 1 ( a, b) = x (1 x) dx 0 Symmetric when a = b a E[ X ] = Var X ) a + b ( = 2 ( a + b) ab ( a + b + 1)

22 Meta Beta Used to represent a distributed belief of a probability

23 Piech, CS106A, Stanford University Beta is a distribution for probabilities

24 Back to flipping coins Flip a coin (n + m) times, comes up with n heads We don t know probability X that coin comes up heads Our belief before flipping coins is that: X ~ Uni(0, 1) Let N = number of heads Given X = x, coin flips independent: (N X) ~Bin(n +m, x) f X N (x n) = P (N = n X = x)f X(x) P (N = n) = = n+m n x n (1 x) m P (N = n) n+m n P (N = n) xn (1 x) m Z = 1 c xn (1 x) m where c = Z 1 0 x n (1 x) m dx

25 Flip a coin (n + m) times, comes up with n heads Conditional density of X given N = n f X Note: 1 n m n m N ( x n) = x (1 x) where c = x (1 x) dx c Recall Beta distribution: B( f ( x) = 0 Dude, Where s My Beta? 0 < x < 1, so f ( x n) = 1 x, ) a x a b 1 (1 ) b 1 0 < x < 1 otherwise Hey, that looks more familiar now... X (N = n, n + m trials) ~ Beta(n + 1, m + 1) X N B otherwise 1 a 1 b 1 ( a, b) = x (1 x) dx 0

26 Understanding Beta X (N = n, m + n trials) ~ Beta(n + 1, m + 1) X ~ Uni(0, 1) Check this out, boss: o Beta(1, 1) =? f 1 B( a, b) a 1 b 1 0 ( x) = x (1 x) = x (1 x 1 = 1 1= 1 where 0 < x < 1 1dx 0 1 B( a, b) ) 0 o Beta(1, 1) = Uni(0, 1) So, X ~ Beta(1, 1)

27 If the Prior was a Beta If our belief about X (that random variable for probability) was beta f X (x) = 1 B(a, b) xa 1 (1 x) b 1 What is our belief about X after observing N heads? f X N (x, n) =???

28 If the Prior was a Beta f X N (x, n) = P (N = n X = x)f X(x) P (N = n) = n+m n x n (1 x) m f X (x) P (N = n) n+m n x n (1 x) m 1 B(a,b) = xa 1 (1 x) b 1 P (N = n) n + m = K 1 x n (1 x) m 1 n B(a, b) xa 1 (1 x) b 1 = K 2 x n (1 x) m 1 B(a, b) xa 1 (1 x) b 1 = K 2 x n (1 x) m x a 1 (1 x) b 1 = K 2 x n+a 1 (1 x) m+b 1 X N Beta(n + a, m + b)

29 Understanding Beta If Prior distribution of X (before seeing flips) is Beta Then Posterior distribution of X (after flips) is Beta Beta is a conjugate distribution for Beta Prior and posterior parametric forms are the same! Practically, conjugate means easy update: o Add number of heads and tails seen to Beta parameters

30 Further Understanding Beta Can set X ~ Beta(a, b) as prior to reflect how biased you think coin is apriori This is a subjective probability! Then observe n + m trials, where n of trials are heads Update to get posterior probability X (n heads in n + m trials) ~ Beta(a + n, b + m) Sometimes call a and b the equivalent sample size Prior probability for X based on seeing (a + b 2) imaginary trials, where (a 1) of them were heads. Beta(1, 1) ~ Uni(0, 1) à we haven t seen any imaginary trials, so apriori know nothing about coin

31 Check out Demo!

32 Four Prototypical Trajectories Damn

33 Four Prototypical Trajectories Next level?

34 Assignment Grades Frequency Frequency Frequency Frequency We have 2055 assignment distributions from gradescope

35 Binomial Distributions Geometric Exponential Poisson Neg Binomial Normal Beta Beta Uniform

36 Four Prototypical Trajectories Grades must be bounded

37 Four Prototypical Trajectories Normal: No

38 Four Prototypical Trajectories Poisson: No

39 Four Prototypical Trajectories Exponential: No

40 Four Prototypical Trajectories Beta: Yes

41 Assignment Grades Demo Assignment id = 1613 Frequency

42 Assignment Grades Demo Assignment id = 1613 Frequency X Beta(a =8.28,b=3.16)

43 Assignment Grades a = 4.03, b = 3.00 a = 8.28, b = Frequency Frequency Frequency Frequency a = 5.06, b = 3.79 a = 3.46, b = 0.35 We have 2055 assignment distributions from grade scope

44 Beta is a Better Fit Unpublished results. Based on Gradescope data

45 Beta is a Better Fit For All Class Sizes Unpublished results. Based on Gradescope data

46 Binomial Interpretation Each student has the same probability of getting each point. Generate grades by flipping a coin 100 times for each student. The resulting distribution is binomial. - Binomial

47 Normal Interpretation What the Binomial said, but approximated. - Normal

48 Beta Interpretation Each student s ability is represented as a probability. The distribution of probabilities is a Beta distribution. Each student has a different probability of getting points, and that probability is sampled from a Beta distribution. - Beta This is Chris Piech s opinion. It is open for debate

49 Assignment Grades a = 4.03, b = 3.00 a = 8.28, b = Frequency Frequency Frequency Frequency a = 5.06, b = 3.79 a = 3.46, b = 0.35 These are the distribution of student point probabilitities

50 Assignment Grades Demo What is the semantics of E[X]? Frequency X Beta(a =8.28,b=3.16)

51 Assignment Grades What is the probability that a student is bellow the mean? X Beta(a =8.28,b=3.16) E[X] = a a + b = P (X <0.7238) = F X (0.7238) Wait what? Chris are you holding out on me? stats.beta.cdf(x, alpha, beta) P (X <E[X]) = 0.46

52 Four Prototypical Trajectories As far as I know, this is an unpublished result

53 Implications Will be combined with Item Response Theory which models how assignment difficulty and student ability combine to give point probabilities. Suggests a way to calculate final grades as a probabilistic most likely estimate of ability. Machine learning on education data will be more accurate. Analysis of mixture distributions can be fixed.

54 Four Prototypical Trajectories Will you use this on us?

55 Four Prototypical Trajectories Not yet J

56 Four Prototypical Trajectories Beta: The probability density for probabilities

57 Piech, CS106A, Stanford University Any parameter for a parameterized random variable can be thought of as a random variable.

58 Course Mean Ε[CS109] This is actual midpoint of course (Just wanted you to know)

The Random Variable for Probabilities Chris Piech CS109, Stanford University

The Random Variable for Probabilities Chris Piech CS109, Stanford University Assignment Grades 10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100 Frequency Frequency 10 20 30 40 50 60 70 80