Probability Densities in Data Mining

Probabilit Densities in Data Mining Note to other teachers and users of these slides. Andrew would be delighted if ou found this source material useful in giving our own lectures. Feel free to use these slides verbatim, or to modif them to fit our own needs. PowerPoint originals are available. If ou make use of a significant ortion of these slides in our own lecture, lease include this message, or the following link to the source reositor of Andrew s tutorials: htt://www.cs.cmu.edu/~awm/tutorials. Comments and corrections gratefull received. Andrew W. Moore Associate Professor School of Comuter Science Carnegie Mellon Universit www.cs.cmu.edu/~awm awm@cs.cmu.edu 41-68-7599 Coright 001, Andrew W. Moore Aug 7, 001 Probabilit Densities in Data Mining Wh we should care Notation and Fundamentals of continuous PDFs Multivariate continuous PDFs Combining continuous and discrete random variables Coright 001, Andrew W. Moore Probabilit Densities: Slide

Wh we should care Real Numbers occur in at least 50% of database records Can t alwas quantize them So need to understand how to describe where the come from A great wa of saing what s a reasonable range of values A great wa of saing how multile attributes should reasonabl co-occur Coright 001, Andrew W. Moore Probabilit Densities: Slide 3 Wh we should care Can immediatel get us Baes Classifiers that are sensible with real-valued data You ll need to intimatel understand PDFs in order to do kernel methods, clustering with Miture Models, analsis of variance, time series and man other things Will introduce us to linear and non-linear regression Coright 001, Andrew W. Moore Probabilit Densities: Slide 4

A PDF of American Ages in 000 Coright 001, Andrew W. Moore Probabilit Densities: Slide 5 A PDF of American Ages in 000 Let X be a continuous random variable. If is a Probabilit Densit Function for X then a < X b P d b a 50 < Age 50 P 30 age dage 0.36 age 30 Coright 001, Andrew W. Moore Probabilit Densities: Slide 6

Proerties of PDFs a < X b P d b a That means lim h 0 h P < X h h + P X Coright 001, Andrew W. Moore Probabilit Densities: Slide 7 Proerties of PDFs a < X b P d P b a X Therefore Therefore d 1 : 0 Coright 001, Andrew W. Moore Probabilit Densities: Slide 8

Talking to our stomach What s the gut-feel meaning of? If then 5.31 0.06 and 5.9 0.03 when a value X is samled from the distribution, ou are times as likel to find that X is ver close to 5.31 than that X is ver close to 5.9. Coright 001, Andrew W. Moore Probabilit Densities: Slide 9 Talking to our stomach What s the gut-feel meaning of? If then 5.31 a 0.06 and 5.9 b 0.03 when a value X is samled from the distribution, ou are times as likel to find that X is ver close to 5.31 a than that X is ver close to 5.9. b Coright 001, Andrew W. Moore Probabilit Densities: Slide 10

Talking to our stomach What s the gut-feel meaning of? If then 5.31 a 0.03 z and 5.9 b 0.06 z when a value X is samled from the distribution, ou are times as likel to find that X is ver close to 5.31 a than that X is ver close to 5.9. b Coright 001, Andrew W. Moore Probabilit Densities: Slide 11 Talking to our stomach What s the gut-feel meaning of? If then 5.31 a 0.03 αz and 5.9 b 0.06 z when a value X is samled from the distribution, ou are α times as likel to find that X is ver close to 5.31 a than that X is ver close to 5.9. b Coright 001, Andrew W. Moore Probabilit Densities: Slide 1

Talking to our stomach What s the gut-feel meaning of? If then a b α when a value X is samled from the distribution, ou are α times as likel to find that X is ver close to 5.31 a than that X is ver close to 5.9. b Coright 001, Andrew W. Moore Probabilit Densities: Slide 13 Talking to our stomach What s the gut-feel meaning of? If a b then P a h < lim h 0 P b h < α X X < a + h < b + h α Coright 001, Andrew W. Moore Probabilit Densities: Slide 14

Yet another wa to view a PDF A recie for samling a random age. 1. Generate a random dot from the rectangle surrounding the PDF curve. Call the dot age,d. If d < age sto and return age 3. Else tr again: go to Ste 1. Coright 001, Andrew W. Moore Probabilit Densities: Slide 15 Test our understanding True or False: : 1 : P X 0 Coright 001, Andrew W. Moore Probabilit Densities: Slide 16

Eectations E[X] the eected value of random variable X the average value we d see if we took a ver large number of random samles of X d Coright 001, Andrew W. Moore Probabilit Densities: Slide 17 Eectations E[age]35.897 E[X] the eected value of random variable X the average value we d see if we took a ver large number of random samles of X Coright 001, Andrew W. Moore Probabilit Densities: Slide 18 d the first moment of the shae formed b the aes and the blue curve the best value to choose if ou must guess an unknown erson s age and ou ll be fined the square of our error

Eectation of a function E[age ] 1786.64 E[age] 188.6 µe[fx] the eected value of f where is drawn from X s distribution. the average value we d see if we took a ver large number of random samles of fx µ f d Note that in general: E[ f ] f E[ X ] Coright 001, Andrew W. Moore Probabilit Densities: Slide 19 Variance σ Var[X] the eected squared difference between and E[X] Var [age] 498.0 σ µ d amount ou d eect to lose if ou must guess an unknown erson s age and ou ll be fined the square of our error, and assuming ou la otimall Coright 001, Andrew W. Moore Probabilit Densities: Slide 0

Standard Deviation σ Var[X] the eected squared difference between and E[X] Var [age] σ.3 498.0 σ µ d amount ou d eect to lose if ou must guess an unknown erson s age and ou ll be fined the square of our error, and assuming ou la otimall σ Standard Deviation tical deviation of X from its mean σ Var[ X ] Coright 001, Andrew W. Moore Probabilit Densities: Slide 1 In dimensions, robabilit densit of random variables X,Y at location, Coright 001, Andrew W. Moore Probabilit Densities: Slide

In dimensions Let X,Y be a air of continuous random variables, and let R be some region of X,Y sace P X, Y R, R, dd Coright 001, Andrew W. Moore Probabilit Densities: Slide 3 In dimensions Let X,Y be a air of continuous random variables, and let R be some region of X,Y sace P X, Y R, R, dd P 0<mg<30 and 500<weight<3000 area under the -d surface within the red rectangle Coright 001, Andrew W. Moore Probabilit Densities: Slide 4

In dimensions Let X,Y be a air of continuous random variables, and let R be some region of X,Y sace P X, Y R, R, dd P [mg-5/10] + [weight-3300/1500] < 1 area under the -d surface within the red oval Coright 001, Andrew W. Moore Probabilit Densities: Slide 5 In dimensions Let X,Y be a air of continuous random variables, and let R be some region of X,Y sace P X, Y R, R, dd Take the secial case of region R everwhere. Remember that with robabilit 1, X,Y will be drawn from somewhere. So.., dd 1 Coright 001, Andrew W. Moore Probabilit Densities: Slide 6

In dimensions Let X,Y be a air of continuous random variables, and let R be some region of X,Y sace P X, Y R, R, dd, lim h 0 h P < X h + h h < Y h + Coright 001, Andrew W. Moore Probabilit Densities: Slide 7 In m dimensions Let X 1,X, X m be an n-tule of continuous random variables, and let R be some region of R m P X, X,..., X, 1 1 m R...,..., m R, 1,..., m d m,,... d, d 1 Coright 001, Andrew W. Moore Probabilit Densities: Slide 8

Indeendence X Y iff, :, If X and Y are indeendent then knowing the value of X does not hel redict the value of Y mg,weight NOT indeendent Coright 001, Andrew W. Moore Probabilit Densities: Slide 9 Indeendence X Y iff, :, If X and Y are indeendent then knowing the value of X does not hel redict the value of Y the contours sa that acceleration and weight are indeendent Coright 001, Andrew W. Moore Probabilit Densities: Slide 30

Multivariate Eectation µ X E [ X] d E[mg,weight] 4.5,600 The centroid of the cloud Coright 001, Andrew W. Moore Probabilit Densities: Slide 31 Multivariate Eectation E [ f X] f d Coright 001, Andrew W. Moore Probabilit Densities: Slide 3

Test our understanding Question : When if ever does E [ X + Y] E[ X ] + E[ Y]? All the time? Onl when X and Y are indeendent? It can fail even if X and Y are indeendent? Coright 001, Andrew W. Moore Probabilit Densities: Slide 33 Bivariate Eectation E [ f, ] f,, dd if f, then E[ f X, Y ], dd if f, then E[ f X, Y], dd if f, + then E[ f X, Y ] +, dd E [ X + Y] E[ X ] + E[ Y] Coright 001, Andrew W. Moore Probabilit Densities: Slide 34

σ Bivariate Covariance Cov[ X, Y ] E[ X µ Y µ ] σ σ σ σ Cov[ X, X ] Var[ X ] E[ X µ Cov[ Y, Y] Var[ Y ] E[ Y µ ] ] Coright 001, Andrew W. Moore Probabilit Densities: Slide 35 σ Bivariate Covariance Cov[ X, Y ] E[ X µ Y µ ] σ σ σ Cov[ X] σ Cov[ X, X ] Var[ X ] E[ X µ Cov[ Y, Y] Var[ Y ] E[ Y µ X Write X, then Y ] σ σ T E[ X µ X µ ] Σ σ σ Coright 001, Andrew W. Moore Probabilit Densities: Slide 36 ]

Covariance Intuition σ weight 700 E[mg,weight] 4.5,600 σ weight 700 σ mg 8 σ mg 8 Coright 001, Andrew W. Moore Probabilit Densities: Slide 37 Covariance Intuition Princial Eigenvector of Σ σ weight 700 E[mg,weight] 4.5,600 σ weight 700 σ mg 8 σ mg 8 Coright 001, Andrew W. Moore Probabilit Densities: Slide 38

Cov[ X] Covariance Fun Facts σ σ T E[ X µ X µ ] Σ σ σ True or False: If σ 0 then X and Y are indeendent True or False: If X and Y are indeendent then σ 0 True or False: If σ σ σ then X and Y are deterministicall related True or False: If X and Y are deterministicall related then σ σ σ How could ou rove or disrove these? Coright 001, Andrew W. Moore Probabilit Densities: Slide 39 General Covariance Let X X 1,X, X k be a vector of k continuous random variables T Cov [ X] E [ X µ X µ ] Σ Σ ij Cov[ X, X ] σ i j i j S is a k k smmetric non-negative definite matri If all distributions are linearl indeendent it is ositive definite If the distributions are linearl deendent it has determinant zero Coright 001, Andrew W. Moore Probabilit Densities: Slide 40

Question : When if Test our understanding ever doesvar [ X + Y] Var[ X ] + Var[ Y]? All the time? Onl when X and Y are indeendent? It can fail even if X and Y are indeendent? Coright 001, Andrew W. Moore Probabilit Densities: Slide 41 Marginal Distributions, d Coright 001, Andrew W. Moore Probabilit Densities: Slide 4

mg weight 4600 Conditional Distributions mg weight 300 mg weight 000.d.f. of X when Y Coright 001, Andrew W. Moore Probabilit Densities: Slide 43 mg weight 4600 Conditional Distributions, Wh?.d.f. of X when Y Coright 001, Andrew W. Moore Probabilit Densities: Slide 44

Coright 001, Andrew W. Moore Probabilit Densities: Slide 45 Indeendence Revisited It s eas to rove that these statements are equivalent, :, iff Y X :, :,, :, Coright 001, Andrew W. Moore Probabilit Densities: Slide 46 More useful stuff Baes Rule These can all be roved from definitions on revious slides 1 d,, z z z

Coright 001, Andrew W. Moore Probabilit Densities: Slide 47 Miing discrete and continuous variables h v A h X h P v A + <, lim 0 h 1, 1 n A v d v A Baes Rule Baes Rule A P A P A A P A A P Coright 001, Andrew W. Moore Probabilit Densities: Slide 48 Miing discrete and continuous variables PEduYears,Wealth

Miing discrete and continuous variables PEduYears,Wealth PWealth EduYears Coright 001, Andrew W. Moore Probabilit Densities: Slide 49 Miing discrete and continuous variables PEduYears,Wealth PEduYears Wealth PWealth EduYears Renormalized Aes Coright 001, Andrew W. Moore Probabilit Densities: Slide 50

What ou should know You should be able to la with discrete, continuous and mied joint distributions You should be ha with the difference between and PA You should be intimate with eectations of continuous and discrete random variables You should smile when ou meet a covariance matri Indeendence and its consequences should be second nature Coright 001, Andrew W. Moore Probabilit Densities: Slide 51 Discussion Are PDFs the onl sensible wa to handle analsis of real-valued variables? Wh is covariance an imortant concet? Suose X and Y are indeendent real-valued random variables distributed between 0 and 1: What is [minx,y]? What is E[minX,Y]? Prove that E[X] is the value u that minimizes E[X-u ] What is the value u that minimizes E[ X-u ]? Coright 001, Andrew W. Moore Probabilit Densities: Slide 5