Richard Socher, Henning Peters Elements of Statistical Learning I E[X] = arg min. E[(X b) 2 ]

1 Prolem (10P) Show that f X s a random varale, then E[X] = arg mn E[(X ) 2 ] Thus a good predcton for X s E[X] f the squared dfference s used as the metrc. The followng rules are used n the proof: 1. E[X + Y ] = E[X] + E[Y ] 2. E[cX] = c E[X] 3. E[X] and E[X 2 ] are constant 4. the expected value of a constant s the constant tself Proof. E[X] = arg mn E[(X ) 2 ] = arg mn E[X 2 2X + 2 ] = arg mn E[X 2 ] E[2X] + E[ 2 ] = arg mn E[X 2 ] 2E[X] + 2 In order to fnd the for whch the rght sde of the last equaton has a mnmum, we can see t as a functon of : f() = 2 2E[X] + E[X 2 ] Ths functon s ovously to the second power of wth a postve a 0, hence the only extremum s a gloal mnmum. The dervaton of ths functon s: To fnd the mnmum, we set t to 0: Whch gves: f () = 2 2E[X] 0 = 2 2E[X] = E[X] Thus we have shown that wth = E[X], the functon s mnmal. Intutvely ths also seems correct. If you predct the outcome of a random varale X wth E[X], the error should e mnmal. 2 Prolem (20P) Consder N data ponts unformly dstruted n a p-dmensonal unt all centered at the orgn. Suppose we consder a nearest-neghor estmate at the orgn. Show that the medan dstance from the orgn to the closest data pont s gven y the expresson d(p, N) = ( 1 1 2 ) 1 1 p N 1-6

(Exercse 2.3 n Haste, Tshran, Fredman). What does ths mean for the k-nearest neghor algorthm? Hnt: Consder that the volume of a p-dmensonal sphere wth radus r s gven y V (r, p) = G(p)r p, wth G(p) a dmenson-dependent constant. The proalty that a pont falls nto a sphere of radus r s proportonal to the sphere s volume snce the ponts are unformly dstruted. The proof s separated nto two parts. In the frst part the equaton s proven for one dmenson. Ths manly helped us to grasp the concept. The second part then only uses the hnt and apples the same dea to p-dmensons. Proof. To calculate the medan dstance from the orgn to the closest data pont, N random varale samples (rvs) are needed. Based on these, the medan of the pont nearest to the orgn can e defned. The rvs are ndependent of each other and unformly dstruted: X 1,..., X N Unf([0, 1]) For all of them the cumulatve dstruton functon (cdf) demonstrates that the proalty for a pont to e smaller than 0 s 0, then the proalty s equal for all pont etween [0, 1] and thus the cdf ncreases lnearly, up to 1 and stays at 1. Formally, we say: : 0, x < 0 P (X x) = F X = x, 0 x 1 1, x > 1 We need ths cdf, ecause we want the proalty that a pont s n a certan nterval [0, r] wth r 1. Also P (X > x) = 1 x, ecause t s lnear. Next, the random varale R = mn {X } s defned. It denotes the smallest value of all samples X. Usng F X we can smplfy the cdf of R. P (R r) = F R (r) = P (mn{x } r) = 1 P (mn{x } > r) = 1 P (X 1 > r,..., X N > r) = 1 P (X > r) = 1 =1 (1 r) =1 = 1 (1 r) N The cdf of R denotes the proalty that the shortest dstances of a pont to the orgn n all rvs X s less or equal than the cdf s parameter r. Now, we can use the fact that the medan s the 0.5 quantle of ts cdf, so F R (r 0.5 ) = 0.5. Hence: 1 (1 r 0.5 ) N = 0.5 We conclude: 1 (1 r 0.5 ) N = 0.5 1 r 0.5 = N 0.5 Thus, the expresson s proven for the 1-dmensonal case. r 0.5 = 1 N 0.5 = d(1, N) For the p-dmensonal case we use the hnt: The proalty that a pont falls nto a sphere of radus r s proportonal to the sphere s volume snce the ponts are unformly dstruted. 2-6

Snce we are talkng aout the un all, the radus s 1 and oth alls are centered at the orgn. Thus, wth S p (r) eng the volume of a p-dmensonal sphere wth a radus r the cdf s: P (X S p (r)) = G(p)rp G(p)1 p = r p If an element s nsde a all, ts dstance from the orgn s smaller than the radus of ths all. Wth the Eucldean norm x of any p-dmensonal vector x, t s true that: P (X S p (r)) = P ( X r) = r p smlarly to the 1-dmensonal case we now defne another random varale on top of that: We apply the same dea as n 1-d: R = mn { X } P (R r) = F R (r) = P (mn{ X } r) = 1 P (mn{ X } > r) = 1 P ( X > r) = 1 =1 (1 P ( X r)) =1 = 1 (1 r p ) N Agan we set ths cdf to 0.5, ecause we are lookng for the merdan. 1 (1 r p 0.5 )N = 0.5 r 0.5 = (1 0.5 1 N ) 1 p = d(p, N) Ths means that wth lm p, all ponts are extremely close to the order of the sphere and further away from the orgn. Defnng nearest neghor areas for a specfc class ecomes harder, snce one must extrapolate outsde of the doman of gven neghorng ponts, rather than nterpolate n the area etween them. Furthermore the samplng densty decreases, that s why the same amount of tranng samples can not cover a hgher dmenson approprately, as t could n a lower dmenson. 3 Prolem - R(20P) 3.1 Q&A Males n the regon of Western Cape, South Afrca present hgh-rsk heart-dsease. Measurements of systolc lood pressure (sp), cumulatve toacco consumpton (toacco), low densty lpoproten cholesterol (ldl), adposty (adposty), famly hstory of heart dsease (famhst), type-a ehavor (typea), oesty (oesty), current alcohol consumpton (alcohol) and age at onset (age) are avalale for a numer of male patents, together wth correspondng nformaton on whether they suffer or not from coronary heart dsease (chd). The goal s to analyze ths data n order to understand how the aove characterstcs nfluence the presence or asence of heart dsease, so that to make possle the predcton of the llness n new, unseen patents. Requrements: 3-6

1. Whch are the nput varales? sp toacco ldl adposty famhst typea oesty alcohol age 2. Whch s the target varale? chd 3. Whch s the (tranng) data set? The 462 oservatons of the ten varales. 4. Whch are the samples? Each lne, denoted y a numer etween 1 and 462 (wth 262 mssng) 5. Is ths a classfcaton or a regresson prolem? Classfcaton. 6. Use R to vsualze the data. See elow Results from f) 1. How many samples are there? 462 2. For each feature, specfy whether t s qualtatve or quanttatve. Wth the excepton of famhst and chd whch are oth qualtatve, all other features are quanttatve. See the output of str(ahd) 3. How many patents have coronary heart dsease and how many don t? 160 have chd, 320 don t. 4. How many sujects of age 20 are there? 6 3.2 R - I/O > load("ahd.rdata") > str(ahd) data.frame : 462 os. of 10 varales: $ sp : nt 160 144 118 170 134 132 142 114 114 132... $ toacco : num 12 0.01 0.08 7.5 13.6 6.2 4.05 4.08 0 0... $ ldl : num 5.73 4.41 3.48 6.41 3.5 6.47 3.38 4.59 3.83 5.8... $ adposty: num 23.1 28.6 32.3 38.0 27.8... $ famhst : Factor w/ 2 levels "Asent","Present": 2 1 2 2 2 2 1 2 2 2... $ typea : nt 49 55 52 51 60 62 59 62 49 69... $ oesty : num 25.3 28.9 29.1 32.0 26.0... $ alcohol : num 97.20 2.06 3.81 24.26 57.34... $ age : nt 52 63 46 58 49 45 38 58 29 53... $ chd : nt 1 1 0 1 1 0 0 1 0 1... > names(ahd) [1] "sp" "toacco" "ldl" "adposty" "famhst" "typea" "oesty" "alcohol" "age" "chd" > dm(ahd) [1] 462 10 > class(ahd$alcohol) [1] "numerc" > summary(ahd) sp toacco ldl adposty famhst Mn. :101.0 Mn. : 0.0000 Mn. : 0.980 Mn. : 6.74 Asent :270 1st Qu.:124.0 1st Qu.: 0.0525 1st Qu.: 3.283 1st Qu.:19.77 Present:192 Medan :134.0 Medan : 2.0000 Medan : 4.340 Medan :26.11 Mean :138.3 Mean : 3.6356 Mean : 4.740 Mean :25.41 3rd Qu.:148.0 3rd Qu.: 5.5000 3rd Qu.: 5.790 3rd Qu.:31.23 Max. :218.0 Max. :31.2000 Max. :15.330 Max. :42.49 typea oesty alcohol age chd Mn. :13.0 Mn. :14.70 Mn. : 0.00 Mn. :15.00 Mn. :0.0000 1st Qu.:47.0 1st Qu.:22.98 1st Qu.: 0.51 1st Qu.:31.00 1st Qu.:0.0000 Medan :53.0 Medan :25.81 Medan : 7.51 Medan :45.00 Medan :0.0000 Mean :53.1 Mean :26.04 Mean : 17.04 Mean :42.82 Mean :0.3463 3rd Qu.:60.0 3rd Qu.:28.50 3rd Qu.: 23.89 3rd Qu.:55.00 3rd Qu.:1.0000 4-6

Max. :78.0 Max. :46.58 Max. :147.19 Max. :64.00 Max. :1.0000 > tale(ahd$chd) 0 1 302 160 > tale(ahd$age) 15 16 17 18 19 20 21 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 3 20 17 8 2 6 3 2 6 4 5 11 7 7 7 9 11 9 6 1 3 6 13 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 10 12 10 13 8 8 14 11 3 13 14 8 8 10 14 10 16 9 8 17 16 15 16 12 8 13 3.3 Vsualzaton and Results Fgure 1 shows that oesty and adposty are somehow correlated. However no real conclusons can e dscovered for the resultng chd. Fgure 2 shows that age and chd are not correlated. Fgure 3 s a lttle more complex. It s created y: Fgure 1: pars(ahd) - scatterplot of all pars 5-6

Fgure 2: plot(ahdage, ahdchd) - scatterplot of age-chd scatterplot(oesty~adposty famhst, reg.lne=lm, smooth=false, laels=false, oxplots= xy, span=0.5, y.groups=true, data=ahd) # to prnt t drectly to a fle: dev.prnt(png, flename="adposty-oesty.png", wdth=500, heght=500) It shows that a famly hstory s of chd does not nfluence the relaton etween oesty and adposty. Furthermore t demonstrates that the hgher the adposty, the hgher the oesty. The Box-Whsker-plots next to the axes show that the medans of oth adposty and oesty s slghtly aove 25. Fgure 3: scatterplot(oesty adposty famhst...) - scatterplot of oesty-dposty, wth two groups that ndcate famhst Fgure 4 s the most nterestng. It s created y scatterplot.matrx(~chd+ldl+oesty+sp+toacco+typea, reg.lne=lm, smooth=false, span=0.5, dagonal = densty, data=ahd) It shows on ts dagonal, the densty. We can extract some nformaton: There are more samples wthout chd. oesty, typea and ldl seem to have a Gaussan dstruton the hgher toacco use and ldl are the more lkely the patent has chd. the same holds for the other features, ut weaker 6-6

Fgure 4: scatterplot.matrx( chd+ldl+oesty+sp+toacco+typea,...) - scatterplot matrx 7-6