SDMML HT 06 - MSc Problem Sheet 4. The recever operatng characterstc ROC curve plots the senstvty aganst the specfcty of a bnary classfer as the threshold for dscrmnaton s vared. Let the data space be R, and denote the class-condtonal denstes wth g 0 x and g x for x R and for the two classes 0 and. Consder a classfer that classfes x as class f x c, where threshold c vares from to +. a Gve expressons for the populaton versons of specfcty and senstvty of ths classfer. b Show that the AUC corresponds to the probablty that X > X 0, where data tems X and X 0 are ndependent and come from classes and 0 respectvely.. -NN rsk n bnary classfcaton Let {X, Y } n = be a tranng dataset where X R p and Y {0, }. We denote by g k x the condtonal densty of X gven Y = k and assume that g k x > 0 for all x R p, and the class probabltes as π k = P Y = k. We further denote q x = P Y = X = x. a Consder the Bayes classfer mnmzng rsk w.r.t. 0/ loss {fx Y }: f Bayes x = arg max k {0,} π k g k x. Wrte the condtonal expected loss P [fx Y X = x] at a gven test pont X = x n terms of q x. [The resultng expresson should depend only on q x]. b The -nearest neghbour -NN classfer assgns to a test data pont x the label of the closest tranng pont;.e. f NN x = y class of nearest neghbour n the tranng set. Gven some test pont X = x and ts nearest neghbour X = x, what s the condtonal expected loss P [f NN X Y X = x, X = x ] of the -NN classfer n terms of q x, q x? c As the number of tranng examples goes to nfnty,.e. n, assume that the tranng data flls the space such that q x q x, x. Gve the lmt as n of P [f NN X Y X = x]. If we denote by R Bayes = P [ Y f Bayes X ] and R NN = P [Y f NN X], show that for suffcently large n R Bayes R NN R Bayes RBayes.. Recall the defnton of a one-hdden layer neural network for bnary classfcaton n the lectures. The objectve functon s L -regularzed log loss: J = y log ŷ + y log ŷ + λ wjl h + wl o = jl l and the network defnton s: m ŷ = s b o + wl o h l, h l = s b h l + l= wth transfer functon sa = +e a. p wjl h x j, j=
a Verfy that the dervatves needed for gradent descent are: J w o l J w h jl = λwl o + ŷ y h l, = = λwjl h + ŷ y wl o h l h l x j. = b Suppose nstead that you have a neural network for bnary classfcaton wth L hdden layers, each hdden layer havng m neurons wth logstc transfer functon. Gve the parameterzaton for each layer, and derve the backpropagaton algorthm to compute the dervatves of the objectve wth respect to the parameters. For smplcty, you can gnore bas terms. 4. In ths queston you wll nvestgate fttng neural networks usng the nnet lbrary n R. We wll tran a neural network to classfy handwrtten dgts 0-9. Download fles usps tranx.data, usps trany.data, usps testx.data, usps testy.data from http://www.stats.ox.ac.uk/ sejdnov/sdmml/data/. Each handwrtten dgt s 6 6 n sze, so that data vectors are p = 56 dmensonal and each entry pxel takes nteger values 0-55. There are 000 dgts 00 dgts of each class n each of the tranng set and test set. You can vew the dgts wth magematrxas.matrxtranx[500,],6,6,col=greyseq0,,length=56 trany[500,] Download the R scrpt nnetusps.r from the course webpage. The scrpt trans a -hdden layer neural network wth S = 0 hdden unts for T = 0 teratons, reports the tranng and test errors, runs t for another 0 teratons, and reports the new tranng and test errors. To make computatons qucker, the scrpt down-samples the tranng set to 00 cases, by usng only one out of every 0 tranng cases. You wll fnd the documentaton for the nnet lbrary useful: http://cran.r-project.org/web/packages/nnet/nnet.pdf. a Edt the scrpt to report the tranng and test error after every teraton of tranng the network. Use networks of sze S = 0 and up to T = 00 teratons. Plot the tranng and test errors as functons of the number of teratons. Dscuss the results and the fgure. b Edt the scrpt to vary the sze of the network, reportng the tranng and test errors for network szes S =,,, 4, 5, 0, 0, 40. Use T = 5 teratons. Plot these as a functon of the network sze. Dscuss the results and the fgure. 5. Consder a bnary classfcaton problem wth Y = {, }. We are at a node t n a decson tree and would lke to splt t based on Gn mpurty. Consder a categorcal attrbute A wth L levels,.e., x A {a, a,..., a L }. For a generc example X, Y reachng node t, denote: p k = P Y = k, k =,, q l = P X A = a l, l =,..., L, p k l = P Y = k X A = a l, k =,, and l =,..., L. Thus, the populaton Gn mpurty s gven by p p. Further, assume N = n examples
{X, Y } n = have reached the node t, and denote N k = { : Y = k}, k =,, { } N l = : X A = a l, l =,..., L, { } N k l = : Y = k and X A = a l, k =,, and l =,..., L. a Assumng data vectors reachng node t are ndependent, explan why N l N = n, N k N = n and N k l N l = n l have respectvely multnomal, bnomal and bnomal dstrbutons wth parameters q l, p k and p k l. b If we splt usng attrbute A and are not usng dummy varables we wll have an L-way splt and the resultng mpurty change wll be Gn = p p L q l p l p l The parameters p k, q l and p k l are unknown, however. The Gn mpurty estmate ˆ Gn s thus computed usng the plug-n estmates ˆp k = N k /N, ˆq l = N l /N and ˆp k l = N k l /N l respectvely. Calculate the expected estmated mpurty change E[ ˆ Gn N = n] between node t and ts L chld-nodes, condtoned on N = n data vectors reachng node t. l= c Suppose the attrbute-levels are actually unnformatve about the class label, so that p k l = p k. Show that, condtoned on N = n, the expected estmated Gn mpurty change s then equal p p L /n. d Is ths attrbute selecton crteron based n favor of attrbutes wth more levels? 6. Download the wne dataset from https://archve.cs.uc.edu/ml/machne-learnng-databases/wne/wne.data and load t usng read.table"wne.data",sep=",". Descrpton of the dataset s gven at https://archve.cs.uc.edu/ml/datasets/wne. a Make a bplot usng the scale=0 opton, and then use the xlabs=as.numerctd$type opton n bplot to label ponts by ther $Type. The output should look lke:
0.4 0. 0.0 0. 0.4 Comp. 4 0 4 V V V8 V5 V0 V9 V7 V V6 V4 V4 V V 0.4 0. 0.0 0. 0.4 4 0 4 Comp. b Now tran a classfcaton tree usng rpart, and relate the decson rule dscovered there to the projectons of the orgnal varable axes dsplayed n the bplot. Gve the plots of the tree as well as of the cross-valdaton results n rpart object usng plotcp. c Now produce a Random Forest ft, calculatng the out-of-bag estmaton error and compare wth the tree analyss. You could start lke: lbraryrandomforest rf <- randomforesttd[,:4],td[,],mportance=true prntrf Use tunerf to fnd an optmal value of mtry, the number of attrbute canddates at each splt. Use varimpplot to determne what are the most mportant varables. Optonal 7. A mxture of experts s an ensemble model n whch a number of experts compete to predct a label. Consder a regresson problem wth dataset {x, y } n = and y R. We have E experts, each assocated wth a parametrzed regresson functon f j x; θ j, for j =,..., E for example, each expert could be a neural network. a A smple mxture of experts model uses as objectve functon Jπ, σ, θ j E j= = log = E j= π j e σ f jx ;θ j y where π = π,..., π E are mxng proportons and σ s a parameter. Relate the objectve functon to the log-lkelhood of a mxture model where each component s a condtonal dstrbuton of Y gven X = x. 4
b Dfferentate the objectve functon wth respect to θ j. Introduce a latent varable z, ndcatng whch expert s responsble for predctng y, and nterpret J θ j n the context of the correspondng EM algorthm. In ths context, one needs to use the generalzed EM algorthm, where n the M-step gradent descent s used to update the expert parameters θ j. c A mxture of experts allows each expert to specalze n predctng the response n a certan part of the data space, wth the overall model havng better predctons than any one of the experts. However to encourage ths specalzaton, t s useful also for the mxng proportons to depend on the data vectors x,.e. to model π j x; φ as a functon of x wth parameters φ. The dea s that ths gatng network controls where each expert specalzes. To ensure E j= π jx; φ =, we can use the softmax nonlnearty: π j x; φ = exph j x; φ j E l= exph lx; φ l where h j x; φ j are parameterzed functons for the gatng network. The prevous generalzed EM algorthm extends to ths scenaro easly. Descrbe what changes have to be made, and derve a gradent descent learnng update for φ j. 5