Some technical details on confidence. intervals for LIFT measures in data. mining

Size: px

Start display at page:

Download "Some technical details on confidence. intervals for LIFT measures in data. mining"

Jennifer Fowler
5 years ago
Views:

1 Some technical details on confidence intevals fo LIFT measues in data mining Wenxin Jiang and Yu Zhao June 2, 2014 Abstact A LIFT measue, such as the esponse ate, lift, o the pecentage of captued esponse, is a fundamental measue of effectiveness fo a scoing ule obtained fom data mining, which is estimated fom a set of validation data. The LIFT measues ae elated to the ROC (Receive Opeato Chaacteistic), but thee exist some impotant diffeences. In this pape, we study how to constuct confidence intevals of the LIFT measues. We point out the difficulty of this task and explain how simple binomial confidence intevals can have incoect coveage pobabilities, due to omitting vaiation fom the sample pecentile of the scoing ule. We deive the asymptotic distibution using some advanced empiical pocess theoy and the functional delta method in Technical Repot 14-02, Depatment of Statistics, Nothwesten Univesity. Wenxin Jiang is Pofesso of Depatment of Statistics, Nothwesten Univesity, Evanston, IL ( wjiang@nothwesten.edu); and Yu Zhao is Statistician at Amazon ( yuzhaonwu@gmail.com). 1

2 Appendix B. The additional vaiation is shown to be elated to a conditional mean esponse, which can be estimated by a local aveaging of the esponses ove the scoes fom the validation data. Altenatively, a subsampling method is shown to povide a valid confidence inteval, without needing to estimate the conditional mean esponse. A simple nonpaamatic boostap confidence inteval can also be used. Numeical expeiments ae conducted to compae these diffeent methods egading the coveage pobabilities and the lengths of the esulting confidence intevals. Keywods: bootstap, confidence inteval, empiical pocess, functional delta method, LIFT, local aveage, %esponse, ROC (Receive Opeato Chaacteistic), subsampling, validation data 1 Intoduction In data mining, pedictive models can be used to detect the likely espondes to maketing campaigns. The capability of a pedictive model to captue the espondes can be evaluated by a numbe of measues on the validation data. Fo example, SAS Entepise Mine, a popula data mining softwae, is capable of pesenting vaious kind of LIFT chats, which is actually a geneal name that includes lift, %esponse, and %captued.esponse, among othes, fo descibing the effectiveness of a pedictive model in identifying espondes, and also fo compaing diffeent pedictive models (SAS Institute, 2003). Fo example, a pedictive model (say, logistic egession, o neual netwok) is used to ank the subjects in the validation data accoding to a scoe, which can be elated to an estimated pobability of esponding to a maketing campaign. A measue such as the %esponse is then computed in SAS fom a fixed pecentage, say, 100% of validation data ( (0, 1)), whee the 2

3 pedictive model scoes above a coesponding cut-off point c estimated fom the validation data. Figue 1 pesents such a pefomance measue lif t (fo a logistic egession model) plotted against the 10 decile values of {0.1, 0.2, 0.3,..., 1.0}, on a validation data set explained in moe detail in Section 6. Despite thei decade-long populaity in data mining pactices, the LIFT measues have not been studied vey thooughly in a statistical pespective. Fo example, a ecent liteatue eview on Confidence inteval of lift povided little elevant wok, and SAS Entepise Mine still does not include the option of confidence intevals fo the LIFT measues in a typical output such as Figue 1. The cuent pape attempts to fill this void and enable the uses to add valid confidence intevals to Figue 1. We will illustate this with a eal data application in Section 6. The poblem of finding a confidence inteval, fo example, fo %esponse, may be deceptively simple, since the paamete can be estimated by a sample popotion, which would suggest a binomial confidence inteval. Howeve, we will show that the pope solution will tun out to be much moe complicated. Binomial confidence intevals would not be appopiate to account fo all the inheent vaiations, since the sample popotion tuns out to be not computed ove statistically independent subjects - these subjects all have model scoes above a common cut-off point estimated fom the entie validation sample. We need to deal with the exta vaiation of the cut-off point c, of the 100% top model scoes, as estimated fom the validation sample. This tuns out to be a mathematically challenging task, since a paamete such as %esponse will be shown to be a discontinuous function of the estimated cutoff point. We will use some advanced empiical pocess theoy and the functional delta method in Appendix B to deive the asymptotic distibution popely. The phenomenon of impope coveage of the binomial confidence intevals was fist epoted in Rosset et al. (2001), which is the only efeence diectly elated 3

4 to ou pape. Thei pape studied the behavio of binomial confidence intevals fo a LIFT measue (%captued.esponse), and found empiically that the they ae ovely consevative and too loose compaed to the bootstap confidence intevals. Since the emphasis of thei pape is diffeent (on the effect of sampling balance on the measues), thee empiical finding was not vey noticeable and was placed in a vey shot subsubsection (Rosset et al. 2001, Section and Table 2). Ou pape will povide a theoetical explanation to the empiical phenomenon epoted in Rosset et al. (2001). We focus on how to deive a coect asymptotic theoy to account fo the additional souces of vaiations in the confidence intevals. Although little pevious wok was done on the asymptotic distibutions and the confidence intevals of the LIFT measues, thee have been extensive studies on a elated pefomance measue ROC (Receive Opeato Chaacteistic), which is commonly used in medical diagnosis (see, e.g., Ma and Hall 1993; Hsieh and Tunbull 1996; Hall et al. 2005ab; Hováth et al. 2008; Su et al. 2009). Ou theoetical wok is most elated to the theoetical woks of Hsieh and Tunbull (1996) and Hall et al. (2004) on ROC. The elations and diffeences, as well as some othe elated woks, ae descibed in moe details in a sepaate subsection (Section 2.3) late. The following is an outline of this pape. In Section 2, we fist define vaious LIFT measues used in SAS and point out how these LIFT measues ae elated to each othe, and how they ae estimated in data mining pactices using a validation data set. We then discuss the elation and diffeence between the LIFT measues and ROC measues and discuss some elated woks in Section 2.3. Then in Section 3, we descibe the asymptotic distibutions of the validation sample estimates of the LIFT measues. The asymptotic vaiances ae shown to be elated to a mean esponse conditional on the scoe at the cutoff point. At this point, two diffeent appoaches ae consideed to deive confidence intevals (in Section 4): One uses a local aveaging 4

5 method to estimate the conditional mean paamete, the othe uses a subsampling method to bypass the step of local aveaging. Simulations ae conducted in Section 5 to compae the coveage pobabilities and the lengths of the confidence intevals obtained fom the binomial method, the local aveaging method, and the subsampling method. A eal data application is descibed in Section 6. Appendix A descibes how to extend ou theoy to compute confidence intevals that ae simultaneously valid fo all the deciles. Technical poofs and moe geneal distibutional esults based on empiical pocesses ae included in Appendix B. The cuent pape focuses on the lage sample asymptotics, which is cetainly vey appopiate in the cuent ea of big data. Fo smalle sample sizes, some adjustment on the vaiance estimation can lead to futhe impovements. These details ae included in Section 8, which also has included a nonpaametic bootstap method in the numeical compaisons. 2 Lift measues 2.1 Notation and definitions Befoe poceeding, we intoduce some mathematical notation. Let Y {0, 1} be the andom esponse vaiable which will be 1 if a subject esponds to the maketing campaign and 0 if othewise. Let X be a andom input vecto, which can include demogaphic vaiables o household status vaiables that can be used to pedict the esponse. Let S = S(X) be a scala function of X used to scoe the subjects. This scoe S in the ideal case would be the same as the esponse pobability P (Y = 1 X), but in pactice an estimated vesion (fom a taining data set based on logistic egession, neual netwoks, o a decision tee, fo example) of the esponse pobability (o its monotone tansfomation) is often used instead. In this pape we will not 5

6 investigate how S is obtained, but only conside how it pefoms once the scoing ule S(X) is given. Theefoe, sometimes it may be simple to diectly teat S as a andom vaiable of inteest and conside the joint distibution fo (Y, S), instead of fo (Y, X). Conside a maketing campaign that aims to contact the top 100% ( (0, 1)) of the subjects accoding to scoe S. Coespondingly, we define an action indicato A = I(S > c) whose expectation is EA =. Hee an individual would be contacted by the maketing campaign if and only if its action value A = 1. The paamete c is a cutoff fo the scoe S, which is the 100(1 )% pecentile of S. The pefomance of the scoing method S can be descibed by the following LIFT measues: (i) π %esponse = P (Y = 1 A = 1) = E(Y A)/EA = E(Y A)/ which measues the pecentage of espondes among all the contacted people; (ii) κ %captued.esponse = P (A = 1 Y = 1) = E(Y A)/EY which measues the pecentage of contacted people among all people who would espond; (iii) lift = P (Y = 1 A = 1)/P (Y = 1) = E(Y A)/(EY EA) which is the atio of the %esponse achieved by action A and the baseline %esponse P (Y = 1); (iv) expected.pof it = Eg(Y, A), accoding to some pay-off function g(.,.). All these measues can be shown to be a monotone function of %esponse and can be egaded as equivalent. So, fo example, fo any two actions A 1 and A 2, lift(a 1 ) > lift(a 2 ) if and only if %esponse(a 1 ) > %esponse(a 2 ), since lift = %esponse/ey. Likewise, %captued.esponse = (/EY )%esponse, and expected.pof it = [(g(1, 1) + g(0, 0) g(1, 0) g(0, 1)) %esponse + g(0, 1) g(0, 0)] + [(g(1, 0) g(0, 0))EY + g(0, 0)] ae also monotone in %esponse. These elations immediately imply the following esult: 6

7 Poposition 1 Fo two actions A 1 and A 2 with the same pecentage of contacted population EA 1 = EA 2 = > 0, we have %esponse(a 1 ) %esponse(a 2 ) = %captued.esponse(a 1) %captued.esponse(a 2 ) = lift(a 1) lift(a 2 ), wheneve these atios ae finite. 2.2 Estimated LIFT measues fom the validation sample The quantities defined above ae population quantities elated to an unknown taget population. In pactice, all these quantities will need to be estimated on a sample (validation data). We will eplace all P by P, E by Ẽ, whee tilde denotes the empiical vesion of the coesponding measue fo the validation sample. Theefoe, fo example, Ẽg(Y, A) = m 1 m i=1 g(y i, A i ) fo a validation sample (Y i, S i ) m i=1 with size m, which ae assume to be iid (independent and identically distibuted) with (Y, S). In paticula, the cut-off paamete c will be estimated by c, which is the 100(1 ) sample pecentile. The coesponding estimated action ule is Ã = I[S > c] with sample pobability Ẽ(Ã) = 1. Then π = %esponse is estimated by π = Ẽ(Y Ã)/ẼÃ, (1) whee Ẽ(Ã) = and Ẽ(Y Ã) = m 1 m i=1 Y ii(s i > c). 2.3 Relation to the liteatue on ROC The LIFT measues ae intimately elated to the ROC, but thee exist some impotant diffeences. Using a notation simila to Hsieh and Tunbull (1996) and Hall et 1 The two sides of this equation can actually be slightly diffeent if is not divisible by m. Howeve, we will ignoe this diffeence in the notation fo simplicity, since the size of the diffeence is at most 1/m and does not change asymptotics in the leading ode O p (1/ m). 7

8 al. (2004), the ROC measues the Tue Positive ate 1 F 1 (s) = P (S > s Y = 1) fo a given False Positive ate p = 1 F 0 (s) = P (S > s Y = 0). Theefoe the ROC function can be expessed as ROC = 1 F 1 F 1 0 (1 p). This is simila to a LIFT measue %captued.esponse = P (S > c Y = 1), which can be expessed as %captued.esponse = 1 F 1 F 1 (1 ) whee = 1 F (c) = P (S > c) is the pecentage of contacted population. In data mining, the late appoach is moe natual, since it is moe natual to examine the pefomance against (the pecentage of contacted population due to a limited budget), say, in a maketing campaign, than to contol the false positive ate p. Ou estimates of the LIFT measues in Section 2.2 ae based on the empiical distibution fom the validation sampl, simila to Hsieh and Tunbull (1996) in the ROC liteatue. Although the kenel-smoothed estimate by Hall et al. (2004) may also be used in pinciple, we use the simplest empiical distibution estimate hee, since this is cuently the pevalent pactice in data mining softwaes. Thee ae many methods to constuct confidence intevals and simultaneous confidence bands fo ROC measues, see a good liteatue eview and an extensive numeical compaison of the pefomances by Macskassy et al. (2005a,b). Fo nonpaametic estimates of the ROC measues, the asymptotic distibution theoy is given by, e.g., Hsieh and Tunbull (1996), and pointwise confidence intevals ae povided by, e.g., Hall et al.(2004). Ou situation has some impotant diffeences fom that of Hsieh and Tunbull (1996) and Hall et al.(2004). Thei nonpaametic estimates of the two distibutions F 0 and F 1 ae assumed to be independent, since they coespond to mutually nonovelapping populations with Y = 0 and Y = 1, espectively. This is used in thei deivation of a fist ode appoximation in a sum of two independent components. On the othe hand, ou estimates of F and F 1 ae not independent, since F 8

9 is the distibution of the combined population. In addition, Y is geneally egaded as andom in data mining, and some othe LIFT measues, such as the %esponse, ae not only elated to the two distibutions F 1 and F as is the %captued.esponse, but also elated to the mixing popotion P (Y = 1). In this pape, we focus on asymptotic distibutions fo estimated LIFT measues with a given contacted popotion, and povide methods fo constucting pointwise confidence intevals (simila to those of Hall et al. 2004). Howeve, we also outline in Appendix A how to deive confidence intevals that ae simultaneously valid fo all decile values of. In addition, a joint asymptotic distibution fo the entie LIFT cuve estimate (simila to Hsieh and Tunbull 1996 in the context of ROC) is also povided in Appendix B. 3 Asymptotic distibution fo the estimated LIFT measues In geneal, below we will denote θ fo any of the fou LIFT measues and θ as its validation sample estimate. We will conside how to constuct a pointwise confidence inteval fo θ based on the validation sample estimate θ. We show that fo lage m, θ conveges in distibution to a nomal distibution N(θ, va( θ)). The computation of asymptotic vaiance va( θ), howeve, is not as staightfowad as it looks. Fo example, although the estimated %esponse π = Ẽ(Y Ã)/ẼÃ can be egaded as a sample popotion of individuals with Y = 1 out of all individuals being contacted (with Ã = 1), it is not pope to use a binomial distibution to compute its vaiance, since Ã depends on an estimated cutoff c, which is the (1 )th empiical quantile of the scoing ule S, dependent on all m andom individuals. The tems being aveaged in Ẽ( ) ae 9

10 theefoe no longe independent. Moeove, the estimato π = m 1 m i=1 Y i I(S i > c)/ G( c)/ depends on c in a discontinuous way, so G (c) does not exist, and we cannot use the usual δ technique to deive the asymptotic distibution based on a fist ode Taylo expansion in c. We will use the functional delta method (as descibed in, e.g., Section 3.9, van de Vaat and Wellne 1996) to solve this poblem. The key obsevation is that although π is not continuous in c, it is diffeentiable (in Hadamad s sense) as a functional of the empiical pocess G and the empiical quantile pocess c, at the limiting points of these empiical quantities, and we can essentially do a functional delta method to deive the asymptotic distibution of π fom the joint distibution of G and c. We will show the deivation of the following esults in Appendix B. Poposition 2 Assume that P (Y = 1) > 0, and that the conditional pobability densities p(s Y ) exist and ae positive and continuous diffeentiable in a neighbohood of S = c, fo Y = 1 and Y = 0. Then we have the following esults: fo any of the fou LIFT measues θ, as m, the sample LIFT measue θ conveges in distibution to a nomal distibution N(θ, va( θ)), whee va( θ) = m 1 va(h) = m 1 E(H EH) 2, whee H = (Y Λ)(aA + b), A = I(S > c), S is a scoe, c is the cutoff such that = EI(S > c), (0, 1), and the symbol Λ denotes the conditional mean Λ = E(Y S = c),. The paametes (a, b) take diffeent values fo the fou LIFT measues: (i) Fo θ = %esponse, use (a, b) = ( 1, 0); (ii) Fo θ = %captued.esponse, use (a, b) = ((EY ) 1, (EY ) 1 %captued.esponse); (iii) Fo θ = lift, use (a, b) = ((EY ) 1, (EY ) 1 %captued.esponse); 10

11 (iv) Fo θ = expected.pofit = Eg(Y, A), use (a, b) = (g(1, 1) + g(0, 0) g(1, 0) g(0, 1), g(1, 0) g(0, 0)). Remak 1 In pactice, the deciles used ae {0.1, 0.2, 0.3,..., 0.9, 1.0}, and the last point = 1.0 is not included in the domain (0, 1) of the Poposition. Howeve, the deived asymptotic vaiance fomula is still coect fo = 1.0. This is because when = 1, all people ae contacted, and the LIFT paametes have vey simple foms. e.g., the %esponse = P (Y = 1), lift = %captued.esponse = 1 always. The vaiance of thei sample estimates ae tivial and can be checked individually to satisfy the fomulas of Poposition 2. Coollay 1 Assume the conditions of Poposition 2. Fo the estimated %esponse π = Ẽ(Y Ã)/ẼÃ, the asymptotic vaiance is va( π) = (m) 1 π(1 π) [1 + (1 )(π Λ) 2 /(π(1 π))]. (2) Fo the estimated %captued.esponse κ = Ẽ(Y Ã)/ẼY, the asymptotic vaiance is va( κ) = (mπ 0 ) 1 κ(1 κ) [1 2Λ + Λ 2 (1 )/(π(1 κ))] (3) Fo the esimated lift κ/, the asymptotic vaiance is va( κ)/ 2. Hee π 0 = EY, π = E(Y I(S > c))/ei(s > c) is the %esponse, π is its estimato based on (1), κ = E(Y I(S > c))/ey is the %captued.esponse, κ is its estimato, S is a scoe, c is the cutoff such that = EI(S > c), (0, 1), and Λ = E(Y S = c). Remak 2 (Theoetical behavios of the binomial vaiances.) In the fomula fo the asymptotic vaiance of the estimated %esponse, the tem (m) 1 π(1 π) va B ( π) is the vaiance of a binomial popotion based on Bin(m, π), whee m is the total numbe of contacted people. In the fomula fo the asymptotic vaiance of the estimated %captued.esponse, the tem (mπ 0 ) 1 κ(1 κ) va B ( κ) is the vaiance 11

12 of a binomial popotion Bin( m i=1 Y i, κ), whee mπ 0 can be estimated by the total numbe of espondes. (See Rosset et al. 2001, equations (8) and (10)). Now conside the atio va( κ)/va B ( κ) = [1 2Λ+Λ 2 (1 )/(π(1 κ))] (which is also the vaiance atio fo the lift estimate). Thee is a negative sign, which suggests that the coect vaiance may be smalle than the binomial vaiance, leading to moe accuate (shote) confidence intevals. Note that the atio va( κ)/va B ( κ) = 1 π + O(). 2 So the atio may become nealy 0, and the binomial vaiance may become much too lage, fo small coesponding to a high %esponse π (i.e., when a small pecentage of highly likely espondes ae contacted). In this situation, ou theoetical esult can povide an expanation about the empiical findings epoted in Rosset et al (2001, Table 2), that the binomial confidence intevals ae loose at small values (such as 10%, 5%, 3% and 1%). Regading the atio va( π)/va B ( π) = [1 + (1 )(π Λ) 2 /(π(1 π))], note that ou esult suggests that va( π) va B ( π) and the diffeence depends on the diffeence π Λ = E(Y S > c) E(Y S = c), which would be small if E(Y S = s) vaies slowly fo s c. Howeve, when the mean function E(Y S = s) changes steeply, the binomial vaiance can be too small, leading to a lowe-than-nominal coveage pobability fo the esulting confidence inteval. We will povide two examples late (in Section 5.1), whee the vaiance atios can be computed analytically, to show these two types of biases on the coveage pobabilities of the binomial confidence intevals. They will be named the Case I example (fo the too-long intevals fo lif t o %captued.esponse) and the Case II example (fo the too-shot intevals fo %esponse), espectively. 2 This is established by expanding the atio fo small and noticing that fo smooth Λ(), we have Λ π = O() fo small, and that κ = π/π 0. 12

13 4 Confidence intevals 4.1 Local estimation method We now conside how to apply Poposition 2 to constuct valid asymptotic confidence intevals. Suppose we have a consistent vaiance estimate va( θ) based on the validation sample, such that va( θ)/va( θ) 1 in pobability as m. Then due to the Slutsky s theoem, as m, Poposition 2 implies that ( θ θ)/ va( θ) N(0, 1) in distibution, which implies that P (θ θ ± z α va( θ)) 1 α fo any α (0, 1), if z α has a standad nomal cumulative distibution function value Φ(z α ) = 1 α/2. Theefoe an asymptotic 100(1 α)% confidence inteval fo θ is θ ± z α va( θ). We now conside the poblem of consistently estimating va( θ) by va( θ) based on the validation sample. In the vaiance fomula, va(h) can be estimated by the validation sample vaiance va(h) = Ẽ(H ẼH)2, whee the unknown paametes (EY, %captued.esponse, c, Λ) ae eplaced by thei consistent estimates (ẼY, %captued.esponse, c, Λ) based on the validation sample. The paamete is known (such as 20% to be contacted by a maketing campaign). Now we conside estimation of Λ = E(Y S = c). In geneal, Λ needs to be estimated fom a local egession of Y on S based on the validation data aound S = c, whee c is estimated by c. Fo example, a local aveage estimate Λ = ẼY I(S c ± h)/ẽi(s c ± h) is well known to be consistent when h deceases with m such that h 0 and mh. 4.2 Subsampling method In the pevious subsection, we descibed a method to constuct a confidence inteval fo a LIFT paamete θ. The asymptotic vaiance involves a conditional mean esponse paamete Λ, which can be estimated by a local aveaging of the esponses against the scoes. An altenative method we conside hee is to bypass the poblem 13

14 of estimating Λ by using a subsample estimate of the asymptotic vaiance, simila to Ibagimov and Mülle (2010). Let θ j, j = 1,..., q be q estimated LIFT measues based on q(> 1) independent sub-samples of size m j, j = 1,..., q. The total sample size is m = q j=1 m j and we conside the asymptotics when lim m m j /m = 1/q (which will be valid almost suely, fo example, when obsevations ae andomly assigned to the q goups with equal pobability). Let θ and s be, espectively, the sample mean and sample vaiance of { θ j, j = 1,..., q}. Then we have the following poposition. Poposition 3 Assume the egulaity condition of Poposition 2, and assume that lim m m j /m = 1/q fo all j = 1,..., q. We have the following esults as m : q( θ θ)/s conveges in distibution to tq 1 (the t-distibution with q 1 degees of feedom), and fo any α (0, 1), P (θ θ ± t q 1,α s/ q) 1 α, whee t q 1,α is the (1 α/2)th quantile of the t q 1 distibution. 3 Poof: Applying the asymptotic nomality esult of Poposition 2, noting that the subsamples ae independent wheeas unde this method, we have (*) m/q( θ 1 θ,..., θ q θ) T conveges in distibution to N(0, diag(σ 2,..., σ 2 )) whee σ 2 = va(h). (Note that hee the individual sample sizes m j have all been eplaced by m/q due to the Slutsky s Theoem.) Note that the esult (*) satisfies the basic assumption (4) made in Ibagimov and Mülle (2010). Due to the continuous mapping theoem, we deive fom (*) that q( θ θ)/s conveges in distibution to t q 1 (the t-distibution with q 1 degees of feedom). 3 This suggests that we can constuct a 100(1 α)% asymptotic confidence inteval fo θ by θ ± t q 1,α s/ q. Altenatively, one can cente the inteval at the oiginal whole sample estimate θ to obtain θ ± t q 1,α s/ q, as descibed in Section 8. 14

15 Then the coveage pobability esult holds. Q.E.D. 5 A simulation study 5.1 Two analytic examples In the emaks afte Coollay 1, we have descibed the behavio of the binomial confidence intevals, based on a theoetical study on the atios of the asymptotic vaiances. In this subsection, we will use the notation of Coollay 1 and pesent two examples whee the asymptotic vaiances can be analytically computed, in ode to have a sense of the size of the numeical diffeences that can be obseved. The simulation study late will be based on a smoothed vesion of the two models pesented hee. Case I: (gadual case) Let S = X and X U nif(0, 1), E(Y X) = X. we conside contacting top = 10% of the X-scoes. Then the theshold valus is c = 0.9, since P (X > c) = 10%. The othe useful paametes include π 0 = EY = 0.5, Λ = E(Y X = c) = 0.9, π = P (Y = 1 X > c) = 0.95, κ = P (X > c Y = 1) = π/π 0 = This is a case with a small π Λ = The ation va( π)/va B ( π) = [1 + (1 )(π Λ) 2 /(π(1 π))] = , which is vey close to 1. So thee should be not much diffeence in the coveage pobability and the length when one uses the binomial confidence inteval fo %esponse. Howeve, the atio va( κ)/va B ( κ) = [1 2Λ+Λ 2 (1 )/(π(1 κ))] = is vey small. The length of the binomial confidence inteval fo the lif t o fo the %captued.esponse will much be longe than needed, at a atio 15

16 1/ {va( κ)/va B ( κ)} = The coveage pobability will be ovely consevative. Case II: (steep case) The stuctue is simila to Case I except that we E(Y X) = max{min{3(x 1/3), 1}, 0}, which is a thee-piece linea model that is 0 below X < 1/3, and 1 above X > 2/3. We conide contacting half of the people, = 0.5. Then c = 0.5 since P (X > 0.5) = 0.5. The othe useful paametes include π 0 = EY = 0.5, Λ = E(Y X = c) = 0.5, π = P (Y = 1 X > c) = 11/12 = = P (X > c Y = 1) = κ. This is a case with a lage π Λ = , caused by a steep change in the middle pat of the X domain. The vaiance atio va( π)/va B ( π) = [1 + (1 )(π Λ) 2 /(π(1 π))] = , which is much lage than 1. So the binomial confidence inteval fo the %esponse is too shot and will unde-cove in pobability. (In this case the binomial confidence inteval fo the lif t o %captued.esponse will also be too shot since va( κ)/va B ( κ) = [1 2Λ + Λ 2 (1 )/(π(1 κ))] = > 1.), When the nominal confidence level =0.95, the asymptotic coveage pobability fo the binomial confidence inteval (of the fom π ± 1.96 va B ( π)) is only Φ( / ) Φ( / ) Although these two cases ae only theoetical pedictions based on piecewise constant esponse cuves, we will veify in the late simulations that these pedictions ae qualitatively coect even with moe ealistic smooth esponse cuves following the logistic egession models. 16

17 5.2 Simulations We let the scoe S = X Unif[0, 1] in the simulations. We will use two diffeent logistic egession models to geneate simulated data. These two cases will be called the gadual case and the steep case, espectively. In the fist (gadual) case, the tue esponse pobability (logit model) is µ (x) = E(Y X = x) = 1 1+e x, population with the top 10% of S is contacted, and µ vaies little (fom to 0.934) in the contacted egion S (0.9, 1.0]. In the second (steep) case, the tue esponse pobability (logit model) is µ (x) = E(Y X = x) = 1 1+e x, population with the top 50% of S is contacted, and µ vaies a lot (fom to 1.000) in the contacted egion S (0.5, 1.0]. These ae simila to the two cases discussed in Section 5.1 but ae now smoothed. We will compae the pefomance of thee confidence intevals (CIs): the binomial CI, the local estimation CI (as explained in Section 4.1), and the subsampling CI (as explained in Section 4.2). Fo the subsampling method, we use q = 10 subsamples in the simulation. Fo the local estimation method, we need to estimate the conditional mean function Λ = E(Y S = c). We will use a local aveage estimato Λ, which is the sample aveage of Y fo S [c h, c + h]. We will use h = m 1/3 in the simulations (which will lead to the optimal convegence ate fo estimating Λ). So, fo example, when m = 1000, we use h optimal = 0.1. In the Tables (1 to 6), we compae the empiical coveage pobabilities with the nominal ones fo a thousand CIs obtained fom the thee methods (binomial, local estimation, and subsampling), and also the aveage widths of the CIs, as well as the algoithm complexities in tems of the total execution times. 4 Each of these thousand 4 The hadwae infomation of the compute used to deive the esults is as follows fo efeence of timing. CPU: Intel R Coe TM i5-3210m CPU 2.50GHz; RAM: 8.00GB; OS: Windows R 7 64 bits; 17

18 CIs is deived fom a common sample size m. We epot the esults fo two choices of the sample size: m = 1000 and m = 10000, which ae quite typical sample sizes fo the eal data sets used in data mining. (Behavios at smalle sample sizes, and with vaying tuning paametes h and q, ae summaized in Section 8.) We fist look at the pefomance of the Binomial CI. Fo %esponse, ou theoetical pedictions in Section 5.1 tun out to be vey close to the actual simulation esults. In Case 1 (the gadual case), we pedicted that the coveage pobabilities would be close to the nominal ones. The simulation esults (based on 1000 CIs) ae 93.3% (nominal: 95%) and 91.9%(nominal: 90%). In Case 2 (the steep case), we pedicted that the CI would seveely undecove. The simulation esults ae 81.2%(nominal: 95%) and 74.0%(nominal: 90%). Fo lift and %captued.esponse, since they only diffe by a facto, the CI coveage pobabilities must be exactly the same in simulations, while the widths of thei CIs must diffe exactly by a facto of. In the gadual case (whee the binomial CI pefoms well fo the %esponse), we notice that the binomial CIs fo both lift and %captued.esponse tend to be ovely consevative and too loose (simila to the findings epoted in Rosset et al. 2001, Section 3.2.3). This can be seen in the simulation esults fo Case 1, whee fo lif t and %captued.esponse all the coveage pobabilities (based on 1000 CIs) ae 100%. On the othe hand, fo Case 2, whee the binomial CI undecoves fo %esponse, it also undecoves fo lif t and %captued.esponse. The simulation coveage pobabilities ae 85.5%(nominal: 95%) and 77.1%(nominal: 90%). In summay, the esults above veify ou theoetical pedictions. Regading the sample sizes (m) needed fo the asymptotic CIs to have satisfactoy coveage pob- Pseudo-andom numbe geneato: R R ; Pogamming language and majo softwae component: R R. 18

19 abilities (being close to the nominal ones), both the local estimation method and the subsampling method pefom quite well when m = 1000 aleady, and they do even bette when m = (In compaison, inceasing m does not impove the pefomance of the binomial method. In othe wods, the binomial CIs can eithe ovecove o undecove even with m = ) The local estimation method tends to poduce slightly shote CIs than the subsampling method, but they both wok well in tems of the coveage pobabilities. Table 1: CIs fo %esponse, case 1, µ (x) = E(Y X = x) = 1 1+e x sample size n=1000 n=1000 n=1000 n=10000 n=10000 n=10000 method Binomial Local Subsample Binomial Local Subsample coveage(95%ci) width coveage(90%ci) width time(sec) Table 2: CIs fo %esponse, case 2, µ (x) = E(Y X = x) = 1 1+e x sample size n=1000 n=1000 n=1000 n=10000 n=10000 n=10000 method Binomial Local Subsample Binomial Local Subsample coveage(95%ci) width coveage(90%ci) width time(sec) Table 3: CIs fo Lift, case 1, µ (x) = E(Y X = x) = 1 1+e x sample size n=1000 n=1000 n=1000 n=10000 n=10000 n=10000 method Binomial Local Subsample Binomial Local Subsample coveage(95%ci) width coveage(90%ci) width time(sec)

20 Table 4: CIs fo Lift, case 2, µ (x) = E(Y X = x) = 1 1+e x sample size n=1000 n=1000 n=1000 n=10000 n=10000 n=10000 method Binomial Local Subsample Binomial Local Subsample coveage(95%ci) width coveage(90%ci) width time(sec) Table 5: CIs fo %captued esponse, case 1, µ (x) = E(Y X = x) = 1 1+e x sample size n=1000 n=1000 n=1000 n=10000 n=10000 n=10000 method Binomial Local Subsample Binomial Local Subsample coveage(95%ci) width coveage(90%ci) width time(sec) Table 6: CIs fo %captued esponse, case 2, µ (x) = E(Y X = x) = 1 1+e x sample size n=1000 n=1000 n=1000 n=10000 n=10000 n=10000 method Binomial Local Subsample Binomial Local Subsample coveage(95%ci) width coveage(90%ci) width time(sec)

21 lift cuve lift lift cuve Figue 1: lif t estimates without confidence intevals. (The solid hoizontal line epesents the baseline lift, which is equal to the lift estimate at = 1.0.) 21

22 6 Real data application 95% CIs fo lift lift lift cuve 95% Local CIs 95% Binomial CIs Figue 2: lift estimates and thei CIs (The solid hoizontal line with height 1 epesents the lif t with campaign action completely andom.) To illustate the application of the methodology, in this section we conside the Oange Juice Data (Stine, Foste and Wateman 1998). The data contains puchases whee the custome eithe puchased Citus Hill (Y = 1) o Minute Maid (Y = 0) Oange Juice. A numbe of chaacteistics (X) of the custome and poduct ae ecoded, such as pice diffeence and custome band loyalty. The total numbe of obsevations is 1070, fom which we save half andomly fo the taining data. We use logistic egession with the default option in SPSS to obtain an estimate of the pobability P (Y = 1 X), which will be used as the scoing ule S(X). The ule S(X) is applied to the validation sample with m = 535, and the 22

23 sample lift values ae computed and plotted in Figue 2, accoding to the deciles {0.1, 0.2, 0.3,..., 0.9, 1.0}. In this figue, the pointwise asymptotic 95% confidence intevals ae pesented as little squaes (connected by dashed lines). They ae computed accoding to the local estimation method with bandwidth h = m 1/3. Fo compaison, we have also pesented the binomial confidence intevals as little diamonds (connected in solid lines). We notice that the local estimation method can lead to much moe accuate confidence intevals, especially at small. At = 0.1, the width of the confidence inteval is educed to 36% of the width fom the binomial method. Theefoe, we have demonstated that ou method can be used to povide valid confidence intevals to the LIFT chat function of the standad data mining softwae, which will be a vey useful addition fo statisticians. 7 Discussions Despite the populaity of the data mining pactices, we find vey little pevious wok in statistical infeence fo the LIFT measues, which ae commonly used in data mining. In this pape, we have povided an asymptotic distibution theoy fo the validation sample estimates of the LIFT measues. We have also discussed seveal methods fo constucting valid asymptotic confidence intevals fo some common LIFT measues, including %esponse, lif t, and %captued.esponse. The cuent wok focuses on single confidence inteval with one given scoing ule applied to a fixed pecentage of the tageted population. Howeve, afte adjusting fo multiplicity as outlined in Appendix A, ou asymptotic distibution esults may also be applied to study simultaneous confidence intevals, with seveal diffeent pecentiles of the contacted population, as is commonly plotted in the LIFT chats 23

24 in data mining pactices. This is illustated in a eal data example in Zhao (2014, Ph.D. thesis). It will also be of inteest to study the statistical compaison of seveal scoing ules, say, one obtained fom logistic egession, one fom neual netwoks, and one fom a decision tee. We will leave this as futue wok. 8 Futhe technical details Moe extensive simulation studies and additional technical details ae contained in this section, the contents ae of which ae summized below. We have studied the sensitiy of the choice of the tuning paametes (h fo local aveage and q fo subsampling), and we have studied the pefomance of the confidence intevals fo smalle sample sizes such as m = 200, 600, We have also studied the coveage popety fo all values fom 0.1 to 1.0 and consideed small sample coections to impove the pefomance fo vey small and vey lage. In addition, we have also studied the pefomance of a simple nonpaametic bootstap confidence inteval. We found that the bootap method is not significantly bette than the poposed methods and can be slowe by odes of magnitude. We found that the poposed methods ae obust unde diffeent choices of the tuning paametes, and have good coveage pefomance fo smalle sample sizes, at all decile values of, afte using the plus fou-type coections which ae simila to the ones used in common feshman textbooks (e.g., Mooe 2010, p.508). We have studied the nominal 95% confidence intevals in all these esults. Also, we have only epoted esults on lif t and on %esponse. (We omitted %captued.esponse since it is popotional to the lift by a nonandom constant.) Hee is a detailed summay. 1. (Bootstap). As a efeee suggested, we study the pecentile bootstap confidence 24

25 inteval, which is a simple nonpaametic bootstap method based on the 2.5% and 97.5% bootstap quantiles. We use 1000 bootstap epetitions, which is a typical numbe of bootstap epetitions, as is used in Section 2.2 of Macskassy et al A theoetical note: We believe that the bootstap method is asymptotically valid, since ou esults in Appendix B can be used to establish the Hadamad diffeentiablity of the LIFT estimatos and we can use the idea of Van de Vaat (2000, Example 23.11) to pove the validity of the boostap method. Howeve, we will descibe late that the bootstap method is much slowe and does not pefom bette than the poposed methods. Below, we will fist point out that the bootstap method (as well as all othe methods) can wok vey pooly in some situations, and a plus fou coection (explained below) can significantly impove its pefomance. 2. (Plus fou coection). We notice that the asymptotic confidence intevals can wok vey pooly fo some finite sample situations (especially in Case II simulations). Plus fou coections can significantly impove the finite sample pefomance. Fo example, fo the bootstap method, initially, pecentiles obtained wee diectly based on esampling the oiginal estimates, without the plus fou coection. Then we notice that they pefom pooly (in having vey low coveage pobabilities) fo small o modeate m, e.g., m = 500, fo some values in Case II. Afte some expeiments, we found the eason. In Case II, P (A = 1 Y = 1) is nealy 1 fo high (e.g., fo = 0.9), and P (Y = 1 A = 1) is nealy 1 fo small (e.g., 0.1). Ove esampling, the sample popotions ae almost always one, which will often lead to zeo-width confidence intevals, missing the tue 25

26 values by a little amount. Similaly, in Case II fo these values, confidence intevals without plus fou do not wok well fo all othe thee methods. E.g., fo the binomial method, a binomial vaiance estimate such as π(1 π)/50 is zeo when the sample popotion π is, say, 50/50, without plus fou coection. This often happens fo lage success pobability vey close to one (such as π = ). Which will be missed by the zeo-width confidence inteval [1, 1]. Even though it misses only by a small amount, it will almost always happen and will lead to a vey low coveage pobability. Fo emedy, one can use a plus fou-type coections that ae commonly used in feshman textbooks (e.g., Mooe 2010, p.508), by adding 4 obsevations with 2 Yeses and 2 Nos, then the π in the sample vaiance estimate will be eplaced by 52/54, and the esulting confidence inteval, even though changes vey little, will have a nonzeo width enough to cove the tue value. The plus fou appoach can sometimes be ovely consevative, but it is bette to ovecove than to undecove, and the incease in the width of the confidence inteval is often vey little anyway. The plus fou coection does not affect the asympotics and any diffeences will go away in the lage sample limit. Fo the local estimation method, in the vaiance fomulas we can use the plus fou estimates fo all popotion paametes, including Λ (when estimated fom the obsevations falling within the bandwith). Now we conside the plus fou coection fo bootstap, i.e., bootstapping the plus fou estimates. A andom plus fou method is needed hee, since the fixed plus fou estimates (say, (50+2) out of (50+4),) would emain the same ove bootstap esampling, if the esampled oiginal estimates wee all the same (say, 50 out of 50) to begin with. The andom method adds 4 obsevations with 26

27 Bin(4, 0.5) Yeses and 4 Bin(4, 2) No s, to the sample popotion estimates, independently ove all the bootstap epetitions. Similaly, a fixed plus fou method would not wok fo subsampling. We consideed a andom plus fou method that adds a total numbe of 2 Yeses and 2 Nos, andomly assigned to the q subsample popotion estimates, when estimating the %esponse and the %captued.esponse. Altenatively, we consideed a method of an appoximate plus fou type coection to the vaiance estimate. We found that both methods wok vey similaly, with the apoximate plus fou method woking slightly bette. All the esults epoted in this section fo subsampling ae based on this appoximate plus fou method. This method involves inceasing the oiginal vaiance estimate (the sample vaiance of q subsample estimates, without plus fou coection, divided by q) by a small coection 2/n 2, whee n is the sample size used in the oiginal (whole sample) popotion estimate (which is the total numbe of esponses fo %captued.esponse, o the total numbe of contacted people fo %esponse). This coection does not affect the asymptotics since the asymptotic vaiance is of ode 1/n. It is based on an analogy elated to the following appoximation to the plus fou coection fo the binomial method: A theoetical note: Notice that fo a binomial data with y Yeses and n y Nos, the effect of plus fou coection on the binomial vaiance estimate is an incease upto 2/n 2. Let v4 = (y + 2)/(n + 4) (1 (y + 2)/(n + 4))/(n + 4) and v = (y/n)(1 y/n)/n. Then v4 = v[n/(n + 4) 3 ] + (2/n 2 )[(n + 2)n 2 /(n + 4) 3 ], whee the factos in the squae backets ae less than 1 and ae vey close to 1 fo lage n. The plus fou coections lead to signficant impovement fo all fou methods in Case II. See Figue 3 (fo Case I) and Figue 4 (fo Case II) fo a 27

28 compaison of the fou methods: binomial, local estimation, subsampling and bootstap, with and without plus fou coections. We tied sample sizes m = 100, 300, 500, 1000 and epoted the esults at m = 500. The esults at othe sample sizes ae all simila. 3. (Pefomance Compaisons). As seen in Figue 3 (fo Case I) and Figue 4 (fo Case II), all the fou methods (binomial, local estimation, subsampling and bootstap) wok well in Case I, with o without plus fou coection. In Case ll, all fou methods wok vey pooly without plus 4 coection, fo %esponse at small, and fo %captued.esponse at lage. All fou methods wok much bette with plus fou coections. The poposed methods wok vey well in both Case I and Case II, compaed to the binomial method and the bootstap method. The subsampling confidence intevals ae wide and moe consevative than the local estimation confidence intevals, pobably due to the use of the t distibution instead of the z distibution. In tems of the coveage pobabilities, thee ae no seious undecoveages in eithe method, and any ovecoveages do not lead to significant widening of the confidence intevals. These hold fo all decile values of. This is in contast to the binomial method, which can also ovecove fo lift at small values, but the widening of the confidence intevals is much moe sevee. 4. (Time Compaison). In tems of the time needed to compute the confidence intevals, the binomial method is the fasted, next the subsampling method, next the local estimation method, and the bootstap method is the slowest. Ove vaious uns, we obseve that typical atios of the computational time needed ae about 1 : (1.2 to 1.6) : (1.5 to 1.7) : (30 to 100). 28

29 5. (Cente of the confidence intevals). Wheneve possible, we use the unmodified oiginal sample popotion estimates as the cente of the confidence intevals, since sample popotions ae most commonly used to estimate the LIFT measues (e.g., in standad data mining softwaes) This means that we do not use the plus fou coection on the cente of the (binomial, local, o subsampling) confidence intevals, but only use the coection fo impoving the vaiance estimates. This also means that fo the subsampling method, we do not use the aveage subsample estimates as the cente of the confidence inteval, but athe the oiginal whole sample estimates which ae used in standad data ming softwaes, which ae easie to compute, and can pefom bette (see below). [The subsample estimates ae only used to estimate the vaiance (by the sample vaiance of q subsample esimates, divided by q), with a plus fou coection.] A theoetical note: We believe that the whole sample estimate θ and the aveage of the subsample estimates θ ae asymptotically equivalent. A heuistic agument is that they have the same infuence function and the same Hadamad diffeential with espect to the undelying subsample empical distibution functions. Howeve, in esults not pesented hee, we found that the t confidence intevals centeed at the whole sample estimates θ pefom bette than the intevals centeed at θ. This may be because the latte has a lage asymptotic bias (o ode O(1/(m/q))) than that of the fome (of ode O(1/m)). (See, e.g., Haas 2006.) Howeve we expect that the diffeence will disappea asymptotically. Fo lage sample sizes such as those used in the ealie simulations in Section 5, the using confidence intevals centeed at the aveage subsample estimates also pefom vey well. 29

30 6. (Effect of tuning paametes unde vaious sample sizes). In Figue 5 (fo Case I) and Figue 6 (fo Case II), we conside the effect of using diffeent bandwith paametes h fo the local estimation method. We conside h = 0.5m 1/3, m 1/3, 1.5m 1/3, fo m = 200, 600, In Figue 7 (fo Case I) and Figue 8 (fo Case II), we conside the effect of using diffeent numbe of subsample q fo the subsampling method. We conside q = 5, 10, 20, fo m = 200, 600, In all these simulations, we found that the esults change vey little. The coveage pobabilities ae not significantly affected by diffeent choices of h o q. In simulation esults not pesented hee, we also found that the widths of confidence intevals ae not significantly affected by diffeent choices of h o of q. Howeve, the widths of the subsampling confidence intevals vay somewhat moe than the widths of the local estimation confidence intevals, and they tend to be wide and moe consevative. In tems of the coveage pobabilities, thee ae no seious undecoveages fo eithe method, fo all these choices of tuning paametes and sample sizes. Appendix A: Simultaneous confidence intevals Conside any LIFT measue θ, such as θ = %esponse = E(Y S > c). It is dependent on the scoing ule S, as well as the cutoff c which is detemined by the pecentage contacted = P (S > c). SAS Entepise Mine plots the LIFT chats to display multiple θ s simultaneously. These θ s can include, fo example, %esponse k (o lift k ) at all the deciles k {0.1, 0.2, 0.3,..., 1.0}. Let us label the coesponding pefomance measues as θ k, k = 1,..., p. The pevious confidence intevals ae of the fom θ k ±z α s k fo the pefomance measue θ k, k = 1,..., p, whee s k = va( θ k ). These confidence intevals only have individually coect asymptotic coveage pobabilities, i.e., P (θ k θ k ± z α s k ) 1 α, fo each k = 1,..., p. Howeve, 30

31 Figue 3: Compaison of fou CI methods, without and with plus 4 coection, fo Case I. (m = 500) Coveage fo Lift, no plus 4, Case I Coveage fo Lift, plus 4, Case I Pecentage Coveage Bootstap Binomial Local Regession Subsampling Pecentage Coveage Bootstap Binomial Local Regession Subsampling Mean Width fo Lift, no plus 4, Case I Mean Width fo Lift, plus 4, Case I Mean Width Bootstap Binomial Local Regession Subsampling Mean Width Bootstap Binomial Local Regession Subsampling Coveage fo %Response, no plus 4, Case I Coveage fo %Response,plus 4, Case I Pecentage Coveage Bootstap Binomial Local Regession Subsampling Pecentage Coveage Bootstap Binomial Local Regession Subsampling Mean Width fo %Response,no plus 4, Case I Mean Width fo %Response,plus 4, Case I Mean Width Bootstap Binomial Local Regession Subsampling Mean Width Bootstap Binomial Local Regession Subsampling 31

32 Figue 4: Compaison of fou CI methods, without and with plus 4 coection, fo Case II. (m = 500) Coveage fo Lift, no plus 4, Case II Coveage fo Lift, plus 4, Case II Pecentage Coveage Bootstap Binomial Local Regession Subsampling Pecentage Coveage Bootstap Binomial Local Regession Subsampling Mean Width fo Lift, no plus 4, Case II Mean Width fo Lift, plus 4, Case II Mean Width Bootstap Binomial Local Regession Subsampling Mean Width Bootstap Binomial Local Regession Subsampling Coveage fo %Response,no plus 4, Case II Coveage fo %Response, plus 4, Case II Pecentage Coveage Bootstap Binomial Local Regession Subsampling Pecentage Coveage Bootstap Binomial Local Regession Subsampling Mean Width fo %Response,no plus 4, Case II Mean Width fo %Response,plus 4, Case II Mean Width Bootstap Binomial Local Regession Subsampling Mean Width Bootstap Binomial Local Regession Subsampling 32

Central Coverage Bayes Prediction Intervals for the Generalized Pareto Distribution

Central Coverage Bayes Prediction Intervals for the Generalized Pareto Distribution Statistics Reseach Lettes Vol. Iss., Novembe Cental Coveage Bayes Pediction Intevals fo the Genealized Paeto Distibution Gyan Pakash Depatment of Community Medicine S. N. Medical College, Aga, U. P., India