MODEL CHANGE DETECTION WITH APPLICATION TO MACHINE LEARNING. University of Illinois at Urbana-Champaign

Size: px

Start display at page:

Download "MODEL CHANGE DETECTION WITH APPLICATION TO MACHINE LEARNING. University of Illinois at Urbana-Champaign"

Ernest Fisher
5 years ago
Views:

1 MODEL CHANGE DETECTION WITH APPLICATION TO MACHINE LEARNING Yuheg Bu Jiaxu Lu Veugopal V. Veeravalli Uiversity of Illiois at Urbaa-Champaig Tsighua Uiversity Throughout this paper, we use lower case letters to deote scalars ad vectors, ad use upper case letters to deote radom variables ad matrices. We use λ max (A) ad λ mi (A) to deote the largest ad the smallest eigevalues of matrix A, respectively, ad Tr(A) to deote the trace of a square matrix A. All logarithms are the atural oes. We cosider the model chage detectio problem i the followig settig. We are give two datasets S = z,, z ad S = z,, z with samples z draw from some istace space Z. I additio, we are give a paarxiv: v [stat.ml] 9 Nov 08 ABSTRACT Model chage detectio is studied, i which there are two sets of samples that are idepedetly ad idetically distributed (i.i.d.) accordig to a pre-chage probabilistic model with parameter, ad a post-chage model with parameter, respectively. The goal is to detect whether the chage i the model is sigificat, i.e., whether the differece betwee the prechage parameter ad the post-chage parameter is larger tha a pre-determied threshold ρ. The problem is cosidered i a Neyma-Pearso settig, where the goal is to maximize the probability of detectio uder a false alarm costrait. Sice the geeralized likelihood ratio test (GLRT) is difficult to compute i this problem, we costruct a empirical differece test (EDT), which approximates the GLRT ad has low computatioal complexity. Moreover, we provide a approximatio method to set the threshold of the EDT to meet the false alarm costrait. Experimets with liear regressio ad logistic regressio are coducted to validate the proposed algorithms. Idex Terms Model chage detectio, geeralized likelihood ratio test, Neyma-Pearso settig. INTRODUCTION We study the model chage detectio problem, where two sets of samples are idepedetly ad idetically distributed (i.i.d.) accordig to a pre-chage probabilistic model with parameter, ad a post-chage probabilistic model with parameter, respectively. The goal is to determie whether the chage i the model is sigificat or ot. We formulate the problem i a Neyma-Pearso settig, ad adopt the l distace betwee the parameters to measure the chage betwee the models. More specifically, our goal is to costruct a test to detect whether is larger tha a pre-determied threshold ρ, while satisfyig a false alarm costrait. This problem is motivated i part by the recet works o active ad adaptive sequetial learig [ 3], where the machie learig models leared i previous time-steps are used adaptively to improve the accuracy ad data-efficiecy i the The work of Y. Bu ad V. V. Veeravalli was supported by the Army Research Laboratory uder Cooperative Agreemet W9NF through the Uiversity of Illiois at Urbaa-Champaig. ext time-step. A key step i applyig these adaptive sequetial learig methods is the detectio of a abrupt or large model chage, sice adaptig to the previous model if it is sigificatly differet from the curret oe could deteriorate performace. A specific applicatio i this cotext is the detectio of a shift of user prefereces i persoalized recommedatio systems [4, 5]. I additio, we believe that our model chage detectio formulatio ca be applied i trasfer learig [6] to determie whether two machie learig tasks are trasferable. We ote that our model chage detectio problem is differet from the quickest chage detectio problem studied i [7, 8]. There a liear regressio model chages at a ukow poit i time, ad the goal is to detect the chage as soo as possible with streamig data. We are iterested i detectig whether the chage i the model is sigificat, give sets of samples from the pre- ad post-chage models. A stadard method for solvig a composite hypothesis testig problem such as the model chage detectio problem uder cosideratio is the geeralized likelihood ratio test (GLRT). However, the maximum likelihood estimates of ad required i the GLRT are difficult to compute uder the costrait ρ i this case. Our first cotributio is to propose a empirical differece test (EDT), which approximates the GLRT ad has low computatioal complexity. Moreover, we provide a approximatio method to set the threshold i the proposed EDT, which esures a boud o the worst-case false alarm probability. We validate our results usig experimets ivolvig liear regressio ad logistic regressio.. PROBLEM MODEL

2 rameterized family of distributio models M = p(z ), R d. We assume that there exist two ukow parameters, R d, such that the datasets S ad S are idepedetly geerated from the followig pre-chage ad post-chage models, respectively, Z i p(z i ), z i S, ad Z j p(z i ), z j S. () Our goal is to costruct a computatioal efficiet test to decide betwee the followig two hypotheses: H 0 : (, ) χ 0 (, ) ρ, H : (, ) χ (, ) > ρ, where ρ is a costat determied by the specific applicatios. Let δ : Z Z 0, deote the decisio rule for the model chage detectio problem. The the probabilities of false alarm ad correct detectio ca be writte as P F (δ,, ) P (, )δ(s, S ) =, (, ) χ 0, (3) P D (δ,, ) P (, )δ(s, S ) =, (, ) χ, (4) where P (, ) deotes the probability measure for the data coditioed o the model parameter (, ). Note that i (), both the ull hypothesis ad the alterative hypothesis are composite. We study the detectio problem i the Neyma-Pearso settig: max δ P D (δ,, ), (, ) χ s.t. P F (δ,, ) α, (, ) χ 0. As see i (5), our goal is to costruct a test that maximizes the detectio probability for all (, ) χ, ad satisfies the false alarm costrait for all (, ) χ 0. The solutio to (5) if it exists is said to be a uiformly most powerful (UMP) test. Sice z i ad z i are draw i.i.d. from p(z i ) ad p(z i ), respectively, we ca use L() log p(z i ), () (5) L () log p(z i ) (6) to deote the egative log-likelihood fuctios with the prechage dataset S ad post-chage dataset S, respectively. The, the maximum likelihood estimates (MLE) of ad ca be writte as ˆ ML argmi L(), ˆ ML argmi L (). (7) I additio, we deote the Hessia matrices of L() ad L () as H() L(), ad H () L (). 3. EMPIRICAL DIFFERENCE TEST 3.. Geeralized Likelihood Ratio Test I geeral, a UMP solutio to the composite hypothesis testig problem i (5) may ot exist, ad may be difficult to fid eve if it exists. A alterative approach is to apply the GLRT. The geeralized log-likelihood ratio (GLR) is give by L G(S, S ) log max (, ) χ p(zi ) p(z i ) max (, ) χ 0 p(zi ). (8) p(z i ) If L G (S, S ) does ot have poit masses uder either H 0 or H, the GLRT has the followig structure δ GL (S, S, if L G (S, S ) τ ) = 0, if L G (S, S (9) ) < τ, where τ is the threshold for the GLR statistics determied by the false alarm costrait α. For the cociseess, we defie (ˆ, ˆ ) argmi (, ) χ L() + L ( ), (ˆ 0, ˆ 0) argmi (, ) χ 0 L() + L ( ). The, the geeralized log-likelihood ratio ca be writte as (0) L G (S, S ) = L(ˆ 0 ) + L (ˆ 0) L(ˆ ) L (ˆ ). () The mai difficulty i applyig GLRT is that the miimizers (ˆ, ˆ ) ad (ˆ 0, ˆ 0) i (0) are hard to compute. I the followig subsectio, we propose a empirical differece test which approximates the GLRT ad has reduced the computatioal complexity. 3.. Empirical Differece Test We eed the followig coditios to proceed with our aalysis ad establish the asymptotical ormality of the MLEs [9]. Assumptio Regularity coditios for MLE. Smoothess: L() ad L () have first, secod ad third derivatives for all.. Strog Covexity: For all, H() ad H () are positive defiite ad ivertible. 3. Boudedess: For all, the largest eigevalues of H() ad H () are upper bouded by λ M. We ote that the MLEs (ˆ ML, ˆ ML ) belog to either χ 0 or χ. If (ˆ ML, ˆ ML ) χ, i.e., (ˆ, ˆ ) = (ˆ ML, ˆ ML ), we have L G (S, S ) = L(ˆ 0 ) L(ˆ ML ) + L (ˆ 0) L (ˆ ML ) > 0. I additio, the worst-case false alarm probability of GLRT is give by max (, χ 0) P (, )L G (S, S ) τ, which we wish to upper bouded by α. Note that L G (S, S ) > 0 whe (ˆ ML, ˆ ML ) χ holds. I the followig, we focus o the case where α < max (, χ 0) P (, )L G (S, S ) 0, i.e., a relatively small false alarm costrait α. Thus, we

3 just eed to study the false alarm probability of GLRT whe (ˆ ML, ˆ ML ) χ ad τ > 0. Give (ˆ ML, ˆ ML ) χ, it is difficult to solve for (ˆ 0, ˆ 0) i (0) exactly. However, we ca costruct a upper boud for the GLR by approximatig (ˆ 0, ˆ 0) usig a liear combiatio of (ˆ ML, ˆ ML ). Let ˆ ˆ ML ˆ ML. The ˆ > ρ, 0 = ˆ ML + µ ˆ ˆ, 0 = ˆ ML + (µ + ρ) ˆ ˆ, () where µ [0, ˆ ρ] deotes the distace betwee 0 ad ˆ ML. It ca be verified that ( 0, 0) χ 0. The, the GLR i () ca be upper bouded as L G(S, S ) = L(ˆ 0) + L (ˆ 0) L(ˆ ) L(ˆ ) L( 0) + L ( 0) L(ˆ ) L(ˆ ) (a) = (ˆ 0) H( )(ˆ 0) + (ˆ 0) H ( )(ˆ 0) = µ ˆ ˆ H( ) ˆ ˆ + ( ˆ (µ + ρ)) ˆ H ( ˆ ) (3) ˆ ˆ (b) µ σ λmax(h( )) + ( ˆ (µ + ρ)) σ λ max(h ( )), where (a) follows from the Taylor s Theorem, ad deote the parameters i the correspodig remaiders; ad (b) follows from the fact that H( ) ad H ( ) are positive defiite ad ˆ ˆ is a uit vector. Note that λ max (H( )) ad λ max (H ( )) are bouded by λ M i Assumptio. Hece, P F(δ GL) = P (, )L G(S, S ) τ µ P (, ) σ λm + ( ˆ (µ + ρ)) σ λ M τ = P (, ) ˆ η, (4) for (, ) χ 0. The false alarm probability of GLRT ca be upper bouded by the probability that the empirical differece ˆ is larger tha aother threshold η. Note that the threshold η ca be set by lettig P (, ) ˆ η α for all (, ) χ 0, which is idepedet of the ukow quatities µ ad λ M. Thus, we propose the followig empirical differece test with the followig structure to approximate the GLRT,, if ˆ η δ ED = (5) 0, if ˆ < η. The beefits for usig δ ED are two-fold: ) Istead of costructig the more complicated GLR statistics, our EDT oly requires the computatio of the empirical differece ˆ betwee the MLEs, which is more tractable i practice. ) The distributio of the empirical differece ˆ is asymptotically Gaussia, which facilitates the settig of the threshold η to meet the false alarm costrait α. 4. APPROXIMATION FOR SETTING TEST THRESHOLD I this subsectio, we provide a method based o a χ approximatio [0] to set the threshold η i the EDT. Sice ˆ ML ad ˆ ML are the MLEs of ad with ad samples, respectively, we have (ˆML ) d. N (0, I ), (ˆ ML ) d. N (0, I ), from the asymptotical ormality of MLE [9], where I deotes the Fisher iformatio matrix of the probabilistic model p(z ). Thus, we ca approximate the distributio of usig a Gaussia distributio N (, Σ ), where Σ I + I. I practice, I ad I ca be estimated by replacig ad with the correspodig MLEs ˆ ML ad ˆ ML, respectively. To satisfy the false alarm costrait i (5), we eed to set the threshold η α based o the followig equatio i the EDT, max P (, ) ηα = α. (6), χ 0 The followig theorem characterizes the distributio of that results from the Gaussia approximatio. Theorem Suppose N (, Σ ), ad the covariace matrix Σ has the eige-decompositio Σ = P ΛP, where Λ = diag(λ,, λ d ) cotais all the eigevalues, ad P is a orthogoal matrix. The, d. = d λ i (U i + b i ), (7) where U i N (0, ), ad b = ( Λ) ( ). The distributio of is a liear combiatio of idepedet o-cetral chi-squared radom variables with degree of freedom of oe, which does ot have a simple closed form []. We therefore propose the followig approximatio method to set the threshold i the EDT. Note that d P F (δ ED ) = P (, ) λ i (U i + b i ) η d P (, ) (U i + b i ) η /λ max (Σ ), (8) for (, ) χ 0, ad d (U i + b i ) is a o-cetral chi-squared χ (k, γ) radom variable with degrees of freedom k = d, ad o-cetrality parameter γ = d b i ρ /λ mi (Σ ), where the iequality follows from the fact ρ uder H 0. Thus, max P (,, χ ) η 0 max P χ (d,, χ 0 d b i ) η /λ max (Σ ). (9)

4 We ca set the threshold η α with the χ approximatio [0] usig the followig equatio, P χ (d, ρ /λ mi (Σ )) η α/λ max (Σ ) = α (0) to esure that the false alarm probability is bouded by α. 5. NUMERICAL RESULTS I this sectio, we evaluate the performace of the proposed empirical differece test δ ED i liear regressio ad logistic regressio models. Liear regressio model: The datasets S ad S are geerated from the liear model y = X + ξ, where X R d deotes the iput variable, y R deotes the respose variable ad R d deotes the weight vector. We assume that all the elemets i oises ξ R are i.i.d. zero mea Gaussia radom variables geerated from N (0, σ ). The, the Fisher iformatio matrix I = XX /σ is idepedet of. I the simulatios, we set the dimesio d = 0, the umber of samples = = 40, σ = ad ρ =. Logistic regressio model: The datasets S ad S are geerated from the followig logistic model p(y i x i, ) = + exp( y i x i ), (x i, y i ) S, () Fig.. Compariso of the performaces of the GLRT ad EDT for the liear regressio model. Fig.. Compariso of the performace of EDT with the threshold η α ad the χ approximatio η α, for the liear regressio model with α = 0.. where x i R d deotes the feature vector, y i ± deotes the label, ad R d, = is the ormalized model parameter vector. The, the Fisher iformatio matrix [ ] I = E x + exp(x i ) + exp( x i )x ix i. () I the simulatios, we choose dimesio d = 5, the umber of samples = = 60, ad set ρ such that the agle betwee ad is π 4. To illustrate the performace of the proposed algorithms, we plot the probability P (, )δ = as a fuctio of i all three figures, where the ormalized model chage / ρ rages from 0 to. Note that whe < ρ, i.e., (, ) χ 0, P (, )δ = deotes the false alarm probability P F (δ) (i the left side of the figures). I cotrast, whe > ρ, (, ) χ ad P (, )δ = deotes the detectio probability P D (δ) (i the right side of the figures). Thus, the plot of P (, )δ = provides us with a illustratio of the test performace uder both hypotheses with differet model parameters. To verify the approximatio of the GLRT with the proposed EDT, we first compare the performace of these tests for the liear regressio model (the GLRT is ot computatioally feasible for logistic regressio) for two values of the false alarm costrait α = 0. ad α = 0.3. The thresholds of these tests η α are set usig 000 rus of Mote-Carlo simulatios such that the false alarm probabilities are equal to α as i (6). It is show i Fig. that the differece betwee Fig. 3. Compariso of the performace of EDT with the threshold η α ad the χ approximatio η α, for the logistic regressio model with α = 0.. the performace of EDT ad that of GLRT is egligible with oly = = 40 samples, which justifies the use of EDT. We ote that whe / ρ =, it is impossible to distiguish H 0 ad H eve if the umber of samples ad go to ifiity, i.e., the probabilities of false alarm ad detectio are both equal to α i this case. Fig. ad Fig. 3 compare the performace of EDT with the threshold η α computed by 000 rus of Mote-Carlo simulatios i (6), ad the threshold η α set by the proposed χ approximatio i (0), respectively, whe α = 0.. It ca be observed that i both liear regressio ad logistic regressio cases, the o-cetral chi-squared approximatio i (0) provides coservative estimates of the test thresholds η, thereby esurig that the false alarm costrait is met.

5 6. REFERENCES [] C. Wilso, V. V Veeravalli, ad A. Nedich, Adaptive sequetial stochastic optimizatio, IEEE Trasactios o Automatic Cotrol, 08. [] C. Wilso ad V. V Veeravalli, Adaptive sequetial optimizatio with applicatios to machie learig, i Proceedigs of IEEE Iteratioal Coferece o Acoustics, Speech ad Sigal Processig, 06, pp [3] Y. Bu, J. Lu, ad V. V Veeravalli, Active ad adaptive sequetial learig, arxiv preprit arxiv:805.70, 08. [4] M. Elahi, F. Ricci, ad N. Rubes, A survey of active learig i collaborative filterig recommeder systems, Computer Sciece Review, vol. 0, pp. 9 50, 06. [5] N. Rubes, M. Elahi, M. Sugiyama, ad D. Kapla, Active learig i recommeder systems, i Recommeder Systems Hadbook, pp Spriger, 05. [6] S. J Pa ad Q. Yag, A survey o trasfer learig, IEEE Trasactios o Kowledge ad Data Egieerig, vol., o. 0, pp , 00. [7] J. Geg, B. Zhag, L. M Huie, ad L. Lai, Olie chage detectio of liear regressio models, i Acoustics, Speech ad Sigal Processig (ICASSP), 06 IEEE Iteratioal Coferece o. IEEE, 06, pp [8] S. Zou, G. Fellouris, ad V. V Veeravalli, Quickest chage detectio uder trasiet dyamics: Theory ad asymptotic aalysis, arxiv preprit arxiv:7.086, 07. [9] A. W. Va der Vaart, Asymptotic statistics, Cambridge series i statistical ad probabilistic mathematics. Cambridge Uiversity Press, 000. [0] A. DasGupta, Asymptotic Theory of Statistics ad Probability, Spriger, 008. [] S. J. Press, Liear combiatios of o-cetral chisquare variates, The Aals of Mathematical Statistics, pp , 966.

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d Liear regressio Daiel Hsu (COMS 477) Maximum likelihood estimatio Oe of the simplest liear regressio models is the followig: (X, Y ),..., (X, Y ), (X, Y ) are iid radom pairs takig values i R d R, ad Y