Higher-Order Markov Chain Models for Categorical Data Sequences*

Size: px

Start display at page:

Download "Higher-Order Markov Chain Models for Categorical Data Sequences*"

Wilfred Caldwell
5 years ago
Views:

1 Higher-Order Markov Chai Models for Categorical Data Sequeces* Wai Ki Chig, Eric S. Fug, Michael K. Ng Departmet of Mathematics, The Uiversity of Hog Kog, Pokfulam Road, Hog Kog, People s Republic of Chia Received 12 March 2002; revised 21 October 2003; accepted 28 Jauary 2004 DOI /av Abstract: I this paper we study higher-order Markov chai models for aalyzig categorical data sequeces. We propose a efficiet estimatio method for the model parameters. Data sequeces such as DNA ad sales demad are used to illustrate the predictig power of our proposed models. I particular, we apply the developed higher-order Markov chai model to the server logs data. The objective here is to model the users behavior i accessig iformatio ad to predict their behavior i the future. Our tests are based o a realistic web log ad our model shows a improvemet i predictio Wiley Periodicals, Ic. Naval Research Logistics 51: , Keywords: higher-order Markov model; categorical data; liear programmig 1. INTRODUCTION Data sequeces (or time series) occur frequetly i may real world applicatios. The most importat step i aalyzig a data sequece (or time series) is the selectio of a appropriate mathematical model for the data. Because it helps i predictios, hypothesis testig, ad rule discovery. A data sequece X ca be logically represeted as a vector (X 1, X 2,...,X T ), where T is the legth of the sequece, ad X i DOM( A) (1 i T), associated with a defied sematic ad a data type. I this paper, we cosider ad assume other types used ca be mapped to oe of these two types. The domais of attributes associated with these two types are called umeric ad categorical respectively. A umeric domai cosists of real umbers. A domai DOM( A) is defied as categorical if it is fiite ad uordered, e.g., for ay a, b DOM( A), either a b or a b (see, e.g., [8]). Numerical data sequeces have bee studied i detail (see, e.g., [5]). Mathematical tools such as Fourier trasform ad spectral aalysis are employed * The research of this project is supported by RGC Grat Nos. HKU 7130/02P, HKU 7126/02P, ad 7046/03P ad HKU CRCG Grat Nos , , , ad Correspodece to: M. K. Ng (kkpog@hkusua.hku.hk) 2004 Wiley Periodicals, Ic.

2 558 Naval Research Logistics, Vol. 51 (2004) frequetly i the aalysis of umerical data sequeces. Differet time sequeces models are proposed ad developed i the literatures [5]. For categorical data sequeces, there are may situatios that oe would like to employ higher-order Markov chai models as a mathematical tool (see, e.g., [2, 11, 13 15]). A umber of applicatios ca be foud i the literatures [9, 14, 16, 18]. For example, i sales demad predictio, products are classified ito several states such as very high sales volume, high sales volume, stadard, low sales volume, ad very low sales volume (categorical type: ordial data). A higher-order Markov chai model is the used to fit the observed data ad apply i the wid turbie desig. Aligmet of sequeces (categorical type: omial data) is a importat topic i DNA sequece aalysis [18]. It ivolves searchig of patters i a DNA sequece of huge size. I these applicatios ad may others, oe would like to (i) characterize categorical data sequeces for the purpose of compariso ad classificatio process or (ii) model categorical data sequeces ad hece to make predictios i the cotrol ad plaig process. It has bee show higher-order Markov chai models ca be a promisig approach for these purposes [9, 15, 16, 18]. For simplicity i discussio, i the followig we assume that each data poit X t i a categorical data sequece takes values i 1, 2,..., m ad m is fiite, i.e., it has m possible categories or states. The covetioal model for a th order Markov chai has (m 1)m model parameters. The major problem i usig such kid of model is that the umber of parameters (the trasitio probabilities) icreases expoetially with respect to the order of the model. This large umber of parameters discourages people from usig a higher-order Markov chai directly. I [15], Raftery proposed a higher-order Markov chai model which ivolves oly oe additioal parameter for each extra lag. The model ca be writte as follows: where PX t k 0 X t1 k 1,...,X t k i q k0k i, (1) i 1 ad Q [q ij ] is a trasitio matrix with colum sums equal to oe, such that 0 i q k0k i 1, k 0, k i. (2) The costrait i (2) is to guaratee that the right hadside of (1) is a probability. The total umber of idepedet parameters i his model is of m 2. Raftery proved that (1) is aalogous to the stadard AR() model i the sese that each additioal lag, after the first is

3 Chig, Fug, ad Ng: Higher-Order Markov Chai Models for Categorical Data Sequeces 559 specified by a sigle parameter ad the autocorrelatios satisfy a system of liear equatios similar to the Yule-Walker equatios. Moreover, the parameters q k0 k i, i ca be estimated umerically by maximizig the log-likelihood of (1) subjected to the costraits (2). However, this approach ivolves solvig a highly oliear optimizatio problem (a coded program for solvig the maximum log-likelihood problem ca be foud at mtd). The proposed method either guaratees covergece or a global maximum. The mai cotributio of this paper is to geeralize the Raftery model by allowig Q to vary with differet lags. Numerical examples are give to demostrate that our geeralized model has a better predictio power tha the Raftery model does. This meas that our model is ot overparameterized i geeral. We also develop a efficiet method to estimate the model parameters. The rest of the paper is orgaized as follows. I Sectio 2, we propose our higher-order Markov chai models ad discuss some properties of the proposed model. I Sectio 3, we propose a estimatio method for the model parameters required i our higher-order Markov chai model. I Sectio 4, umerical examples o DNA sequece ad the sales demad data are give to demostrate the predictig power of our model. I Sectio 5, we apply our higher-order Markov chai models to a real data set for web predictio. Fially, cocludig remarks are give to coclude the paper i Sectio HIGHER-ORDER MARKOV CHAIN MODELS I this sectio we exted the Raftery model [15] to a more geeral higher-order Markov model by allowig Q to vary with differet lags. Here we assume that the weight i is oegative such that We first otice that (1) ca be rewritte as i 1. (3) X t1 i QX t1i, (4) where X t1i is the probability distributio of the states at time (t 1 i). Usig (3) ad the fact that Q is a trasitio probability matrix, we ote that each etry of X t1 is i betwee 0 ad 1, ad the sum of all etries is also equal to 1. We remark that the Raftery model does ot assume is oegative ad therefore the additioal costraits (2) should be added to guaratee that X t1 is the probability distributio of the states. The Raftery model i (4) ca be geeralized as follows: X t1 i Q i X t1i. (5) The total umber of idepedet parameters i the ew model is m 2. We ote that if Q 1 Q 2... Q the (5) is just the Raftery model i (4).

4 560 Naval Research Logistics, Vol. 51 (2004) I our model we assume that X t1 depeds o X ti (i 1, 2,..., ) via the matrix Q i ad weight i. Oe may relate Q i to the ith step trasitio matrix of the process ad we will use this idea to estimate Q i. Here we assume that each Q i is a oegative stochastic matrix with colum sums equal to 1. Before we preset our estimatio method for the model parameters, we first discuss some properties of our proposed model i the followig propositio. PROPOSITION 1: If Q is irreducible ad 0 such that 0 i 1 ad i 1, the the model i (5) has a statioary distributio X whe t 3 idepedet of the iitial state vectors X 0, X 1,..., X 1. The statioary distributio X is also the uique solutio of the followig liear system of equatios: i I i Q X 0 ad 1 T X 1. Here I is the m-by-m idetity matrix (m is the umber of possible states take by each data poit) ad 1 is a m-vector of 1 s. PROOF: We first ote that if 0, the this is ot a th-order Markov chai. Therefore, 0 is a reasoable assumptio. Secodly, if Q is ot irreducible, the we cosider the case that 1, ad, i this case, clearly there is o uique statioary distributio for the system. Therefore, Q is irreducible is a ecessary coditio for the existece of a uique statioary distributio. Now we let be a m-by-1 vector. The oe may write Y t1 X t1, X t,...,x t2 T where Y 1 RY, 2Q2 1Q1 Q R 1Q1 I I 0 (6) I 0 is a m-by-m square matrix. We the defie

5 Chig, Fug, ad Ng: Higher-Order Markov Chai Models for Categorical Data Sequeces Q 1 I I R (7) 0 1 Q 1 I Q We ote that R ad R have the same characteristic polyomial i : det 11 1Q 1 I 1 i i Q i. i2 Thus R ad R have the same set of eigevalues. It is clear that R is a irreducible stochastic matrix with colum sums equal to 1. The from the Perro-Frobeius Theorem [4, p. 134], all the eigevalues of R (or equivaletly R) lie i the iterval (0, 1] ad there is exactly oe eigevalue equal to oe. This implies that lim t3 È t R R limr t VU T t3 is a positive rak-1 matrix as R is irreducible. Therefore, we have lim t3 Y t1 limr t Y 1 VU T Y 1 V. t3 Here is a positive umber because Y 1 0 ad is oegative. This implies that X t also teds to a statioary distributio as t goes to ifiity. Hece we have lim X t1 lim i Q i X t1i, t3 t3 ad therefore we have X i Q i X. The statioary distributio vector X satisfies i I i Q X 0 with 1 T X 1. (8)

6 562 Naval Research Logistics, Vol. 51 (2004) The ormalizatio costrait is ecessary as the matrix i I i Q has a 1-dimesioal ull space. The result is the proved. We remark that if some i are equal to zero, we ca rewrite the vector Y t1 i terms of X i, where i are ozero. The the model i (5) still has a statioary distributio X whe t goes to ifiity idepedet of the iitial state vectors, ad the statioary distributio X ca be obtaied by solvig the correspodig liear system of equatios with the ormalizatio costrait. 3. PARAMETERS ESTIMATION I this sectio, we preset two efficiet methods to estimate the parameters Q i ad i for i 1, 2,...,. To estimate Q i, we regard Q i as the ith step trasitio matrix of the categorical data sequece {X t }. Give the categorical data sequece {X t }, oe ca cout the trasitio frequecy f (i) jk i the sequece from state k to state j i the ith step. Hece oe ca costruct the ith step trasitio matrix for the sequece {X t } as follows: From F (i), we get the estimates for Q i [q (i) kj ] as follows: i i 11 f m1 i i f F f i 12 f m2. (9) i i f 1m f mm where i i 11 qˆ m1 i i qˆ 12 qˆ m2 Qˆ i qˆ, (10) i i qˆ 1m qˆ mm qˆ i kj i m f kj i if f kj f i kj 0, k1 0 otherwise. m k1 (11) We ote that the computatioal complexity of the costructio of F (i) is of O(L 2 ) operatios, where L is the legth of the give data sequece. Hece the total computatioal complexity of the costructio of {F (i) } is of O(L 2 ) operatios. Here is the umber of lags. The followig propositio shows that these estimators are ubiased.

7 Chig, Fug, ad Ng: Higher-Order Markov Chai Models for Categorical Data Sequeces 563 PROPOSITION 2: The estimators i (11) satisfies E( f (i) kj ) q (i) m kj E ( j1 f (i) kj ). PROOF: Let T be the legth of the sequece, [q (i) kj ] be the ith step trasitio probability matrix ad X l be the steady state probability that the process is i state l. The we have Ef i i kj T X k q kj ad Therefore we have m E f kj j1 m i T X k j1 q kj i T X k. Ef i kj q kj m i E j1 f kj i. I some situatios, if the sequece is too short so that Qˆ i (especially Qˆ ) cotais a lot of zeros (therefore Qˆ may ot be irreducible). We remark that this did ot occur i our tested examples. Here we propose the secod method. Let W (i) be the distributio of the ith trasitio sequece; the aother possible estimatio for Q i ca be W (i) 1 T. We ote that if W (i) is a positive vector, the W (i) 1 T will be a positive matrix ad hece a irreducible matrix Liear Programmig Formulatio for Estimatio of i Propositio 1 gives a sufficiet coditio for the sequece X t to coverge to a statioary distributio X. Suppose X t 3 X as t goes to ifiity the X ca be estimated from the sequece {X t } by computig the proportio of the occurrece of each state i the sequece ad let us deote it by Xˆ. From (8) oe would expect i Qˆ ixˆ Xˆ. (12) This suggests oe possible way to estimate the parameters ( 1,..., ) as follows. We cosider the followig optimizatio problem: mi max k i Qˆ ixˆ Xˆk, subject to

8 564 Naval Research Logistics, Vol. 51 (2004) i 1 ad i 0, i. Here [] k deotes the kth etry of the vector. The costraits i the optimizatio problem guaratee the existece of the statioary distributio X. Next we see that the above optimizatio problem formulate a liear programmig problem: mi w subject to w w w w w w 2 Xˆ Qˆ 1Xˆ Qˆ 2Xˆ Qˆ Xˆ 1, 2 Xˆ Qˆ 1Xˆ Qˆ 2Xˆ Qˆ Xˆ 1, w 0, i 1, ad i 0, i. We ca solve the above liear programmig problem efficietly ad obtai the parameters i. I the ext subsectio, we demostrate the estimatio method by a simple example. Istead of solvig a mi-max problem, we remark that we ca also formulate the followig optimizatio problem: mi k1 i Qˆ ixˆ Xˆk subject to i 1 ad i 0, i. The correspodig liear programmig problem is give as follows:

9 Chig, Fug, ad Ng: Higher-Order Markov Chai Models for Categorical Data Sequeces 565 mi m k1 w k subject to w1 w 2 m w w1 w 2 m w 2 Xˆ Qˆ 1Xˆ Qˆ 2Xˆ Qˆ Xˆ 1, 2 Xˆ Qˆ 1Xˆ Qˆ 2Xˆ Qˆ Xˆ 1, w i 0, i, i 1, ad i 0, i. I the above liear programmig formulatio, the umber of variables is equal to ad the umber of costraits is equal to 2m 1. The order of the liear programmig is liear i the umber of lags ad i the umber of states. Therefore, the expected computatioal complexity of solvig the above liear programmig problem is of O(m 2 ) [7, p. 96] A Example We cosider a sequece {X t } of three states (m 3) give by 1, 1, 2, 2, 1, 3, 2, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 1, 2. (13) The sequece {X t } ca be writte i vector form X 1 1, 0, 0 T, X 2 1, 0, 0 T, X 3 0, 1, 0 T,..., X 20 0, 1, 0 T. We cosider 2, the from (13) we have the trasitio frequecy matrices F ad F (14) Therefore from (14) we have the i-step trasitio matrices (i 1, 2) as follows: Qˆ 1 1/8 3/7 3/4 3/4 1/7 1/4 ad Qˆ 2 1/8 3/7 0 1/7 4/7 1/4 3/7 2/7 3/4 3/7 1/7 0 (15)

10 566 Naval Research Logistics, Vol. 51 (2004) ad Xˆ 2 5, 2 5, 1 5 T. Hece we have Qˆ 1Xˆ 13 35, , T ad Qˆ 2Xˆ , , 8 35 T. To estimate i we cosider the optimizatio problem: subject to mi w 1, 2 The optimal solutio is ad we have the model w , w , w , w , w , w , w 0, 1 2 1, 1, 2 0. * 1, * 2, w* 1, 0, , X t1 Qˆ 1X t. (16) We remark that if we do ot specify the oegativity of 1 ad 2, the optimal solutio becomes the correspodig model is * 1 *, * 2 *, w** 1.80, 0.80, , X t1 1.80Qˆ 1X t 0.80Qˆ 2X t1. (17) Although w** is less tha w*, the model (17) is ot suitable. It is easy to check that

11 Chig, Fug, ad Ng: Higher-Order Markov Chai Models for Categorical Data Sequeces Qˆ Qˆ therefore, * 1 * ad * 2 * are ot valid parameters. We ote that if we cosider the optimizatio problem: subject to mi w 1 w 2 w 3, 1, ; , w , w , w , w , w , w 1, w 2, w 3 0, 1 2 1, 1, 2 0. w1 The optimal solutio is the same as the previous mi-max formulatio ad is equal to * 1, * 2, w* 1, w* 2, w* 3 1, 0, , , SOME PRACTICAL EXAMPLES I this sectio we apply our model to some data sequeces. The data sequeces are the DNA sequece ad the sales demad data sequece. Give the state vectors X i, i t, t 1,..., t 1, the state probability distributio at time t ca be estimated as follows: Xˆ t i Qˆ ix ti. I may applicatios, oe would like to make use of the higher-order Markov models for the purpose of predictio. Accordig to the this state probability distributio, the predictio of the ext state Xˆ t at time t ca be take as the state with the maximum probability, i.e., Xˆ t j, ifxˆ t i Xˆ t j, 1 i m. To evaluate the performace ad effectiveess of our higher-order Markov chai model, a predictio result is measured by the predictio accuracy r defied as

12 568 Naval Research Logistics, Vol. 51 (2004) Table 1. Predictio accuracy i the DNA sequece. 2-State model 3-State model 4-State model New model Raftery s model Radom chose r T t1 T where T is the legth of the data sequece ad t t 1, if Xˆ t X t 0, otherwise. Usig the example i the previous sectio, there are two possible predictio rules: Xˆ t1 2, if X t 1, Xˆ t1 1, if X t 2, Xˆ t1 1, if X t 3 or Xˆ t1 2, if X t 1, Xˆ t1 3, if X t 2, Xˆ t1 1, if X t 3. The predictio accuracy r for the sequece i (13) is equal to 12/19 for both predictio rules. We ote that the predictio accuracies of other rules for the sequece i (13) are less tha 12/19. Next the test results o differet data sequeces are discussed. I the followig tests, we solve mi-max optimizatio problems to determie the parameters i of higher-order Markov models. However, we remark that the results of usig the 1-orm optimizatio problem as discussed i the previous sectio are about the same as that of usig the mi-max formulatio. All the computatios here are doe by MATLAB with a PC., 4.1. The DNA Sequece I order to determie whether certai short DNA sequece (a categorical data sequece of four possible categories) occurred more ofte tha would be expected by chace, Avery [3] examied the Markovia structure of itros from several other gees i mice. Here we apply our model to the itros from the mouse A-crystalli gee see for istace [16]. We compare our secod-order model with the Raftery secod-order model. The model parameters of the Raftery model are give i [16]. The results are reported i Table 1 below. The compariso is made with differet groupig of states as suggested i [16]. I groupig states 1 ad 3, ad states 2 ad 4 we have a 2-state model. Our model gives

13 Chig, Fug, ad Ng: Higher-Order Markov Chai Models for Categorical Data Sequeces Qˆ , Qˆ , 569 Xˆ , T, , ad I groupig states 1 ad 3 we have a 3-state model. Our model gives Qˆ , Qˆ , Xˆ , , T, 1 1.0, ad If there is o groupig, we have a 4-state model. Our model gives Qˆ , Qˆ , Xˆ , , , T, , ad Whe usig the expected errors (assumig that the ext state is radomly chose with equal probability for all states) as a referece, the percetage gai i effectiveess of usig higherorder Markov chai models is i the 3-state model. I this case, our model also gives a better estimatio compared with the Raftery model. Raftery [15] refers to usig BIC to weight efficiecy gaied i terms of extra parameters used. This is importat i his approach sice his method requires to solve a highly oliear optimizatio problem. The complexity of solvig the optimizatio problem icreases whe there are may parameters to be estimated. We remark that our estimatio method is quite efficiet. The mai cost is to solve a liear programmig problem ad the expected computatioal complexity of solvig the above liear programmig problem is of O(m 2 ), where m is the umber of states ad is the order of the model (see Sectio 3) The Sales Demad Data A large soft-drik compay i Hog Kog presetly faces a i-house problem of productio plaig ad ivetory cotrol. A pressig issue that stads out is the storage space of its cetral warehouse, which ofte fids itself i the state of overflow or ear capacity. The compay is thus i urget eeds to study the iterplay betwee the storage space requiremet ad the overall growig sales demad. There are product states due to the level of sales volume. The states iclude:

14 570 Naval Research Logistics, Vol. 51 (2004) Figure 1. The states of four products A, B, C, ad D. State 1: very slow-movig (very low sales volume) State 2: slow-movig State 3: stadard State 4: fast-movig State 5: very fast-movig (very high sales volume). Such labeligs are useful from both marketig ad productio plaig poits of view. For istace, i the productio plaig, the compay develops a dyamic programmig (DP) model to recommed better productio plaig so as to miimize its ivetory build-up, ad to maximize the demad satisfactio as well. Sice the umber of alteratives at each stage (each day i the plaig horizo) are very large (the umber of products raised to the power of the umber of productio lies), the computatioal complexity of the DP model is eormous. A priority scheme based o the state (the level of sales volume) of the product is itroduced to tackle this combiatorial problem, ad therefore a effective ad efficiet productio pla ca be obtaied. It is obvious that the accurate predictio of state (the level of sales volume) of the product is importat i the productio plaig model. I Figure 1, we show that the states of four products of the soft-drik compay for some sales periods. Here we employ higher-order Markov models to predict categories of these four

15 Chig, Fug, ad Ng: Higher-Order Markov Chai Models for Categorical Data Sequeces 571 Table 2. Predictio accuracy i the sales demad data. Product A Product B Product C Product D First-order Markov model Secod-order Markov model New model ( 2) Radom chose products separately. For our ew model, we cosider the secod-order ( 2) model ad use the data to estimate Qˆ i ad i (i 1, 2). The results are reported i Table 2. For compariso, we also study the first-order ad the secod-order full Markov chai model. Results show the effectiveess of our ew model. We also see from Figure 1 that the chage of the states of the products A, B, ad D is more regular tha that of the product C. We fid i Table 2 that the predictio results for the products A, B, ad D are better tha that of C. 5. APPLICATIONS TO WEB PREDICTION The Iteret provides a rich eviromet for users to retrieve iformatio. However, it is easy for a user to get lost i the sea of iformatio. Oe way to assist the user with their iformatioal eed is to predict a user s future request ad use the predictio for recommedatio. Recommedatio systems reply o a predictio model to make ifereces o users iterests based upo which to make recommedatios. Examples are the WebWatcher [10] system ad Letzia [12] system. Accurate predictio ca potetially shorte the users access times ad reduce etwork traffic whe the recommedatio is hadled correctly. I this sectio, we use a higher-order Markov chai model to exploit the iformatio from Web server logs for predictig users actios o the web. Our higher-order Markov chai model is built o a Web server log file. We cosider the Web server log file to be preprocessed ito a collectio of user sessios. Each sessio is idexed by a uique user id ad startig time [17]. Each sessio is a sequece of requests where each request correspods to a visit to a web page. For simplicity, we represet each request as a state. The each sessio is just a categorical data sequece. For simplicity, we deote each Web page (state) by a iteger Web Log Files ad Preprocessig Experimets were coducted o a real Web log file take from the Iteret. We first implemeted a data preprocessig program to extract sessios from the log file. We dowloaded two web log files from the Iteret. The data set was a Web log file from the EPA WWW server located at Research Triagle Park, NC. This log cotaied trasactios geerated i 24 hours from 23:53:25 EDT, August 29, to 23:53:07, August 30, I preprocessig, we removed all the ivalid requests ad the requests for images. We used Host id to idetify visitors ad a 30-mi time threshold to idetify sessios. 428 sessios of legths betwee 16 ad 20 were idetified from the EPA log file. The total umber of web pages (states) ivolved is Predictio Models By explorig the sessio data from the Web log file, we have observed that a large umber of similar sessios rarely exist. This is because i a complex Web site with variety of pages, ad

16 572 Naval Research Logistics, Vol. 51 (2004) Figure 2. The first (left) ad the secod (right) step trasitio matrices of all sessios. may paths ad liks, oe should ot expect that, i a give time period, a large umber of visitors follow oly a few paths. If this was true, it would mea that the structure ad cotets of the Web site had a serious problem because oly a few pages ad paths were iterested by the visitors. I fact, most Web site desigers expect that the majority of their pages, if ot every oe, are visited ad paths followed (equally) frequetly. I Figure 2, we depict the first ad the secod step trasitio matrices of all sessios. It is clear that these matrices are very sparse. There are 3900 ad 4747 etries i the first ad the secod step trasitio matrices respectively. Nozero etries oly cotai about 0.033% i the total elemets of the first ad the secod step trasitio matrices. Based o these observatios, if we directly use these trasitio matrices to build predictio models, they may ot be effective. Sice the umber of pages (states) are very large, the predictio probability for each page may be very low. Moreover, the computatioal work for solvig the liear programmig problem i the estimatio of i are also high sice the umber of costraits i the liear programmig problem depeds o the umber of pages (states). Here we propose to use clusterig algorithms [9] to cluster the sessios. The idea is to form a trasitio probability matrix for each sessio, to costruct the distace betwee two sessios based o the Frobeius orm of the differece of their trasitio probability matrices, ad the to use k-meas algorithm to cluster the sessios. As a result of the cluster aalysis, the Web page cluster ca be used to costruct a higher-order Markov chai model. The we prefetch those web documets that are close to a user-requested documet i a Markov chai model. We fid that there is a clear similarity amog these sessios i each cluster for the EPA log file. As a example, we show i Figure 3 that the first, the secod, ad the third step trasitio probability matrices of a cluster i EPA log file. There are 70 pages ivolved i this cluster. Nozero etries cotai about 1.92%, 2.06%, ad 2.20%, respectively, i the total elemets of the first, the secod, ad the third step trasitio matrices. Usually, the predictio of the ext Web page is based o the curret page ad the previous few pages [1]. Therefore, we use a third-order model ( 3) ad cosider the first, the secod, ad the third trasitio matrices i the costructio of the Markov model. After we fid the trasitio matrices, we determie i ad build our ew higher-order Markov chai model for each cluster. For the above metioed

17 Chig, Fug, ad Ng: Higher-Order Markov Chai Models for Categorical Data Sequeces 573 Figure 3. The first (left), the secod (middle), ad the third (right) step trasitio matrices of a cluster. cluster, its correspodig 1, 2, ad 3 are , , ad , respectively. The parameters show that the predictio of the ext Web page strogly depeds o the curret ad the previous pages. Below we preset the predictio results for the EPA log file. We perform clusterig based o their trasitio matrices ad parameters. Sixtee clusters are foud experimetally based o their average withi-cluster distace, ad therefore 16 third-order Markov chai model for these clusters are determied for the predictio of user-request documets. For compariso, we also compute the first-order Markov chai model for each cluster. Totally, there are 6255 web documets for the predictio test. We fid the predictio accuracy of our method is about 0.77, but the predictio accuracy of usig the first-order full Markov chai model is oly Results show a improvemet i the predictio. We have applied these predictio results to the problem of itegrated web cachig ad prefetchig [19]. The slight icrease of the predictio accuracy ca power a prefetchig egie. Experimetal results i [19] show that the resultat system outperforms Web systems that are based o cachig aloe. 6. CONCLUDING REMARKS I this paper, we proposed ad developed a higher-order Markov chai model for categorical data sequeces. The umber of model parameters icreases liearly with respect to the umber of lags. Efficiet estimatio methods for the model parameters are also proposed by makig use of the observed trasitio frequecies ad the steady state distributio. The expected computatioal complexity of our estimatio methods is of (L 2 m 2 ), where is the umber of lags, m is the umber of states ad L is the legth of sequece. Numerical examples i the DNA sequeces ad sales demad are give to demostrate the predictig power of our model. We also apply the developed higher-order Markov chai model to the server logs data. Our tests are based o a realistic Web log ad our model has show a improvemet i the predictio of the users behavior i accessig iformatio. We coclude the paper by givig the followig possible extesios of our model i future research: For the problem of modelig sales demads, we have assumed that the products are idepedet. We the costructed a higher-order Markov chai model for each product idividually. However, the demads of the products ca be correlated. Therefore, it is

18 574 Naval Research Logistics, Vol. 51 (2004) atural to further develop Markov models for modelig multiple categorical data sequeces together, ad to get better predictio rules. It is possible to exted our model to the case of Hidde Markov Models (HMMs) [14]. It is well kow that the HMMs are first order Markov models. It is iterestig to develop a higher-order HMM based o our proposed approach. REFERENCES [1] D. Albrecht, I. Zukerma, ad A. Nicholso, Pre-sedig documets o the WWW: A comparative study, Proc Sixteeth It Joit Cof Artif Itell IJCAI99, 1999, pp [2] S. Adke ad D. Deshmukh, Limit distributio of a high order Markov chai, J Roy Statist Soc Ser B 50 (1988), [3] P. Avery, The aalysis of itro data ad their use i the detectio of short sigals, J Mol Evol 26 (1987), [4] O. Axelsso, Iterative solutio methods, Cambridge Uiversity Press, Cambridge, [5] P. Brockwell ad R. Davis, Time series: Theory ad methods, Spriger-Verlag, New York, [6] W. Craig, The sog of the wood pewee, Uiversity of the State of New York, Albay, [7] S. Fag ad S. Puthepura, Liear optimizatio ad extesios, Pretice-Hall, Eglewood Cliffs, NJ, [8] K. Gowda ad E. Diday, Symbolic clusterig usig a ew dissimilarity measure, Patter Recogitio 24(6) (1991), [9] J. Huag, M. Ng, W. Chig, D. Cheug, ad J. Ng, A cube model for Web access sessios ad cluster aalysis, WEBKDD 2001, Workshop o Miig Web Log Data Across All Customer Touch Poits, Seveth ACM SIGKDD It Cof Kowledge Discovery Data Miig, August 2001, pp [10] T. Joachims, D. Freitag, ad T. Mitchell, Web Watch: A tour guide for the World Wide Web, Proc Fifteeth It Joit Cof Artif Itell IJCAI 97, 1997, pp [11] W. Li ad M. Kwok, Some results o the estimatio of a higher order Markov chai, Departmet of Statistics, The Uiversity of Hog Kog, Hog Kog, [12] H. Lieberma, Letizia: A aget that assists Web browsig, Proc Fourteeth It Joit Cof Artif Itell IJCAI 95, 1995, pp [13] J. Loga, A structural model of the higher-order Markov process icorporatig reversio effects, J Math Sociol 8 (1981), [14] I. MacDoald ad W. Zucchii, Hidde Markov ad other models for discrete-valued time series, Chapma & Hall, Lodo, [15] A. Raftery, A model for high-order Markov chais, J Roy Statist Soc Ser B 47 (1985), [16] A. Raftery ad S. Tavare, Estimatio ad modellig repeated patters i high order Markov chais with the mixture trasitio distributio model, Appl Statist 43 (1994), [17] C. Shahabi, A. Faisal, F. Kashai, ad J. Faruque, INSITE: A tool for real time kowledge discovery from users Web avigatio, Proc VLDB2000, Cairo, Egypt, 2000, pp [18] M. Waterma, Itroductio to computatioal biology, Chapma & Hall, Cambridge, [19] Q. Yag, Z. Huag, ad M. Ng, A data cube model for predictio-based Web prefetchig, J Itell Iform Syst 20 (2003),

A New Multivariate Markov Chain Model with Applications to Sales Demand Forecasting

Iteratioal Coferece o Idustrial Egieerig ad Systems Maagemet IESM 2007 May 30 - Jue 2 BEIJING - CHINA A New Multivariate Markov Chai Model with Applicatios to Sales Demad Forecastig Wai-Ki CHING a, Li-Mi