A Monte Carlo Method to Data Stream Analysis

Size: px

Start display at page:

Download "A Monte Carlo Method to Data Stream Analysis"

Prosper Ralph Murphy
5 years ago
Views:

1 TRANSACTIONS ON ENGINEERING, COMPUTING AND TECHNOLOGY VOLUME 14 AUGUST 06 ISSN A Mote Carlo Metho to Data Stream Aalysis Kittisak Kerprasop, Nittaya Kerprasop, a Pairote Sattayatham Abstract Data stream aalysis is the process of computig various summaries a erive values from large amouts of ata which are cotiuously geerate at a rapi rate. The ature of a stream oes ot allow a revisit o each ata elemet. Furthermore, ata processig must be fast to prouce timely aalysis results. These requiremets impose costraits o the esig of the algorithms to balace correctess agaist timely resposes. Several techiques have bee propose over the past few years to aress these challeges. These techiques ca be categorize as either ataoriete or task-oriete. The ata-oriete approach aalyzes a subset of ata or a smaller trasforme represetatio, whereas taskoriete scheme solves the problem irectly via approximatio techiques. We propose a hybri approach to tackle the ata stream aalysis problem. The ata stream has bee both statistically trasforme to a smaller size a computatioally approximate its characteristics. We aopt a Mote Carlo metho i the approximatio step. The ata reuctio has bee performe horizotally a vertically through our EMR samplig metho. The propose metho is aalyze by a series of experimets. We apply our algorithm o clusterig a classificatio tasks to evaluate the utility of our approach. Keywors Estimatio. D Data Stream, Mote Carlo, Samplig, Desity I. INTRODUCTION ATA aalysis is the process of computig various summaries a erive values from collecte ata. Data miig ca be viewe as a itelliget ata aalysis aimig at extractig valuable kowlege from large amouts of iformatio store i ata repositories [1], [3]. The techiques use i ata miig have bee aopte from the areas of machie learig a statistics, but scalable to eal with the problem of huge repositories of iformatio. The recet avaces i harware a software have eable the rapi geeratio of cotiuous stream of iformatio such as customer click streams, telephoe recors, retail chai trasactios, This work was supporte by the Thaila Research Fu uer Grat MRG , the Natioal Research Coucil, a Suraaree Uiversity of Techology for the sposorship of Data Egieerig a Kowlege Discovery Research Uit. Kittisak Kerprasop is with the School of Computer Egieerig, a the irector of Data Egieerig a Kowlege Discovery Research Uit Suraaree Uiversity of Techology, Nakho Ratchasima 000, Thaila. ( kerpras@ sut.ac.th). Nittaya Kerprasop is with the School of Computer Egieerig, a the member of Data Egieerig a Kowlege Discovery Research Uit Suraaree Uiversity of Techology, Nakho Ratchasima 000, Thaila. ( ittaya@ sut.ac.th). Pairote Sattayatham is with the School of Mathematics, Suraaree Uiversity of Techology, Nakho Ratchasima 000, Thaila. ( pairote@ sut.ac.th). web page visits, a so o. Miig stream ata that grow at a ulimite rate poses a ew challege to researchers a practitioers i the area of ata miig [1], [9]. Data stream is efie as massive amouts of ata cotiuously geerate at a rapi rate, possibly time-varyig a upreictable [2], [9]. Major characteristics of ata streams are the cotiuously olie arrival of ata elemets, ucotrolle orer of such elemets upo arrival, variable sizes, a a oe-time processig of a elemet before it is iscare or archive ue to the massive size of ata that far excees the storage capacity. The requiremets of timely aalysis a efficiet memory usage costrai most ata stream miig algorithms to sacrifice accuracy of the aalysis results for the fast a feasible processig. Developmet of approximatio algorithms [5], [13] is a irect solutio to the problem of ata stream miig. However, the large volumes of ata cotiuously arrivig i a stream coul evetually make the algorithms iefficiet. A more practical solutio is to apply a ata reuctio techique alog with the approximatio algorithms. Data summarizatio techiques, such as wavelet aalysis [] a histogram [2], have bee propose as syopsis ata structures to provie a summary presetatio of ata. The issue of yamic space allocatio as the uerlyig ata istributio chages over time is a fuametal problem of these approaches. Data stream aalysis by choosig a subset of the icomig stream is aother class of techiques for proucig approximate results. Samplig is a statistical-base techique wiely use to scale up the miig algorithms [7]. Nevertheless, i the cotext of ata stream i which the ata size is ukow, simply applyig a samplig metho caot give reliable approximatio. We, therefore, propose a Mote Carlo metho to raw represetatives from ata stream. Mote Carlo simulatio is a wiely use metho to prouce a goo approximatio to the true value or quatity. Our algorithm has bee esige to prouce ata elemets from which the approximate aalysis is close to the exact oe. We perform cluster a classificatio aalyses o several ata sets to verify the reliability of the metho. The paper is orgaize as follows. Sectio 2 presets the theoretical backgrou of a geeral Mote Carlo metho. Sectio 3 sketches the raft iea of esity estimatio from a sample. Our propose metho that is efficietly applicable to ata stream aalysis is explaie i Sectio 4. Some of the experimetal results from cluster a classificatio aalyses over the reuce ata stream are show i Sectio 5. We coclue i Sectio 6 with a iscussio for future work. ENFORMATIKA V14 06 ISSN WORLD ENFORMATIKA SOCIETY

2 TRANSACTIONS ON ENGINEERING, COMPUTING AND TECHNOLOGY VOLUME 14 AUGUST 06 ISSN II. PRINCIPLES OF MONTE CARLO Mote Carlo metho is a class of stochastic algorithm for simulatig the behavior of physical or mathematical systems [11], [14]. The term stochastic implies that the methos are o-etermiistic i which they are base o the use of raom umbers a probability statistics to ivestigate problem. To uersta the metho of Mote Carlo, it is useful to thik of it as a geeral techique of umerical itegratio. Suppose we ee to evaluate the -imesioal itegral of a fuctio f over the uit iterval f ( x, x,..., x ) x x... x f ( x) x. (1) (0,1) The itegral is a o-raom problem, but the Mote Carlo metho represets the itegral as a approximatio problem by itroucig a raom vector U that is uiformly istribute betwee 0 a 1. Applyig the fuctio f to U, we obtai a raom variable f (U) with expectatio E[ f ( U )] f ( x) ( x) x (2) (0,1) where is the probability esity fuctio of U. Sice the value of o the regio of itegratio is 1, equatio (2) becomes E[ f ( U )] f ( x) x (3) (0,1) Equatios (1) a (3) allow us to represet the itegral probabilistic expressio as follow: as a E[ f ( U )] (4) To estimate, we ee a mechaism for rawig poits U 1, U 2,...,U. Applyig fuctio f to each of these raom poits yiels iepeet a ietically istribute (ii ) raom variables f (U 1 ), f (U 2 ),..., f (U ), each with expectatio a staar eviatio. Averagig the results prouces the Mote Carlo estimator 1 f ( Ui ) (5) i 1 which is a ubiase estimator for with the error - approximately ormally istribute with mea 0 a staar eviatio. The form of the staar error is a importat property of Mote Carlo methos. First, it tells us that if we icrease the umber of our samples by a factor of four, we will half the staar error. Seco, staar error oes ot epe o the imesioality of the itegral. A Mote Carlo estimator base o raws from the omai [0,1] still have the form for all imesios. Most techiques of umerical itegratio such as the trapezoial rule egrae i covergece rate with icreasig imesios. We cosier a Mote Carlo metho to be useful i the omai of ata stream aalysis i which the umber of ata is overwhelmig a the exact ata istributio is ukow. The focus of our stuy is to geerate samples from a stream ata which is a prior step to ata moelig a aalysis. Oce the samples have bee successfully raw, the characteristics of stream ca be estimate. We cocetrate o the samplig problem because it ca provie a satisfactory estimatio which will be prove through experimetatios o cluster a classificatio aalyses. III. SAMPLING METHOD AND DENSITY ESTIMATION Basically the Mote Carlo metho employs ay techique of statistical samplig to approximate solutios to quatitative problems. With Mote Carlo metho, a large system ca be sample i a umber of raom cofiguratios, a that ata ca be use to escribe the system as a whole. The efficiecy of the metho epes largely o the ability to raw samples effectively. For a particular omai of stream ata, we cosier the rejectio samplig metho. Rejectio samplig, or acceptace-rejectio samplig, is a samplig metho first itrouce by Vo Neuma [16]. This metho is use i cases where a target istributio, f(x), is too complicate for us to sample from it irectly. Suppose we have a simpler istributio, g(x), which we ca evaluate a geerate samples from, the the ifficult samplig problem ca be avoie by samplig from g(x) istea. By geeratig a uiform raom variable u from the iterval [0,1], we accept x if the coitio u f(x) / Cg(x) hols; otherwise reject the value of x a repeat the samplig step. Posig the restrictio Cg(x) f(x) for some C >1, we say that Cg evelopes f. The valiatio of this metho is the evelope priciple. Whe simulatig the poit (x, v) where v = u*cg(x), we prouce a uiform simulatio over the subgraph of Cg(x). Acceptig oly poits such that u f(x) / Cg(x) the prouces poits (x, v) uiformly istribute over the subgraph of f(x) a thus, margially, a simulatio from f(x). Rejectio samplig will work best if g is a goo approximatio to f. However, i a high-imesioal problem the value of C ees to be chose very large to esure the requiremet Cg(x) > f(x), for all x. The result is a eormous rejectio rate. The ifficulty of applyig rejectio samplig metho irectly to the problem of ata stream aalysis is that we o ot kow beforeha where the moes of f are locate or how high they are. I other wors, we o ot kow the exact characteristics of the target esity. We thus propose to apply the EM (Expectatio-Maximizatio) techique [6] to approximate the esity f(x). We cosier multi-imesioal stream ata as mixtures of Gaussia, or ormal, probability esity fuctios (pf). Gaussia mixtures [8], [12] are combiatios of Gaussia istributios writte as K g( x) p f ( x ) (6) i 1 i A raom variable x eotes iepeet observatio i K mixture compoets. The p i s are the mixig proportios, 0 < p i <1 for all i = 1,..., K, a p p K = 1. The f(x i ) eotes the esity of a -imesioal Gaussia istributio i ENFORMATIKA V14 06 ISSN WORLD ENFORMATIKA SOCIETY

3 TRANSACTIONS ON ENGINEERING, COMPUTING AND TECHNOLOGY VOLUME 14 AUGUST 06 ISSN with mea vector a covariace matrix, that is = (, ), a the Gaussia pf is give by [4], [15] 1 1 T 1 g ( x) exp{ ( x ) ( x )} (7) (, ) 2 2 (2 ) et( ) By varyig the umber of Gaussias K, the mixig proportios p i, a the parameter i of each Gaussia esity fuctio, Gaussia mixtures ca be use to escribe ay complex pfs (Fig. 1). Fig. 1 oe imesioal Gaussia mixture esities for K = 3 (first row) a K = (seco row). The left colum shows the histogram of Gaussia esity, the right colum gives the correspoig Gaussia mixture pf I stream ata a mixture esity p i f(x i) has bee observe with ukow parameters i a p i. To fi these parameters to optimally fit a mixture moel for a give set of ata, the EM algorithm [6], [12], [15] ca be use. The EM algorithm is a broaly applicable approach to the iterative computatio of maximum likelihoo estimates. For a set of ii samples X = { x 1,..., x N }raw from a ata geeratio moel f ( x ) (, ) i, thus the resultig esity for the samples is N i 1 f ( xi ) L( x). (8) (, ) The likelihoo fuctio L( x) is the likelihoo of the parameters give the ata. I the maximum likelihoo problem, the goal is to fi that maximizes L, that is arg max L( X ). I the Gaussia case, the computatio of the expoetial ca be avoie by maximizig log ( L( x ) ) istea of L( x ). The EM algorithm is a approach to fi the maximum of likelihoo fuctios i icomplete ata problems. Let X be observe ata, Z be uobserve ata, a Y = X Z be full ata set. The probability istributio of Z epes o X a the ukow parameter. Give a iitial parameter (0), The EM algorithm prouces a sequece { (0), (1), (2),... } that coverges to a statioary poit of the likelihoo fuctio. IV. EMR SAMPLING I our particular case of ata stream aalysis, we assume that the observe ata have a ormal istributio. Give a specific umber of moels, the EM algorithm is applie to estimate the mea of each moel. These mea values have bee scale up to prouce a upper bou for the uerlyig partially observe target esity. The iea of the propose metho is illustrate i Fig. 2. The target fuctio is represete as a oe-imesioal 3-Gaussia mixtures (the three soli lies at the bottom of Fig. 2) from which we wat to raw samples. The esity E(x) is estimate with the upper bou requiremet that E(x) > f (x) for all x. E ( x) is the approximatio (show as a thick ash lie i Fig. 2) of the ukow target esity. A broa istace of E a E (e.g., at x = 1) represets a rejectig area, whereas a arrow istace (e.g., at x = 6.5) is a acceptace oe. It shoul be ote that EM requires a pre-specifie umber of K compoets to be icorporate ito the mixture moels. Accorig to our propose metho, a suitable umber shoul be selecte by a user. To cope with multi-imesioal problem, we propose to use a statistical metho pricipal compoet aalysis (PCA) to reuce the complicate problem to a simpler two-imesioal problem. That is, we take ito accout oly the first a seco major compoets of the ata set. The two-imesioal ata are use to trai the EM algorithm to estimate parameters a of the Gaussia mixture moels. The estimate Gaussia pf is a istributio E (as show i Fig. 2). To sample from the estimate esity we scale up this istributio to obtai a approximate E, which is a simpler istributio that we ca evaluate a geerate samples from. The outlie of our EMR samplig algorithm is illustrate i Fig. 3. The subroutie Desity_Estimator to approximate the esity fuctio has bee show i Fig. 4. Fig. 2 EM-base rejectio (EMR) samplig From the estimate esity E a the rough approximate E, we perform rejectio samplig with the ecisio criteria { E( x) /( E ( x))} u, whe u is a uiform variable istribute betwee 0 a 1, a is a imesioality of the ata. The iput from stream ata has bee take oe by oe. The ata item that satisfies the criteria will be iclue i the ENFORMATIKA V14 06 ISSN WORLD ENFORMATIKA SOCIETY

4 TRANSACTIONS ON ENGINEERING, COMPUTING AND TECHNOLOGY VOLUME 14 AUGUST 06 ISSN sample util the specifie sample size is completely fille up. The the sample ata set is a represetative of the whole stream. Ay aalysis methos ca ow be performe o this set. Iput: - a -imesioal ata set D with N poits - a iteger K to specify the umber of moels, a - a sample size SS Output: - a sample set S raw from the mixture moels // Data preprocessig steps // 1. If > 0 the Apply PCA to obtai 1 st a 2 compoets 2. Trasform D to a two-imesioal ata set X // Desity estimatio with EM a gettig a rough pf E ( X ) // 3. Set max_iteratio = max{, *K} 4. (E(X), E ( X ) ) = Desity_Estimator (X, K, max_iteratio) 5. Set cout = 0 6. While cout < SS // Samplig steps // 7. Sample x from E(X) 8. Geerate u from U(0,1) 9. If u E( x) /( E ( x)) the Accept x, a it to S, a icremet cout. Retur S Fig. 3 EMR samplig algorithm Desity_Estimator (X, K, max_iteratio) 1. Iitialize parameter = (, ) for each of K Gaussia moels by ruig K-meas 2. Iitialize the prior probabilities P( mk ) of each moel m to 1/K, k = 1,..., K 3. Repeat 4. Compute the probability ( i) ( i) ( i) ( i) ( i) ( i) P( mk ) p( x k, k ) P( mk x, ) ( i) ( i) ( i) ( i) P( m ) p( x, j j j j 5. Upate meas k, variaces k, a priors P N ( i) ( i) x ( 1) 1 P mk x i k N ( i) ( i) P( m, ) 1 k x (, ) N ( i) ( i) ( i 1) ( i 1) T ( i 1) P( m, ) ( )( ) 1 k x x k x k k N ( i) ( i) P( m, ) 1 k x 1 P( m ) P( m x, ) N ( i 1) ( i 1) ( i) ( i) k k N 1 6. Util the max_iteratio has bee reache or the joit likelihoo of all ata with respect to all the moels is greater tha the lower bouary criterio CL( ) K N L( ) CL( ) P( m x, ) log p( x ) k 1 1 k 7. Retur (, ) i k k for k = 1,..., K, a a rough r r i ( k, k ) from r iteratios, r < Fig. 4 Desity-Estimator algorithm V. EXPERIMENTATIONS A. Evaluatio of Desity Estimator The objective of our iitial experimets is to empirically evaluate the closeess of the estimate esity to the real oe. The closeess is etermie by comparig the Eucliea istace of the estimate mea vector to the origial mea vector, a comparig the estimate covariace matrix to the origial covariace matrix. We use a sythetic ata geerator to prouce two-imesioal Gaussia mixtures. The umber of mixture moels, umber of poits i each moel, origial mea vector a covariace matrix are iput parameters. We vary the umber of moels from 2 to with to 1,000 ata poits i each moel. To properly iitialize the compoet meas for the -parameter learig, we fi the approximate mea poits by ruig max{, *K} iteratios of k-meas algorithm [17]. Compoet elemets a mai iagoal covariace matrix elemets are also iitialize accorigly, a off-iagoal matrix elemets are costraie to zero. Some of our experimetal results o the accuracy of our esity estimator compare with the simple uiform samplig are illustrate i Table 1. The EMR samplig results are compare agaist the uiform samplig which always assumes a sigle Gaussia moel. The efficiecy of the samplig methos is evaluate o the basis of the closeess of the estimate i ( i, i ) to the origial meas a covariace matrices of the geerative moels. The - iffereces a -iffereces are average from K moels. The experimetal results cofirm the applicability of the EMR approach towar the problem of -parameter approximatio. The estimate meas a variaces are very close to the origial parameter values. TABLE I EXPERIMENTAL RESULTS OF EMR SAMPLING FROM VARIOUS MIXING OF GAUSSIAN MODELS Number of Mixture Moels EMR Samplig - ifferece - ifferece Uiform Samplig - - ifferece ifferece ENFORMATIKA V14 06 ISSN WORLD ENFORMATIKA SOCIETY

5 TRANSACTIONS ON ENGINEERING, COMPUTING AND TECHNOLOGY VOLUME 14 AUGUST 06 ISSN B. Cluster a Classificatio Aalyses To verify the utility of the propose metho o the realworl ata we ru the k-meas clusterig algorithm [17] o various sample ata from the UCI repository [ uci.eu/~mlear/ MLRepository.html]. We test our algorithm o four ata sets: Wiscosi iagostic breast cacer (466 ata poits, 2 classes), iabetes (512 ata poits, 2 classes), DNA (00 ata poits, 3 classes), a satellite image (4435 ata poits, 6 classes). I each ata set, we assume that the class labels are correct clusters to be fou by the k-meas algorithm. By assumig prior kowlege about kow clusters, we ca evaluate the error rate of the cluster learig. O evaluatio the efficiecy of the Mote Carlo approach we simulate a ata stream by geeratig several samples for each ata set. I our experimets we observe the performace of cluster learig o icreasig samples varie from 1%, 5%, %, 15%,...,%, a the complete ata set. The experimetal results are show i Fig. 5. The clusterig results reveal the efficiecy of the propose metho that oly arou -% samplig size is sufficiet for the accurate learig of ata clusters. The classificatio task has bee performe o the same experimetal settig with the C4.5 algorithm [17]. the results are show i Fig. 6. VI. CONCLUSION I this paper we propose a techique of Mote Carlo estimatio to aalyze major characteristics of ata stream. At the samplig phase of the Mote Carlo metho we propose the EMR samplig algorithm to efficietly raw represetative samples from ata cotaiig mixture moels. We propose to apply the expectatio-maximizatio techique to estimate the meas a variaces of the mixture moels. The algorithm Desity_Estimator prouces two esity fuctios, E a E. The istace of E a E at each samplig poit is a ecisio criteria for either sample acceptace or rejectio. A arrow istace amog the two estimate esities tes to the acceptace case if the istace ratio is greater tha the geerate uiform raom variable from the iterval [0, 1]. The experimetal results verify the utility of the propose Desity_Estimator algorithm a the EMR samplig metho. The clusterig a classificatio experimetatios o realworl ata also cofirm the efficiecy of our metho. We pla to further our stuy o skewe ata i which the istributios are ot uiformly istribute. [5] G. Coremoe a S. Muthukrisha, What s hot a what s ot: Trackig most frequet items yamically, i Pro. ACM PODS, 03. [6] A. P. Dempster, N. M. Lair, a D. B. Rubi, Maximum likelihoo from icomplete ata via the EM algorithm, Joural of the Royal Statistical Society B, vol. 39, pp. 1-22, [7] P. Domigos a G. Hulte, A geeral metho to scalig up machie learig algorithms a its applicatio to clusterig, i Pro. ICML, 01. [8] M. A. T. Figueireo a A. K. Jai, Usupervise learig of fiite mixture moels, IEEE Tras. Patter Aalysis a Machie Itelligece, vol. 24, pp , 02. [9] M. Gaber, A. Zaslavsky, a S. Krishaswamy, Miig ata stream: A review, SIGMOD Recor, vol. 34, pp , 05. [] A. Gilbert, Y. Kotiis, S. Muthukrisha, a M. Strauss, Oe-pass wavelet ecompositios of ata streams, IEEE Tras. Kowlege a Data Egieerig, vol. 15, pp , 03. [11] D. Mackay, Itrouctio to Mote Carlo, i Learig i Graphical Moels, M. Jora, E. MIT Press, 1996, pp [12] J. M. Mari, K. Megerse, a C. Robert, Bayesia moellig a iferece o mixtures of istributios, i Habook of Statistics, vol., Elsevier-Sciece, 05. [13] S. Muthukrisha, Data streams: Algorithms a applicatios, i Proc. ACM-SIAM Symposium o Discrete Algorithm, 03. [14] R. Neal, Probabilistic iferece usig Markov chai Mote Carlo methos, Dept. Computer Sciece, Uiversity of Toroto, Techical Report CRG-TR93-1, [15] B. Resch, A tutorial for the course computatioal itelligece, Available: [16] J. vo Neuma, Various techiques use i coectio with raom igits, Applie Mathematics Series, vol. 12, Natioal Bureau of Staars, Washigto, D.C., [17] I. Witte a E. Frak, Data Miig: Practical Machie Learig Tools a Techiques with Java Implemetatios. Morga Kaufma, 00. REFERENCES [1] C. Aggarwal, J. Ha, J. Wag, a P. Yu, A framework for clusterig evolvig ata streams, i Pro. Very Large Data Bases, 03. [2] B. Babcock, S. Babu, M. Datar, R. Motwai, a J. Wiom, Moel a issues i ata stream systems, i Pro. ACM PODS, 02. [3] M. Berthol a D.J. Ha, Itelliget Data Aalysis: A Itrouctio. Spriger-Verlag, 03. [4] J. Bilmes, A getle tutorial of the EM algorithm a its applicatio to parameter estimatio for Gaussia mixture a hie Markov moels, Dept. Electrical Egieerig a Computer Sciece, Uiversity of Califoria Berkeley, Techical Report TR , ENFORMATIKA V14 06 ISSN WORLD ENFORMATIKA SOCIETY

6 TRANSACTIONS ON ENGINEERING, COMPUTING AND TECHNOLOGY VOLUME 14 AUGUST 06 ISSN Wiscosi iagostic breast cacer Uiform Samplig EMR Samplig 15 Wiscosi iagostic breast cacer Uiform Samplig Samplig from Estimate Distributio Diabetes Diabetes DNA DNA 0 Satellite image Satellite image Fig. 5 Clusterig results o four ata sets Fig. 6 Classificatio results o four ata sets ENFORMATIKA V14 06 ISSN WORLD ENFORMATIKA SOCIETY

A COMPUTATIONAL STUDY UPON THE BURR 2-DIMENSIONAL DISTRIBUTION

TOME VI (year 8), FASCICULE 1, (ISSN 1584 665) A COMPUTATIONAL STUDY UPON THE BURR -DIMENSIONAL DISTRIBUTION MAKSAY Ştefa, BISTRIAN Diaa Alia Uiversity Politehica Timisoara, Faculty of Egieerig Hueoara