Evaluating Data Minability Through Compression An Experimental Study
|
|
- Solomon Boyd
- 5 years ago
- Views:
Transcription
1 Evaluatig Data Miability Through Compressio A Experimetal Study Da Simovici Uiv. of Massachusetts Bosto, Bosto, USA, dsim at cs.umb.edu Da Pletea Uiv. of Massachusetts Bosto, Bosto, USA, dpletea at cs.umb.edu Saaid Baraty Uiv. of Massachusetts Bosto, Bosto, USA, sbaraty at cs.umb.edu Abstract The effectiveess of compressio algorithms is icreasig as the data subjected to compressio cotais repetitive patters. This basic idea is used to detect the existece of regularities i various types of data ragig from market basket data to udirected graphs. The results are quite idepedet of the particular algorithms used for compressio ad offer a idicatio of the potetial of discoverig patters i data before the actual miig process takes place. Keywords-data miig; lossless compressio; LZW; market basket data; patters; Kroecker product. I. INTRODUCTION Our goal is to show that compressio ca be used as a tool to evaluate the potetial of a data set of producig iterestig results i a data miig process. The basic idea that data that displays repetitive patters or patters that occur with a certai regularity will be compressed more efficietly compared to data that has o such characteristics. Thus, a pre-processig phase of the miig process should allow to decide whether a data set is worth miig, or compare the iterestigess of applyig miig algorithms to several data sets. Sice compressio is geerally iexpesive ad compressio methods are well-studied ad uderstood, pre-miig usig compressio will help data miig aalysts to focus their efforts o miig resources that ca provide a highest payout without a exorbitat cost. Compressio has received lots of attetio i the data miig literature. As observed by Maila [7], data compressio ca be regarded as oe of the fudametal approaches to data miig [7], sice the goal of the data miig is to compress data by fidig some structure i it. The role of compressio developig parameter-free data miig algorithms i aomaly detectio, classificatio ad clusterig was examied i [4]. The size C(x) of a compressed file x is as a approximatio of Kolmogorov complexity [2] ad allows the defiitio of a pseudo-distace betwee two files x ad y as d(x, y) = C(xy) C(x) + C(y). Further advaces i this directio were developed i [8][5][6]. A Kolmogorov complexity-based dissimilarity was successfully used to texture matchig problems i [1] which have a broad spectrum of applicatios i areas like bioiformatics, atural laguages. ad music. We illustrate the use of lossless compressio i pre-miig data by focusig o several distict data miig processes: files with frequet patters, frequet itemsets i market basket data, ad explorig similarity of graphs. The LZW (Lempel-Ziv-Welch) algorithm was itroduced i 1984 by T. Welch i [9] ad is amog the most popular compressio techiques. The algorithm does ot eed to check all the data before startig the compressio ad the performace is based o the umber of the repetitios ad the legths of the strigs ad the ratio of 0s/1s or true/false at the bit level. There are several versios of the LZW algorithm. Popular programs (such as Wizip) use variatios of the LZW compressio. The Wizip/Zip type of algorithms also work at the bit level ad ot at a charater/byte level. We explore three experimetal settigs that provide strog empirical evidece of the correlatio betwee compressio ratio ad the existece of hidde patters i data. I Sectio II, we compress biary strigs that cotai patters; i Sectio III, we study the compressibility of adjacecy matrix for graphs relative to the etropy of distributio of subgraphs. Fially, i Sectio IV, we examie the compressibility of files that cotai market basket data sets. II. PATTERNS IN STRINGS AND COMPRESSION Let A be the set of strigs o the alphabet A. The legth of a strig w is deoted by w. The ull strig o A is deoted by λ ad we defie A + as A + = A {λ}. If w A ca be writte as w = utv, where u, v A ad t A +, we say that the pair (t, m) is a occurrece of t i w, where m is the legth of u. The occurreces (x, m) ad (y, p) are overlappig if p < m + x. If this is the case, there is a proper suffix of x that equals a proper prefix of y. If t is a word such that the sets of its proper prefixes ad its proper suffixes are disjoit, there are o overlappig occurreces of x i ay word. The umber of occurreces of a strig t i a strig w is deoted by t (w). Clearly, we have { a (w) a A} = w. The prevalece of t i w is the umber f t (w) = t(w) t w which gives the ratio of the characters cotaied i the occurreces of t relative to the total umber of characters i the strig.
2 The result of applyig a compressio algorithm C to a strig w A is deoted by C(w) ad the compressio ratio is the umber CR C (w) = C(w). w I this sectio, we shall use the biary alphabet B = {0, 1} ad the LZW algorithm or the compressio algorithm of the package java.util.zip. We geerated radom strigs of bits (0s ad 1s) ad computed the compressio ratio strigs with a variety of symbol distributios. A strig w that cotais oly 0s (or oly 1s) achieves a very good compressio ratio of CR jzip (w) = for 100K bits ad CR jzip = for 500K bits, where jzip deotes the compressio algorithm from the package java.util.zip. Figure 1 shows, as expected, that the worst compressio ratio is achieved whe 0s ad 1s occur with equal frequecies. Table I PATTERN 001 PREVALENCE VERSUS THE CR jzip Prevalece of CR jzip Baselie 001 patter 0% % % % % % % % % % % % Figure 2. Variatio of compressio rate depeds o the prevalece of the patter 001 Figure 1. Baselie CR jzip Behavior For strigs of small legth (less tha 10 4 bits) the compressio ratio may exceed 1 because of the overhead itroduced by the algorithm. However, whe the size of the radom strig exceeds 10 6 bits this pheomeo disappears ad the compressio ratio depeds oly o the prevalece of the bits ad is relatively idepedet o the size of the file. Thus, i Figure 1, the curves that correspod to files of size 10 6 ad overlap. We refer to the compressio ratio of a radom strig w with a ( 0 (w), 1 (w)) distributio as the baselie compressio ratio. We created a series of biary strigs ϕ t,m which have a miimum guarateed umber m of occurreces of patters t {0, 1} k, where 0 m 100. Specifically, we created 101 files ϕ 001,m for the patter 001, each cotaiig 100K bits ad we geerated similar series for t {01, 0010, 00010}. The compressio ratio is show i Figure 2. The compressio ratio starts at a value of 0.94 ad after the prevalece of the patter becomes more frequet tha 20% the compressio ratio drops dramatically. Results of the experimet are show i Table I ad i Figure 3. Figure 3. Depedecy of Compressio Ratio o Patter Prevalece We coclude that the presece of repeated patters i strigs leads to a high degree of compressio (that is, to low compressio ratios). Thus, a low compressio ratio for a file idicates that the miig process may produce iterestig results. III. RANDOM INSERTION AND COMPRESSION For a matrix M {0, 1} u v deote by i (M) the umber of etries of M that equal i, where i {0, 1}. Clearly, we have 0 (B)+ 1 (B) = uv. For a radom variable V which
3 rages over the set of matrices {0, 1} u v let ν i (V ) be the radom variable whose values equal the umber of etries of V that equal i, where i {0, 1}. Let A {0, 1} p q be a 0/1 matrix ad let ( ) B1 B B : 2 B k, p 1 p 2 p k be a matrix-valued radom variable where B j R r s, p j 0 for 1 j k, ad k j=1 p j = 1. Defiitio 3.1: The radom variable A B obtaied by the isertio of B ito A is give by a 11 B... a 1 B A B =..... R mr s a m1 B... a m B I other words, the etries of A B are obtaied by substitutig the block a ij B l with the probability p l for a ij i A. Note that this operatio is a probabilistic geeralizatio of Kroecker s product for if ( ) B1 B :, 1 the A B has as its uique value the Kroecker product A B. The expected umber of 1s i the isertio A B is k E[ν 1 (A B)] = 1 (A) 1 (B j )p j j=1 Whe 1 (B 1 ) = = 1 (B k ) =, we have E[ν 1 (A B)] = 1 (A). I the experimet that ivolves isertio, we used a matrix-valued radom variable such that 1 (B 1 ) = = 1 (B k ) =. Thus, the variability of the values of A B is caused by the variability of the cotets of the matrices B 1,..., B k which ca be evaluated usig the etropy of the distributio of B, k H(B) = p j log 2 p j. j=1 We expect to obtai a strog positive correlatio betwee the etropy of B ad the degree of compressio achieved o the file that represets the matrix A B, ad the experimets support this expectatio. I a first series of compressios, we worked with a matrix A {0, 1} ad with a matrix-valued radom variable ( ) B1 B B : 2 B 3, p 1 p 2 p 3 where B j {0, 1} 3 3, ad 1 (B 1 ) = 1 (B 2 ) = 1 (B 3 ) = 4. Several probability distributios were cosidered, as show i Table II. Values of A B had = etries. I Table II, we had 39% 1s ad the baselie compressio rate for a biary file with this ratio of 1s is We also computed the correlatio betwee the CR jzip ad the Shao etropy of the probability distributio ad obtaied the value for 3 matrices. I Table III, we did the same experimet but with 4 differet matrices of 4 4. A strog correlatio (0.992.) was observed betwee CR jzip ad the Shao etropy of the probability distributio. Table II MATRIX INSERTIONS, ENTROPY AND COMPRESSION RATIOS Probability distributio CR jzip Shao Etropy (0, 1, 0) (1, 0, 0) (0, 0, 1) (0.2, 0.2, 0.6) (0.6, 0.2, 0.2) (0.33, 0.33, 0.34) (0, 0.3, 0.7) (0.9, 0.1, 0) (0.8, 0, 0.2) (0.49, 0.25, 0.26) (0.15, 0.35, 0.5) Table III KRONECKER PRODUCT AND PROBABILITY DISTRIBUTION FOR 4 MATRICES Probability distributio CR jzip Shao Etropy (0, 1,0, 0) (0, 1,0, 0) (0.2, 0.2, 0.2, 0.4) (0.25, 0.25, 0.25, 0.25) (0.4, 0, 0.2, 0.4) (0.3, 0.1, 0.2, 0.4) (0.45, 0.12, 0.22, 0.21) Figure 4. Evolutio CR jzip ad Shao Etropy of Probability Distributio. I Figure 4, we have the evolutio of CR jzip o the y axis ad o the x axis the Shao Etropy of the probability distributio for both experimets. We ca see clearly the liear correlatio betwee the two.
4 This experimet proves us agai that i case of repetitios/patters the CR jzip is better tha i the case of radomly geerated files. Next, we examie the compressibility of biary square matrices ad its relatioship with the distributio of pricipal submatrices. A biary square matrix is compressed by first vectorizig the matrix ad the compressig the biary sequece. The issue is relevat i graph theory, where the pricipal submatrices of the adjacecy matrix of a graph correspod to the adjacecy matrices of the subgraphs of that graph. The patters i a graph are captured i the form of frequet isomorphic subgraphs. There is a strog correlatio betwee the compressio ratio of the adjacecy matrix of a graph ad the frequecies of the occurreces of isomorphic subgraphs of it. Specifically, the lower the compressio ratio is, the higher are the frequecies of isomorphic subgraphs ad hece the worthier is the graph for beig mied. Let G be a udirected graph havig {v 1,..., v } as its set of odes. The adjacecy matrix of G, A G {0, 1} is defied as { 1 if there is a edge betwee v i ad v j i G (A G ) ij = 0 otherwise. We deote with CR C (A G ) the compressio ratio of the adjacecy matrix of graph G obtaied by applyig the compressio algorithm C. Defie the pricipal subcompoet of matrix A G with respect to the set of idices S = {s 1,..., s k } {1, 2,..., } to be the k k matrix A G (S) such that 1 if there is a edge betwee v si ad v sj A G (S) ij = i G 0 otherwise. The matrix A G (S) is the adjacecy matrix of the subgraph of G which cosists of the odes with idices i S alog with those edges that coect these odes. We deote by P (k) the collectio of all subsets of {1, 2,..., } of size k where 2 k. We have P (k) = ( k). Let (M k 1,..., M k l k ) be a eumeratio of possible adjacecy matrices of graphs with k odes where l k = 2 k(k 1) 2. We defie the fiite probability distributio ( ) k 1 P(G, k) = (G ) P (k),..., k l k (G ), P (k) where k i (G ) for 1 i l k is the umber of subgraphs of G with adjacecy matrix M k i. The Shao etropy of this probability distributio is: H P (G, k) = l k i=1 k i (G ) P (k) log k i (G ) 2 P (k). If H P (G, k) is low, there are to be fewer ad larger sets of isomorphic subgraphs of G of size k. I other words, small values of H P (G, k) for various values of k suggest that the graph G cotais repeated patters ad is susceptible to produce iterestig results. Note that although two isomorphic subgraphs do ot ecessarily have the same adjacecy matrix, the umber H P (G, k) is a good idicator of the frequecy of isomorphic subgraphs ad hece subgraph patters. We evaluated the correlatio betwee CR jzip (A G ) ad H P (G, k) for differet values of k. As expected, the compressio ratio of the adjacecy matrix ad the distributio etropy of graphs are roughly the same for isomorphic graphs, so both umbers are characteristic for a isomorphism type. If is a permutatio of the vertices of G, the adjacecy matrix of the graph G obtaied by applyig the permutatio is defied by A G is give by A G = P A G P 1. We compute this adjacecy matrix of A G, the etropy H P (G, k) the compressio ratio CR jzip (A G ) for several values of k ad permutatios. We radomly geerated graphs with = 60 odes ad various umber of edges ragig from 5 to For each geerated graph, we radomly produced twety permutatios of its set of odes ad computed H P (G, k) ad CR jzip (A G ). Fially, for each graph we calculated the ratio of stadard deviatio over average for the computed compressio ratios, followed by the same computatio for distributio etropies. The results of this experimet are show i Figures 5 ad 6 agaist the umber of edges. As it ca be see, the deviatio over mea of the compressio ratios for = 60 does ot exceed the umber Also, the deviatio over average of the distributio etropies for various values of k do ot exceed I particular, the deviatio of the distributio etropy for the graphs of 100 to 1500 edges falls below 0.001, which allows us to coclude that the deviatios of both compressio ratio ad distributio etropy with respect to isomorphisms are egligible. For each k {3, 4, 5}, we geerated radomly 560 graphs havig 60 vertices ad sets of edges whose size were varyig from 10 to The, the umbers H P (G, k) ad CR jzip (A G ) were computed. Figure 7 captures the results of the experimet. Each plot cotais two curves. The first curve represets the chages i average CR jzip (A G ) for forty radomly geerated graphs of equal umber of edges. The secod curve represets the variatio of the average H P (G, k) for the same forty graphs. The treds of these two curves are very similar for differet values of k. Table IV cotais the correlatio betwee CR jzip (A G ) ad H P (G, k) calculated for the 560 radomly geerated graphs for each value of k.
5 Figure 5. Stadard deviatio vs. average of the CR jzip (A G ) for a umber of differet permutatios of odes for the same graph. The horizotal axis is labelled with the umber of edges of the graph. = 60 ad k = 3 Figure 6. Stadard deviatio vs. average of the H P (G, k) of a umber of differet permutatios of odes for the same graph. The horizotal axis is labelled with the umber of edges of the graph. Each curve correspods to oe value of k. IV. FREQUENT ITEMS SETS AND COMPRESSION RATIO = 60 ad k = 4 A market basket data set cosists of a multiset T of trasactios. Each trasactio t is a subset of a set of items I = {i 1,...,i N }. A trasactio is described by its characteristic N-tuple t = (t 1,..., t N ), where { 1 if i k t. t k = 0 otherwise, for 1 k N. The legth of a trasactio t is t = N k=0 t k, while the average size of trasactios is { t t i T } T. The support of a set of items K of the data set T is {t T K t the umber supp(k) = T. The set of items K is s-frequet if supp(k) > s. The study of market basket data sets is cocered with the idetificatio of associatio rules. A pair of item sets (X, Y ) is a associatio rule. Its support, supp(x Y ) equals supp(x) ad its cofidece cof(x Y ) is defied as cof(x Y ) = supp(xy ) supp(x). = 60 ad k = 5 Figure 7. Plots of average CR jzip (A G ) (CMP RTIO) ad average H P (G, k) (DIST ENT) for radomly geerated graphs G of equal umber of edges with respect to the umber of edges.
6 Table IV CORRELATIONS BETWEEN CR jzip (A G ) AND H P (G, k) k Correlatio Usig the artificial trasactio ARMier geerator described i [3], we created a basket data set. Trasactios are represeted by sequeces of bits (t 1,, t N ). The multiset of M trasactios was represeted as a biary strig of legth M N obtaied by cocateatig the strigs that represet trasactios. We geerated files with 1000 trasactios, with 100 items available i the basket, addig up to 100K bits. For data sets havig the same umber of items ad trasactios, the efficiecy of the compressio icreases whe the umber of patters is lower (causig more repetitios). I a experimet with a average size of a frequet item set equal to 10, the average size of a trasactio equal to 15, ad the umber of frequet item sets varyig i the set {5, 10, 20, 30, 50, 75, 100, 200, 500, 1000}, the compressio ratio had a sigificat variatio ragig betwee 0.20 ad 0.75, as show i Table V. The correlatio betwee the umber of patters ad CR was Although the frequecy of 1s ad baselie compressio ratio were roughly costat (at 0.75), the umber of patters ad compressio ratio were correlated. Table V NUMBER OF ASSOCIATION RULES AT 0.05 SUPPORT LEVEL AND 0.9 CONFIDENCE Number of Patters Frequecy of 1s Baselie Compressio Number of assoc. compressio ratio rules 5 16% ,128, % ,539, % ,233, % , % ,910, % , % , % % % ivestigatios show that idetifyig compressible areas of huma DNA is a useful tool for detectig areas where the gee replicatio mechaisms are disturbed (a pheomeo that occurs i certai geetically based diseases. REFERENCES [1] B. J. L. Campaa ad E. J. Keogh. A compressio based distace measure for texture. I SDM, pages , [2] R. Cilibrasi ad P. M. B. Vitayi. Clusterig by compressio. IEEE Trasactios o Iformatio Theory, 51: , [3] L. Cristofor. ARMier project, [4] E. Keogh, S. Loardi, ad C. A. Rataamahataa. Towards parameter-free data miig. I Proc. 10th ACM SIGKDD Itl Cof. Kowledge Discovery ad Data Miig, pages ACM Press, [5] E. Keogh, S. Loardi, C. A. Rataamahataa, L. Wei, S. Lee, ad J. Hadley. Compressio-based data miig of sequetial data. Data Miig ad Kowledge Discovery, 14:99 129, [6] E. J. Keogh, L. Keogh, ad J. Hadley. Compressiobased data miig. I Ecyclopedia of Data Warehousig ad Miig, pages [7] H. Maila. Theoretical frameworks for data miig. SIGKDD Exploratio, 1:30 32, [8] L. Wei, J. Hadley, N. Marti, T. Su, ad E. J. Keogh. Clusterig workflow requiremets usig compressio dissimilarity measure. I ICDM Workshops, pages 50 54, [9] T. Welch. A techique for high performace data compressio. IEEE Computer, 17:8 19, Further, there was a strog egative correlatio (-0.92) betwee the compressio ratio ad the umber of associatio rules idicatig that market basket data sets that satisfy may associatio rules are very compressible V. CONCLUDING REMARKS Compressio ratio of a file ca be computed fast ad easy, ad i may cases offers a cheap way of predictig the existece of embedded patters i data. Thus, it becomes possible to obtai a approximative estimatio of the usefuless of a i-depth exploratio of a data set usig more sophisticated ad expesive algorithms. The use of compressio as a measure of miability is illustrated o a variety of paradigms: graph data, market basket data, etc. Recet
Evaluating Data Minability Through Compression An Experimental Study
Evaluatig Data Miability Through Compressio A Experimetal Study Da Simovici Uiv. of Massachusetts Bosto, Bosto, USA, dsim at cs.umb.edu Da Pletea Uiv. of Massachusetts Bosto, Bosto, USA, dpletea at cs.umb.edu
More informationInformation Theory and Statistics Lecture 4: Lempel-Ziv code
Iformatio Theory ad Statistics Lecture 4: Lempel-Ziv code Łukasz Dębowski ldebowsk@ipipa.waw.pl Ph. D. Programme 203/204 Etropy rate is the limitig compressio rate Theorem For a statioary process (X i)
More informationUC Berkeley CS 170: Efficient Algorithms and Intractable Problems Handout 17 Lecturer: David Wagner April 3, Notes 17 for CS 170
UC Berkeley CS 170: Efficiet Algorithms ad Itractable Problems Hadout 17 Lecturer: David Wager April 3, 2003 Notes 17 for CS 170 1 The Lempel-Ziv algorithm There is a sese i which the Huffma codig was
More informationLecture 14: Graph Entropy
15-859: Iformatio Theory ad Applicatios i TCS Sprig 2013 Lecture 14: Graph Etropy March 19, 2013 Lecturer: Mahdi Cheraghchi Scribe: Euiwoog Lee 1 Recap Bergma s boud o the permaet Shearer s Lemma Number
More informationBasics of Probability Theory (for Theory of Computation courses)
Basics of Probability Theory (for Theory of Computatio courses) Oded Goldreich Departmet of Computer Sciece Weizma Istitute of Sciece Rehovot, Israel. oded.goldreich@weizma.ac.il November 24, 2008 Preface.
More information1 Inferential Methods for Correlation and Regression Analysis
1 Iferetial Methods for Correlatio ad Regressio Aalysis I the chapter o Correlatio ad Regressio Aalysis tools for describig bivariate cotiuous data were itroduced. The sample Pearso Correlatio Coefficiet
More information( ) = p and P( i = b) = q.
MATH 540 Radom Walks Part 1 A radom walk X is special stochastic process that measures the height (or value) of a particle that radomly moves upward or dowward certai fixed amouts o each uit icremet of
More informationThe Boolean Ring of Intervals
MATH 532 Lebesgue Measure Dr. Neal, WKU We ow shall apply the results obtaied about outer measure to the legth measure o the real lie. Throughout, our space X will be the set of real umbers R. Whe ecessary,
More informationCS284A: Representations and Algorithms in Molecular Biology
CS284A: Represetatios ad Algorithms i Molecular Biology Scribe Notes o Lectures 3 & 4: Motif Discovery via Eumeratio & Motif Represetatio Usig Positio Weight Matrix Joshua Gervi Based o presetatios by
More informationTopic 18: Composite Hypotheses
Toc 18: November, 211 Simple hypotheses limit us to a decisio betwee oe of two possible states of ature. This limitatio does ot allow us, uder the procedures of hypothesis testig to address the basic questio:
More informationProperties and Hypothesis Testing
Chapter 3 Properties ad Hypothesis Testig 3.1 Types of data The regressio techiques developed i previous chapters ca be applied to three differet kids of data. 1. Cross-sectioal data. 2. Time series data.
More informationProbability, Expectation Value and Uncertainty
Chapter 1 Probability, Expectatio Value ad Ucertaity We have see that the physically observable properties of a quatum system are represeted by Hermitea operators (also referred to as observables ) such
More informationCEE 522 Autumn Uncertainty Concepts for Geotechnical Engineering
CEE 5 Autum 005 Ucertaity Cocepts for Geotechical Egieerig Basic Termiology Set A set is a collectio of (mutually exclusive) objects or evets. The sample space is the (collectively exhaustive) collectio
More informationOptimally Sparse SVMs
A. Proof of Lemma 3. We here prove a lower boud o the umber of support vectors to achieve geeralizatio bouds of the form which we cosider. Importatly, this result holds ot oly for liear classifiers, but
More informationAxioms of Measure Theory
MATH 532 Axioms of Measure Theory Dr. Neal, WKU I. The Space Throughout the course, we shall let X deote a geeric o-empty set. I geeral, we shall ot assume that ay algebraic structure exists o X so that
More informationBinary classification, Part 1
Biary classificatio, Part 1 Maxim Ragisky September 25, 2014 The problem of biary classificatio ca be stated as follows. We have a radom couple Z = (X,Y ), where X R d is called the feature vector ad Y
More informationInfinite Sequences and Series
Chapter 6 Ifiite Sequeces ad Series 6.1 Ifiite Sequeces 6.1.1 Elemetary Cocepts Simply speakig, a sequece is a ordered list of umbers writte: {a 1, a 2, a 3,...a, a +1,...} where the elemets a i represet
More informationECE 8527: Introduction to Machine Learning and Pattern Recognition Midterm # 1. Vaishali Amin Fall, 2015
ECE 8527: Itroductio to Machie Learig ad Patter Recogitio Midterm # 1 Vaishali Ami Fall, 2015 tue39624@temple.edu Problem No. 1: Cosider a two-class discrete distributio problem: ω 1 :{[0,0], [2,0], [2,2],
More informationExpectation and Variance of a random variable
Chapter 11 Expectatio ad Variace of a radom variable The aim of this lecture is to defie ad itroduce mathematical Expectatio ad variace of a fuctio of discrete & cotiuous radom variables ad the distributio
More informationLecture 2: April 3, 2013
TTIC/CMSC 350 Mathematical Toolkit Sprig 203 Madhur Tulsiai Lecture 2: April 3, 203 Scribe: Shubhedu Trivedi Coi tosses cotiued We retur to the coi tossig example from the last lecture agai: Example. Give,
More informationThis is an introductory course in Analysis of Variance and Design of Experiments.
1 Notes for M 384E, Wedesday, Jauary 21, 2009 (Please ote: I will ot pass out hard-copy class otes i future classes. If there are writte class otes, they will be posted o the web by the ight before class
More informationLinear Regression Demystified
Liear Regressio Demystified Liear regressio is a importat subject i statistics. I elemetary statistics courses, formulae related to liear regressio are ofte stated without derivatio. This ote iteds to
More informationLecture 4 February 16, 2016
MIT 6.854/18.415: Advaced Algorithms Sprig 16 Prof. Akur Moitra Lecture 4 February 16, 16 Scribe: Be Eysebach, Devi Neal 1 Last Time Cosistet Hashig - hash fuctios that evolve well Radom Trees - routig
More information1 of 7 7/16/2009 6:06 AM Virtual Laboratories > 6. Radom Samples > 1 2 3 4 5 6 7 6. Order Statistics Defiitios Suppose agai that we have a basic radom experimet, ad that X is a real-valued radom variable
More informationMath 4707 Spring 2018 (Darij Grinberg): homework set 4 page 1
Math 4707 Sprig 2018 Darij Griberg): homewor set 4 page 1 Math 4707 Sprig 2018 Darij Griberg): homewor set 4 due date: Wedesday 11 April 2018 at the begiig of class, or before that by email or moodle Please
More informationInformation-based Feature Selection
Iformatio-based Feature Selectio Farza Faria, Abbas Kazeroui, Afshi Babveyh Email: {faria,abbask,afshib}@staford.edu 1 Itroductio Feature selectio is a topic of great iterest i applicatios dealig with
More informationOn Random Line Segments in the Unit Square
O Radom Lie Segmets i the Uit Square Thomas A. Courtade Departmet of Electrical Egieerig Uiversity of Califoria Los Ageles, Califoria 90095 Email: tacourta@ee.ucla.edu I. INTRODUCTION Let Q = [0, 1] [0,
More informationOn the Linear Complexity of Feedback Registers
O the Liear Complexity of Feedback Registers A. H. Cha M. Goresky A. Klapper Northeaster Uiversity Abstract I this paper, we study sequeces geerated by arbitrary feedback registers (ot ecessarily feedback
More information4.1 Sigma Notation and Riemann Sums
0 the itegral. Sigma Notatio ad Riema Sums Oe strategy for calculatig the area of a regio is to cut the regio ito simple shapes, calculate the area of each simple shape, ad the add these smaller areas
More informationLecture 9: Expanders Part 2, Extractors
Lecture 9: Expaders Part, Extractors Topics i Complexity Theory ad Pseudoradomess Sprig 013 Rutgers Uiversity Swastik Kopparty Scribes: Jaso Perry, Joh Kim I this lecture, we will discuss further the pseudoradomess
More informationLecture 2. The Lovász Local Lemma
Staford Uiversity Sprig 208 Math 233A: No-costructive methods i combiatorics Istructor: Ja Vodrák Lecture date: Jauary 0, 208 Origial scribe: Apoorva Khare Lecture 2. The Lovász Local Lemma 2. Itroductio
More informationDiscrete Mathematics and Probability Theory Summer 2014 James Cook Note 15
CS 70 Discrete Mathematics ad Probability Theory Summer 2014 James Cook Note 15 Some Importat Distributios I this ote we will itroduce three importat probability distributios that are widely used to model
More information4 The Sperner property.
4 The Sperer property. I this sectio we cosider a surprisig applicatio of certai adjacecy matrices to some problems i extremal set theory. A importat role will also be played by fiite groups. I geeral,
More informationLecture 4: April 10, 2013
TTIC/CMSC 1150 Mathematical Toolkit Sprig 01 Madhur Tulsiai Lecture 4: April 10, 01 Scribe: Haris Agelidakis 1 Chebyshev s Iequality recap I the previous lecture, we used Chebyshev s iequality to get a
More informationConfidence interval for the two-parameter exponentiated Gumbel distribution based on record values
Iteratioal Joural of Applied Operatioal Research Vol. 4 No. 1 pp. 61-68 Witer 2014 Joural homepage: www.ijorlu.ir Cofidece iterval for the two-parameter expoetiated Gumbel distributio based o record values
More informationBinary codes from graphs on triples and permutation decoding
Biary codes from graphs o triples ad permutatio decodig J. D. Key Departmet of Mathematical Scieces Clemso Uiversity Clemso SC 29634 U.S.A. J. Moori ad B. G. Rodrigues School of Mathematics Statistics
More informationDiscrete Mathematics and Probability Theory Spring 2012 Alistair Sinclair Note 15
CS 70 Discrete Mathematics ad Probability Theory Sprig 2012 Alistair Siclair Note 15 Some Importat Distributios The first importat distributio we leared about i the last Lecture Note is the biomial distributio
More informationEECS564 Estimation, Filtering, and Detection Hwk 2 Solns. Winter p θ (z) = (2θz + 1 θ), 0 z 1
EECS564 Estimatio, Filterig, ad Detectio Hwk 2 Sols. Witer 25 4. Let Z be a sigle observatio havig desity fuctio where. p (z) = (2z + ), z (a) Assumig that is a oradom parameter, fid ad plot the maximum
More informationLecture 11: Pseudorandom functions
COM S 6830 Cryptography Oct 1, 2009 Istructor: Rafael Pass 1 Recap Lecture 11: Pseudoradom fuctios Scribe: Stefao Ermo Defiitio 1 (Ge, Ec, Dec) is a sigle message secure ecryptio scheme if for all uppt
More informationSensitivity Analysis of Daubechies 4 Wavelet Coefficients for Reduction of Reconstructed Image Error
Proceedigs of the 6th WSEAS Iteratioal Coferece o SIGNAL PROCESSING, Dallas, Texas, USA, March -4, 7 67 Sesitivity Aalysis of Daubechies 4 Wavelet Coefficiets for Reductio of Recostructed Image Error DEVINDER
More informationAnalysis of Algorithms. Introduction. Contents
Itroductio The focus of this module is mathematical aspects of algorithms. Our mai focus is aalysis of algorithms, which meas evaluatig efficiecy of algorithms by aalytical ad mathematical methods. We
More informationAn Introduction to Randomized Algorithms
A Itroductio to Radomized Algorithms The focus of this lecture is to study a radomized algorithm for quick sort, aalyze it usig probabilistic recurrece relatios, ad also provide more geeral tools for aalysis
More informationExercises Advanced Data Mining: Solutions
Exercises Advaced Data Miig: Solutios Exercise 1 Cosider the followig directed idepedece graph. 5 8 9 a) Give the factorizatio of P (X 1, X 2,..., X 9 ) correspodig to this idepedece graph. P (X) = 9 P
More information1 Statement of the Game
ANALYSIS OF THE CHOW-ROBBINS GAME JON LU May 10, 2016 Abstract Flip a coi repeatedly ad stop wheever you wat. Your payoff is the proportio of heads ad you wish to maximize this payoff i expectatio. I this
More informationChapter 22. Comparing Two Proportions. Copyright 2010, 2007, 2004 Pearson Education, Inc.
Chapter 22 Comparig Two Proportios Copyright 2010, 2007, 2004 Pearso Educatio, Ic. Comparig Two Proportios Read the first two paragraphs of pg 504. Comparisos betwee two percetages are much more commo
More informationRandom Variables, Sampling and Estimation
Chapter 1 Radom Variables, Samplig ad Estimatio 1.1 Itroductio This chapter will cover the most importat basic statistical theory you eed i order to uderstad the ecoometric material that will be comig
More informationSkip lists: A randomized dictionary
Discrete Math for Bioiformatics WS 11/12:, by A. Bocmayr/K. Reiert, 31. Otober 2011, 09:53 3001 Sip lists: A radomized dictioary The expositio is based o the followig sources, which are all recommeded
More informationOverview. p 2. Chapter 9. Pooled Estimate of. q = 1 p. Notation for Two Proportions. Inferences about Two Proportions. Assumptions
Chapter 9 Slide Ifereces from Two Samples 9- Overview 9- Ifereces about Two Proportios 9- Ifereces about Two Meas: Idepedet Samples 9-4 Ifereces about Matched Pairs 9-5 Comparig Variatio i Two Samples
More information6.895 Essential Coding Theory October 20, Lecture 11. This lecture is focused in comparisons of the following properties/parameters of a code:
6.895 Essetial Codig Theory October 0, 004 Lecture 11 Lecturer: Madhu Suda Scribe: Aastasios Sidiropoulos 1 Overview This lecture is focused i comparisos of the followig properties/parameters of a code:
More informationOn a Smarandache problem concerning the prime gaps
O a Smaradache problem cocerig the prime gaps Felice Russo Via A. Ifate 7 6705 Avezzao (Aq) Italy felice.russo@katamail.com Abstract I this paper, a problem posed i [] by Smaradache cocerig the prime gaps
More informationDistribution of Random Samples & Limit theorems
STAT/MATH 395 A - PROBABILITY II UW Witer Quarter 2017 Néhémy Lim Distributio of Radom Samples & Limit theorems 1 Distributio of i.i.d. Samples Motivatig example. Assume that the goal of a study is to
More informationConvergence of random variables. (telegram style notes) P.J.C. Spreij
Covergece of radom variables (telegram style otes).j.c. Spreij this versio: September 6, 2005 Itroductio As we kow, radom variables are by defiitio measurable fuctios o some uderlyig measurable space
More informationShannon s noiseless coding theorem
18.310 lecture otes May 4, 2015 Shao s oiseless codig theorem Lecturer: Michel Goemas I these otes we discuss Shao s oiseless codig theorem, which is oe of the foudig results of the field of iformatio
More informationOPTIMAL ALGORITHMS -- SUPPLEMENTAL NOTES
OPTIMAL ALGORITHMS -- SUPPLEMENTAL NOTES Peter M. Maurer Why Hashig is θ(). As i biary search, hashig assumes that keys are stored i a array which is idexed by a iteger. However, hashig attempts to bypass
More informationCS322: Network Analysis. Problem Set 2 - Fall 2009
Due October 9 009 i class CS3: Network Aalysis Problem Set - Fall 009 If you have ay questios regardig the problems set, sed a email to the course assistats: simlac@staford.edu ad peleato@staford.edu.
More information11 Correlation and Regression
11 Correlatio Regressio 11.1 Multivariate Data Ofte we look at data where several variables are recorded for the same idividuals or samplig uits. For example, at a coastal weather statio, we might record
More informationFortgeschrittene Datenstrukturen Vorlesung 11
Fortgeschrittee Datestruture Vorlesug 11 Schriftführer: Marti Weider 19.01.2012 1 Succict Data Structures (ctd.) 1.1 Select-Queries A slightly differet approach, compared to ra, is used for select. B represets
More informationChapter 3. Strong convergence. 3.1 Definition of almost sure convergence
Chapter 3 Strog covergece As poited out i the Chapter 2, there are multiple ways to defie the otio of covergece of a sequece of radom variables. That chapter defied covergece i probability, covergece i
More informationClassification of problem & problem solving strategies. classification of time complexities (linear, logarithmic etc)
Classificatio of problem & problem solvig strategies classificatio of time complexities (liear, arithmic etc) Problem subdivisio Divide ad Coquer strategy. Asymptotic otatios, lower boud ad upper boud:
More information# fixed points of g. Tree to string. Repeatedly select the leaf with the smallest label, write down the label of its neighbour and remove the leaf.
Combiatorics Graph Theory Coutig labelled ad ulabelled graphs There are 2 ( 2) labelled graphs of order. The ulabelled graphs of order correspod to orbits of the actio of S o the set of labelled graphs.
More informationThe method of types. PhD short course Information Theory and Statistics Siena, September, Mauro Barni University of Siena
PhD short course Iformatio Theory ad Statistics Siea, 15-19 September, 2014 The method of types Mauro Bari Uiversity of Siea Outlie of the course Part 1: Iformatio theory i a utshell Part 2: The method
More informationEntropies & Information Theory
Etropies & Iformatio Theory LECTURE I Nilajaa Datta Uiversity of Cambridge,U.K. For more details: see lecture otes (Lecture 1- Lecture 5) o http://www.qi.damtp.cam.ac.uk/ode/223 Quatum Iformatio Theory
More information4.1 Data processing inequality
ECE598: Iformatio-theoretic methods i high-dimesioal statistics Sprig 206 Lecture 4: Total variatio/iequalities betwee f-divergeces Lecturer: Yihog Wu Scribe: Matthew Tsao, Feb 8, 206 [Ed. Mar 22] Recall
More information4.3 Growth Rates of Solutions to Recurrences
4.3. GROWTH RATES OF SOLUTIONS TO RECURRENCES 81 4.3 Growth Rates of Solutios to Recurreces 4.3.1 Divide ad Coquer Algorithms Oe of the most basic ad powerful algorithmic techiques is divide ad coquer.
More informationStatistical inference: example 1. Inferential Statistics
Statistical iferece: example 1 Iferetial Statistics POPULATION SAMPLE A clothig store chai regularly buys from a supplier large quatities of a certai piece of clothig. Each item ca be classified either
More informationt distribution [34] : used to test a mean against an hypothesized value (H 0 : µ = µ 0 ) or the difference
EXST30 Backgroud material Page From the textbook The Statistical Sleuth Mea [0]: I your text the word mea deotes a populatio mea (µ) while the work average deotes a sample average ( ). Variace [0]: The
More informationPrivacy of Dependent Users Against Statistical Matching
Privacy of Depedet Users Agaist Statistical Matchig Nazai Takbiri Studet Member, IEEE, Amir Houmasadr Member, IEEE, Deis Goeckel Fellow, IEEE, Hossei Pishro-Nik Member, IEEE, Abstract Moder applicatios
More informationEntropy and Ergodic Theory Lecture 5: Joint typicality and conditional AEP
Etropy ad Ergodic Theory Lecture 5: Joit typicality ad coditioal AEP 1 Notatio: from RVs back to distributios Let (Ω, F, P) be a probability space, ad let X ad Y be A- ad B-valued discrete RVs, respectively.
More informationMath 155 (Lecture 3)
Math 55 (Lecture 3) September 8, I this lecture, we ll cosider the aswer to oe of the most basic coutig problems i combiatorics Questio How may ways are there to choose a -elemet subset of the set {,,,
More informationMachine Learning Theory Tübingen University, WS 2016/2017 Lecture 11
Machie Learig Theory Tübige Uiversity, WS 06/07 Lecture Tolstikhi Ilya Abstract We will itroduce the otio of reproducig kerels ad associated Reproducig Kerel Hilbert Spaces (RKHS). We will cosider couple
More informationLecture 12: November 13, 2018
Mathematical Toolkit Autum 2018 Lecturer: Madhur Tulsiai Lecture 12: November 13, 2018 1 Radomized polyomial idetity testig We will use our kowledge of coditioal probability to prove the followig lemma,
More informationChapter 6: Mining Frequent Patterns, Association and Correlations
Chapter 6: Miig Frequet Patters, Associatio ad Correlatios Basic cocepts Frequet itemset miig methods Costrait-based frequet patter miig (ch7) Associatio rules 1 What Is Frequet Patter Aalysis? Frequet
More informationLecture 10: Universal coding and prediction
0-704: Iformatio Processig ad Learig Sprig 0 Lecture 0: Uiversal codig ad predictio Lecturer: Aarti Sigh Scribes: Georg M. Goerg Disclaimer: These otes have ot bee subjected to the usual scrutiy reserved
More informationFACULTY OF MATHEMATICAL STUDIES MATHEMATICS FOR PART I ENGINEERING. Lectures
FACULTY OF MATHEMATICAL STUDIES MATHEMATICS FOR PART I ENGINEERING Lectures MODULE 5 STATISTICS II. Mea ad stadard error of sample data. Biomial distributio. Normal distributio 4. Samplig 5. Cofidece itervals
More informationHOMEWORK 2 SOLUTIONS
HOMEWORK SOLUTIONS CSE 55 RANDOMIZED AND APPROXIMATION ALGORITHMS 1. Questio 1. a) The larger the value of k is, the smaller the expected umber of days util we get all the coupos we eed. I fact if = k
More informationLecture 16: Monotone Formula Lower Bounds via Graph Entropy. 2 Monotone Formula Lower Bounds via Graph Entropy
15-859: Iformatio Theory ad Applicatios i TCS CMU: Sprig 2013 Lecture 16: Mootoe Formula Lower Bouds via Graph Etropy March 26, 2013 Lecturer: Mahdi Cheraghchi Scribe: Shashak Sigh 1 Recap Graph Etropy:
More informationRank Modulation with Multiplicity
Rak Modulatio with Multiplicity Axiao (Adrew) Jiag Computer Sciece ad Eg. Dept. Texas A&M Uiversity College Statio, TX 778 ajiag@cse.tamu.edu Abstract Rak modulatio is a scheme that uses the relative order
More informationMath 61CM - Solutions to homework 3
Math 6CM - Solutios to homework 3 Cédric De Groote October 2 th, 208 Problem : Let F be a field, m 0 a fixed oegative iteger ad let V = {a 0 + a x + + a m x m a 0,, a m F} be the vector space cosistig
More information1036: Probability & Statistics
036: Probability & Statistics Lecture 0 Oe- ad Two-Sample Tests of Hypotheses 0- Statistical Hypotheses Decisio based o experimetal evidece whether Coffee drikig icreases the risk of cacer i humas. A perso
More informationGeneral IxJ Contingency Tables
page1 Geeral x Cotigecy Tables We ow geeralize our previous results from the prospective, retrospective ad cross-sectioal studies ad the Poisso samplig case to x cotigecy tables. For such tables, the test
More informationMASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 19 11/17/2008 LAWS OF LARGE NUMBERS II THE STRONG LAW OF LARGE NUMBERS
MASSACHUSTTS INSTITUT OF TCHNOLOGY 6.436J/5.085J Fall 2008 Lecture 9 /7/2008 LAWS OF LARG NUMBRS II Cotets. The strog law of large umbers 2. The Cheroff boud TH STRONG LAW OF LARG NUMBRS While the weak
More information(6) Fundamental Sampling Distribution and Data Discription
34 Stat Lecture Notes (6) Fudametal Samplig Distributio ad Data Discriptio ( Book*: Chapter 8,pg5) Probability& Statistics for Egieers & Scietists By Walpole, Myers, Myers, Ye 8.1 Radom Samplig: Populatio:
More informationThe Method of Least Squares. To understand least squares fitting of data.
The Method of Least Squares KEY WORDS Curve fittig, least square GOAL To uderstad least squares fittig of data To uderstad the least squares solutio of icosistet systems of liear equatios 1 Motivatio Curve
More informationLecture 3 The Lebesgue Integral
Lecture 3: The Lebesgue Itegral 1 of 14 Course: Theory of Probability I Term: Fall 2013 Istructor: Gorda Zitkovic Lecture 3 The Lebesgue Itegral The costructio of the itegral Uless expressly specified
More informationSection 5.1 The Basics of Counting
1 Sectio 5.1 The Basics of Coutig Combiatorics, the study of arragemets of objects, is a importat part of discrete mathematics. I this chapter, we will lear basic techiques of coutig which has a lot of
More informationAN INTRODUCTION TO SPECTRAL GRAPH THEORY
AN INTRODUCTION TO SPECTRAL GRAPH THEORY JIAQI JIANG Abstract. Spectral graph theory is the study of properties of the Laplacia matrix or adjacecy matrix associated with a graph. I this paper, we focus
More informationÉCOLE POLYTECHNIQUE FÉDÉRALE DE LAUSANNE
ÉCOLE POLYTECHNIQUE FÉDÉRALE DE LAUSANNE School of Computer ad Commuicatio Scieces Hadout Iformatio Theory ad Sigal Processig Compressio ad Quatizatio November 0, 207 Data compressio Notatio Give a set
More information7. Modern Techniques. Data Encryption Standard (DES)
7. Moder Techiques. Data Ecryptio Stadard (DES) The objective of this chapter is to illustrate the priciples of moder covetioal ecryptio. For this purpose, we focus o the most widely used covetioal ecryptio
More informationData Analysis and Statistical Methods Statistics 651
Data Aalysis ad Statistical Methods Statistics 651 http://www.stat.tamu.edu/~suhasii/teachig.html Suhasii Subba Rao Review of testig: Example The admistrator of a ursig home wats to do a time ad motio
More informationThere is no straightforward approach for choosing the warmup period l.
B. Maddah INDE 504 Discrete-Evet Simulatio Output Aalysis () Statistical Aalysis for Steady-State Parameters I a otermiatig simulatio, the iterest is i estimatig the log ru steady state measures of performace.
More informationCHAPTER 5. Theory and Solution Using Matrix Techniques
A SERIES OF CLASS NOTES FOR 2005-2006 TO INTRODUCE LINEAR AND NONLINEAR PROBLEMS TO ENGINEERS, SCIENTISTS, AND APPLIED MATHEMATICIANS DE CLASS NOTES 3 A COLLECTION OF HANDOUTS ON SYSTEMS OF ORDINARY DIFFERENTIAL
More information4. Partial Sums and the Central Limit Theorem
1 of 10 7/16/2009 6:05 AM Virtual Laboratories > 6. Radom Samples > 1 2 3 4 5 6 7 4. Partial Sums ad the Cetral Limit Theorem The cetral limit theorem ad the law of large umbers are the two fudametal theorems
More informationResponse Variable denoted by y it is the variable that is to be predicted measure of the outcome of an experiment also called the dependent variable
Statistics Chapter 4 Correlatio ad Regressio If we have two (or more) variables we are usually iterested i the relatioship betwee the variables. Associatio betwee Variables Two variables are associated
More informationSummary: CORRELATION & LINEAR REGRESSION. GC. Students are advised to refer to lecture notes for the GC operations to obtain scatter diagram.
Key Cocepts: 1) Sketchig of scatter diagram The scatter diagram of bivariate (i.e. cotaiig two variables) data ca be easily obtaied usig GC. Studets are advised to refer to lecture otes for the GC operatios
More informationRandomized Algorithms I, Spring 2018, Department of Computer Science, University of Helsinki Homework 1: Solutions (Discussed January 25, 2018)
Radomized Algorithms I, Sprig 08, Departmet of Computer Sciece, Uiversity of Helsiki Homework : Solutios Discussed Jauary 5, 08). Exercise.: Cosider the followig balls-ad-bi game. We start with oe black
More informationSome Explicit Formulae of NAF and its Left-to-Right. Analogue Based on Booth Encoding
Vol.7, No.6 (01, pp.69-74 http://dx.doi.org/10.1457/ijsia.01.7.6.7 Some Explicit Formulae of NAF ad its Left-to-Right Aalogue Based o Booth Ecodig Dog-Guk Ha, Okyeo Yi, ad Tsuyoshi Takagi Kookmi Uiversity,
More informationSEQUENCES AND SERIES
9 SEQUENCES AND SERIES INTRODUCTION Sequeces have may importat applicatios i several spheres of huma activities Whe a collectio of objects is arraged i a defiite order such that it has a idetified first
More informationVector Quantization: a Limiting Case of EM
. Itroductio & defiitios Assume that you are give a data set X = { x j }, j { 2,,, }, of d -dimesioal vectors. The vector quatizatio (VQ) problem requires that we fid a set of prototype vectors Z = { z
More informationENGI 4421 Confidence Intervals (Two Samples) Page 12-01
ENGI 44 Cofidece Itervals (Two Samples) Page -0 Two Sample Cofidece Iterval for a Differece i Populatio Meas [Navidi sectios 5.4-5.7; Devore chapter 9] From the cetral limit theorem, we kow that, for sufficietly
More informationContext-free grammars and. Basics of string generation methods
Cotext-free grammars ad laguages Basics of strig geeratio methods What s so great about regular expressios? A regular expressio is a strig represetatio of a regular laguage This allows the storig a whole
More information