Nonlinear Estimators and Tail Bounds for Dimension Reduction in l 1 Using Cauchy Random Projections

Size: px

Start display at page:

Download "Nonlinear Estimators and Tail Bounds for Dimension Reduction in l 1 Using Cauchy Random Projections"

Sara Jennings
5 years ago
Views:

1 Journal of Machine Learning Research 007 Submitted 0/06; Published Nonlinear Estimators and Tail Bounds for Dimension Reduction in l Using Cauchy Random Projections Ping Li Department of Statistical Science Faculty of Computing and Information Science Cornell University Ithaca, NY 4853, USA Trevor J. Hastie Department of Statistics Stanford University Stanford, CA 94305, USA Kenneth W. Church Microsoft Research Microsoft Corporation Redmond, WA 9805, USA PINGLI@CORNELL.EDU HASTIE@STANFORD.EDU CHURCH@MICROSOFT.COM Editor: Sam Roweis Abstract For dimension reduction in the l norm, the method of Cauchy random projections multiplies the original data matrix A R n D with a random matrix R R D k k D whose entries are i.i.d. samples of the standard Cauchy C0,. Because of the impossibility result, one can not hope to recover the pairwise l distances in A from B = A R R n k, using linear estimators without incurring large errors. However, nonlinear estimators are still useful for certain applications in data stream computations, information retrieval, learning, and data mining. We study three types of nonlinear estimators: the sample median estimators, the geometric mean estimators, and the maximum likelihood estimators MLE. We derive tail bounds for the geometric mean estimators and establish that k = O log n ǫ suffices with the constants explicitly given. Asymptotically as k, both the sample median and the geometric mean estimators are about 80% efficient compared to the MLE. We analyze the moments of the MLE and propose approximating its distribution of by an inverse Gaussian. Keywords: Dimension reduction, l norm, Johnson-Lindenstrauss JL Lemma, Cauchy random projections. Introduction There has been considerable interest in the l norm in statistics and machine learning, as it is now well-known that the l distance is far more robust than the l distance against outliers. Huber, 98 It is sometimes a good practice to replace the l norm minimization with the l norm minimization, for example, the Least Absolute Deviation LAD Boost Friedman, 00. Chapelle et al.. A preliminary version appeared in COLT 007 Li et al., 007b. c 007 Ping Li, Trevor J. Hastie and Kenneth W. Church.

2 LI, HASTIE, AND CHURCH 999 demonstrated that using the l Laplacian radial basis kernel produced better classification results than the usual l Gaussian radial basis kernel, in their histogram-based image classification project using support vector machines SVM. Recently, it also becomes popular to use the l norm for variable feature selection; success stories include LASSO Tibshirani, 996, LARS Efron et al., 004 and -norm SVM Zhu et al., 003. This paper focuses on dimension reduction in the l norm, in particular, on the method based on Cauchy random projections, which is a special case of linear stable random projections Johnson and Schechtman, 98; Indyk, 000, 006; Li, 008. The idea of linear random projections is to multiply the original data matrix A R n D with a random projection matrix R R D k, resulting in a projected matrix B = AR R n k. We would like k to be as small as possible. If k D, then it should be much more efficient to compute certain summary statistics e.g., pairwise distances from B as opposed to A. Moreover, B may be small enough to reside in physical memory while A is often too large to fit in the main memory. The choice of the random projection matrix R depends on which norm we would like to work with. For dimension reduction in l p 0 < p, it is common practice to construct R from i.i.d. samples of p-stable distributions Johnson and Schechtman, 98; Indyk, 000, 006; Li, 008. In the stable distribution family Zolotarev, 986, normal is -stable and Cauchy is -stable. Thus, we will call random projections for l and l, normal random projections and Cauchy random projections, respectively. In normal random projections Vempala, 004, we can estimate the original pairwise l distances in A directly using the corresponding l distances in B up to a normalizing constant. Furthermore, the Johnson-Lindenstrauss JL Lemma Johnson and Lindenstrauss, 984 provides the performance guarantee. We will review normal random projections in more detail in Section. For Cauchy random projections, however, one shall not use the l distance in B to approximate the original l distance in A, as the Cauchy distribution does not even have a finite first moment. The impossibility results Brinkman and Charikar, 003; Lee and Naor, 004; Brinkman and Charikar, 005 have proved that one can not hope to recover the l distance using linear projections and linear estimators e.g., sample mean, without incurring large errors. Fortunately, the impossibility results do not rule out nonlinear estimators, which may be still useful in certain applications in data stream computations, information retrieval, learning, and data mining. In this paper, we study three types of nonlinear estimators: the sample median estimators, the geometric mean estimators, and the maximum likelihood estimators MLE. The sample median and the geometric mean estimators are asymptotically as k equivalent i.e., both are about 80% efficient as the MLE, but the latter is more accurate at small sample size k. Furthermore, we derive explicit tail bounds for the geometric mean estimators and establish an analog of the JL Lemma for dimension reduction in l. This analog of the JL Lemma for l is weaker than the classical JL Lemma for l, as the geometric mean is not convex and hence is not a metric. Many efficient algorithms, such as some sub-linear time using super-linear memory nearest neighbor algorithms Shakhnarovich et al., 005, rely on metric properties e.g., the triangle inequality. Nevertheless, nonlinear estimators may be still useful in important scenarios. Estimating l distances online The original data matrix A R n D requires OnD storage space; and hence it is often too large for physical memory. The storage cost of materializing all pairwise distances is On, which may be also too large for the memory. For example, in information retrieval, n could

3 CAUCHY RANDOM PROJECTIONS be the total number of word types or documents at Web scale. To avoid page faults, it may be more efficient to estimate the distances on the fly from the projected data matrix B in the memory. Computing all pairwise l distances In distance-based clustering, classification, and kernels e.g., for SVM, we need to compute all pairwise distances in A, at the cost of time On D, which can be prohibitive, especially when A does not fit in the memory. Using Cauchy random projections, the cost is reduced to OnDk + n k. Linear scan nearest neighbor searching Nearest neighbor searching is notorious for being inefficient, especially when the data matrix A is too large for the memory. Searching for the nearest neighbors from the projected data matrix B which is in the memory becomes much more efficient, even by linear scans. The cost of searching for the nearest neighbor for one data point is reduced from OnD to Onk. Data stream computations. Massive data streams come from Internet routers, phone switches, atmospheric observations, sensor networks, highway traffic, finance data, and more Henzinger et al., 999; Feigenbaum et al., 999; Indyk, 000; Babcock et al., 00; Cormode et al., 00. Unlike in the traditional databases, it is not common to store massive data streams; and hence the processing is often done on the fly. In data stream computations, Cauchy random projections can be used for A: approximating the l frequency moments for individual streams; B: approximating the l differences between a pair of streams. We briefly comment on random coordinate sampling, another strategy for dimension reduction. One can randomly sample k columns from A R n D and estimate the summary statistics including l and l distances. Despite its simplicity, this strategy has two major drawbacks. First, in heavy-tailed data, one may have to choose k very large in order to achieve a sufficient accuracy. Second, large datasets are often highly sparse, for example, text data Dhillon and Modha, 00 and market-basket data Aggarwal and Wolf, 999; Strehl and Ghosh, 000. For sparse data, Li and Church 005, 007; Li et al. 007a provided an alternative coordinate sampling strategy, called Conditional Random Sampling CRS. For non-sparse data, however, methods based on linear stable random projections are superior. The rest of the paper is organized as follows. Section reviews linear random projections. Section 3 summarizes the main results for three types of nonlinear estimators. Section 4 presents the sample median estimators. Section 5 concerns the geometric mean estimators. Section 6 is devoted to the maximum likelihood estimators. Section 7 concludes the paper.. Introduction to Linear Stable Random Projections We give a review on linear random projections, including normal and Cauchy random projections. Denote the original data matrix by A R n D, i.e., n data points in D dimensions. Let {u T i }n i= RD be the ith row of A. Let R R D k be a projection matrix and denote the entries of R by {r ij } D i= k j=. The projected data matrix B = AR Rn k. Let {vi T}n i= Rk be the ith row of B, i.e., v i = R T u i. 3

4 LI, HASTIE, AND CHURCH For simplicity, we focus on the leading two rows, u and u, in A, and the leading two rows, v and v, in B. Define {x j } k j= to be x j = v,j v,j = D r ij u,i u,i, i= j =,,...,k If we sample r ij i.i.d. from a p-stable distribution Zolotarev, 986, then x j s are also i.i.d. samples of a p-stable distribution with a different scale parameter. In the family of stable distributions, normal p = and Cauchy p = are two important special cases.. Normal Random Projections When r ij is sampled from the standard normal, i.e., r ij N0,, i.i.d., then x j = v,j v,j = D D r ij u,i u,i N 0, u,i u,i, j =,,...,k, i= because a weighted sum of normals is also normal. Denote the squared l distance between u and u by i= D d l = u u = u,i u,i. We can estimate d l from the sample squared l distance i.e., sample mean: ˆd l = k i= k x j. Note that k ˆd l /d l follows a Chi-square distribution with k degrees of freedom, χ k. Therefore, it is easy to prove the following Lemma about the tail bounds: j= Lemma Pr ˆd l d l ǫd l exp k ǫ log + ǫ = exp k ǫ Pr ˆd l d l ǫd l exp k ǫ log ǫ = exp where the constants G R = G L = ǫ ǫ log + ǫ 4 3 ǫ, G R k ǫ ǫ ǫ log ǫ ǫ 4 3 ǫ., ǫ > 0,, 0 < ǫ <, G L 4

5 CAUCHY RANDOM PROJECTIONS Proof Using the standard Chernoff inequality Chernoff, 95, Pr ˆd l d l ǫd l =Pr k ˆd l /d l k + ǫ E expk ˆd l /d l t t > 0 exp + ǫkt = exp k log t + + ǫt, which is minimized at t = ǫ +ǫ. Thus, for any ǫ > 0 Pr ˆd l d l > ǫd l exp k ǫ log + ǫ. We can similarly prove the other tail bound for Pr ˆd l d l ǫd l. For convenience, sometimes we would like to write the tail bounds in a symmetric form Pr ˆdl d l ǫdl exp k ǫ, 0 < ǫ <, G and we know that it suffices to let G = max{g R,G L } 4 ǫ. 3 Since there are in total nn < n pairs among n data points, we would like to bound the tail probabilities simultaneously for all pairs. By the Bonferroni union bound, it suffices if n Pr ˆdl d l ǫdl δ, i.e., it suffices if n exp k ǫ log n log δ δ = k G G ǫ. Therefore, we obtain one version of the Johnson-Lindenstrauss JL Lemma: log n log δ Lemma If k G, where G = 4 ǫ ǫ, then with probability at least δ, the squared 3 l distance between any pair of data points among n data points can be approximated within a ± ǫ factor 0 < ǫ <, using the squared l distance of the projected data after normal random projections. Many versions of the JL Lemma have been proved Johnson and Lindenstrauss, 984; Frankl and Maehara, 987; Indyk and Motwani, 998; Arriaga and Vempala, 999; Dasgupta and Gupta, 003; Indyk, 000, 00; Achlioptas, 003; Arriaga and Vempala, 006; Ailon and Chazelle, 006. Note that we do not have to use r ij N0, for dimension reduction in l. For example, we can sample r ij from the following sparse projection distribution: r ij = s with prob. s 0 with prob. s with prob. s. 5

6 LI, HASTIE, AND CHURCH When s 3, Achlioptas 00, 003 proved the JL Lemma for the above sparse projection. Recently, Li et al. 006b proposed very sparse random projections using s 3 in, based on two practical considerations: D should be very large, otherwise there would be no need for dimension reduction. The original l distance should make engineering sense, in that the second or higher moments should be bounded otherwise various term-weighting schemes will be applied. Based on these two practical assumptions, the projected data are asymptotically normal at a fast rate of convergence when s = D and the data have bounded third moments. Of course, very sparse random projections do not have worst case performance guarantees.. Cauchy Random Projections In Cauchy random projections, we sample r ij i.i.d. from the standard Cauchy distribution, i.e., r ij C0,. By the -stability of Cauchy Zolotarev, 986, we know that x j = v,j v,j C 0, D u,i u,i. That is, the projected differences x j = v,j v,j are also Cauchy random variables with the scale parameter being the l distance, d = u u = D i= u,i u,i, in the original space. Recall that a Cauchy random variable z C0,γ has the density fz = γ i= z + γ, γ > 0, < z <. The easiest way to see the -stability is via the characteristic function, E exp z t = exp γ t, t D E exp c i z i = exp γ i= D c i t, for z, z,..., z D, i.i.d. C0,γ, and any constants c, c,..., c D. Therefore, in Cauchy random projections, the problem boils down to estimating the Cauchy scale parameter of C0,d from k i.i.d. samples x j C0,d. Unlike in normal random projections, we can no longer estimate d from the sample mean i.e., k k j= x j because E x j =. i= 3. Main Results Although the impossibility results Lee and Naor, 004; Brinkman and Charikar, 005 have ruled out accurate estimators that are also metrics, there is enough information to recover d from k samples {x j } k j=, with high accuracy. We analyze three types of nonlinear estimators: the sample median estimators, the geometric mean estimators, and the maximum likelihood estimators. 6

7 CAUCHY RANDOM PROJECTIONS 3. The Sample Median Estimators The sample median estimator, ˆd me = median x j,j =,,...,k, is simple and computationally convenient. We recommend the bias-corrected version: where b me = 0 m +! m! ˆd me,c = ˆd me b me, t tan t t m dt, k = m +. Here, for convenience, we only consider k = m +, m =,, 3,... Some properties of ˆd me,c : E ˆdme,c = d, i.e, ˆd me,c is unbiased. When k 5, the variance of ˆd me,c is Var ˆdme,c =d m! 0 tan t t t m dt m +! 0 tan, k 5 t t t m dt = 4k d + O k. b me and b me monotonically with increasing k. 3. The Geometric Mean Estimators The geometric mean estimator has tail bounds Pr ˆdgm + ǫd where G R,gm = G L,gm = ˆd gm = k x j /k j= U R,gm = exp k ǫ G R,gm Pr ˆdgm ǫd U L,gm = exp k ǫ ǫ G L,gm, ǫ > 0, 0 < ǫ < log + log + ǫ +, tan log + ǫ log + ǫ ǫ log + log ǫ +. tan log ǫ log ǫ 7

8 LI, HASTIE, AND CHURCH Moreover, for small ǫ, we obtain the following convenient approximations: G R,gm = G L,gm = + ǫ + ǫ ǫ +..., ǫ Consequently, we establish an analog of the Johnson-Lindenstrauss JL Lemma for dimension reduction in l : log n log δ If k G gm, then with probability at least δ, one can recover the original l ǫ distance between any pair of data points among all n data points within a ±ǫ factor 0 < ǫ <, using ˆd gm. The constant G gm can be specified from G R,gm and G L,gm : G gm = max{g R,gm,G L,gm }. To remove the bias and also reduce the variance, we recommend the bias-corrected geometric mean estimator: ˆd gm,c = cos k k k which is unbiased and has variance cos Var ˆdgm,c = d k k cos k k We also derive tail bounds for ˆd gm,c : j= = 4 x j /k, d k + 4 d 3 k + O k 3. Pr ˆdgm,c + ǫd U R,gm,c, ǫ > 0 Pr ˆdgm,c ǫd U L,gm,c, 0 < ǫ <, and show that, compared with ˆd gm, the ratios of the tail bounds ρ R,k = U R,gm,c U R,gm ρ R, = ρ L,k = U L,gm,c U L,gm ρ L, = as k, where A, C, A, and C are only functions of ǫ. 3.3 The Maximum Likelihood Estimators + ǫ C exp 8 A + C tan A, ǫ C exp 8 A C tan A, Denoted by ˆd MLE,c, the bias-corrected maximum likelihood estimator MLE is ˆd MLE,c = ˆd MLE, k 8

9 CAUCHY RANDOM PROJECTIONS where ˆd MLE solves a nonlinear MLE equation k ˆd MLE + k j= ˆd MLE x j + ˆd MLE = 0. Some properties of ˆd MLE,c : It is nearly unbiased, E ˆdMLE,c Its asymptotic variance is = d + O k. Var ˆdMLE,c = d k + 3d k + O k 3, i.e., Var ˆd MLE,c Var ˆd me,c 8, Var ˆd MLE,c Var ˆd gm,c 8, as k. 8 80% Its distribution can be accurately approximated by an inverse Gaussian, at least in the small deviation range, which suggests the following approximate tail bound Pr ˆd MLE,c d ǫd exp ǫ / + ǫ k + 3, 0 < ǫ <, which is verified by simulations for the tail probability 0 0 range. 4. The Sample Median Estimators Recall in Cauchy random projections, B = AR, we denote the leading two rows in A by u, u R D, and the leading two rows in B by v, v R k. Our goal is to estimate the l distance d = u u = D i= u,i u,i from {x j } k j=, x j = v,j v,j C0,d, i.i.d. A widely-used estimator in statistics is based on the sample inter-quantiles Fama and Roll, 968, 97; McCulloch, 986. For the symmetric Cauchy, the absolute sample median estimator ˆd me = median{ x j,j =,,...,k} is convenient because the population median of absolute Cauchy is exactly d. Indyk, 006 It is well-known in statistics that ˆd me, is asymptotically unbiased and normal; see Lemma 3. For small samples e.g., k 0, however, ˆd me is severely biased. Lemma 3 The sample median estimator, ˆd me, is asymptotically unbiased and normal D k ˆdme d = N 0, 4 d. When k = m +, m =,, 3,..., the r th moment of ˆd me can be represented as r E ˆdme = d r m +! 0 m! tan r t t t m dt, m r k 9

10 LI, HASTIE, AND CHURCH If m < r, then E ˆdme r =. Proof Let fz; d and Fz; d be the probability density and cumulative density respectively for C0, d : fz;d = d z + d, Fz;d = z tan, z 0. d The inverse of Fz;d is F q;d = dtan q. Here, we take q = 0.5, to consider the sample median. By the asymptotic normality of sample quantiles David, 98, Theorem 9., we know that k ˆdme d D = N 0, f dtan ;d tan = 4 d, ˆdme i.e., ˆd me is asymptotically unbiased and normal with the variance Var = 4k d + O k. For convenience, we assume k = m+. Again, by properties of sample quantile David, 98, Chapter., the probability density of ˆd me is f ˆdme z = m +! m! Fz;d Fz;d m fz;d, from which we can write down the r th moment of ˆd me in, after some change of variables. Once we know E ˆdme, we can design an unbiased estimator as described in Lemma 4. Lemma 4 The estimator, ˆd me,c = ˆd me, b me is unbiased, i.e., E ˆdme,c = d, where the bias-correction factor b me is b me = E ˆdme d = The variance of ˆd me,c is Var ˆdme,c = d m! m +! 0 m +! m! t tan t t m dt, k = m tan t t t m dt 0 tan, k = m + 5. t t t m dt ˆd gm,c and ˆd gm are asymptotically equivalent, i.e., D k ˆdme,c d = N 0, 4 d. 0

11 CAUCHY RANDOM PROJECTIONS The bias-correction factor b me is monotonically decreasing with increasing m, and b me, lim b me =. m Proof Most of the results follow directly from Lemma 3. Here we only show b me decreases monotonically and b me as m. Note that m+! m! t t m, 0 t, is the probability density of a Beta distribution Betam +,m +, whose rth moment is E z r = m+!m+r! m++r!m!. By Taylor expansions Gradshteyn and Ryzhik, 994,.4.6, tan t j j j = B j t j, j! where B j is the Bernoulli number Gradshteyn and Ryzhik, 994, 9.6. Therefore, j j b me = j= j! j= B j j m +!m + j!. m + j!m! It is easy to show that m+!m+j! m+j!m! decreases monotonically with increasing m and it converges to j. Thus, bme also decreases monotonically with increasing m. From the Taylor expansion of tant, we know that b me j j j= j! j j B j = tan =. It is well-known that bias-corrections are not always beneficial because of the bias-variance trade-off phenomenon. In our case, because the correction factor b me always, the biascorrection not only removes the bias of ˆd me but also reduces the variance of ˆd me. The bias-correction factor b me can be numerically evaluated and tabulated, at least for small k. Figure plots b me as a function of k, indicating that ˆd me is severely biased when k 0. When k > 50, the bias becomes negligible. 5. The Geometric Mean Estimators This section derives estimators based on the geometric mean, which are more accurate than the sample median estimators. The geometric mean estimators allow us to derive tail bounds in explicit forms and consequently establish an analog of the Johnson-Lindenstrauss JL Lemma for dimension reduction in the l norm. Lemma 5 Assume x C0, d. Then E x λ = d λ, λ <. cosλ/

12 LI, HASTIE, AND CHURCH Bias correction factor Sample size k Figure : The bias correction factor, b me in 3, as a function of k = m +. After k > 50, the bias is negligible. Note that b me = when k =. Proof Using the integral table Gradshteyn and Ryzhik, 994, 3.., page 337, E x λ = d 0 y λ dλ y + ddy = 0 y λ y + dy = d λ cosλ/. From Lemma 5, by taking λ = k, we obtain an unbiased estimator, ˆdgm,c, based on the biascorrected geometric mean in the next lemma, proved in Appendix A. Lemma 6 ˆd gm,c = cos k k k j= is unbiased, with the variance valid when k > cos Var ˆdgm,c = d k k cos k k x j /k, k > 4 = d k The third and fourth central moments are d 3 E ˆdgm,c E ˆdgm,c = 6 k + O k d 4 E ˆdgm,c E ˆdgm,c = 6 k + O k d 3 k + O k 3. The higher third or fourth moments may be useful for approximating the distribution of ˆd gm,c. In Section 6, we will show how to approximate the distribution of the maximum likelihood estimator by matching the first four moments in the leading terms. We could apply the similar technique to,.

13 CAUCHY RANDOM PROJECTIONS approximate ˆd gm,c. Fortunately, we do not have to do so because we are able to derive the exact tail bounds for ˆd gm,c in Lemma 9. Note that in 4, as k, the bias-correction term converges to quickly: cos k = k +... = k +... = k k k k When k is not too small e.g., k > 50, the geometric mean estimator without bias-correction, ˆd gm = k x j /k, k >, j= gives similar results as ˆd gm,c. As shown in Figure, the ratios of the mean square errors MSE MSE ˆdgm cos = k k cos k k + 5 cos MSE ˆdgm,c k k cos k k demonstrate that the two geometric mean estimators are similar when k > 50, in terms of the MSE. MSE ratio Sample size k Figure : The ratios of the mean square errors MSE MSE ˆd gm MSE ˆd in 5 indicate that the difference gm,c between ˆd gm and ˆd gm,c becomes negligible when k > 50. One advantage of ˆd gm is the convenience for deriving tail bounds. Thus, before presenting Lemma 9 for ˆd gm,c, we prove tail bounds for ˆd gm in Lemma 7 proved in Appendix B. Lemma 7 Pr ˆdgm + ǫd U R,gm = exp k ǫ G R,gm Pr ˆdgm ǫd U L,gm = exp k ǫ G L,gm, ǫ > 0 6, 0 < ǫ < 7 3

14 LI, HASTIE, AND CHURCH where G R,gm = G L,gm = ǫ log + log + ǫ +, 8 tan log + ǫ log + ǫ ǫ log + log ǫ +. 9 tan log ǫ log ǫ Moreover, for small ǫ, we have the following convenient approximations: G R,gm = + ǫ ǫ +..., 0 G L,gm = ǫ ǫ Consequently, as ǫ 0+, we know G R,gm, G L,gm. Figure 3 plots the constants G R,gm and G L,gm in 8 and 9, along with their convenient approximations 0 and. For G R,gm, the exact and approximate expressions are indistinguishable when ǫ <. For G L,gm, the exact and approximate expressions are indistinguishable when ǫ < 0.7. The plots also suggest that the approximations, 0 and, are upper bounds of the exact constants, 8 and 9, respectively. G R,gm Exact Approx ε G L,gm Exact Approx ε a G R,gm b G L,gm Figure 3: We plot the constants G R,gm and G L,gm in 8 and 9, along with their convenient approximations 0 and. Consequently, Lemma 7 establishes an analog of the Johnson-Lindenstrauss JL Lemma for dimension reduction in l : 4

15 CAUCHY RANDOM PROJECTIONS log n log δ Lemma 8 If k G gm, then with probability at least δ, one can recover the original l distance between any pair of data points among all n data points within a ± ǫ factor ǫ 0 < ǫ <, using ˆd gm. It suffices to specify the constant G gm = max{g R,gm,G L,gm }. Similarly, we derive tail bounds for the unbiased geometric mean estimator ˆd gm,c, in Lemma 9, which is proved in Appendix C. Lemma 9 where Pr ˆdgm,c + ǫd U R,gm,c = cos kt k cos k t k + ǫ t t = k tan log + ǫ k log cos. k, ǫ > 0 Pr ˆdgm,c ǫd U L,gm,c = cos k t k ǫ t cos kt, 0 < ǫ <, k 8ǫ k 3 where t = k tan log ǫ + k log cos. k As k, for any fixed ǫ, we have ρ R,k = U R,gm,c ρ R, = U R,gm + ǫ C exp 8 A + C tan A, 4 ρ L,k = U L,gm,c ρ L, = U L,gm ǫ C exp 8 A C tan A, 5 where U R,gm and U L,gm are upper bounds for ˆd gm as derived in Lemma 7, and A = tan log + ǫ /, C = + log + ǫ, A = tan log ǫ /, C = + log ǫ. Figure 4 plots the tail bound ratios ρ R,k and ρ L,k as defined in Lemma 9, indicating that the asymptotic expressions ρ R, and ρ L, are in fact very accurate even for small k e.g., k = 0. Figure 4 illustrates that introducing the bias-correction term in ˆd gm,c reduces the right tail bound but amplifies the left tail bound. Because the left tail bound is usually much smaller than the right 5

16 LI, HASTIE, AND CHURCH ρ R,k k= k= 0 k= ε a Right bound ratios ρ R,k ρ L,k k= ε 5 k= b Left bound ratios ρ L,k Figure 4: Tail bound ratios ρ R,k and ρ L,k as defined in 4 and 5 for k =,5,0, along with the asymptotic expressions ρ R, and ρ L,. The dashed curves correspond to k =. tail bound, we expect that overall the bias-correction should be beneficial, as shown in Figure 5, which plots the overall ratio of tail bounds: ρ k = U R,gm,c + U L,gm,c U R,gm + U L,gm = cos kt k cos k t k exp «+ǫ t + cos k t k k ǫ G R,gm + exp ǫ t «cos kt k k ǫ G L,gm. 6 ρ k k=00 k=0 k=5 k= ε Figure 5: The overall ratios of tail bounds, ρ k as defined in 6 are almost always below one, demonstrating that the bias-corrected estimator ˆd gm,c may exhibit better overall tail behavior than the biased estimator ˆd gm. 6

17 CAUCHY RANDOM PROJECTIONS Finally, Figure 6 compares ˆd gm,c with the sample median estimators ˆd me and ˆd me,c, in terms of the mean square errors. ˆd gm,c is considerably more accurate than ˆd me at small k. The bias correction significantly reduces the mean square errors of ˆd me. MSE ratios No correction Bias corrected Sample size k MSEˆd me Figure 6: The ratios of the mean square errors MSE, MSE ˆd and MSE ˆd me,c, demonstrate that gm,c MSEˆd gm,c the bias-corrected geometric mean estimator ˆd gm,c is considerably more accurate than the sample median estimator ˆd me. The bias correction on ˆd me considerably reduces the MSE. Note that when k = 3, the ratios are. 6. The Maximum Likelihood Estimators This section analyzes the maximum likelihood estimators MLE, which are asymptotically optimum in terms of the variance. In comparisons, the sample median and geometric mean estimators are not optimum. Our contribution in this section includes the higher-order analysis for the bias and moments and accurate closed-from approximations to the distribution of the MLE. The method of maximum likelihood is widely used. For example, Li et al. 006a applied the maximum likelihood method to normal random projections and provided an improved estimator of the l distance by taking advantage of the marginal information. Recall our goal is to estimate d from k i.i.d. samples x j C0,d,j =,,...,k. The log joint likelihood of {x j } k j= is Lx,x,...x k ;d = k logd k log whose first and second derivatives w.r.t. d are L d = k d k j= L d = k d k d x j + d, j= x j d x j + d = d L d k logx j + d, j= 4 k j= x j x j + d. 7

18 LI, HASTIE, AND CHURCH The maximum likelihood estimator of d, denoted by ˆd MLE, is the solution to L d = 0, i.e., k ˆd MLE + k j= ˆd MLE x j + ˆd MLE = 0. 7 Because L ˆd MLE 0, ˆd MLE indeed maximizes the joint likelihood and is the only solution to the MLE equation 7. Solving 7 numerically is not difficult e.g., a few iterations using the Newton s method. For a better accuracy, we recommend the following bias-corrected estimator: ˆd MLE,c = ˆd MLE. k Lemma 0 concerns the asymptotic moments of ˆd MLE and ˆd MLE,c, proved in Appendix D. Lemma 0 Both ˆd MLE and ˆd MLE,c are asymptotically unbiased and normal. The first four moments of ˆd MLE are E ˆdMLE d = d k + O k The first four moments of ˆd MLE,c are Var ˆdMLE = d k + 7d k + O k 3 E ˆdMLE E ˆd 3 d 3 MLE = + O k E ˆdMLE E ˆd 4 d 4 MLE = E ˆdMLE,c d = O k k Var ˆdMLE,c = d k + 3d k + O k 3 + d4 k 3 k 3 E ˆdMLE,c E ˆd 3 d 3 MLE,c = + O k E ˆdMLE,c E ˆd 4 d 4 MLE,c = k + O k 4. k d4 k 3 + O k 4. The order O k term of the variance, i.e., d k, is well-known Haas et al., 970. We derive the bias-corrected estimator, ˆd MLE,c, and the higher order moments using stochastic Taylor expansions Bartlett, 953; Shenton and Bowman, 963; Ferrari et al., 996; Cysneiros et al., 00. We will propose an inverse Gaussian distribution to approximate the distribution of ˆd MLE,c, by matching the first four moments at least in the leading terms. 8

19 CAUCHY RANDOM PROJECTIONS 6. A Numerical Example The maximum likelihood estimators are tested on some Microsoft Web crawl data, a term-bydocument matrix with D = 6 Web pages. We conduct Cauchy random projections and estimate the l distances between words. In this experiment, we compare the empirical and asymptotic theoretical moments, using one pair of words. Figure 7 illustrates that the bias correction is effective and these asymptotic formulas for the first four moments of ˆd MLE,c in Lemma 0 are accurate, especially when k 0. Bias d MLE,c d MLE k a E ˆd MLE d/d vs. E ˆd MLE,c d/d Variance Variance k b E ˆd MLE,c Eˆd MLE,c /d / Third moment Third moment Fourth moment Fourth moment k c Eˆd MLE,c E ˆd /3 MLE,c 3 /d k d E ˆd /4 MLE,c Eˆd MLE,c 4 /d 4 Figure 7: One pair of words are selected from a Microsoft term-by-document matrix with D = 6 Web pages. We conduct Cauchy random projections and estimate the l distance between one pair of words using the maximum likelihood estimator ˆd MLE and the bias-corrected version ˆd MLE,c. Panel a plots the biases of ˆd MLE and ˆd MLE,c, indicating that the bias correction is effective. Panels b, c, and d plot the variance, third moment, and fourth moment of ˆd MLE,c, respectively. The dashed curves are the theoretical asymptotic moments. When k 0, the theoretical asymptotic formulas for moments are accurate. 9

20 LI, HASTIE, AND CHURCH 6. Approximate Distributions Theoretical analysis on the exact distribution of a maximum likelihood estimator is difficult. It is common practice to assume normality, which, however, is inaccurate. The Edgeworth expansion improves the normal approximation by matching higher moments Feller, 97; Bhattacharya and Ghosh, 978; Severini, 000, which however, has some well-known drawbacks. The resultant expressions are quite sophisticated and are not accurate at the tails. It is possible that the approximate probability has values below zero. Also, Edgeworth expansions consider the support to be,, while ˆd MLE,c is non-negative. The saddle-point approximation Jensen, 995 in general improves Edgeworth expansions, often considerably. Unfortunately, we can not apply the saddle-point approximation in our case at least not directly, because it requires a bounded moment generating function. We propose approximating the distributions of ˆd MLE,c directly using some well-studied common distributions. We will first consider a gamma distribution with the same first two asymptotic moments of ˆd MLE,c. That is, the gamma distribution will be asymptotically equivalent to the normal approximation. While a normal has zero third central moment, a gamma has nonzero third central moment. This, to an extent, speeds up the rate of convergence. Another important reason why a gamma is more accurate is because it has the same support as ˆd MLE,c, i.e., [0,. We will furthermore consider a generalized gamma distribution, which allows us to match the first three asymptotic moments of ˆd MLE,c. Interestingly, in this case, the generalized gamma approximation turns out to be an inverse Gaussian distribution, which has a closed-form probability density. More interestingly, this inverse Gaussian distribution also matches the fourth central moment of ˆd MLE,c in the O k term and almost in the O k term. By simulations, the inverse 3 Gaussian approximation is highly accurate. Note that, since we are interested in the very small e.g., 0 0 tail probability range, O k 3/ is not too meaningful. For example, k 3/ = 0 3 if k = 00. Therefore, we will have to rely on simulations to assess the accuracy of the approximations. On the other hand, an upper bound may hold exactly verified by simulations even if it is based on an approximate distribution. As the related work, Li et al. 006c applied gamma and generalized gamma approximations to model the performance measure distribution in some wireless communication channels using random matrix theory and produced accurate results in evaluating the error probabilities. 6.. THE GAMMA APPROXIMATION The gamma approximation is an obvious improvement over the normal approximation. 3 A gamma distribution, Gα,β, has two parameters, α and β, which can be determined by matching the first two asymptotic moments of ˆd MLE,c. That is, we assume that ˆd MLE,c Gα,β, with αβ = d, αβ = d k + 3d k, = α = k + 3, k β = d k + 3d k.. The simple normal approximation can be improved by taking advantage of the conditional density on the ancillary configuration statistic, based on the observations x, x,..., x k Fisher, 934; Lawless, 97; Hinkley, Recall that, in normal random projections for dimension reduction in l see Lemma, the resultant estimator of the squared l distance has a Chi-squared distribution, which is a special case of gamma. 0

21 CAUCHY RANDOM PROJECTIONS Assuming a gamma distribution, it is easy to obtain the following Chernoff bounds: Pr ˆdMLE,c + ǫd exp αǫ log + ǫ, ǫ > 0 8 Pr ˆdMLE,c ǫd exp α ǫ log ǫ, 0 < ǫ <, 9 where we use to indicate that these inequalities are based on an approximate distribution. Note that the distribution of ˆd MLE /d and hence ˆd MLE,c /d is only a function of k Antle and Bain, 969; Haas et al., 970. Therefore, we can evaluate the accuracy of the gamma approximation by simulations with d =, as presented in Figure 8. Tail probability k=400 Empirical Gamma Normal k=0 k=00 k=50 k=0 k=50 k= ε a Tail probability k=400 Empirical Gamma Bound k=00 k=0 k=50 k=0 k= ε Figure 8: We consider k = 0, 0, 50, 00, 00, and 400. For each k, we simulate standard Cauchy samples, from which we estimate the Cauchy parameter by the MLE ˆd MLE,c and compute the tail probabilities. Panel a compares the empirical tail probabilities thick solid with the gamma tail probabilities thin solid, indicating that the gamma distribution is better than the normal dashed for approximating the distribution of ˆd MLE,c. Panel b compares the empirical tail probabilities with the gamma upper bound 8+9. b Figure 8a shows that both the gamma and normal approximations are fairly accurate when the tail probability ; and the gamma approximation is obviously better. Figure 8b compares the empirical tail probabilities with the gamma Chernoff upper bound 8+9, indicating that these bounds are reliable, when the tail probability THE INVERSE GAUSSIAN GENERALIZED GAMMA APPROXIMATION The distribution of ˆd MLE,c can be well approximated by an inverse Gaussian distribution, which is a special case of the three-parameter generalized gamma distribution Hougaard, 986; Gerber, 99; Li et al., 006c, denoted by GGα,β,η. Note that the usual gamma distribution is a special case with η =. If z GGα,β,η, then the first three moments are Ez = αβ, Varz = αβ, E z Ez 3 = αβ 3 + η.

22 LI, HASTIE, AND CHURCH We can approximate the distribution of ˆd MLE,c by matching the first three moments, i.e., from which we obtain αβ = d, α = k + 3, k αβ = d k + 3d k, β = d k + 3d k, αβ3 + η = d3 k, η = + O. 0 k Taking only the leading term for η, the generalized gamma approximation of ˆd MLE,c would be GG k + 3, d k + 3d k,. k In general, a generalized gamma distribution does not have a closed-form density function although it always has a closed-from moment generating function. In our case, is actually an inverse Gaussian IG distribution, which has a closed-form density function. Assuming ˆd MLE,c IGα,β, with parameters α and β defined in 0, the moment generating function MGF, the probability density function PDF, and cumulative density function CDF would be Seshadri, 993, Chapter Tweedie, 957a,b 4 E exp ˆd = MLE,c t exp α βt /, f ˆdMLE,c y = α β y/β y 3 α αd y d exp = y/β y 3 exp, yβ = α Pr ˆdMLE,c y β Φ y = Φ y αβ + e α Φ αd y y d + e α Φ where Φ. is the standard normal CDF, i.e., Φz = z that these equalities are based on an approximate distribution. Assuming ˆd MLE,c IGα,β, the fourth central moment should be 4 = E ˆdMLE,c E ˆdMLE,c 5αβ αβ αd y α β y y αβ + y d +, e t dt. Here we use = to indicate d = 5d k + 3d 3 d k + 3 = d4 k + 56d4 k 3 + O k 4 k + 3d k. 4. The inverse Gaussian distribution was first noted as the distribution of the first passage time of the Brownian motion with a positive drift. It has many interesting properties such as infinitely divisibility. Two monographs Chhikara and Folks, 989; Seshadri, 993 are devoted entirely to the inverse Gaussian distributions. For a quick reference, one can check

23 CAUCHY RANDOM PROJECTIONS Lemma 0 has shown the true asymptotic fourth central moment: 4 d 4 E ˆdMLE,c E ˆdMLE,c = k + 86d4 k 3 + O k 4. That is, the inverse Gaussian approximation matches not only the leading term, d4 k, but also almost the higher order term, 86d4 k 3, of the true asymptotic fourth moment of ˆd MLE,c. Assuming ˆd MLE,c IGα,β, the tail probability of ˆd MLE,c can be expressed as = α Pr ˆdMLE,c + ǫd Φ ǫ = α Pr ˆdMLE,c ǫd Φ ǫ + ǫ ǫ α e α Φ + ǫ α + e α Φ ǫ + ǫ ǫ Assuming ˆd MLE,c IGα,β, it is easy to show the following Chernoff bounds: Pr ˆdMLE,c + ǫd exp αǫ, ǫ > 0, 0 < ǫ <., ǫ > 0 + ǫ Pr ˆdMLE,c ǫd exp αǫ, 0 < ǫ <. 3 ǫ To see, assume z IGα, β. Then, using the Chernoff inequality: Pr z + ǫd Eztexp + ǫdt = exp α βt / + ǫdt, whose minimum is exp αǫ +ǫ, attained at t = +ǫ β. We can similarly show 3. Combining and 3 yields a symmetric approximate bound Pr ˆdMLE,c d ǫd exp ǫ / + ǫ k + 3, 0 < ǫ <. 4 Figure 9 compares the inverse Gaussian approximation with the same simulations as presented in Figure 8, indicating that the inverse Gaussian approximation is highly accurate. When the tail probability , we can treat the inverse Gaussian as the exact distribution of ˆd MLE,c. The Chernoff upper bounds for the inverse Gaussian are always reliable in our simulation range the tail probability Conclusion In machine learning, it is well-known that the l distance is far more robust than the l distance against outliers. Dimension reduction in the l norm, however, has been proved impossible if we use linear random projections and linear estimators. In this study, we analyze three types of nonlinear estimators for Cauchy random projections: the sample median estimators, the geometric k 3

24 LI, HASTIE, AND CHURCH Tail probability k=400 Empirical IG k=50 k=00 k=0 k=0 k= ε a Tail probability k=0 k=0 k=50 k=400 k=00 k=00 Empirical IG Bound ε b Figure 9: We compare the inverse Gaussian approximation with the same simulations as presented in Figure 8. Panel a compares the empirical tail probabilities with the inverse Gaussian tail probabilities, indicating that the approximation is highly accurate. Panel b compares the empirical tail probabilities with the inverse Gaussian upper bound +3. The upper bounds are all above the corresponding empirical curves, indicating that our proposed bounds are reliable at least in our simulation range. mean estimators, and the maximum likelihood estimators. Our theoretical analysis has shown that these nonlinear estimators can accurately recover the original l distance, even though none of them can be a metric. The sample median estimators and the geometric mean estimators are asymptotically equivalent but the latter are more accurate at small sample size. We have derived explicit tail bounds for the geometric mean estimators in exponential forms. Using these tail bounds, we have established an analog of the Johnson-Lindenstrauss JL Lemma for dimension reduction in l, which is weaker than the classical JL Lemma for dimension reduction in l. We conduct theoretic analysis on the maximum likelihood estimators MLE, which are asymptotically optimum. Both the sample median and geometric mean estimators are about 80% efficient as the MLE. We propose approximating the distribution of the MLE by an inverse Gaussian, which has the same support and matches the leading terms of the first four moments of the MLE. Approximate tail bounds have been provided based on the inverse Gaussian approximation. Verified by simulations, these approximate tail bounds hold at least in the 0 0 tail probability range. Although these nonlinear estimators are not metrics, they are still useful for certain applications in e.g., data stream computations, information retrieval, learning and data mining, whenever the goal is to compute the l distances efficiently using a small storage space in a single pass of the data. Li 008 generalized the geometric mean estimators to the stable distribution family, for dimension reduction in the l p norm 0 < p. Li 008 also proposed the harmonic mean estimator for p 0+, which is far more accurate than the geometric mean estimator. 5 In addition, Li 007 sug- 5. Stable random projections with very small p p 0+ have been applied to approximating the Hamming distances Cormode et al., 00, 003 and the max-dominance norm Cormode and Muthukrishnan,

25 CAUCHY RANDOM PROJECTIONS gested very sparse stable random projections by replacing the stable distribution with a mixture of a symmetric Pareto distribution and point mass at the origin, for considerably simplifying the sampling procedure to generate the projection matrix and for achieving a significant cost reduction of matrix multiplication operations. The general method of linear stable random projections is an appealing paradigm for applications involving massive, high-dimensional, non-sparse, and heavy-tailed data. If there is the prior information that the data are highly sparse e.g., text data, other alternative dimension reduction methods may be more suitable; for example, the new technique called Conditional Random Sampling CRS Li and Church, 005, 007; Li et al., 007a was particulary designed for approximating distances and other summary statistics in highly sparse data. Acknowledgment We thank Piotr Indyk, Assaf Naor, Art Owen, and Anand Vidyashankar. Ping Li was partially supported by NSF Grant DMS and Cornell University junior faculty startup fund. Trevor Hastie was partially supported by NSF Grant DMS and NIH Grant R0 CA Appendix A. Proof of Lemma 6 Assume that x, x,..., x k, are i.i.d. C0,d. The estimator, ˆd gm,c, expressed as is unbiased, because, from Lemma 5, E ˆdgm,c = cos k k ˆd gm,c = cos k k k k j= The variance is Var ˆdgm,c = cos k k k j= cos = d k k cos k k because cos k k cos k = + k = + 4 j= x j /k, E x j /k = cos k k k cos/k E x j /k d k = 4 k k 48 k 4 + O k 6 = + k 4 k kk 48 k 4 + = + 4 k k + O k 3. j= d /k cos k d k + 4 d 3 k + O k 3, 4 k = d. 4 k

26 LI, HASTIE, AND CHURCH Some more algebra can similarly show the third and fourth central moments: Appendix B. Proof of Lemma d 3 E ˆdgm,c E ˆdgm,c = 6 k + O k d 4 E ˆdgm,c E ˆdgm,c = 6 k + O k 3 We will use the Markov moment bound, because ˆd gm,c does not have a moment generating function E ˆdgm,c t = if t k. In fact, even when the Chernoff bound is applicable, for any positive random variable, the Markov moment bound is always sharper than the Chernoff bound Philips and Nelson, 995; Lugosi, 004. By the Markov moment bound, for any ǫ > 0 and 0 < t < k, Pr ˆdgm + ǫd t E ˆdgm + ǫd t = whose minimum is attained at t = k tan log + ǫ. Thus Pr ˆdgm + ǫd exp k log cos log = exp k = exp k ǫ G R,gm,,. cos k t + ǫ t, tan log + ǫ + tan log + ǫ log + ǫ + log + ǫ + tan log + ǫ log + ǫ k where G R,gm = ǫ log + log + ǫ +. tan log + ǫ log + ǫ Again, by the Markov moment bound, for any 0 < ǫ <, Pr ˆdgm ǫd = Pr ˆd gm ǫd t E ˆdgm ǫt = ǫd t cos k t, k 6

27 CAUCHY RANDOM PROJECTIONS whose minimum is attained at t = k tan log ǫ. Thus Pr ˆdgm ǫd exp k log cos log = exp k = exp k where ǫ G L,gm, tan log ǫ + tan log ǫ log ǫ + log ǫ + tan log ǫ log ǫ G L,gm = ǫ log + log ǫ +. tan log ǫ log ǫ Finally, we derive convenient approximations for G G,gm and G L,gm, for small ǫ e.g., ǫ <. Recall that, for x <, we have Thus for small ǫ, we have log + x = x x + x3 3 x tan x = x x = = = G R,gm ǫ log + log + ǫ + tan log + ǫ log + ǫ log + ǫ log + ǫ ǫ log + ǫ 3 log + ǫ +... = = = = ǫ + ǫ ǫ 3 ǫ ǫ ǫ ǫ +... ǫ log + ǫ 3 ǫ ǫ ǫ + ǫ ǫ ǫ + ǫ ǫ +... log + ǫ log + ǫ 7

28 LI, HASTIE, AND CHURCH Similarly, for small ǫ, we have = = = G L,gm ǫ log + log ǫ + tan log ǫ log ǫ log ǫ log ǫ ǫ log ǫ 3 log ǫ +... = = = = ǫ log ǫ 3 ǫ ǫ ǫ ǫ ǫ ǫ + ǫ ǫ +... ǫ 3 ǫ + 3 ǫ + ǫ ǫ +... ǫ ǫ log ǫ log ǫ Appendix C. Proof of Lemma 9 For any ǫ > 0 and 0 < t < k, the Markov inequality says t E ˆdgm,c Pr ˆdgm,c + ǫd + ǫ t d t = cos kt k + ǫ t, cos k t k which can be minimized by choosing the optimum t = t, where t = k tan log + ǫ k log cos. k We need to make sure that 0 t < k. t 0 because log cos. 0; and t < k because tan., with equality holding only when k. Now we show the other tail bound Pr ˆdgm,c ǫd. Let 0 < t < k. Pr ˆdgm,c ǫd k k =Pr cos x j k /k ǫd j= t k =Pr x j t/k ǫd cos k k j= which is minimized at t = t t = k tan log ǫ + k log cos, k t ǫ cos k k cos k t, k 8

29 CAUCHY RANDOM PROJECTIONS provided k 8ǫ, otherwise t may be less than 0. To see this, in order for t 0, we must have log ǫ k log cos, i.e., ǫ cos k k. k Because cos k k k k 8k, it suffices if ǫ 8k, i.e., k 8ǫ. Now we prove the asymptotic as k expressions for the ratios of tail bounds, another way to compare ˆd gm and ˆd gm,c. First, we consider the right tail bounds. For large k, the optimal t = t can be approximated as t = k tan log + ǫ k log cos k k tan log + ǫ k log 8k k tan log + ǫ + 4k k tan log + ǫ + 4k + log + ǫ Taylor expansion k tan log + ǫ / + + log + ǫ =ka + C, where A = tan log + ǫ, C = / + log + ǫ. Note that, in the asymptotic decomposition, t ka + C, the term ka is the optimal t in proving the right tail bound for ˆd gm. Thus to study the asymptotic ratio of the right tail bounds, we only need to keep track of the additional terms in cos kt k cos k t k + ǫ t. Because cos kt k k A +kc 8k exp 8 A, 9

30 LI, HASTIE, AND CHURCH and t cos k cos k A k + C k cos k A C k k tan A cos k A exp C tan A, we know the ratio of the right tail bounds ρ R,k = exp cos kt k cos k t k «+ǫ t k ǫ G R,gm ρ R, = Taylor expansion + ǫ C exp 8 A + C tan A. Next, we consider the left tail bound. First, we obtain an asymptotic decomposition of t : t = k tan log ǫ + k log cos k k tan log ǫ 4k k tan log ǫ / + log ǫ Taylor expansion =ka C, where A = tan log ǫ /, C = + log ǫ. Again, in the above asymptotic decomposition, t ka C, the term ka is the optimal t in proving the left tail bound for ˆd gm. Thus to study the asymptotic ratio of the left tail bounds, we only need to keep track of the additional terms in and Because cos kt ǫ t cos kt k cos k t k k k A kc 8k exp 8 A, t cos k cos k A k C k cos k A + C k k tan A cos k A C exp tan A,. 30

31 CAUCHY RANDOM PROJECTIONS we know the ratio of the left tail bounds ρ L,k = ǫ t cos kt k exp cos k t k «k ǫ G L,gm ρ L, = ǫ C exp 8 A C tan A. Appendix D. Proof of Lemma 0 Assume x C0, d. The log likelihood lx; d and its first three derivatives are lx;d = logd log logx + d, l d = d d x + d, l d = d x d x + d, l d = d 3 + 4d x + d + 8dx d x + d 3. The MLE ˆd MLE is asymptotically normal with mean d and variance expected Fisher Information, is I = Id = E l d = d + E x d x + d kid = d,, where Id, the because Therefore, we obtain x d E x + d = d = d = d = d / / / x d x + d 3dx / d tan t d 6 /cos 6 t 3 8 d cos t dt cos t cos 4 tdt = 4d. Var ˆdMLE = d k + O k. General formulas for the bias and higher moments of the MLE are available in Bartlett, 953; Shenton and Bowman, 963. We need to evaluate the expressions in Shenton and Bowman, 963, 3

32 LI, HASTIE, AND CHURCH 6a-6d, involving tedious algebra: E ˆdMLE = d [] Var ki + O k ˆdMLE = ki + k I + [4 ] [ ] [3] 3 [ 3 ] 3[] E ˆdMLE E ˆdMLE = k I 4 3 E ˆdMLE E ˆdMLE = k I + k 3 where, after re-formatting, I [] [ 3 ] I 4 + O k 3 9 I + 7[4 ] 6[ ] 0[3] + k 3 6[ 3 ] [ 3 ][] + 45[] I 5 I 4 + O k 3 + O k 4, [] = El 3 + El l, [ 4 ] = El 4, [ ] = El l + El 4, [3] = El 4 + 3El l + El l, [ 3 ] = El 3. We will neglect most of the algebra. To help readers verifying the results, the following formula we derive may be useful: Hence m E x + d = Without giving the detail, we report m m dm, m =,,3,... E l 3 = 0, E l l = d 3, E l 4 3 = 8 d 4, El l = 8 d 4, E l l = 3 4 d 4. [] = d 3, [4 ] = 3 8 d 4, [ ] = 4 d 4, [3] = 3 4 d 4, [3 ] = 0. Thus, we obtain E ˆdMLE = d + d k + O k Var ˆdMLE = d k + 7d k + O k 3 3 d 3 E ˆdMLE E ˆdMLE = + O k 4 d 4 E ˆdMLE E ˆdMLE = k k 3 + d4 k 3 + O k 4. 3

Randomized Algorithms

Randomized Algorithms Saniv Kumar, Google Research, NY EECS-6898, Columbia University - Fall, 010 Saniv Kumar 9/13/010 EECS6898 Large Scale Machine Learning 1 Curse of Dimensionality Gaussian Mixture Models