Nonlinear Estimators and Tail Bounds for Dimension Reduction in l 1 Using Cauchy Random Projections

Size: px
Start display at page:

Download "Nonlinear Estimators and Tail Bounds for Dimension Reduction in l 1 Using Cauchy Random Projections"

Transcription

1 Journal of Machine Learning Research 007 Submitted 0/06; Published Nonlinear Estimators and Tail Bounds for Dimension Reduction in l Using Cauchy Random Projections Ping Li Department of Statistical Science Faculty of Computing and Information Science Cornell University Ithaca, NY 4853, USA Trevor J. Hastie Department of Statistics Stanford University Stanford, CA 94305, USA Kenneth W. Church Microsoft Research Microsoft Corporation Redmond, WA 9805, USA PINGLI@CORNELL.EDU HASTIE@STANFORD.EDU CHURCH@MICROSOFT.COM Editor: Sam Roweis Abstract For dimension reduction in the l norm, the method of Cauchy random projections multiplies the original data matrix A R n D with a random matrix R R D k k D whose entries are i.i.d. samples of the standard Cauchy C0,. Because of the impossibility result, one can not hope to recover the pairwise l distances in A from B = A R R n k, using linear estimators without incurring large errors. However, nonlinear estimators are still useful for certain applications in data stream computations, information retrieval, learning, and data mining. We study three types of nonlinear estimators: the sample median estimators, the geometric mean estimators, and the maximum likelihood estimators MLE. We derive tail bounds for the geometric mean estimators and establish that k = O log n ǫ suffices with the constants explicitly given. Asymptotically as k, both the sample median and the geometric mean estimators are about 80% efficient compared to the MLE. We analyze the moments of the MLE and propose approximating its distribution of by an inverse Gaussian. Keywords: Dimension reduction, l norm, Johnson-Lindenstrauss JL Lemma, Cauchy random projections. Introduction There has been considerable interest in the l norm in statistics and machine learning, as it is now well-known that the l distance is far more robust than the l distance against outliers. Huber, 98 It is sometimes a good practice to replace the l norm minimization with the l norm minimization, for example, the Least Absolute Deviation LAD Boost Friedman, 00. Chapelle et al.. A preliminary version appeared in COLT 007 Li et al., 007b. c 007 Ping Li, Trevor J. Hastie and Kenneth W. Church.

2 LI, HASTIE, AND CHURCH 999 demonstrated that using the l Laplacian radial basis kernel produced better classification results than the usual l Gaussian radial basis kernel, in their histogram-based image classification project using support vector machines SVM. Recently, it also becomes popular to use the l norm for variable feature selection; success stories include LASSO Tibshirani, 996, LARS Efron et al., 004 and -norm SVM Zhu et al., 003. This paper focuses on dimension reduction in the l norm, in particular, on the method based on Cauchy random projections, which is a special case of linear stable random projections Johnson and Schechtman, 98; Indyk, 000, 006; Li, 008. The idea of linear random projections is to multiply the original data matrix A R n D with a random projection matrix R R D k, resulting in a projected matrix B = AR R n k. We would like k to be as small as possible. If k D, then it should be much more efficient to compute certain summary statistics e.g., pairwise distances from B as opposed to A. Moreover, B may be small enough to reside in physical memory while A is often too large to fit in the main memory. The choice of the random projection matrix R depends on which norm we would like to work with. For dimension reduction in l p 0 < p, it is common practice to construct R from i.i.d. samples of p-stable distributions Johnson and Schechtman, 98; Indyk, 000, 006; Li, 008. In the stable distribution family Zolotarev, 986, normal is -stable and Cauchy is -stable. Thus, we will call random projections for l and l, normal random projections and Cauchy random projections, respectively. In normal random projections Vempala, 004, we can estimate the original pairwise l distances in A directly using the corresponding l distances in B up to a normalizing constant. Furthermore, the Johnson-Lindenstrauss JL Lemma Johnson and Lindenstrauss, 984 provides the performance guarantee. We will review normal random projections in more detail in Section. For Cauchy random projections, however, one shall not use the l distance in B to approximate the original l distance in A, as the Cauchy distribution does not even have a finite first moment. The impossibility results Brinkman and Charikar, 003; Lee and Naor, 004; Brinkman and Charikar, 005 have proved that one can not hope to recover the l distance using linear projections and linear estimators e.g., sample mean, without incurring large errors. Fortunately, the impossibility results do not rule out nonlinear estimators, which may be still useful in certain applications in data stream computations, information retrieval, learning, and data mining. In this paper, we study three types of nonlinear estimators: the sample median estimators, the geometric mean estimators, and the maximum likelihood estimators MLE. The sample median and the geometric mean estimators are asymptotically as k equivalent i.e., both are about 80% efficient as the MLE, but the latter is more accurate at small sample size k. Furthermore, we derive explicit tail bounds for the geometric mean estimators and establish an analog of the JL Lemma for dimension reduction in l. This analog of the JL Lemma for l is weaker than the classical JL Lemma for l, as the geometric mean is not convex and hence is not a metric. Many efficient algorithms, such as some sub-linear time using super-linear memory nearest neighbor algorithms Shakhnarovich et al., 005, rely on metric properties e.g., the triangle inequality. Nevertheless, nonlinear estimators may be still useful in important scenarios. Estimating l distances online The original data matrix A R n D requires OnD storage space; and hence it is often too large for physical memory. The storage cost of materializing all pairwise distances is On, which may be also too large for the memory. For example, in information retrieval, n could

3 CAUCHY RANDOM PROJECTIONS be the total number of word types or documents at Web scale. To avoid page faults, it may be more efficient to estimate the distances on the fly from the projected data matrix B in the memory. Computing all pairwise l distances In distance-based clustering, classification, and kernels e.g., for SVM, we need to compute all pairwise distances in A, at the cost of time On D, which can be prohibitive, especially when A does not fit in the memory. Using Cauchy random projections, the cost is reduced to OnDk + n k. Linear scan nearest neighbor searching Nearest neighbor searching is notorious for being inefficient, especially when the data matrix A is too large for the memory. Searching for the nearest neighbors from the projected data matrix B which is in the memory becomes much more efficient, even by linear scans. The cost of searching for the nearest neighbor for one data point is reduced from OnD to Onk. Data stream computations. Massive data streams come from Internet routers, phone switches, atmospheric observations, sensor networks, highway traffic, finance data, and more Henzinger et al., 999; Feigenbaum et al., 999; Indyk, 000; Babcock et al., 00; Cormode et al., 00. Unlike in the traditional databases, it is not common to store massive data streams; and hence the processing is often done on the fly. In data stream computations, Cauchy random projections can be used for A: approximating the l frequency moments for individual streams; B: approximating the l differences between a pair of streams. We briefly comment on random coordinate sampling, another strategy for dimension reduction. One can randomly sample k columns from A R n D and estimate the summary statistics including l and l distances. Despite its simplicity, this strategy has two major drawbacks. First, in heavy-tailed data, one may have to choose k very large in order to achieve a sufficient accuracy. Second, large datasets are often highly sparse, for example, text data Dhillon and Modha, 00 and market-basket data Aggarwal and Wolf, 999; Strehl and Ghosh, 000. For sparse data, Li and Church 005, 007; Li et al. 007a provided an alternative coordinate sampling strategy, called Conditional Random Sampling CRS. For non-sparse data, however, methods based on linear stable random projections are superior. The rest of the paper is organized as follows. Section reviews linear random projections. Section 3 summarizes the main results for three types of nonlinear estimators. Section 4 presents the sample median estimators. Section 5 concerns the geometric mean estimators. Section 6 is devoted to the maximum likelihood estimators. Section 7 concludes the paper.. Introduction to Linear Stable Random Projections We give a review on linear random projections, including normal and Cauchy random projections. Denote the original data matrix by A R n D, i.e., n data points in D dimensions. Let {u T i }n i= RD be the ith row of A. Let R R D k be a projection matrix and denote the entries of R by {r ij } D i= k j=. The projected data matrix B = AR Rn k. Let {vi T}n i= Rk be the ith row of B, i.e., v i = R T u i. 3

4 LI, HASTIE, AND CHURCH For simplicity, we focus on the leading two rows, u and u, in A, and the leading two rows, v and v, in B. Define {x j } k j= to be x j = v,j v,j = D r ij u,i u,i, i= j =,,...,k If we sample r ij i.i.d. from a p-stable distribution Zolotarev, 986, then x j s are also i.i.d. samples of a p-stable distribution with a different scale parameter. In the family of stable distributions, normal p = and Cauchy p = are two important special cases.. Normal Random Projections When r ij is sampled from the standard normal, i.e., r ij N0,, i.i.d., then x j = v,j v,j = D D r ij u,i u,i N 0, u,i u,i, j =,,...,k, i= because a weighted sum of normals is also normal. Denote the squared l distance between u and u by i= D d l = u u = u,i u,i. We can estimate d l from the sample squared l distance i.e., sample mean: ˆd l = k i= k x j. Note that k ˆd l /d l follows a Chi-square distribution with k degrees of freedom, χ k. Therefore, it is easy to prove the following Lemma about the tail bounds: j= Lemma Pr ˆd l d l ǫd l exp k ǫ log + ǫ = exp k ǫ Pr ˆd l d l ǫd l exp k ǫ log ǫ = exp where the constants G R = G L = ǫ ǫ log + ǫ 4 3 ǫ, G R k ǫ ǫ ǫ log ǫ ǫ 4 3 ǫ., ǫ > 0,, 0 < ǫ <, G L 4

5 CAUCHY RANDOM PROJECTIONS Proof Using the standard Chernoff inequality Chernoff, 95, Pr ˆd l d l ǫd l =Pr k ˆd l /d l k + ǫ E expk ˆd l /d l t t > 0 exp + ǫkt = exp k log t + + ǫt, which is minimized at t = ǫ +ǫ. Thus, for any ǫ > 0 Pr ˆd l d l > ǫd l exp k ǫ log + ǫ. We can similarly prove the other tail bound for Pr ˆd l d l ǫd l. For convenience, sometimes we would like to write the tail bounds in a symmetric form Pr ˆdl d l ǫdl exp k ǫ, 0 < ǫ <, G and we know that it suffices to let G = max{g R,G L } 4 ǫ. 3 Since there are in total nn < n pairs among n data points, we would like to bound the tail probabilities simultaneously for all pairs. By the Bonferroni union bound, it suffices if n Pr ˆdl d l ǫdl δ, i.e., it suffices if n exp k ǫ log n log δ δ = k G G ǫ. Therefore, we obtain one version of the Johnson-Lindenstrauss JL Lemma: log n log δ Lemma If k G, where G = 4 ǫ ǫ, then with probability at least δ, the squared 3 l distance between any pair of data points among n data points can be approximated within a ± ǫ factor 0 < ǫ <, using the squared l distance of the projected data after normal random projections. Many versions of the JL Lemma have been proved Johnson and Lindenstrauss, 984; Frankl and Maehara, 987; Indyk and Motwani, 998; Arriaga and Vempala, 999; Dasgupta and Gupta, 003; Indyk, 000, 00; Achlioptas, 003; Arriaga and Vempala, 006; Ailon and Chazelle, 006. Note that we do not have to use r ij N0, for dimension reduction in l. For example, we can sample r ij from the following sparse projection distribution: r ij = s with prob. s 0 with prob. s with prob. s. 5

6 LI, HASTIE, AND CHURCH When s 3, Achlioptas 00, 003 proved the JL Lemma for the above sparse projection. Recently, Li et al. 006b proposed very sparse random projections using s 3 in, based on two practical considerations: D should be very large, otherwise there would be no need for dimension reduction. The original l distance should make engineering sense, in that the second or higher moments should be bounded otherwise various term-weighting schemes will be applied. Based on these two practical assumptions, the projected data are asymptotically normal at a fast rate of convergence when s = D and the data have bounded third moments. Of course, very sparse random projections do not have worst case performance guarantees.. Cauchy Random Projections In Cauchy random projections, we sample r ij i.i.d. from the standard Cauchy distribution, i.e., r ij C0,. By the -stability of Cauchy Zolotarev, 986, we know that x j = v,j v,j C 0, D u,i u,i. That is, the projected differences x j = v,j v,j are also Cauchy random variables with the scale parameter being the l distance, d = u u = D i= u,i u,i, in the original space. Recall that a Cauchy random variable z C0,γ has the density fz = γ i= z + γ, γ > 0, < z <. The easiest way to see the -stability is via the characteristic function, E exp z t = exp γ t, t D E exp c i z i = exp γ i= D c i t, for z, z,..., z D, i.i.d. C0,γ, and any constants c, c,..., c D. Therefore, in Cauchy random projections, the problem boils down to estimating the Cauchy scale parameter of C0,d from k i.i.d. samples x j C0,d. Unlike in normal random projections, we can no longer estimate d from the sample mean i.e., k k j= x j because E x j =. i= 3. Main Results Although the impossibility results Lee and Naor, 004; Brinkman and Charikar, 005 have ruled out accurate estimators that are also metrics, there is enough information to recover d from k samples {x j } k j=, with high accuracy. We analyze three types of nonlinear estimators: the sample median estimators, the geometric mean estimators, and the maximum likelihood estimators. 6

7 CAUCHY RANDOM PROJECTIONS 3. The Sample Median Estimators The sample median estimator, ˆd me = median x j,j =,,...,k, is simple and computationally convenient. We recommend the bias-corrected version: where b me = 0 m +! m! ˆd me,c = ˆd me b me, t tan t t m dt, k = m +. Here, for convenience, we only consider k = m +, m =,, 3,... Some properties of ˆd me,c : E ˆdme,c = d, i.e, ˆd me,c is unbiased. When k 5, the variance of ˆd me,c is Var ˆdme,c =d m! 0 tan t t t m dt m +! 0 tan, k 5 t t t m dt = 4k d + O k. b me and b me monotonically with increasing k. 3. The Geometric Mean Estimators The geometric mean estimator has tail bounds Pr ˆdgm + ǫd where G R,gm = G L,gm = ˆd gm = k x j /k j= U R,gm = exp k ǫ G R,gm Pr ˆdgm ǫd U L,gm = exp k ǫ ǫ G L,gm, ǫ > 0, 0 < ǫ < log + log + ǫ +, tan log + ǫ log + ǫ ǫ log + log ǫ +. tan log ǫ log ǫ 7

8 LI, HASTIE, AND CHURCH Moreover, for small ǫ, we obtain the following convenient approximations: G R,gm = G L,gm = + ǫ + ǫ ǫ +..., ǫ Consequently, we establish an analog of the Johnson-Lindenstrauss JL Lemma for dimension reduction in l : log n log δ If k G gm, then with probability at least δ, one can recover the original l ǫ distance between any pair of data points among all n data points within a ±ǫ factor 0 < ǫ <, using ˆd gm. The constant G gm can be specified from G R,gm and G L,gm : G gm = max{g R,gm,G L,gm }. To remove the bias and also reduce the variance, we recommend the bias-corrected geometric mean estimator: ˆd gm,c = cos k k k which is unbiased and has variance cos Var ˆdgm,c = d k k cos k k We also derive tail bounds for ˆd gm,c : j= = 4 x j /k, d k + 4 d 3 k + O k 3. Pr ˆdgm,c + ǫd U R,gm,c, ǫ > 0 Pr ˆdgm,c ǫd U L,gm,c, 0 < ǫ <, and show that, compared with ˆd gm, the ratios of the tail bounds ρ R,k = U R,gm,c U R,gm ρ R, = ρ L,k = U L,gm,c U L,gm ρ L, = as k, where A, C, A, and C are only functions of ǫ. 3.3 The Maximum Likelihood Estimators + ǫ C exp 8 A + C tan A, ǫ C exp 8 A C tan A, Denoted by ˆd MLE,c, the bias-corrected maximum likelihood estimator MLE is ˆd MLE,c = ˆd MLE, k 8

9 CAUCHY RANDOM PROJECTIONS where ˆd MLE solves a nonlinear MLE equation k ˆd MLE + k j= ˆd MLE x j + ˆd MLE = 0. Some properties of ˆd MLE,c : It is nearly unbiased, E ˆdMLE,c Its asymptotic variance is = d + O k. Var ˆdMLE,c = d k + 3d k + O k 3, i.e., Var ˆd MLE,c Var ˆd me,c 8, Var ˆd MLE,c Var ˆd gm,c 8, as k. 8 80% Its distribution can be accurately approximated by an inverse Gaussian, at least in the small deviation range, which suggests the following approximate tail bound Pr ˆd MLE,c d ǫd exp ǫ / + ǫ k + 3, 0 < ǫ <, which is verified by simulations for the tail probability 0 0 range. 4. The Sample Median Estimators Recall in Cauchy random projections, B = AR, we denote the leading two rows in A by u, u R D, and the leading two rows in B by v, v R k. Our goal is to estimate the l distance d = u u = D i= u,i u,i from {x j } k j=, x j = v,j v,j C0,d, i.i.d. A widely-used estimator in statistics is based on the sample inter-quantiles Fama and Roll, 968, 97; McCulloch, 986. For the symmetric Cauchy, the absolute sample median estimator ˆd me = median{ x j,j =,,...,k} is convenient because the population median of absolute Cauchy is exactly d. Indyk, 006 It is well-known in statistics that ˆd me, is asymptotically unbiased and normal; see Lemma 3. For small samples e.g., k 0, however, ˆd me is severely biased. Lemma 3 The sample median estimator, ˆd me, is asymptotically unbiased and normal D k ˆdme d = N 0, 4 d. When k = m +, m =,, 3,..., the r th moment of ˆd me can be represented as r E ˆdme = d r m +! 0 m! tan r t t t m dt, m r k 9

10 LI, HASTIE, AND CHURCH If m < r, then E ˆdme r =. Proof Let fz; d and Fz; d be the probability density and cumulative density respectively for C0, d : fz;d = d z + d, Fz;d = z tan, z 0. d The inverse of Fz;d is F q;d = dtan q. Here, we take q = 0.5, to consider the sample median. By the asymptotic normality of sample quantiles David, 98, Theorem 9., we know that k ˆdme d D = N 0, f dtan ;d tan = 4 d, ˆdme i.e., ˆd me is asymptotically unbiased and normal with the variance Var = 4k d + O k. For convenience, we assume k = m+. Again, by properties of sample quantile David, 98, Chapter., the probability density of ˆd me is f ˆdme z = m +! m! Fz;d Fz;d m fz;d, from which we can write down the r th moment of ˆd me in, after some change of variables. Once we know E ˆdme, we can design an unbiased estimator as described in Lemma 4. Lemma 4 The estimator, ˆd me,c = ˆd me, b me is unbiased, i.e., E ˆdme,c = d, where the bias-correction factor b me is b me = E ˆdme d = The variance of ˆd me,c is Var ˆdme,c = d m! m +! 0 m +! m! t tan t t m dt, k = m tan t t t m dt 0 tan, k = m + 5. t t t m dt ˆd gm,c and ˆd gm are asymptotically equivalent, i.e., D k ˆdme,c d = N 0, 4 d. 0

11 CAUCHY RANDOM PROJECTIONS The bias-correction factor b me is monotonically decreasing with increasing m, and b me, lim b me =. m Proof Most of the results follow directly from Lemma 3. Here we only show b me decreases monotonically and b me as m. Note that m+! m! t t m, 0 t, is the probability density of a Beta distribution Betam +,m +, whose rth moment is E z r = m+!m+r! m++r!m!. By Taylor expansions Gradshteyn and Ryzhik, 994,.4.6, tan t j j j = B j t j, j! where B j is the Bernoulli number Gradshteyn and Ryzhik, 994, 9.6. Therefore, j j b me = j= j! j= B j j m +!m + j!. m + j!m! It is easy to show that m+!m+j! m+j!m! decreases monotonically with increasing m and it converges to j. Thus, bme also decreases monotonically with increasing m. From the Taylor expansion of tant, we know that b me j j j= j! j j B j = tan =. It is well-known that bias-corrections are not always beneficial because of the bias-variance trade-off phenomenon. In our case, because the correction factor b me always, the biascorrection not only removes the bias of ˆd me but also reduces the variance of ˆd me. The bias-correction factor b me can be numerically evaluated and tabulated, at least for small k. Figure plots b me as a function of k, indicating that ˆd me is severely biased when k 0. When k > 50, the bias becomes negligible. 5. The Geometric Mean Estimators This section derives estimators based on the geometric mean, which are more accurate than the sample median estimators. The geometric mean estimators allow us to derive tail bounds in explicit forms and consequently establish an analog of the Johnson-Lindenstrauss JL Lemma for dimension reduction in the l norm. Lemma 5 Assume x C0, d. Then E x λ = d λ, λ <. cosλ/

12 LI, HASTIE, AND CHURCH Bias correction factor Sample size k Figure : The bias correction factor, b me in 3, as a function of k = m +. After k > 50, the bias is negligible. Note that b me = when k =. Proof Using the integral table Gradshteyn and Ryzhik, 994, 3.., page 337, E x λ = d 0 y λ dλ y + ddy = 0 y λ y + dy = d λ cosλ/. From Lemma 5, by taking λ = k, we obtain an unbiased estimator, ˆdgm,c, based on the biascorrected geometric mean in the next lemma, proved in Appendix A. Lemma 6 ˆd gm,c = cos k k k j= is unbiased, with the variance valid when k > cos Var ˆdgm,c = d k k cos k k x j /k, k > 4 = d k The third and fourth central moments are d 3 E ˆdgm,c E ˆdgm,c = 6 k + O k d 4 E ˆdgm,c E ˆdgm,c = 6 k + O k d 3 k + O k 3. The higher third or fourth moments may be useful for approximating the distribution of ˆd gm,c. In Section 6, we will show how to approximate the distribution of the maximum likelihood estimator by matching the first four moments in the leading terms. We could apply the similar technique to,.

13 CAUCHY RANDOM PROJECTIONS approximate ˆd gm,c. Fortunately, we do not have to do so because we are able to derive the exact tail bounds for ˆd gm,c in Lemma 9. Note that in 4, as k, the bias-correction term converges to quickly: cos k = k +... = k +... = k k k k When k is not too small e.g., k > 50, the geometric mean estimator without bias-correction, ˆd gm = k x j /k, k >, j= gives similar results as ˆd gm,c. As shown in Figure, the ratios of the mean square errors MSE MSE ˆdgm cos = k k cos k k + 5 cos MSE ˆdgm,c k k cos k k demonstrate that the two geometric mean estimators are similar when k > 50, in terms of the MSE. MSE ratio Sample size k Figure : The ratios of the mean square errors MSE MSE ˆd gm MSE ˆd in 5 indicate that the difference gm,c between ˆd gm and ˆd gm,c becomes negligible when k > 50. One advantage of ˆd gm is the convenience for deriving tail bounds. Thus, before presenting Lemma 9 for ˆd gm,c, we prove tail bounds for ˆd gm in Lemma 7 proved in Appendix B. Lemma 7 Pr ˆdgm + ǫd U R,gm = exp k ǫ G R,gm Pr ˆdgm ǫd U L,gm = exp k ǫ G L,gm, ǫ > 0 6, 0 < ǫ < 7 3

14 LI, HASTIE, AND CHURCH where G R,gm = G L,gm = ǫ log + log + ǫ +, 8 tan log + ǫ log + ǫ ǫ log + log ǫ +. 9 tan log ǫ log ǫ Moreover, for small ǫ, we have the following convenient approximations: G R,gm = + ǫ ǫ +..., 0 G L,gm = ǫ ǫ Consequently, as ǫ 0+, we know G R,gm, G L,gm. Figure 3 plots the constants G R,gm and G L,gm in 8 and 9, along with their convenient approximations 0 and. For G R,gm, the exact and approximate expressions are indistinguishable when ǫ <. For G L,gm, the exact and approximate expressions are indistinguishable when ǫ < 0.7. The plots also suggest that the approximations, 0 and, are upper bounds of the exact constants, 8 and 9, respectively. G R,gm Exact Approx ε G L,gm Exact Approx ε a G R,gm b G L,gm Figure 3: We plot the constants G R,gm and G L,gm in 8 and 9, along with their convenient approximations 0 and. Consequently, Lemma 7 establishes an analog of the Johnson-Lindenstrauss JL Lemma for dimension reduction in l : 4

15 CAUCHY RANDOM PROJECTIONS log n log δ Lemma 8 If k G gm, then with probability at least δ, one can recover the original l distance between any pair of data points among all n data points within a ± ǫ factor ǫ 0 < ǫ <, using ˆd gm. It suffices to specify the constant G gm = max{g R,gm,G L,gm }. Similarly, we derive tail bounds for the unbiased geometric mean estimator ˆd gm,c, in Lemma 9, which is proved in Appendix C. Lemma 9 where Pr ˆdgm,c + ǫd U R,gm,c = cos kt k cos k t k + ǫ t t = k tan log + ǫ k log cos. k, ǫ > 0 Pr ˆdgm,c ǫd U L,gm,c = cos k t k ǫ t cos kt, 0 < ǫ <, k 8ǫ k 3 where t = k tan log ǫ + k log cos. k As k, for any fixed ǫ, we have ρ R,k = U R,gm,c ρ R, = U R,gm + ǫ C exp 8 A + C tan A, 4 ρ L,k = U L,gm,c ρ L, = U L,gm ǫ C exp 8 A C tan A, 5 where U R,gm and U L,gm are upper bounds for ˆd gm as derived in Lemma 7, and A = tan log + ǫ /, C = + log + ǫ, A = tan log ǫ /, C = + log ǫ. Figure 4 plots the tail bound ratios ρ R,k and ρ L,k as defined in Lemma 9, indicating that the asymptotic expressions ρ R, and ρ L, are in fact very accurate even for small k e.g., k = 0. Figure 4 illustrates that introducing the bias-correction term in ˆd gm,c reduces the right tail bound but amplifies the left tail bound. Because the left tail bound is usually much smaller than the right 5

16 LI, HASTIE, AND CHURCH ρ R,k k= k= 0 k= ε a Right bound ratios ρ R,k ρ L,k k= ε 5 k= b Left bound ratios ρ L,k Figure 4: Tail bound ratios ρ R,k and ρ L,k as defined in 4 and 5 for k =,5,0, along with the asymptotic expressions ρ R, and ρ L,. The dashed curves correspond to k =. tail bound, we expect that overall the bias-correction should be beneficial, as shown in Figure 5, which plots the overall ratio of tail bounds: ρ k = U R,gm,c + U L,gm,c U R,gm + U L,gm = cos kt k cos k t k exp «+ǫ t + cos k t k k ǫ G R,gm + exp ǫ t «cos kt k k ǫ G L,gm. 6 ρ k k=00 k=0 k=5 k= ε Figure 5: The overall ratios of tail bounds, ρ k as defined in 6 are almost always below one, demonstrating that the bias-corrected estimator ˆd gm,c may exhibit better overall tail behavior than the biased estimator ˆd gm. 6

17 CAUCHY RANDOM PROJECTIONS Finally, Figure 6 compares ˆd gm,c with the sample median estimators ˆd me and ˆd me,c, in terms of the mean square errors. ˆd gm,c is considerably more accurate than ˆd me at small k. The bias correction significantly reduces the mean square errors of ˆd me. MSE ratios No correction Bias corrected Sample size k MSEˆd me Figure 6: The ratios of the mean square errors MSE, MSE ˆd and MSE ˆd me,c, demonstrate that gm,c MSEˆd gm,c the bias-corrected geometric mean estimator ˆd gm,c is considerably more accurate than the sample median estimator ˆd me. The bias correction on ˆd me considerably reduces the MSE. Note that when k = 3, the ratios are. 6. The Maximum Likelihood Estimators This section analyzes the maximum likelihood estimators MLE, which are asymptotically optimum in terms of the variance. In comparisons, the sample median and geometric mean estimators are not optimum. Our contribution in this section includes the higher-order analysis for the bias and moments and accurate closed-from approximations to the distribution of the MLE. The method of maximum likelihood is widely used. For example, Li et al. 006a applied the maximum likelihood method to normal random projections and provided an improved estimator of the l distance by taking advantage of the marginal information. Recall our goal is to estimate d from k i.i.d. samples x j C0,d,j =,,...,k. The log joint likelihood of {x j } k j= is Lx,x,...x k ;d = k logd k log whose first and second derivatives w.r.t. d are L d = k d k j= L d = k d k d x j + d, j= x j d x j + d = d L d k logx j + d, j= 4 k j= x j x j + d. 7

18 LI, HASTIE, AND CHURCH The maximum likelihood estimator of d, denoted by ˆd MLE, is the solution to L d = 0, i.e., k ˆd MLE + k j= ˆd MLE x j + ˆd MLE = 0. 7 Because L ˆd MLE 0, ˆd MLE indeed maximizes the joint likelihood and is the only solution to the MLE equation 7. Solving 7 numerically is not difficult e.g., a few iterations using the Newton s method. For a better accuracy, we recommend the following bias-corrected estimator: ˆd MLE,c = ˆd MLE. k Lemma 0 concerns the asymptotic moments of ˆd MLE and ˆd MLE,c, proved in Appendix D. Lemma 0 Both ˆd MLE and ˆd MLE,c are asymptotically unbiased and normal. The first four moments of ˆd MLE are E ˆdMLE d = d k + O k The first four moments of ˆd MLE,c are Var ˆdMLE = d k + 7d k + O k 3 E ˆdMLE E ˆd 3 d 3 MLE = + O k E ˆdMLE E ˆd 4 d 4 MLE = E ˆdMLE,c d = O k k Var ˆdMLE,c = d k + 3d k + O k 3 + d4 k 3 k 3 E ˆdMLE,c E ˆd 3 d 3 MLE,c = + O k E ˆdMLE,c E ˆd 4 d 4 MLE,c = k + O k 4. k d4 k 3 + O k 4. The order O k term of the variance, i.e., d k, is well-known Haas et al., 970. We derive the bias-corrected estimator, ˆd MLE,c, and the higher order moments using stochastic Taylor expansions Bartlett, 953; Shenton and Bowman, 963; Ferrari et al., 996; Cysneiros et al., 00. We will propose an inverse Gaussian distribution to approximate the distribution of ˆd MLE,c, by matching the first four moments at least in the leading terms. 8

19 CAUCHY RANDOM PROJECTIONS 6. A Numerical Example The maximum likelihood estimators are tested on some Microsoft Web crawl data, a term-bydocument matrix with D = 6 Web pages. We conduct Cauchy random projections and estimate the l distances between words. In this experiment, we compare the empirical and asymptotic theoretical moments, using one pair of words. Figure 7 illustrates that the bias correction is effective and these asymptotic formulas for the first four moments of ˆd MLE,c in Lemma 0 are accurate, especially when k 0. Bias d MLE,c d MLE k a E ˆd MLE d/d vs. E ˆd MLE,c d/d Variance Variance k b E ˆd MLE,c Eˆd MLE,c /d / Third moment Third moment Fourth moment Fourth moment k c Eˆd MLE,c E ˆd /3 MLE,c 3 /d k d E ˆd /4 MLE,c Eˆd MLE,c 4 /d 4 Figure 7: One pair of words are selected from a Microsoft term-by-document matrix with D = 6 Web pages. We conduct Cauchy random projections and estimate the l distance between one pair of words using the maximum likelihood estimator ˆd MLE and the bias-corrected version ˆd MLE,c. Panel a plots the biases of ˆd MLE and ˆd MLE,c, indicating that the bias correction is effective. Panels b, c, and d plot the variance, third moment, and fourth moment of ˆd MLE,c, respectively. The dashed curves are the theoretical asymptotic moments. When k 0, the theoretical asymptotic formulas for moments are accurate. 9

20 LI, HASTIE, AND CHURCH 6. Approximate Distributions Theoretical analysis on the exact distribution of a maximum likelihood estimator is difficult. It is common practice to assume normality, which, however, is inaccurate. The Edgeworth expansion improves the normal approximation by matching higher moments Feller, 97; Bhattacharya and Ghosh, 978; Severini, 000, which however, has some well-known drawbacks. The resultant expressions are quite sophisticated and are not accurate at the tails. It is possible that the approximate probability has values below zero. Also, Edgeworth expansions consider the support to be,, while ˆd MLE,c is non-negative. The saddle-point approximation Jensen, 995 in general improves Edgeworth expansions, often considerably. Unfortunately, we can not apply the saddle-point approximation in our case at least not directly, because it requires a bounded moment generating function. We propose approximating the distributions of ˆd MLE,c directly using some well-studied common distributions. We will first consider a gamma distribution with the same first two asymptotic moments of ˆd MLE,c. That is, the gamma distribution will be asymptotically equivalent to the normal approximation. While a normal has zero third central moment, a gamma has nonzero third central moment. This, to an extent, speeds up the rate of convergence. Another important reason why a gamma is more accurate is because it has the same support as ˆd MLE,c, i.e., [0,. We will furthermore consider a generalized gamma distribution, which allows us to match the first three asymptotic moments of ˆd MLE,c. Interestingly, in this case, the generalized gamma approximation turns out to be an inverse Gaussian distribution, which has a closed-form probability density. More interestingly, this inverse Gaussian distribution also matches the fourth central moment of ˆd MLE,c in the O k term and almost in the O k term. By simulations, the inverse 3 Gaussian approximation is highly accurate. Note that, since we are interested in the very small e.g., 0 0 tail probability range, O k 3/ is not too meaningful. For example, k 3/ = 0 3 if k = 00. Therefore, we will have to rely on simulations to assess the accuracy of the approximations. On the other hand, an upper bound may hold exactly verified by simulations even if it is based on an approximate distribution. As the related work, Li et al. 006c applied gamma and generalized gamma approximations to model the performance measure distribution in some wireless communication channels using random matrix theory and produced accurate results in evaluating the error probabilities. 6.. THE GAMMA APPROXIMATION The gamma approximation is an obvious improvement over the normal approximation. 3 A gamma distribution, Gα,β, has two parameters, α and β, which can be determined by matching the first two asymptotic moments of ˆd MLE,c. That is, we assume that ˆd MLE,c Gα,β, with αβ = d, αβ = d k + 3d k, = α = k + 3, k β = d k + 3d k.. The simple normal approximation can be improved by taking advantage of the conditional density on the ancillary configuration statistic, based on the observations x, x,..., x k Fisher, 934; Lawless, 97; Hinkley, Recall that, in normal random projections for dimension reduction in l see Lemma, the resultant estimator of the squared l distance has a Chi-squared distribution, which is a special case of gamma. 0

21 CAUCHY RANDOM PROJECTIONS Assuming a gamma distribution, it is easy to obtain the following Chernoff bounds: Pr ˆdMLE,c + ǫd exp αǫ log + ǫ, ǫ > 0 8 Pr ˆdMLE,c ǫd exp α ǫ log ǫ, 0 < ǫ <, 9 where we use to indicate that these inequalities are based on an approximate distribution. Note that the distribution of ˆd MLE /d and hence ˆd MLE,c /d is only a function of k Antle and Bain, 969; Haas et al., 970. Therefore, we can evaluate the accuracy of the gamma approximation by simulations with d =, as presented in Figure 8. Tail probability k=400 Empirical Gamma Normal k=0 k=00 k=50 k=0 k=50 k= ε a Tail probability k=400 Empirical Gamma Bound k=00 k=0 k=50 k=0 k= ε Figure 8: We consider k = 0, 0, 50, 00, 00, and 400. For each k, we simulate standard Cauchy samples, from which we estimate the Cauchy parameter by the MLE ˆd MLE,c and compute the tail probabilities. Panel a compares the empirical tail probabilities thick solid with the gamma tail probabilities thin solid, indicating that the gamma distribution is better than the normal dashed for approximating the distribution of ˆd MLE,c. Panel b compares the empirical tail probabilities with the gamma upper bound 8+9. b Figure 8a shows that both the gamma and normal approximations are fairly accurate when the tail probability ; and the gamma approximation is obviously better. Figure 8b compares the empirical tail probabilities with the gamma Chernoff upper bound 8+9, indicating that these bounds are reliable, when the tail probability THE INVERSE GAUSSIAN GENERALIZED GAMMA APPROXIMATION The distribution of ˆd MLE,c can be well approximated by an inverse Gaussian distribution, which is a special case of the three-parameter generalized gamma distribution Hougaard, 986; Gerber, 99; Li et al., 006c, denoted by GGα,β,η. Note that the usual gamma distribution is a special case with η =. If z GGα,β,η, then the first three moments are Ez = αβ, Varz = αβ, E z Ez 3 = αβ 3 + η.

22 LI, HASTIE, AND CHURCH We can approximate the distribution of ˆd MLE,c by matching the first three moments, i.e., from which we obtain αβ = d, α = k + 3, k αβ = d k + 3d k, β = d k + 3d k, αβ3 + η = d3 k, η = + O. 0 k Taking only the leading term for η, the generalized gamma approximation of ˆd MLE,c would be GG k + 3, d k + 3d k,. k In general, a generalized gamma distribution does not have a closed-form density function although it always has a closed-from moment generating function. In our case, is actually an inverse Gaussian IG distribution, which has a closed-form density function. Assuming ˆd MLE,c IGα,β, with parameters α and β defined in 0, the moment generating function MGF, the probability density function PDF, and cumulative density function CDF would be Seshadri, 993, Chapter Tweedie, 957a,b 4 E exp ˆd = MLE,c t exp α βt /, f ˆdMLE,c y = α β y/β y 3 α αd y d exp = y/β y 3 exp, yβ = α Pr ˆdMLE,c y β Φ y = Φ y αβ + e α Φ αd y y d + e α Φ where Φ. is the standard normal CDF, i.e., Φz = z that these equalities are based on an approximate distribution. Assuming ˆd MLE,c IGα,β, the fourth central moment should be 4 = E ˆdMLE,c E ˆdMLE,c 5αβ αβ αd y α β y y αβ + y d +, e t dt. Here we use = to indicate d = 5d k + 3d 3 d k + 3 = d4 k + 56d4 k 3 + O k 4 k + 3d k. 4. The inverse Gaussian distribution was first noted as the distribution of the first passage time of the Brownian motion with a positive drift. It has many interesting properties such as infinitely divisibility. Two monographs Chhikara and Folks, 989; Seshadri, 993 are devoted entirely to the inverse Gaussian distributions. For a quick reference, one can check

23 CAUCHY RANDOM PROJECTIONS Lemma 0 has shown the true asymptotic fourth central moment: 4 d 4 E ˆdMLE,c E ˆdMLE,c = k + 86d4 k 3 + O k 4. That is, the inverse Gaussian approximation matches not only the leading term, d4 k, but also almost the higher order term, 86d4 k 3, of the true asymptotic fourth moment of ˆd MLE,c. Assuming ˆd MLE,c IGα,β, the tail probability of ˆd MLE,c can be expressed as = α Pr ˆdMLE,c + ǫd Φ ǫ = α Pr ˆdMLE,c ǫd Φ ǫ + ǫ ǫ α e α Φ + ǫ α + e α Φ ǫ + ǫ ǫ Assuming ˆd MLE,c IGα,β, it is easy to show the following Chernoff bounds: Pr ˆdMLE,c + ǫd exp αǫ, ǫ > 0, 0 < ǫ <., ǫ > 0 + ǫ Pr ˆdMLE,c ǫd exp αǫ, 0 < ǫ <. 3 ǫ To see, assume z IGα, β. Then, using the Chernoff inequality: Pr z + ǫd Eztexp + ǫdt = exp α βt / + ǫdt, whose minimum is exp αǫ +ǫ, attained at t = +ǫ β. We can similarly show 3. Combining and 3 yields a symmetric approximate bound Pr ˆdMLE,c d ǫd exp ǫ / + ǫ k + 3, 0 < ǫ <. 4 Figure 9 compares the inverse Gaussian approximation with the same simulations as presented in Figure 8, indicating that the inverse Gaussian approximation is highly accurate. When the tail probability , we can treat the inverse Gaussian as the exact distribution of ˆd MLE,c. The Chernoff upper bounds for the inverse Gaussian are always reliable in our simulation range the tail probability Conclusion In machine learning, it is well-known that the l distance is far more robust than the l distance against outliers. Dimension reduction in the l norm, however, has been proved impossible if we use linear random projections and linear estimators. In this study, we analyze three types of nonlinear estimators for Cauchy random projections: the sample median estimators, the geometric k 3

24 LI, HASTIE, AND CHURCH Tail probability k=400 Empirical IG k=50 k=00 k=0 k=0 k= ε a Tail probability k=0 k=0 k=50 k=400 k=00 k=00 Empirical IG Bound ε b Figure 9: We compare the inverse Gaussian approximation with the same simulations as presented in Figure 8. Panel a compares the empirical tail probabilities with the inverse Gaussian tail probabilities, indicating that the approximation is highly accurate. Panel b compares the empirical tail probabilities with the inverse Gaussian upper bound +3. The upper bounds are all above the corresponding empirical curves, indicating that our proposed bounds are reliable at least in our simulation range. mean estimators, and the maximum likelihood estimators. Our theoretical analysis has shown that these nonlinear estimators can accurately recover the original l distance, even though none of them can be a metric. The sample median estimators and the geometric mean estimators are asymptotically equivalent but the latter are more accurate at small sample size. We have derived explicit tail bounds for the geometric mean estimators in exponential forms. Using these tail bounds, we have established an analog of the Johnson-Lindenstrauss JL Lemma for dimension reduction in l, which is weaker than the classical JL Lemma for dimension reduction in l. We conduct theoretic analysis on the maximum likelihood estimators MLE, which are asymptotically optimum. Both the sample median and geometric mean estimators are about 80% efficient as the MLE. We propose approximating the distribution of the MLE by an inverse Gaussian, which has the same support and matches the leading terms of the first four moments of the MLE. Approximate tail bounds have been provided based on the inverse Gaussian approximation. Verified by simulations, these approximate tail bounds hold at least in the 0 0 tail probability range. Although these nonlinear estimators are not metrics, they are still useful for certain applications in e.g., data stream computations, information retrieval, learning and data mining, whenever the goal is to compute the l distances efficiently using a small storage space in a single pass of the data. Li 008 generalized the geometric mean estimators to the stable distribution family, for dimension reduction in the l p norm 0 < p. Li 008 also proposed the harmonic mean estimator for p 0+, which is far more accurate than the geometric mean estimator. 5 In addition, Li 007 sug- 5. Stable random projections with very small p p 0+ have been applied to approximating the Hamming distances Cormode et al., 00, 003 and the max-dominance norm Cormode and Muthukrishnan,

25 CAUCHY RANDOM PROJECTIONS gested very sparse stable random projections by replacing the stable distribution with a mixture of a symmetric Pareto distribution and point mass at the origin, for considerably simplifying the sampling procedure to generate the projection matrix and for achieving a significant cost reduction of matrix multiplication operations. The general method of linear stable random projections is an appealing paradigm for applications involving massive, high-dimensional, non-sparse, and heavy-tailed data. If there is the prior information that the data are highly sparse e.g., text data, other alternative dimension reduction methods may be more suitable; for example, the new technique called Conditional Random Sampling CRS Li and Church, 005, 007; Li et al., 007a was particulary designed for approximating distances and other summary statistics in highly sparse data. Acknowledgment We thank Piotr Indyk, Assaf Naor, Art Owen, and Anand Vidyashankar. Ping Li was partially supported by NSF Grant DMS and Cornell University junior faculty startup fund. Trevor Hastie was partially supported by NSF Grant DMS and NIH Grant R0 CA Appendix A. Proof of Lemma 6 Assume that x, x,..., x k, are i.i.d. C0,d. The estimator, ˆd gm,c, expressed as is unbiased, because, from Lemma 5, E ˆdgm,c = cos k k ˆd gm,c = cos k k k k j= The variance is Var ˆdgm,c = cos k k k j= cos = d k k cos k k because cos k k cos k = + k = + 4 j= x j /k, E x j /k = cos k k k cos/k E x j /k d k = 4 k k 48 k 4 + O k 6 = + k 4 k kk 48 k 4 + = + 4 k k + O k 3. j= d /k cos k d k + 4 d 3 k + O k 3, 4 k = d. 4 k

26 LI, HASTIE, AND CHURCH Some more algebra can similarly show the third and fourth central moments: Appendix B. Proof of Lemma d 3 E ˆdgm,c E ˆdgm,c = 6 k + O k d 4 E ˆdgm,c E ˆdgm,c = 6 k + O k 3 We will use the Markov moment bound, because ˆd gm,c does not have a moment generating function E ˆdgm,c t = if t k. In fact, even when the Chernoff bound is applicable, for any positive random variable, the Markov moment bound is always sharper than the Chernoff bound Philips and Nelson, 995; Lugosi, 004. By the Markov moment bound, for any ǫ > 0 and 0 < t < k, Pr ˆdgm + ǫd t E ˆdgm + ǫd t = whose minimum is attained at t = k tan log + ǫ. Thus Pr ˆdgm + ǫd exp k log cos log = exp k = exp k ǫ G R,gm,,. cos k t + ǫ t, tan log + ǫ + tan log + ǫ log + ǫ + log + ǫ + tan log + ǫ log + ǫ k where G R,gm = ǫ log + log + ǫ +. tan log + ǫ log + ǫ Again, by the Markov moment bound, for any 0 < ǫ <, Pr ˆdgm ǫd = Pr ˆd gm ǫd t E ˆdgm ǫt = ǫd t cos k t, k 6

27 CAUCHY RANDOM PROJECTIONS whose minimum is attained at t = k tan log ǫ. Thus Pr ˆdgm ǫd exp k log cos log = exp k = exp k where ǫ G L,gm, tan log ǫ + tan log ǫ log ǫ + log ǫ + tan log ǫ log ǫ G L,gm = ǫ log + log ǫ +. tan log ǫ log ǫ Finally, we derive convenient approximations for G G,gm and G L,gm, for small ǫ e.g., ǫ <. Recall that, for x <, we have Thus for small ǫ, we have log + x = x x + x3 3 x tan x = x x = = = G R,gm ǫ log + log + ǫ + tan log + ǫ log + ǫ log + ǫ log + ǫ ǫ log + ǫ 3 log + ǫ +... = = = = ǫ + ǫ ǫ 3 ǫ ǫ ǫ ǫ +... ǫ log + ǫ 3 ǫ ǫ ǫ + ǫ ǫ ǫ + ǫ ǫ +... log + ǫ log + ǫ 7

28 LI, HASTIE, AND CHURCH Similarly, for small ǫ, we have = = = G L,gm ǫ log + log ǫ + tan log ǫ log ǫ log ǫ log ǫ ǫ log ǫ 3 log ǫ +... = = = = ǫ log ǫ 3 ǫ ǫ ǫ ǫ ǫ ǫ + ǫ ǫ +... ǫ 3 ǫ + 3 ǫ + ǫ ǫ +... ǫ ǫ log ǫ log ǫ Appendix C. Proof of Lemma 9 For any ǫ > 0 and 0 < t < k, the Markov inequality says t E ˆdgm,c Pr ˆdgm,c + ǫd + ǫ t d t = cos kt k + ǫ t, cos k t k which can be minimized by choosing the optimum t = t, where t = k tan log + ǫ k log cos. k We need to make sure that 0 t < k. t 0 because log cos. 0; and t < k because tan., with equality holding only when k. Now we show the other tail bound Pr ˆdgm,c ǫd. Let 0 < t < k. Pr ˆdgm,c ǫd k k =Pr cos x j k /k ǫd j= t k =Pr x j t/k ǫd cos k k j= which is minimized at t = t t = k tan log ǫ + k log cos, k t ǫ cos k k cos k t, k 8

29 CAUCHY RANDOM PROJECTIONS provided k 8ǫ, otherwise t may be less than 0. To see this, in order for t 0, we must have log ǫ k log cos, i.e., ǫ cos k k. k Because cos k k k k 8k, it suffices if ǫ 8k, i.e., k 8ǫ. Now we prove the asymptotic as k expressions for the ratios of tail bounds, another way to compare ˆd gm and ˆd gm,c. First, we consider the right tail bounds. For large k, the optimal t = t can be approximated as t = k tan log + ǫ k log cos k k tan log + ǫ k log 8k k tan log + ǫ + 4k k tan log + ǫ + 4k + log + ǫ Taylor expansion k tan log + ǫ / + + log + ǫ =ka + C, where A = tan log + ǫ, C = / + log + ǫ. Note that, in the asymptotic decomposition, t ka + C, the term ka is the optimal t in proving the right tail bound for ˆd gm. Thus to study the asymptotic ratio of the right tail bounds, we only need to keep track of the additional terms in cos kt k cos k t k + ǫ t. Because cos kt k k A +kc 8k exp 8 A, 9

30 LI, HASTIE, AND CHURCH and t cos k cos k A k + C k cos k A C k k tan A cos k A exp C tan A, we know the ratio of the right tail bounds ρ R,k = exp cos kt k cos k t k «+ǫ t k ǫ G R,gm ρ R, = Taylor expansion + ǫ C exp 8 A + C tan A. Next, we consider the left tail bound. First, we obtain an asymptotic decomposition of t : t = k tan log ǫ + k log cos k k tan log ǫ 4k k tan log ǫ / + log ǫ Taylor expansion =ka C, where A = tan log ǫ /, C = + log ǫ. Again, in the above asymptotic decomposition, t ka C, the term ka is the optimal t in proving the left tail bound for ˆd gm. Thus to study the asymptotic ratio of the left tail bounds, we only need to keep track of the additional terms in and Because cos kt ǫ t cos kt k cos k t k k k A kc 8k exp 8 A, t cos k cos k A k C k cos k A + C k k tan A cos k A C exp tan A,. 30

31 CAUCHY RANDOM PROJECTIONS we know the ratio of the left tail bounds ρ L,k = ǫ t cos kt k exp cos k t k «k ǫ G L,gm ρ L, = ǫ C exp 8 A C tan A. Appendix D. Proof of Lemma 0 Assume x C0, d. The log likelihood lx; d and its first three derivatives are lx;d = logd log logx + d, l d = d d x + d, l d = d x d x + d, l d = d 3 + 4d x + d + 8dx d x + d 3. The MLE ˆd MLE is asymptotically normal with mean d and variance expected Fisher Information, is I = Id = E l d = d + E x d x + d kid = d,, where Id, the because Therefore, we obtain x d E x + d = d = d = d = d / / / x d x + d 3dx / d tan t d 6 /cos 6 t 3 8 d cos t dt cos t cos 4 tdt = 4d. Var ˆdMLE = d k + O k. General formulas for the bias and higher moments of the MLE are available in Bartlett, 953; Shenton and Bowman, 963. We need to evaluate the expressions in Shenton and Bowman, 963, 3

32 LI, HASTIE, AND CHURCH 6a-6d, involving tedious algebra: E ˆdMLE = d [] Var ki + O k ˆdMLE = ki + k I + [4 ] [ ] [3] 3 [ 3 ] 3[] E ˆdMLE E ˆdMLE = k I 4 3 E ˆdMLE E ˆdMLE = k I + k 3 where, after re-formatting, I [] [ 3 ] I 4 + O k 3 9 I + 7[4 ] 6[ ] 0[3] + k 3 6[ 3 ] [ 3 ][] + 45[] I 5 I 4 + O k 3 + O k 4, [] = El 3 + El l, [ 4 ] = El 4, [ ] = El l + El 4, [3] = El 4 + 3El l + El l, [ 3 ] = El 3. We will neglect most of the algebra. To help readers verifying the results, the following formula we derive may be useful: Hence m E x + d = Without giving the detail, we report m m dm, m =,,3,... E l 3 = 0, E l l = d 3, E l 4 3 = 8 d 4, El l = 8 d 4, E l l = 3 4 d 4. [] = d 3, [4 ] = 3 8 d 4, [ ] = 4 d 4, [3] = 3 4 d 4, [3 ] = 0. Thus, we obtain E ˆdMLE = d + d k + O k Var ˆdMLE = d k + 7d k + O k 3 3 d 3 E ˆdMLE E ˆdMLE = + O k 4 d 4 E ˆdMLE E ˆdMLE = k k 3 + d4 k 3 + O k 4. 3

Randomized Algorithms

Randomized Algorithms Randomized Algorithms Saniv Kumar, Google Research, NY EECS-6898, Columbia University - Fall, 010 Saniv Kumar 9/13/010 EECS6898 Large Scale Machine Learning 1 Curse of Dimensionality Gaussian Mixture Models

More information

Very Sparse Random Projections

Very Sparse Random Projections Very Sparse Random Projections Ping Li, Trevor Hastie and Kenneth Church [KDD 06] Presented by: Aditya Menon UCSD March 4, 2009 Presented by: Aditya Menon (UCSD) Very Sparse Random Projections March 4,

More information

Ping Li Compressed Counting, Data Streams March 2009 DIMACS Compressed Counting. Ping Li

Ping Li Compressed Counting, Data Streams March 2009 DIMACS Compressed Counting. Ping Li Ping Li Compressed Counting, Data Streams March 2009 DIMACS 2009 1 Compressed Counting Ping Li Department of Statistical Science Faculty of Computing and Information Science Cornell University Ithaca,

More information

16 Embeddings of the Euclidean metric

16 Embeddings of the Euclidean metric 16 Embeddings of the Euclidean metric In today s lecture, we will consider how well we can embed n points in the Euclidean metric (l 2 ) into other l p metrics. More formally, we ask the following question.

More information

BTRY 4090: Spring 2009 Theory of Statistics

BTRY 4090: Spring 2009 Theory of Statistics BTRY 4090: Spring 2009 Theory of Statistics Guozhang Wang September 25, 2010 1 Review of Probability We begin with a real example of using probability to solve computationally intensive (or infeasible)

More information

Sparser Johnson-Lindenstrauss Transforms

Sparser Johnson-Lindenstrauss Transforms Sparser Johnson-Lindenstrauss Transforms Jelani Nelson Princeton February 16, 212 joint work with Daniel Kane (Stanford) Random Projections x R d, d huge store y = Sx, where S is a k d matrix (compression)

More information

Sparse Johnson-Lindenstrauss Transforms

Sparse Johnson-Lindenstrauss Transforms Sparse Johnson-Lindenstrauss Transforms Jelani Nelson MIT May 24, 211 joint work with Daniel Kane (Harvard) Metric Johnson-Lindenstrauss lemma Metric JL (MJL) Lemma, 1984 Every set of n points in Euclidean

More information

Estimating Dominance Norms of Multiple Data Streams Graham Cormode Joint work with S. Muthukrishnan

Estimating Dominance Norms of Multiple Data Streams Graham Cormode Joint work with S. Muthukrishnan Estimating Dominance Norms of Multiple Data Streams Graham Cormode graham@dimacs.rutgers.edu Joint work with S. Muthukrishnan Data Stream Phenomenon Data is being produced faster than our ability to process

More information

arxiv:cs/ v1 [cs.ds] 22 Nov 2006

arxiv:cs/ v1 [cs.ds] 22 Nov 2006 Very Sparse Stable Random Projections, Estimators and Tail Bounds for Stable Random Projections arxiv:cs/064v [cs.s] Nov 006 Ping Li epartment of Statistics Stanford University Stanford, CA 94305 pingli98@stanford.edu

More information

Improved Bounds on the Dot Product under Random Projection and Random Sign Projection

Improved Bounds on the Dot Product under Random Projection and Random Sign Projection Improved Bounds on the Dot Product under Random Projection and Random Sign Projection Ata Kabán School of Computer Science The University of Birmingham Birmingham B15 2TT, UK http://www.cs.bham.ac.uk/

More information

Fast Dimension Reduction

Fast Dimension Reduction Fast Dimension Reduction Nir Ailon 1 Edo Liberty 2 1 Google Research 2 Yale University Introduction Lemma (Johnson, Lindenstrauss (1984)) A random projection Ψ preserves all ( n 2) distances up to distortion

More information

A Few Special Distributions and Their Properties

A Few Special Distributions and Their Properties A Few Special Distributions and Their Properties Econ 690 Purdue University Justin L. Tobias (Purdue) Distributional Catalog 1 / 20 Special Distributions and Their Associated Properties 1 Uniform Distribution

More information

Lecture 7 Introduction to Statistical Decision Theory

Lecture 7 Introduction to Statistical Decision Theory Lecture 7 Introduction to Statistical Decision Theory I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 20, 2016 1 / 55 I-Hsiang Wang IT Lecture 7

More information

Statistical Data Analysis

Statistical Data Analysis DS-GA 0 Lecture notes 8 Fall 016 1 Descriptive statistics Statistical Data Analysis In this section we consider the problem of analyzing a set of data. We describe several techniques for visualizing the

More information

Spring 2012 Math 541B Exam 1

Spring 2012 Math 541B Exam 1 Spring 2012 Math 541B Exam 1 1. A sample of size n is drawn without replacement from an urn containing N balls, m of which are red and N m are black; the balls are otherwise indistinguishable. Let X denote

More information

STAT 200C: High-dimensional Statistics

STAT 200C: High-dimensional Statistics STAT 200C: High-dimensional Statistics Arash A. Amini May 30, 2018 1 / 59 Classical case: n d. Asymptotic assumption: d is fixed and n. Basic tools: LLN and CLT. High-dimensional setting: n d, e.g. n/d

More information

Some Useful Background for Talk on the Fast Johnson-Lindenstrauss Transform

Some Useful Background for Talk on the Fast Johnson-Lindenstrauss Transform Some Useful Background for Talk on the Fast Johnson-Lindenstrauss Transform Nir Ailon May 22, 2007 This writeup includes very basic background material for the talk on the Fast Johnson Lindenstrauss Transform

More information

STA205 Probability: Week 8 R. Wolpert

STA205 Probability: Week 8 R. Wolpert INFINITE COIN-TOSS AND THE LAWS OF LARGE NUMBERS The traditional interpretation of the probability of an event E is its asymptotic frequency: the limit as n of the fraction of n repeated, similar, and

More information

Confidence Intervals in Ridge Regression using Jackknife and Bootstrap Methods

Confidence Intervals in Ridge Regression using Jackknife and Bootstrap Methods Chapter 4 Confidence Intervals in Ridge Regression using Jackknife and Bootstrap Methods 4.1 Introduction It is now explicable that ridge regression estimator (here we take ordinary ridge estimator (ORE)

More information

sublinear time low-rank approximation of positive semidefinite matrices Cameron Musco (MIT) and David P. Woodru (CMU)

sublinear time low-rank approximation of positive semidefinite matrices Cameron Musco (MIT) and David P. Woodru (CMU) sublinear time low-rank approximation of positive semidefinite matrices Cameron Musco (MIT) and David P. Woodru (CMU) 0 overview Our Contributions: 1 overview Our Contributions: A near optimal low-rank

More information

ON THE FAILURE RATE ESTIMATION OF THE INVERSE GAUSSIAN DISTRIBUTION

ON THE FAILURE RATE ESTIMATION OF THE INVERSE GAUSSIAN DISTRIBUTION ON THE FAILURE RATE ESTIMATION OF THE INVERSE GAUSSIAN DISTRIBUTION ZHENLINYANGandRONNIET.C.LEE Department of Statistics and Applied Probability, National University of Singapore, 3 Science Drive 2, Singapore

More information

Some Theoretical Properties and Parameter Estimation for the Two-Sided Length Biased Inverse Gaussian Distribution

Some Theoretical Properties and Parameter Estimation for the Two-Sided Length Biased Inverse Gaussian Distribution Journal of Probability and Statistical Science 14(), 11-4, Aug 016 Some Theoretical Properties and Parameter Estimation for the Two-Sided Length Biased Inverse Gaussian Distribution Teerawat Simmachan

More information

Lecture 6 September 13, 2016

Lecture 6 September 13, 2016 CS 395T: Sublinear Algorithms Fall 206 Prof. Eric Price Lecture 6 September 3, 206 Scribe: Shanshan Wu, Yitao Chen Overview Recap of last lecture. We talked about Johnson-Lindenstrauss (JL) lemma [JL84]

More information

Accelerated Dense Random Projections

Accelerated Dense Random Projections 1 Advisor: Steven Zucker 1 Yale University, Department of Computer Science. Dimensionality reduction (1 ε) xi x j 2 Ψ(xi ) Ψ(x j ) 2 (1 + ε) xi x j 2 ( n 2) distances are ε preserved Target dimension k

More information

Ad Placement Strategies

Ad Placement Strategies Case Study 1: Estimating Click Probabilities Tackling an Unknown Number of Features with Sketching Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox 2014 Emily Fox January

More information

Methods for sparse analysis of high-dimensional data, II

Methods for sparse analysis of high-dimensional data, II Methods for sparse analysis of high-dimensional data, II Rachel Ward May 23, 2011 High dimensional data with low-dimensional structure 300 by 300 pixel images = 90, 000 dimensions 2 / 47 High dimensional

More information

ACTEX CAS EXAM 3 STUDY GUIDE FOR MATHEMATICAL STATISTICS

ACTEX CAS EXAM 3 STUDY GUIDE FOR MATHEMATICAL STATISTICS ACTEX CAS EXAM 3 STUDY GUIDE FOR MATHEMATICAL STATISTICS TABLE OF CONTENTS INTRODUCTORY NOTE NOTES AND PROBLEM SETS Section 1 - Point Estimation 1 Problem Set 1 15 Section 2 - Confidence Intervals and

More information

Multivariate Distributions

Multivariate Distributions IEOR E4602: Quantitative Risk Management Spring 2016 c 2016 by Martin Haugh Multivariate Distributions We will study multivariate distributions in these notes, focusing 1 in particular on multivariate

More information

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and

More information

Exact Inference for the Two-Parameter Exponential Distribution Under Type-II Hybrid Censoring

Exact Inference for the Two-Parameter Exponential Distribution Under Type-II Hybrid Censoring Exact Inference for the Two-Parameter Exponential Distribution Under Type-II Hybrid Censoring A. Ganguly, S. Mitra, D. Samanta, D. Kundu,2 Abstract Epstein [9] introduced the Type-I hybrid censoring scheme

More information

Sketching and Streaming for Distributions

Sketching and Streaming for Distributions Sketching and Streaming for Distributions Piotr Indyk Andrew McGregor Massachusetts Institute of Technology University of California, San Diego Main Material: Stable distributions, pseudo-random generators,

More information

Statistics: Learning models from data

Statistics: Learning models from data DS-GA 1002 Lecture notes 5 October 19, 2015 Statistics: Learning models from data Learning models from data that are assumed to be generated probabilistically from a certain unknown distribution is a crucial

More information

MAT 585: Johnson-Lindenstrauss, Group testing, and Compressed Sensing

MAT 585: Johnson-Lindenstrauss, Group testing, and Compressed Sensing MAT 585: Johnson-Lindenstrauss, Group testing, and Compressed Sensing Afonso S. Bandeira April 9, 2015 1 The Johnson-Lindenstrauss Lemma Suppose one has n points, X = {x 1,..., x n }, in R d with d very

More information

Distributed Estimation, Information Loss and Exponential Families. Qiang Liu Department of Computer Science Dartmouth College

Distributed Estimation, Information Loss and Exponential Families. Qiang Liu Department of Computer Science Dartmouth College Distributed Estimation, Information Loss and Exponential Families Qiang Liu Department of Computer Science Dartmouth College Statistical Learning / Estimation Learning generative models from data Topic

More information

Inference for High Dimensional Robust Regression

Inference for High Dimensional Robust Regression Department of Statistics UC Berkeley Stanford-Berkeley Joint Colloquium, 2015 Table of Contents 1 Background 2 Main Results 3 OLS: A Motivating Example Table of Contents 1 Background 2 Main Results 3 OLS:

More information

ECE 302 Division 2 Exam 2 Solutions, 11/4/2009.

ECE 302 Division 2 Exam 2 Solutions, 11/4/2009. NAME: ECE 32 Division 2 Exam 2 Solutions, /4/29. You will be required to show your student ID during the exam. This is a closed-book exam. A formula sheet is provided. No calculators are allowed. Total

More information

Asymptotic Multivariate Kriging Using Estimated Parameters with Bayesian Prediction Methods for Non-linear Predictands

Asymptotic Multivariate Kriging Using Estimated Parameters with Bayesian Prediction Methods for Non-linear Predictands Asymptotic Multivariate Kriging Using Estimated Parameters with Bayesian Prediction Methods for Non-linear Predictands Elizabeth C. Mannshardt-Shamseldin Advisor: Richard L. Smith Duke University Department

More information

COMS 4771 Introduction to Machine Learning. Nakul Verma

COMS 4771 Introduction to Machine Learning. Nakul Verma COMS 4771 Introduction to Machine Learning Nakul Verma Announcements HW1 due next lecture Project details are available decide on the group and topic by Thursday Last time Generative vs. Discriminative

More information

Multivariate Statistics Random Projections and Johnson-Lindenstrauss Lemma

Multivariate Statistics Random Projections and Johnson-Lindenstrauss Lemma Multivariate Statistics Random Projections and Johnson-Lindenstrauss Lemma Suppose again we have n sample points x,..., x n R p. The data-point x i R p can be thought of as the i-th row X i of an n p-dimensional

More information

Sharp Generalization Error Bounds for Randomly-projected Classifiers

Sharp Generalization Error Bounds for Randomly-projected Classifiers Sharp Generalization Error Bounds for Randomly-projected Classifiers R.J. Durrant and A. Kabán School of Computer Science The University of Birmingham Birmingham B15 2TT, UK http://www.cs.bham.ac.uk/ axk

More information

STAT 7032 Probability Spring Wlodek Bryc

STAT 7032 Probability Spring Wlodek Bryc STAT 7032 Probability Spring 2018 Wlodek Bryc Created: Friday, Jan 2, 2014 Revised for Spring 2018 Printed: January 9, 2018 File: Grad-Prob-2018.TEX Department of Mathematical Sciences, University of Cincinnati,

More information

Behavioral Data Mining. Lecture 7 Linear and Logistic Regression

Behavioral Data Mining. Lecture 7 Linear and Logistic Regression Behavioral Data Mining Lecture 7 Linear and Logistic Regression Outline Linear Regression Regularization Logistic Regression Stochastic Gradient Fast Stochastic Methods Performance tips Linear Regression

More information

Analysis of Type-II Progressively Hybrid Censored Data

Analysis of Type-II Progressively Hybrid Censored Data Analysis of Type-II Progressively Hybrid Censored Data Debasis Kundu & Avijit Joarder Abstract The mixture of Type-I and Type-II censoring schemes, called the hybrid censoring scheme is quite common in

More information

ABC-Boost: Adaptive Base Class Boost for Multi-class Classification

ABC-Boost: Adaptive Base Class Boost for Multi-class Classification ABC-Boost: Adaptive Base Class Boost for Multi-class Classification Ping Li Department of Statistical Science, Cornell University, Ithaca, NY 14853 USA pingli@cornell.edu Abstract We propose -boost (adaptive

More information

Lecture Notes 5 Convergence and Limit Theorems. Convergence with Probability 1. Convergence in Mean Square. Convergence in Probability, WLLN

Lecture Notes 5 Convergence and Limit Theorems. Convergence with Probability 1. Convergence in Mean Square. Convergence in Probability, WLLN Lecture Notes 5 Convergence and Limit Theorems Motivation Convergence with Probability Convergence in Mean Square Convergence in Probability, WLLN Convergence in Distribution, CLT EE 278: Convergence and

More information

Lecture 13 October 6, Covering Numbers and Maurey s Empirical Method

Lecture 13 October 6, Covering Numbers and Maurey s Empirical Method CS 395T: Sublinear Algorithms Fall 2016 Prof. Eric Price Lecture 13 October 6, 2016 Scribe: Kiyeon Jeon and Loc Hoang 1 Overview In the last lecture we covered the lower bound for p th moment (p > 2) and

More information

Review. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

Review. DS GA 1002 Statistical and Mathematical Models.   Carlos Fernandez-Granda Review DS GA 1002 Statistical and Mathematical Models http://www.cims.nyu.edu/~cfgranda/pages/dsga1002_fall16 Carlos Fernandez-Granda Probability and statistics Probability: Framework for dealing with

More information

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen Advanced Machine Learning Lecture 3 Linear Regression II 02.11.2015 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ leibe@vision.rwth-aachen.de This Lecture: Advanced Machine Learning Regression

More information

High Dimensional Geometry, Curse of Dimensionality, Dimension Reduction

High Dimensional Geometry, Curse of Dimensionality, Dimension Reduction Chapter 11 High Dimensional Geometry, Curse of Dimensionality, Dimension Reduction High-dimensional vectors are ubiquitous in applications (gene expression data, set of movies watched by Netflix customer,

More information

Multiple Random Variables

Multiple Random Variables Multiple Random Variables Joint Probability Density Let X and Y be two random variables. Their joint distribution function is F ( XY x, y) P X x Y y. F XY ( ) 1, < x

More information

Fast Dimension Reduction

Fast Dimension Reduction Fast Dimension Reduction MMDS 2008 Nir Ailon Google Research NY Fast Dimension Reduction Using Rademacher Series on Dual BCH Codes (with Edo Liberty) The Fast Johnson Lindenstrauss Transform (with Bernard

More information

Mathematical Institute, University of Utrecht. The problem of estimating the mean of an observed Gaussian innite-dimensional vector

Mathematical Institute, University of Utrecht. The problem of estimating the mean of an observed Gaussian innite-dimensional vector On Minimax Filtering over Ellipsoids Eduard N. Belitser and Boris Y. Levit Mathematical Institute, University of Utrecht Budapestlaan 6, 3584 CD Utrecht, The Netherlands The problem of estimating the mean

More information

Simulating Uniform- and Triangular- Based Double Power Method Distributions

Simulating Uniform- and Triangular- Based Double Power Method Distributions Journal of Statistical and Econometric Methods, vol.6, no.1, 2017, 1-44 ISSN: 1792-6602 (print), 1792-6939 (online) Scienpress Ltd, 2017 Simulating Uniform- and Triangular- Based Double Power Method Distributions

More information

Introduction to Probability and Statistics (Continued)

Introduction to Probability and Statistics (Continued) Introduction to Probability and Statistics (Continued) Prof. icholas Zabaras Center for Informatics and Computational Science https://cics.nd.edu/ University of otre Dame otre Dame, Indiana, USA Email:

More information

DS-GA 1002 Lecture notes 2 Fall Random variables

DS-GA 1002 Lecture notes 2 Fall Random variables DS-GA 12 Lecture notes 2 Fall 216 1 Introduction Random variables Random variables are a fundamental tool in probabilistic modeling. They allow us to model numerical quantities that are uncertain: the

More information

MINIMUM EXPECTED RISK PROBABILITY ESTIMATES FOR NONPARAMETRIC NEIGHBORHOOD CLASSIFIERS. Maya Gupta, Luca Cazzanti, and Santosh Srivastava

MINIMUM EXPECTED RISK PROBABILITY ESTIMATES FOR NONPARAMETRIC NEIGHBORHOOD CLASSIFIERS. Maya Gupta, Luca Cazzanti, and Santosh Srivastava MINIMUM EXPECTED RISK PROBABILITY ESTIMATES FOR NONPARAMETRIC NEIGHBORHOOD CLASSIFIERS Maya Gupta, Luca Cazzanti, and Santosh Srivastava University of Washington Dept. of Electrical Engineering Seattle,

More information

STAT 200C: High-dimensional Statistics

STAT 200C: High-dimensional Statistics STAT 200C: High-dimensional Statistics Arash A. Amini April 27, 2018 1 / 80 Classical case: n d. Asymptotic assumption: d is fixed and n. Basic tools: LLN and CLT. High-dimensional setting: n d, e.g. n/d

More information

Lecture 14 October 16, 2014

Lecture 14 October 16, 2014 CS 224: Advanced Algorithms Fall 2014 Prof. Jelani Nelson Lecture 14 October 16, 2014 Scribe: Jao-ke Chin-Lee 1 Overview In the last lecture we covered learning topic models with guest lecturer Rong Ge.

More information

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A. 1. Let P be a probability measure on a collection of sets A. (a) For each n N, let H n be a set in A such that H n H n+1. Show that P (H n ) monotonically converges to P ( k=1 H k) as n. (b) For each n

More information

On Reparametrization and the Gibbs Sampler

On Reparametrization and the Gibbs Sampler On Reparametrization and the Gibbs Sampler Jorge Carlos Román Department of Mathematics Vanderbilt University James P. Hobert Department of Statistics University of Florida March 2014 Brett Presnell Department

More information

1 Uniform Distribution. 2 Gamma Distribution. 3 Inverse Gamma Distribution. 4 Multivariate Normal Distribution. 5 Multivariate Student-t Distribution

1 Uniform Distribution. 2 Gamma Distribution. 3 Inverse Gamma Distribution. 4 Multivariate Normal Distribution. 5 Multivariate Student-t Distribution A Few Special Distributions Their Properties Econ 675 Iowa State University November 1 006 Justin L Tobias (ISU Distributional Catalog November 1 006 1 / 0 Special Distributions Their Associated Properties

More information

Module 3. Function of a Random Variable and its distribution

Module 3. Function of a Random Variable and its distribution Module 3 Function of a Random Variable and its distribution 1. Function of a Random Variable Let Ω, F, be a probability space and let be random variable defined on Ω, F,. Further let h: R R be a given

More information

A fast randomized algorithm for approximating an SVD of a matrix

A fast randomized algorithm for approximating an SVD of a matrix A fast randomized algorithm for approximating an SVD of a matrix Joint work with Franco Woolfe, Edo Liberty, and Vladimir Rokhlin Mark Tygert Program in Applied Mathematics Yale University Place July 17,

More information

1 Appendix A: Matrix Algebra

1 Appendix A: Matrix Algebra Appendix A: Matrix Algebra. Definitions Matrix A =[ ]=[A] Symmetric matrix: = for all and Diagonal matrix: 6=0if = but =0if 6= Scalar matrix: the diagonal matrix of = Identity matrix: the scalar matrix

More information

ECE531 Lecture 10b: Maximum Likelihood Estimation

ECE531 Lecture 10b: Maximum Likelihood Estimation ECE531 Lecture 10b: Maximum Likelihood Estimation D. Richard Brown III Worcester Polytechnic Institute 05-Apr-2011 Worcester Polytechnic Institute D. Richard Brown III 05-Apr-2011 1 / 23 Introduction So

More information

Declaring Independence via the Sketching of Sketches. Until August Hire Me!

Declaring Independence via the Sketching of Sketches. Until August Hire Me! Declaring Independence via the Sketching of Sketches Piotr Indyk Andrew McGregor Massachusetts Institute of Technology University of California, San Diego Until August 08 -- Hire Me! The Problem The Problem

More information

Practice Problems Section Problems

Practice Problems Section Problems Practice Problems Section 4-4-3 4-4 4-5 4-6 4-7 4-8 4-10 Supplemental Problems 4-1 to 4-9 4-13, 14, 15, 17, 19, 0 4-3, 34, 36, 38 4-47, 49, 5, 54, 55 4-59, 60, 63 4-66, 68, 69, 70, 74 4-79, 81, 84 4-85,

More information

Course: ESO-209 Home Work: 1 Instructor: Debasis Kundu

Course: ESO-209 Home Work: 1 Instructor: Debasis Kundu Home Work: 1 1. Describe the sample space when a coin is tossed (a) once, (b) three times, (c) n times, (d) an infinite number of times. 2. A coin is tossed until for the first time the same result appear

More information

Lecture 1 Measure concentration

Lecture 1 Measure concentration CSE 29: Learning Theory Fall 2006 Lecture Measure concentration Lecturer: Sanjoy Dasgupta Scribe: Nakul Verma, Aaron Arvey, and Paul Ruvolo. Concentration of measure: examples We start with some examples

More information

Manifold Coarse Graining for Online Semi-supervised Learning

Manifold Coarse Graining for Online Semi-supervised Learning for Online Semi-supervised Learning Mehrdad Farajtabar, Amirreza Shaban, Hamid R. Rabiee, Mohammad H. Rohban Digital Media Lab, Department of Computer Engineering, Sharif University of Technology, Tehran,

More information

Locating the Source of Diffusion in Large-Scale Networks

Locating the Source of Diffusion in Large-Scale Networks Locating the Source of Diffusion in Large-Scale Networks Supplemental Material Pedro C. Pinto, Patrick Thiran, Martin Vetterli Contents S1. Detailed Proof of Proposition 1..................................

More information

Compressed Sensing and Sparse Recovery

Compressed Sensing and Sparse Recovery ELE 538B: Sparsity, Structure and Inference Compressed Sensing and Sparse Recovery Yuxin Chen Princeton University, Spring 217 Outline Restricted isometry property (RIP) A RIPless theory Compressed sensing

More information

Regularization Paths

Regularization Paths December 2005 Trevor Hastie, Stanford Statistics 1 Regularization Paths Trevor Hastie Stanford University drawing on collaborations with Brad Efron, Saharon Rosset, Ji Zhu, Hui Zhou, Rob Tibshirani and

More information

Stochastic Subgradient Methods

Stochastic Subgradient Methods Stochastic Subgradient Methods Stephen Boyd and Almir Mutapcic Notes for EE364b, Stanford University, Winter 26-7 April 13, 28 1 Noisy unbiased subgradient Suppose f : R n R is a convex function. We say

More information

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013 Learning Theory Ingo Steinwart University of Stuttgart September 4, 2013 Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 1 / 62 Basics Informal Introduction Informal Description

More information

A Randomized Algorithm for the Approximation of Matrices

A Randomized Algorithm for the Approximation of Matrices A Randomized Algorithm for the Approximation of Matrices Per-Gunnar Martinsson, Vladimir Rokhlin, and Mark Tygert Technical Report YALEU/DCS/TR-36 June 29, 2006 Abstract Given an m n matrix A and a positive

More information

Unsupervised Learning

Unsupervised Learning 2018 EE448, Big Data Mining, Lecture 7 Unsupervised Learning Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/ee448/index.html ML Problem Setting First build and

More information

Monte-Carlo MMD-MA, Université Paris-Dauphine. Xiaolu Tan

Monte-Carlo MMD-MA, Université Paris-Dauphine. Xiaolu Tan Monte-Carlo MMD-MA, Université Paris-Dauphine Xiaolu Tan tan@ceremade.dauphine.fr Septembre 2015 Contents 1 Introduction 1 1.1 The principle.................................. 1 1.2 The error analysis

More information

Terminology Suppose we have N observations {x(n)} N 1. Estimators as Random Variables. {x(n)} N 1

Terminology Suppose we have N observations {x(n)} N 1. Estimators as Random Variables. {x(n)} N 1 Estimation Theory Overview Properties Bias, Variance, and Mean Square Error Cramér-Rao lower bound Maximum likelihood Consistency Confidence intervals Properties of the mean estimator Properties of the

More information

Hybrid Censoring; An Introduction 2

Hybrid Censoring; An Introduction 2 Hybrid Censoring; An Introduction 2 Debasis Kundu Department of Mathematics & Statistics Indian Institute of Technology Kanpur 23-rd November, 2010 2 This is a joint work with N. Balakrishnan Debasis Kundu

More information

SYSM 6303: Quantitative Introduction to Risk and Uncertainty in Business Lecture 4: Fitting Data to Distributions

SYSM 6303: Quantitative Introduction to Risk and Uncertainty in Business Lecture 4: Fitting Data to Distributions SYSM 6303: Quantitative Introduction to Risk and Uncertainty in Business Lecture 4: Fitting Data to Distributions M. Vidyasagar Cecil & Ida Green Chair The University of Texas at Dallas Email: M.Vidyasagar@utdallas.edu

More information

Hypothesis Testing. 1 Definitions of test statistics. CB: chapter 8; section 10.3

Hypothesis Testing. 1 Definitions of test statistics. CB: chapter 8; section 10.3 Hypothesis Testing CB: chapter 8; section 0.3 Hypothesis: statement about an unknown population parameter Examples: The average age of males in Sweden is 7. (statement about population mean) The lowest

More information

Essentials on the Analysis of Randomized Algorithms

Essentials on the Analysis of Randomized Algorithms Essentials on the Analysis of Randomized Algorithms Dimitris Diochnos Feb 0, 2009 Abstract These notes were written with Monte Carlo algorithms primarily in mind. Topics covered are basic (discrete) random

More information

Lecture 1: Supervised Learning

Lecture 1: Supervised Learning Lecture 1: Supervised Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine from Portland, Learning Oregon: pervised learning (Supervised)

More information

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 11, NOVEMBER On the Performance of Sparse Recovery

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 11, NOVEMBER On the Performance of Sparse Recovery IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 11, NOVEMBER 2011 7255 On the Performance of Sparse Recovery Via `p-minimization (0 p 1) Meng Wang, Student Member, IEEE, Weiyu Xu, and Ao Tang, Senior

More information

Probabilistic Hashing for Efficient Search and Learning

Probabilistic Hashing for Efficient Search and Learning Ping Li Probabilistic Hashing for Efficient Search and Learning July 12, 2012 MMDS2012 1 Probabilistic Hashing for Efficient Search and Learning Ping Li Department of Statistical Science Faculty of omputing

More information

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training Charles Elkan elkan@cs.ucsd.edu January 17, 2013 1 Principle of maximum likelihood Consider a family of probability distributions

More information

The Central Limit Theorem: More of the Story

The Central Limit Theorem: More of the Story The Central Limit Theorem: More of the Story Steven Janke November 2015 Steven Janke (Seminar) The Central Limit Theorem:More of the Story November 2015 1 / 33 Central Limit Theorem Theorem (Central Limit

More information

Acceleration of Randomized Kaczmarz Method

Acceleration of Randomized Kaczmarz Method Acceleration of Randomized Kaczmarz Method Deanna Needell [Joint work with Y. Eldar] Stanford University BIRS Banff, March 2011 Problem Background Setup Setup Let Ax = b be an overdetermined consistent

More information

Combining Interval and Probabilistic Uncertainty in Engineering Applications

Combining Interval and Probabilistic Uncertainty in Engineering Applications Combining Interval and Probabilistic Uncertainty in Engineering Applications Andrew Pownuk Computational Science Program University of Texas at El Paso El Paso, Texas 79968, USA ampownuk@utep.edu Page

More information

Brief Introduction to Machine Learning

Brief Introduction to Machine Learning Brief Introduction to Machine Learning Yuh-Jye Lee Lab of Data Science and Machine Intelligence Dept. of Applied Math. at NCTU August 29, 2016 1 / 49 1 Introduction 2 Binary Classification 3 Support Vector

More information

A General Overview of Parametric Estimation and Inference Techniques.

A General Overview of Parametric Estimation and Inference Techniques. A General Overview of Parametric Estimation and Inference Techniques. Moulinath Banerjee University of Michigan September 11, 2012 The object of statistical inference is to glean information about an underlying

More information

Sketching as a Tool for Numerical Linear Algebra

Sketching as a Tool for Numerical Linear Algebra Sketching as a Tool for Numerical Linear Algebra David P. Woodruff presented by Sepehr Assadi o(n) Big Data Reading Group University of Pennsylvania February, 2015 Sepehr Assadi (Penn) Sketching for Numerical

More information

Approximate Dynamic Programming

Approximate Dynamic Programming Master MVA: Reinforcement Learning Lecture: 5 Approximate Dynamic Programming Lecturer: Alessandro Lazaric http://researchers.lille.inria.fr/ lazaric/webpage/teaching.html Objectives of the lecture 1.

More information

STAT 512 sp 2018 Summary Sheet

STAT 512 sp 2018 Summary Sheet STAT 5 sp 08 Summary Sheet Karl B. Gregory Spring 08. Transformations of a random variable Let X be a rv with support X and let g be a function mapping X to Y with inverse mapping g (A = {x X : g(x A}

More information

Probability for Statistics and Machine Learning

Probability for Statistics and Machine Learning ~Springer Anirban DasGupta Probability for Statistics and Machine Learning Fundamentals and Advanced Topics Contents Suggested Courses with Diffe~ent Themes........................... xix 1 Review of Univariate

More information

Compressed Sensing with Very Sparse Gaussian Random Projections

Compressed Sensing with Very Sparse Gaussian Random Projections Compressed Sensing with Very Sparse Gaussian Random Projections arxiv:408.504v stat.me] Aug 04 Ping Li Department of Statistics and Biostatistics Department of Computer Science Rutgers University Piscataway,

More information

Lecture 18: March 15

Lecture 18: March 15 CS71 Randomness & Computation Spring 018 Instructor: Alistair Sinclair Lecture 18: March 15 Disclaimer: These notes have not been subjected to the usual scrutiny accorded to formal publications. They may

More information

Lecture 8: Information Theory and Statistics

Lecture 8: Information Theory and Statistics Lecture 8: Information Theory and Statistics Part II: Hypothesis Testing and I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 23, 2015 1 / 50 I-Hsiang

More information

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison Optimization Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison optimization () cost constraints might be too much to cover in 3 hours optimization (for big

More information