arxiv: v1 [stat.me] 7 Jun 2015

Size: px
Start display at page:

Download "arxiv: v1 [stat.me] 7 Jun 2015"

Transcription

1 Optimal Ridge Detection using Coverage Risk arxiv:156.78v1 [stat.me] 7 Jun 15 Yen-Chi Chen Department of Statistics Carnegie Mellon University yenchic@andrew.cmu.edu Shirley Ho Department of Physics Carnegie Mellon University shirleyh@andrew.cmu.edu Abstract Christopher R. Genovese Department of Statistics Carnegie Mellon University genovese@stat.cmu.edu Larry Wasserman Department of Statistics Carnegie Mellon University larry@stat.cmu.edu We introduce the concept of coverage risk as an error measure for density ridge estimation. The coverage risk generalizes the mean integrated square error to set estimation. We propose two risk estimators for the coverage risk and we show that we can select tuning parameters by minimizing the estimated risk. We study the rate of convergence for coverage risk and prove consistency of the risk estimators. We apply our method to three simulated datasets and to cosmology data. In all the examples, the proposed method successfully recover the underlying density structure. 1 Introduction Density ridges [1, 1, 16, 6] are one-dimensional curve like structures that characterize high density regions. Density ridges have been applied to computer vision [], remote sensing [], biomedical imaging [1], and cosmology [5, 7]. Figure 1 provides an example for applying density ridges to learn the structure of our Universe. To detect the density ridges from data, [1] proposed the Subspace Constrained Mean Shift (SCMS algorithm. SCMS is a modification of usual mean shift algorithm [15, 8] to adapt to the local geometry. Unlike mean shift that pushes every mesh point to a local mode, SCMS moves the meshes along a projected gradient until arriving at nearby ridges. Essentially, the SCMS algorithm detects the ridges of the kernel density estimator (KDE. Therefore, the SCMS algorithm requires a pre-selected parameter h, which acts as the role of smoothing bandwidth in the kernel density estimator. Despite the wide application of the SCMS algorithm, the choice of h remains an unsolved problem. Similar to the density estimation problem, a poor choice of h results in over-smoothing or undersmoothing for the density ridges. See the second row of Figure 1. In this paper, we introduce the concept of coverage risk which is a generalization of the mean integrated expected error from function estimation. We then show that one can consistently estimate the coverage risk by using data splitting or the smoothed bootstrap. This leads us to a data-driven selection rule for choosing the parameter h for the SCMS algorithm. We apply the proposed method to several famous datasets including the spiral dataset, the three spirals dataset, and the NIPS dataset. In all simulations, our selection rule allows the SCMS algorithm to detect the underlying structure of the data. 1

2 Figure 1: The cosmic web. This is a slice of the observed Universe from the Sloan Digital Sky Survey. We apply the density ridge method to detect filaments [7]. The top row is one example for the detected filaments. The bottom row shows the effect of smoothing. Bottom-Left: optimal smoothing. Bottom-Middle: under-smoothing. Bottom-Right: over-smoothing. Under optimal smoothing, we detect an intricate filament network. If we under-smooth or over-smooth the dataset, we cannot find the structure. 1.1 Density Ridges Density ridges are defined as follows. Assume X1,, Xn are independently and identically distributed from a smooth probability density function p with compact support K. The density ridges [1, 17, 6] are defined as R {x K : V (xv (xt p(x, λ (x < }, where V (x [v (x, vd (x] with vj (x being the eigenvector associated with the ordered eigenvalue λj (x (λ1 (x λd (x for Hessian matrix H(x p(x. That is, R is the collection of points whose projected gradient V (xv (xt p(x. It can be shown that (under appropriate conditions, R is a collection of 1-dimensional smooth curves (1-dimensional manifolds in Rd. The SCMS algorithm is a plug-in estimate for R by using n o b (x <, bn x K : Vbn (xvbn (xt b R pn (x, λ Pn i b are the associated quantities defined where pbn (x nh1 d i1 K x X is the KDE and Vbn and λ h by pbn. Hence, one can clearly see that the parameter h in the SCMS algorithm plays the same role of smoothing bandwidth for the KDE.

3 Coverage Risk Before we introduce the coverage risk, we first define some geometric concepts. Let µ l be the l- dimensional Hausdorff measure [13]. Namely, µ 1 (A is the length of set A and µ (A is the area of A. Let d(x, A be the projection distance from point x to a set A. We define U R and U Rn as random variables uniformly distributed over the true density ridges R and the ridge estimator R n respectively. Assuming R and R n are given, we define the following two random variables W n d(u R, R n, Wn d(u Rn, R. (1 Note that U R, U Rn are random variables while R, R n are sets. W n is the distance from a randomly selected point on R to the estimator R n and W n is the distance from a random point on R n to R. Let Haus(A, B inf{r : A B r, B A r} be the Hausdorff distance between A and B where A r {x : d(x, A r}. The following lemma gives some useful properties about W n and W n. Lemma 1 Both random variables W n and W n are bounded by Haus( M n, M. Namely, W n Haus( R n, R, W n Haus( R n, R. ( The cumulative distribution function (CDF for W n and W n are P(W n r R µ 1 (R ( R n r n, P( W n r µ 1 (R R n ( µ 1 Rn (R r ( µ 1 Rn. (3 Thus, P(W n r R n is the ratio of R being covered by padding the regions around R n at distance r. This lemma follows trivially by definition so that we omit its proof. Lemma 1 links the random variables W n and W n to the Hausdorff distance and the coverage for R and R n. Thus, we call them coverage random variables. Now we define the L 1 and L coverage risk for estimating R by R n as Risk 1,n E(W n + W n, Risk,n E(W n + W n. (4 That is, Risk 1,n (and Risk,n is the expected (square projected distance between R and R n. Note that the expectation in (4 applies to both R n and U R. One can view Risk,n as a generalized mean integrated square errors (MISE for sets. A nice property of Risk 1,n and Risk,n is that they are not sensitive to outliers of R in the sense that a small perturbation of R will not change the risk too much. On the contrary, the Hausdorff distance is very sensitive to outliers..1 Selection for Tuning Parameters Based on Risk Minimization In this section, we will show how to choose h by minimizing an estimate of the risk. We propose two risk estimators. The first estimator is based on the smoothed bootstrap [4]. We sample X 1, X n from the KDE p n and recompute the estimator R n. The we estimate the risk by Risk 1,n E(W n + W n X 1,, X n E(Wn +, Risk,n where W n d(u Rn, R n and W n d(u R n, R n. W n X 1,, X n, (5 3

4 The second approach is to use data splitting. We randomly split the data into X 11,, X 1m and X 1,, X m, assuming n is even and m n. We compute the estimated manifolds by using half of the data, which we denote as R 1,n and R,n. Then we compute Risk 1,n E(W 1,n + W,n X 1,, X n E(W 1,n, Risk,n + W,n X 1,, X n, (6 where W 1,n d(u, R R,n and W,n d(u, R 1,n R 1,n.,n Having estimated the risk, we select h by h argmin Risk 1,n, (7 h h n where h n is an upper bound by the normal reference rule [5] (which is known to oversmooth, so that we only select h below this rule. Moreover, one can choose h by minimizing L risk as well. In [11], they consider selecting the smoothing bandwidth for local principal curves by self-coverage. This criterion is a different from ours. The self-coverage counts data points. The self-coverage is a monotonic increasing function and they propose to select the bandwidth such that the derivative is highest. Our coverage risk yields a simple trade-off curve and one can easily pick the optimal bandwidth by minimizing the estimated risk. 3 Manifold Comparison by Coverage The concepts of coverage in previous section can be generalized to comparing two manifolds. Let M 1 and M be an l 1 -dimensional and an l -dimensional manifolds (l 1 and l are not necessarily the same. We define the coverage random variables W 1 d(u M1, M, W 1 d(u M, M 1. (8 Then by Lemma 1, the CDF for W 1 and W 1 contains information about how M 1 and M are different from each other: P(W 1 r µ l 1 (M 1 (M r µ l (M 1, P(W 1 r µ l (M (M 1 r. (9 µ r (M 1 P(W 1 r is the coverage on M 1 by padding regions with distance r around M. We call the plots of the CDF of W 1 and W 1 coverage diagrams since they are linked to the coverage over M 1 and M. The coverage diagram allows us to study how two manifolds are different from each other. When l 1 l, the coverage diagram can be used as a similarity measure for two manifolds. When l 1 l, the coverage diagram serves as a measure for quality of representing high dimensional objects by low dimensional ones. A nice property for coverage diagram is that we can approximate the CDF for W 1 and W 1 by a mesh of points (or points uniformly distributed over M 1 and M. In Figure we consider a Helix dataset whose support has dimension d 3 and we compare two curves, a spiral curve (green and a straight line (orange, to represent the Helix dataset. As can be seen from the coverage diagram (right panel, the green curve has better coverage at each distance (compared to the orange curve so that the spiral curve provides a better representation for the Helix dataset. In addition to the coverage diagram, we can also use the following L 1 and L losses as summary for the difference: Loss 1 (M 1, M E(W 1 + W 1, Loss (M 1, M E(W 1 + W1. (1 The expectation is take over U M1 and U M and both M 1 and M here are fixed. The risks in (4 are the expected losses: Risk 1,n E ( ( Loss 1 ( M n, M, Risk,n E Loss ( M n, M. (11 4

5 r Coverage Figure : The Helix dataset. The original support for the Helix dataset (black dots are a 3- dimensional regions. We can use green spiral curves (d 1 to represent the regions. Note that we also provide a bad representation using a straight line (orange. The coverage plot reveals the quality for representation. Left: the original data. Right: the coverage plot for the spiral curve (green versus a straight line (orange. 4 Theoretical Analysis In this section, we analyze the asymptotic behavior for the coverage risk and prove the consistency for estimating the coverage risk by the proposed method. In particular, we derive the asymptotic properties for the density ridges. We only focus on L risk since by Jensen s inequality, the L risk can be bounded by the L 1 risk. Before we state our assumption, we first define the orientation of density ridges. Recall that the density ridge R is a collection of one dimensional curves. Thus, for each point x R, we can associate a unit vector e(x that represent the orientation of R at x. The explicit formula for e(x can be found in Lemma 1 of [6]. Assumptions. (R There exist β, β 1, β, δ R > such that for all x R δ R, λ (x β 1, λ 1 (x β β 1, p(x p (3 (x max β (β 1 β, (1 where p (3 (x max is the element wise norm to the third derivative. And for each x R, e(x T p(x λ 1(x λ 1(x λ (x. (K1 The kernel function K BC 3 and is symmetric, non-negative and x K (α (xdx <, ( K (α (x dx < for all α, 1,, 3. (K The kernel function K and its partial derivative satisfies condition K 1 in [18]. Specifically, let K { y K (α ( x y h : x R d, h >, α, 1, } (13 We require that K satisfies sup P N ( K, L (P, ɛ F L(P ( A ɛ v (14 5

6 for some positive number A, v, where N(T, d, ɛ denotes the ɛ-covering number of the metric space (T, d and F is the envelope function of K and the supreme is taken over the whole R d. The A and v are usually called the VC characteristics of K. The norm F L(P sup P F (x dp (x. Assumption (R appears in [6] and is very mild. The first two inequality in (1 are just the bound on eigenvalues. The last inequality requires the density around ridges to be smooth. The latter part of (R requires the direction of ridges to be similar to the gradient direction. Assumption (K1 is the common condition for kernel density estimator see e.g. [6] and [3]. Assumption (K is to regularize the classes of kernel functions that is widely assumed [1, 17, 4]; any bounded kernel function with compact support satisfies this condition. Both (K1 and (K hold for the Gaussian kernel. Under the above condition, we derive the rate of convergence for the L risk. Theorem Let Risk,n be the L coverage risk for estimating the density ridges and level sets. Assume (K1 and (R and p is at least four times bounded differentiable. Then as n, h and log n nh d+6 ( Risk,n BRh 4 + σ R 1 nh d+ + o(h4 + o nh d+, for some B R and σr that depends only on the density p and the kernel function K. The rate in Theorem shows a bias-variance decomposition. The first term involving h 4 is the bias term while the latter term is the variance part. Thanks to the Jensen s inequality, the rate of convergence for L 1 risk is the square root of the rate Theorem. Note that we require the smoothing parameter h to decay slowly to by log n. This constraint comes from the uniform bound nh for estimating third derivatives for p. We d+6 need this constraint since we need the smoothness for estimated ridges to converge to the smoothness for the true ridges. Similar result for density level set appears in [3, 19]. By Lemma 1, we can upper bound the L risk by expected square of the Hausdorff distance which gives the rate ( Risk,n E Haus ( R ( log n n, R O(h 4 + O nh d+ (15 The rate under Hausdorff distance for density ridges can be found in [6] and the rate for density ridges appears in [9]. The rate induced by Theorem agrees with the bound from the Hausdorff distance and has a slightly better rate for variance (without a log-n factor. This phenomena is similar to the MISE and L error for nonparametric estimation for functions. The MISE converges slightly faster (by a log-n factor than square to the L error. Now we prove the consistency of the risk estimators. In particular, we prove the consistency for the smoothed bootstrap. The case of data splitting can be proved in the similar way. Theorem 3 Let Risk,n be the L coverage risk for estimating the density ridges and level sets. Let Risk,n be the corresponding risk estimator by the smoothed bootstrap. Assume (K1 and (R and p is at least four times bounded differentiable. Then as n, h and log n nh d+6, Risk,n Risk,n Risk,n P. Theorem 3 proves the consistency for risk estimation using the smoothed bootstrap. This also leads to the consistency for data splitting. 6

7 (Estimated L1 Coverage Risk Smoothing Parameter (Estimated L1 Coverage Risk Smoothing Parameter (Estimated L1 Coverage Risk Smoothing Parameter Figure 3: Three different simulation datasets. Top row: the spiral dataset. Middle row: the three spirals dataset. Bottom row: NIPS character dataset. For each row, the leftmost panel shows the estimated L1 coverage risk using data splitting. Then the rest three panels, are the result using different smoothing parameters. From left to right, we show the result for under-smoothing, optimal smoothing (using the coverage risk, and over-smoothing Applications Simulation Data We now apply the data splitting technique (7 to choose the smoothing bandwidth for density ridge estimation. The density ridge estimation can be done by the subspace constrain mean shift algorithm [1]. We consider three famous datasets: the spiral dataset, the three spirals dataset and a NIPS dataset. Figure 3 shows the result for the three simulation datasets. The top row is the spiral dataset; the middle row is the three spirals dataset; the bottom row is the NIPS character dataset. For each row, from left to right the first panel is the estimated L1 risk by using data splitting. The second to fourth panels are under-smoothing, optimal smoothing, and over-smoothing. Note that we also remove the ridges whose density is below.5 maxx pbn (x since they behave like random noise. As can be seen easily, the optimal bandwidth allows the density ridges to capture the underlying structures in every dataset. On the contrary, the under-smoothing and the over-smoothing does not capture the structure and have a higher risk. 5. Cosmic Web Now we apply our technique to the Sloan Digital Sky Survey, a huge dataset that contains millions of galaxies. In our data, each point is an observed galaxy with three features: 7

8 (Estimated L1 Coverage Risk Smoothing Parameter Figure 4: Another slice for the cosmic web data from the Sloan Digital Sky Survey. The leftmost panel shows the (estimated L1 coverage risk (right panel for estimating density ridges under different smoothing parameters. We estimated the L1 coverage risk by using data splitting. For the rest panels, from left to right, we display the case for under-smoothing, optimal smoothing, and over-smoothing. As can be seen easily, the optimal smoothing method allows the SCMS algorithm to detect the intricate cosmic network structure. z: the redshift, which is the distance from the galaxy to Earth. RA: the right ascension, which is the longitude of the Universe. dec: the declination, which is the latitude of the Universe. These three features (z, RA, dec uniquely determine the location of a given galaxy. To demonstrate the effectiveness of our method, we select a -D slice of our Universe at redshift z.5.55 with (RA, dec [, 4] [, 4]. Since the redshift difference is very tiny, we ignore the redshift value of the galaxies within this region and treat them as a -D data points. Thus, we only use RA and dec. Then we apply the SCMS algorithm (version of [7] with data splitting method introduced in section.1 to select the smoothing parameter h. The result is given in Figure 4. The left panel provides the estimated coverage risk at different smoothing bandwidth. The rest panels give the result for under-smoothing (second panel, optimal smoothing (third panel and over-smoothing (right most panel. In the third panel of Figure 4, we see that the SCMS algorithm detects the filament structure in the data. 6 Discussion In this paper, we propose a method using coverage risk, a generalization of mean integrated square error, to select the smoothing parameter for the density ridge estimation problem. We show that the coverage risk can be estimated using data splitting or smoothed bootstrap and we derive the statistical consistency for risk estimators. Both simulation and real data analysis show that the proposed bandwidth selector works very well in practice. The concept of coverage risk is not limited to density ridges; instead, it can be easily generalized to other manifold learning technique. Thus, we can use data splitting to estimate the risk and use the risk estimator to select the tuning parameters. This is related to the so-called stability selection [], which allows us to select tuning parameters even in an unsupervised learning settings. A Proofs Before we prove Theorem, we need the following lemma for comparing two curves. Lemma 4 Let S1, S be two bounded smooth curves in Rd. Let π1 : S1 7 S and π1 : S 7 S1 be the projections between them. For a S1 and b S, define g1 (a and g (b as the unit tangent vectors for S1 and S at a and b respectively. Assume S1 and S are similar in the following sense: 8

9 (S1 π 1 and π 1 are one-one and onto, (S the projections are similar: { } max sup π 1 (x π1 1 (x, sup π 1 (x π1 1 (x O(ɛ 1, x S 1 x S (S3 the tangent vectors are similar: { } max sup g 1 (x T g (π 1 (x, sup g (x T g 1 (π 1 (x 1 + O(ɛ, x S 1 x S (S4 the length are similar: length(s1 length(s O(ɛ 3 with ɛ 1, ɛ, ɛ 3 being very small. Let I 1 S 1 x π 1 (x dx and I S y π 1 (y dy. Then we have I 1 I I O(ɛ 1 + I O(ɛ + ɛ 3. Moreover, if we further assume (S5 the Hausdorff distance Haus(S 1, S O(ɛ 4 is small, then for any function ξ : R d R that has bounded continuous derivative, we have ξ(γ 1 (tdt ξ(γ (tdt(1 + O(ɛ + ɛ 3 + ɛ 4. PROOF. Since S 1 and S are two bounded, smooth curves. We may parametrized them by γ 1 : [, 1] S 1 and γ : [, 1] S with γ 1(t g 1 (γ 1 (t, γ 1 ( s 1, γ (t g (γ (t, γ ( s π 1 (γ 1 (, where g 1 l 1 g 1 and g l g for l 1, l being the length of S 1 and S and s 1 one of the end point of S 1. The constant l j works as a normalization constant since g j is an unit vector; it is easy to verify that length(s j g j (t dt l j g j (t dt l j. The starting point s S must be the projection π 1 (s 1 otherwise the condition (S1 will not hold. Let I 1 γ 1 (t π 1 (γ 1 (t dt, I Then the goal is to prove I 1 I O(ɛ 1 + O(ɛ. (16 γ (t π 1 (γ (t dt. (17 Now we consider another parametrization for S. Let η : [, 1] S such that η (t π 1 (γ 1 (t. By (S1, η is a parametrization for S. The parametrization η (t has the following useful properties: η ( π 1 (γ 1 ( s, η (t g (η (tg (η (t T γ 1(t g (η (tg (π 1 (γ 1 (t T (18 g 1 (γ 1 (t. By condition (S3 and (S4, we have g (π 1 (γ 1 (t T g 1 (γ 1 (t l 1 g (π 1 (γ 1 (t T g 1 (γ 1 (t l 1 (1 + O(ɛ l (1 + O(ɛ + O(ɛ 3 9 (19

10 uniformly for all t [, 1]. Now apply this result to η (t, we obtain that Together with η ( γ (, we have η (t g (η (t(1 + O(ɛ + O(ɛ 3. ( sup η (t γ (t O(ɛ + O(ɛ 3. (1 t [,1] Now by definition of I 1 and the fact that π1 1 (η (t γ 1 (t, we have I 1 γ 1 (t π 1 (γ 1 (t dt π 1 1 (η (t η (t dt π 1 (η (t + O(ɛ 1 η (t dt I + I O(ɛ 1, by (S ( where I π 1(η (t η (t dt. Now we bound the difference between I and I. Let U be an uniform distribution over [, 1] and define h(x : [, 1] R as h(x π 1 (γ (x γ (x. Note that it is easy to see that h(x has bounded derivative. Then, I E π 1 (γ (U γ (U Eh(U. (3 Since both γ and η are parametrization for the curve S, γ 1 is well defined for all image of η. We define the random variable W γ 1 (η (U. Then by definition of I, I E π 1 (η (U η (U Eh(W. (4 Since sup t [,1] γ (t η (t O(ɛ + O(ɛ 3, we have γ 1 (η (x x + O(ɛ + O(ɛ 3. Thus, the p W (t p U (t O(ɛ +O(ɛ 3, where p W and p U are the probability density for random variable W and U. Since U is uniform distribution, p U 1 so that Eh(W h(tp W (tdt h(t(p U (t + O(ɛ + O(ɛ 3 dt h(t(1 + O(ɛ + O(ɛ 3 dt Eh(U(1 + O(ɛ + O(ɛ 3. This implies I I (1 + O(ɛ + O(ɛ 3. Therefore, by ( we conclude (5 I 1 I + I O(ɛ 1 which completes the proof for the first assertion. I + I O(ɛ 1 + I (O(ɛ + O(ɛ 3, (6 Now we prove the second assertion, here we will assume (S5. Since ξ has bounded first derivative, ξ(γ 1 (tdt ξ(π 1 (γ 1 (tdt(1 + O(Haus(S 1, S ξ(η (tdt(1 + O(ɛ 4. (7 1

11 Again, let U be the uniform distribution and W γ 1 (η (U. We now define the function h(t ξ(γ (t for t [, 1]. Since both ξ and γ are bounded differentiable, h is also bounded differentiable. Then it is easy to see that ξ(η (tdt ξ(γ (tγ 1 (η (tdt E h(w ξ(γ (tdt E h(u. Now by the same derivation of (5, we conclude Thus, by (7 and (9, we conclude which completes the proof. (8 ξ(η (tdt E h(w E h(u(1 + O(ɛ + O(ɛ 3. (9 ξ(γ 1 (tdt ξ(γ (tdt(1 + O(ɛ + O(ɛ 3 + O(ɛ 4, (3 The following Lemma bounds the rate of convergence for the kernel density estimator and will be used frequently in the following derivation. Lemma 5 (Lemma 1 of [6]; see also [17] Assume (K1 K and that log n/n h d b for some < b < 1. Then we have ( log n p n p k,max O(h + O P nh d+k (31 for k,, 3. Moreover, E p n p k,max O(h + O ( log n nh d+k. (3 PROOF FOR THEOREM. Here we prove the case for density ridges. The case for density level set can be proved by the similar method. We will use Lemma 4 to obtain the rate. Our strategy is that first we derive E(d(U R, R n and then show that the other part E(d(U Rn, R is similar to the first part. Part 1. We first introduce the concept of reach [14]. For a smooth set A, the reach is defined as reach(a inf{r : every point in A r has an unique projection onto A.}. (33 The reach condition is essential to establish a one-one projection between two smooth sets. By Lemma, property 7 of [6], reach(r min { δr, β } A ( p (3 max + p (4 max for some constant A. Note that δ R and β are the constants in condition (R. (34 Thus, as long as R n is close to R, every point on R n has an unique projection onto R. Similarly, reach( R n will have a similar bound to reach(r whenever p n p 4,max is small (reach only depends on fourth derivatives. Hence, every point on R will have an unique projection onto R n. 11

12 The projections between R and R n will be one-one and onto except for points near the end points for R and R n. That is, when p n p 4,max is sufficiently small, there exists R R and R n R n such that the projection between R and R n are one-one and onto. Moreover, the length difference length(r length(r O(Haus( R n, R, length( R n length( R n O(Haus( R n, R. Note that by Theorem 6 in [17], (35 Haus( R n, R O( p n p,max. (36 Let x R, and let x π Rn (x R n be its projection onto R n. Then by Theorem 3 in [6] (see their derivation in the proof, the empirical approximation, page 3-3 and equation (79, we have x x W (x(ĝ n (x g(x(1 + O( p n p 3,max, (37 where W (x N(xH 1 N (xn(x (38 H N (x N(x T H(xN(x and N(x is a d (d 1 matrix called the normal matrix for R at x whose columns space spanned the normal space for R at x. The existence for N(x is given in Section 3. and Lemma in [6]. Thus, we have E ( d(x, R n E ( x x E W (x(ĝ n (x g(x + n, (39 where n is the remaining term and by Cauchy-Schwartz inequality, Thus, n E W (x(ĝ n (x g(x O(E p n p 3,max. ( E d(x, R n E W (x(ĝ n (x g(x + n E W (x(ĝ n (x E(ĝ n (x + E(ĝ n (x g(x + n Tr(Cov(W (xĝ n (x + W (x(e(ĝ n (x g(x (4 + n 1 ( 1 nh d+ Tr(Σ(x + h4 b(x T b(x + o nh d+ + o ( h 4, where Σ(x W (xσ(kw (xp(x, b(x c(kw (x ( (41 p(x are related to the variance and bias for nonparametric gradient estimation (Σ(Kp(x is the asymptotic covariance matrix for p n and c(k ( p(x is the asymptotic bias for p n. Σ(K is a matrix and c(k is a scalar; they both depends only on the kernel function K. + + is x 1 x d the Laplacian operator. Now we compute E(d(U R, R n. bounded by (35 and (36: Note that since the length difference between R and R is P(U R R 1 O(E( p n p,max ( log n 1 O(h O nh d+4. Note that we use Lemma 5 to convert the norm into probability bound. By tower property (law of total expectation, E(d(U R, R n E(E(d(U R, R n U R E(E(d(U R, R n U R, U R R P(U R R + E(E(d(U R, R n U R, U R / R P(U R / R ( ( 1 1 E nh d+ Tr(Σ(U R + h 4 b(u R T b(u R + o nh d+ + o ( h 4. 1 (4 (43

13 Note that by (4, the contribution from P(U R / R is smaller than the main effect in (4 so we absorb it into the small o terms. Defining BR E(b(U R T b(u R and σr E(Tr(Σ(U R, we obtain ( E(d(U R, R n BRh 4 + σ R 1 nh d+ + o nh d+ + o ( h 4. (44 Part. We have proved the first part for the L coverage risk. Now we prove the result for E(d(U Rn, R ; this will apply Lemma 4. If we think of R as S 1 and R n as S in Lemma 4, then E(d(U R, R n X 1,, X n E(d(U R n, R X 1,, X n γ 1 (t π 1 (γ 1 (t dt I 1 γ (t π 1 (γ (t dt I. Thus, E(d(U R n, R is approximated by E(d(U R, R n if the ɛ 1, ɛ, ɛ 3 in Lemma 4 is small. Here we bound ɛ j. The bound for ɛ 1 is simple. For all x S 1, let θ be the angle between the two vectors v 1 π 1 (x x and v π1 1 (x x. By the property of projection, v 1 is normal to R n at π 1 (x and v is normal to R at x. Thus, by Lemma properties 5 and 6 of [6], the angle θ is bounded by O( p n p 3,max. Note that their Lemma proves the normal matrices N(x and N n (π 1 (x are close which implies the canonical angle between two subspace are close so that θ is bounded. Now by the fact that both π 1 (x x and π1 1 (x x are bounded by Haus( R n, R, we conclude ɛ 1 Haus( R n, R θ O( p n p 3,max. For ɛ, we will use the property of normal matrix N(x. Let N n (x be the normal matrix for R n at x. By Lemma, properties 5 and 6 of [6], N(xN(x T N n (π Rn (x N n (π Rn (x T max O(Haus( R n, R + O( p n p 3,max O( p n p 3,max. N(xN(x T is the projection matrix onto normal space; so the tangent vector is perpendicular to that projection. The bounds for the two projection matrix implies the bound to the two tangent vectors. Thus, ɛ O( p n p 3,max. For ɛ 3, since the smoothness for R n is similar to R (the normal direction is similar by ɛ and their Hausdorff distance is bounded by O( p n p,max. The length difference is at the same rate of Hausdorff distance. Thus, we may pick ɛ 3 O( p n p,max. Let I 1 E(d(U R, R n X 1,, X n and I E(d(U R n, R X 1,, X n. By Lemma 4 and the above choice for ɛ j, we conclude (45 I 1 I (1 + O( p n p 3,max + I O( p n p 3,max. (46 Thus, by tower property again (taking expectation over both side and Lemma 5 E p n p 3,max ( O(h log n + O o(1, nh d+6 E(d(U R, R n E(I 1 E(I + o(1 E(d(U R n, R + o(1. (47 Now since by (35 and the fact that EHaus( R n, R o(1, we have E(d(U R, R n E(d(U R, R n (1 + o(1 E(d(U R n, R E(d(U Rn, R (1 + o(1. (48 13

14 Combining by (44, (47 and (48, we conclude Risk,n E(d(U R, R n + E(d(U Rn, R E(d(U R, R n + o(1 B Rh 4 + σ R nh d+ + o ( 1 nh d+ + o ( h 4, (49 where BR E(b(U R T b(u R and σr E(Tr(Σ(U R. Note that all the above derivation works only when ( log n E p n p 3,max O(h + O nh d+6 o(1. (5 This requires h and log n, which constitutes the conditions on h we need. nh d+6 PROOF FOR THEOREM 3. are given. Since we are proving the bootstrap consistency, we assume X 1,, X n By Theorem, the estimated risk Risk n, has the following asymptotic behavior ( Risk n, B Rh 4 + σ R 1 nh d+ + o nh d+ + o ( h 4, (51 where B R E ( bn (U T bn Rn (U X Rn 1,, X n, (5 σ R E (Tr( Σ n (U Rn X 1,, X n with b n (x c(kw (x ( p n (x and Σ n (x W (xσ(kw (x p n (x from (41. To prove the bootstrap consistency, it is equivalent to prove that B R and σ R converges to B R and σ R. Here we prove the consistency for B R. The consistency for σ R can be proved in the similar way. We define the following two functions Ω n (x c(kw (x ( p n (x, Ω(x c(kw (x ( p(x. It is easy to see that B R E ( Ωn (U Rn X 1,, X n and B R E (Ω(U R. (53 Similarly as in the proof for Theorem, we define R n R n that has one-one and onto projection to R. By (4, we can replace U Rn by U R n and U R by U R at the cost of probability O(h + ( O log n nh d+4. Now we will apply Lemma 4 again to prove the result. Again, we think of R as S 1 and R n as S. Let U be an uniform distribution over [, 1]. Then the random variable U R γ 1 (U and γ (U. Thus, U R n E (Ω(U R Ω(γ 1 (tdt, E ( Ωn (U R X 1,, X n n Ω n (γ (tdt. (54 14

15 By the second assertion in Lemma 4, E (Ω(U R Ω(γ 1 (tdt Ω(γ (tdt(1 + O(ɛ + O(ɛ 3 + O(ɛ 4 Ω(γ (tdt(1 + O( p n p 3,max. Note that we use the fact that Haus( R n, R O( p n p,max. Since Ω only involves third derivative for the density p, we have sup x R d Ω(x Ω n (x O( p n p 3,max. This implies Ω(γ (tdt (55 Ω n (γ (tdt + O( p n p 3,max. (56 Now combining all the above and the definition for B R, we conclude B R E ( Ωn (U X Rn 1,, X n E ( Ωn (U R X 1,, X n n + O(Haus( R n, R Ω n (γ (tdt + O(Haus( R n, R (by (54 Ω(γ (tdt + O( p n p 3,max (by (56 E (Ω(U R + O( p n p 3,max (by (55 E (Ω(U R + O( p n p 3,max B R + O( p n p 3,max. (57 Therefore, as along as we have p n p 3,max o P (1, we have B R BR o P (1. (58 Similarly, we the same condition implies σ R σr o P (1. (59 Now recall from (51 and Theorem, the risk difference is Risk n, Risk n, ( B R B Rh 4 + σ R σ R nh d+ ( o P h 4 ( 1 + o P nh d+ + o ( h 4 ( 1 + o nh d+ (by (58 and (59. Since Theorem implies Risk n, O ( h 4 + O ( 1 nh d+, by (6 we have Risk n, Risk n, Risk n, o P (1 (61 which proves the theorem. Note that in order (61 to hold, we need p n p 3,max o P (1. By Lemma 5, ( log n p n p 3,max O(h + O P nh d+6. (6 log n Thus, a sufficient condition to p n p 3,max o P (1 is to pick h such that and h. nh This gives the restriction for the smoothing parameter h. d+6 (6 15

16 References [1] E. Bas, N. Ghadarghadar, and D. Erdogmus. Automated extraction of blood vessel networks from 3d microscopy image stacks via multi-scale principal curve tracing. In Biomedical Imaging: From Nano to Macro, 11 IEEE International Symposium on, pages IEEE, 11. [] E. Bas, D. Erdogmus, R. Draft, and J. W. Lichtman. Local tracing of curvilinear structures in volumetric color images: application to the brainbow analysis. Journal of Visual Communication and Image Representation, 3(8:16 171, 1. [3] B. Cadre. Kernel estimation of density level sets. Journal of multivariate analysis, 6. [4] Y.-C. Chen, C. R. Genovese, R. J. Tibshirani, and L. Wasserman. Nonparametric modal regression. arxiv preprint arxiv: , 14. [5] Y.-C. Chen, C. R. Genovese, and L. Wasserman. Generalized mode and ridge estimation. arxiv: , June 14. [6] Y.-C. Chen, C. R. Genovese, and L. Wasserman. Asymptotic theory for density ridges. arxiv preprint arxiv: , 14. [7] Y.-C. Chen, S. Ho, P. E. Freeman, C. R. Genovese, and L. Wasserman. Cosmic web reconstruction through density ridges: Method and algorithm. arxiv preprint arxiv: , 15. [8] Y. Cheng. Mean shift, mode seeking, and clustering. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 17(8:79 799, [9] A. Cuevas, W. Gonzalez-Manteiga, and A. Rodriguez-Casal. Plug-in estimation of general level sets. Aust. N. Z. J. Stat., 6. [1] D. Eberly. Ridges in Image and Data Analysis. Springer, [11] J. Einbeck. Bandwidth selection for mean-shift based unsupervised learning techniques: a unified approach via self-coverage. Journal of pattern recognition research., 6(:175 19, 11. [1] U. Einmahl and D. M. Mason. Uniform in bandwidth consistency for kernel-type function estimators. The Annals of Statistics, 5. [13] L. C. Evans and R. F. Gariepy. Measure theory and fine properties of functions, volume 5. CRC press, [14] H. Federer. Curvature measures. Trans. Am. Math. Soc, 93, [15] K. Fukunaga and L. Hostetler. The estimation of the gradient of a density function, with applications in pattern recognition. Information Theory, IEEE Transactions on, 1(1:3 4, [16] C. R. Genovese, M. Perone-Pacifico, I. Verdinelli, and L. Wasserman. Nonparametric ridge estimation. arxiv: v1, 1. [17] C. R. Genovese, M. Perone-Pacifico, I. Verdinelli, L. Wasserman, et al. Nonparametric ridge estimation. The Annals of Statistics, 4(4: , 14. [18] E. Gine and A. Guillou. Rates of strong uniform consistency for multivariate kernel density estimators. In Annales de l Institut Henri Poincare (B Probability and Statistics,. [19] D. M. Mason, W. Polonik, et al. Asymptotic normality of plug-in level set estimates. The Annals of Applied Probability, 19(3: , 9. [] Z. Miao, B. Wang, W. Shi, and H. Wu. A method for accurate road centerline extraction from a classified image. 14. [1] U. Ozertem and D. Erdogmus. Locally defined principal curves and surfaces. Journal of Machine Learning Research, 11. [] A. Rinaldo and L. Wasserman. Generalized density clustering. The Annals of Statistics, 1. [3] D. W. Scott. Multivariate density estimation: theory, practice, and visualization, volume 383. John Wiley & Sons, 9. [4] B. Silverman and G. Young. The bootstrap: To smooth or not to smooth? Biometrika, 74(3: , [5] B. W. Silverman. Density Estimation for Statistics and Data Analysis. Chapman and Hall, [6] L. Wasserman. All of Nonparametric Statistics. Springer-Verlag New York, Inc., 6. 16

Optimal Ridge Detection using Coverage Risk

Optimal Ridge Detection using Coverage Risk Optimal Ridge Detection using Coverage Risk Yen-Chi Chen Department of Statistics Carnegie Mellon University yenchic@andrew.cmu.edu Shirley Ho Department of Physics Carnegie Mellon University shirleyh@andrew.cmu.edu

More information

Nonparametric Inference via Bootstrapping the Debiased Estimator

Nonparametric Inference via Bootstrapping the Debiased Estimator Nonparametric Inference via Bootstrapping the Debiased Estimator Yen-Chi Chen Department of Statistics, University of Washington ICSA-Canada Chapter Symposium 2017 1 / 21 Problem Setup Let X 1,, X n be

More information

arxiv: v3 [stat.me] 13 Oct 2015

arxiv: v3 [stat.me] 13 Oct 2015 The Annals of Statistics 2015, Vol. 43, No. 5, 1896 1928 DOI: 10.1214/15-AOS1329 c Institute of Mathematical Statistics, 2015 ASYMPTOTIC THEORY FOR DENSITY RIDGES arxiv:1406.5663v3 [stat.me] 13 Oct 2015

More information

arxiv: v2 [stat.me] 12 Sep 2017

arxiv: v2 [stat.me] 12 Sep 2017 1 arxiv:1704.03924v2 [stat.me] 12 Sep 2017 A Tutorial on Kernel Density Estimation and Recent Advances Yen-Chi Chen Department of Statistics University of Washington September 13, 2017 This tutorial provides

More information

Lectures in AstroStatistics: Topics in Machine Learning for Astronomers

Lectures in AstroStatistics: Topics in Machine Learning for Astronomers Lectures in AstroStatistics: Topics in Machine Learning for Astronomers Jessi Cisewski Yale University American Astronomical Society Meeting Wednesday, January 6, 2016 1 Statistical Learning - learning

More information

arxiv: v1 [stat.me] 10 May 2018

arxiv: v1 [stat.me] 10 May 2018 / 1 Analysis of a Mode Clustering Diagram Isabella Verdinelli and Larry Wasserman, arxiv:1805.04187v1 [stat.me] 10 May 2018 Department of Statistics and Data Science and Machine Learning Department Carnegie

More information

Discriminative Direction for Kernel Classifiers

Discriminative Direction for Kernel Classifiers Discriminative Direction for Kernel Classifiers Polina Golland Artificial Intelligence Lab Massachusetts Institute of Technology Cambridge, MA 02139 polina@ai.mit.edu Abstract In many scientific and engineering

More information

Estimation of density level sets with a given probability content

Estimation of density level sets with a given probability content Estimation of density level sets with a given probability content Benoît CADRE a, Bruno PELLETIER b, and Pierre PUDLO c a IRMAR, ENS Cachan Bretagne, CNRS, UEB Campus de Ker Lann Avenue Robert Schuman,

More information

Learning Eigenfunctions: Links with Spectral Clustering and Kernel PCA

Learning Eigenfunctions: Links with Spectral Clustering and Kernel PCA Learning Eigenfunctions: Links with Spectral Clustering and Kernel PCA Yoshua Bengio Pascal Vincent Jean-François Paiement University of Montreal April 2, Snowbird Learning 2003 Learning Modal Structures

More information

Lecture Notes 15 Prediction Chapters 13, 22, 20.4.

Lecture Notes 15 Prediction Chapters 13, 22, 20.4. Lecture Notes 15 Prediction Chapters 13, 22, 20.4. 1 Introduction Prediction is covered in detail in 36-707, 36-701, 36-715, 10/36-702. Here, we will just give an introduction. We observe training data

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression

More information

Nonparametric Estimation of Luminosity Functions

Nonparametric Estimation of Luminosity Functions x x Nonparametric Estimation of Luminosity Functions Chad Schafer Department of Statistics, Carnegie Mellon University cschafer@stat.cmu.edu 1 Luminosity Functions The luminosity function gives the number

More information

The Laplacian PDF Distance: A Cost Function for Clustering in a Kernel Feature Space

The Laplacian PDF Distance: A Cost Function for Clustering in a Kernel Feature Space The Laplacian PDF Distance: A Cost Function for Clustering in a Kernel Feature Space Robert Jenssen, Deniz Erdogmus 2, Jose Principe 2, Torbjørn Eltoft Department of Physics, University of Tromsø, Norway

More information

Advances in Manifold Learning Presented by: Naku Nak l Verm r a June 10, 2008

Advances in Manifold Learning Presented by: Naku Nak l Verm r a June 10, 2008 Advances in Manifold Learning Presented by: Nakul Verma June 10, 008 Outline Motivation Manifolds Manifold Learning Random projection of manifolds for dimension reduction Introduction to random projections

More information

A Novel Nonparametric Density Estimator

A Novel Nonparametric Density Estimator A Novel Nonparametric Density Estimator Z. I. Botev The University of Queensland Australia Abstract We present a novel nonparametric density estimator and a new data-driven bandwidth selection method with

More information

Unsupervised Learning with Permuted Data

Unsupervised Learning with Permuted Data Unsupervised Learning with Permuted Data Sergey Kirshner skirshne@ics.uci.edu Sridevi Parise sparise@ics.uci.edu Padhraic Smyth smyth@ics.uci.edu School of Information and Computer Science, University

More information

Gaussian Estimation under Attack Uncertainty

Gaussian Estimation under Attack Uncertainty Gaussian Estimation under Attack Uncertainty Tara Javidi Yonatan Kaspi Himanshu Tyagi Abstract We consider the estimation of a standard Gaussian random variable under an observation attack where an adversary

More information

From CDF to PDF A Density Estimation Method for High Dimensional Data

From CDF to PDF A Density Estimation Method for High Dimensional Data From CDF to PDF A Density Estimation Method for High Dimensional Data Shengdong Zhang Simon Fraser University sza75@sfu.ca arxiv:1804.05316v1 [stat.ml] 15 Apr 2018 April 17, 2018 1 Introduction Probability

More information

A proposal of a bivariate Conditional Tail Expectation

A proposal of a bivariate Conditional Tail Expectation A proposal of a bivariate Conditional Tail Expectation Elena Di Bernardino a joint works with Areski Cousin b, Thomas Laloë c, Véronique Maume-Deschamps d and Clémentine Prieur e a, b, d Université Lyon

More information

Frontier estimation based on extreme risk measures

Frontier estimation based on extreme risk measures Frontier estimation based on extreme risk measures by Jonathan EL METHNI in collaboration with Ste phane GIRARD & Laurent GARDES CMStatistics 2016 University of Seville December 2016 1 Risk measures 2

More information

Random Forests. These notes rely heavily on Biau and Scornet (2016) as well as the other references at the end of the notes.

Random Forests. These notes rely heavily on Biau and Scornet (2016) as well as the other references at the end of the notes. Random Forests One of the best known classifiers is the random forest. It is very simple and effective but there is still a large gap between theory and practice. Basically, a random forest is an average

More information

12 - Nonparametric Density Estimation

12 - Nonparametric Density Estimation ST 697 Fall 2017 1/49 12 - Nonparametric Density Estimation ST 697 Fall 2017 University of Alabama Density Review ST 697 Fall 2017 2/49 Continuous Random Variables ST 697 Fall 2017 3/49 1.0 0.8 F(x) 0.6

More information

Half of Final Exam Name: Practice Problems October 28, 2014

Half of Final Exam Name: Practice Problems October 28, 2014 Math 54. Treibergs Half of Final Exam Name: Practice Problems October 28, 24 Half of the final will be over material since the last midterm exam, such as the practice problems given here. The other half

More information

Monte Carlo Methods for Statistical Inference: Variance Reduction Techniques

Monte Carlo Methods for Statistical Inference: Variance Reduction Techniques Monte Carlo Methods for Statistical Inference: Variance Reduction Techniques Hung Chen hchen@math.ntu.edu.tw Department of Mathematics National Taiwan University 3rd March 2004 Meet at NS 104 On Wednesday

More information

Non-linear Dimensionality Reduction

Non-linear Dimensionality Reduction Non-linear Dimensionality Reduction CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Introduction Laplacian Eigenmaps Locally Linear Embedding (LLE)

More information

Modal Regression using Kernel Density Estimation: a Review

Modal Regression using Kernel Density Estimation: a Review Modal Regression using Kernel Density Estimation: a Review Yen-Chi Chen arxiv:1710.07004v2 [stat.me] 7 Dec 2017 Abstract We review recent advances in modal regression studies using kernel density estimation.

More information

Tutorial on Machine Learning for Advanced Electronics

Tutorial on Machine Learning for Advanced Electronics Tutorial on Machine Learning for Advanced Electronics Maxim Raginsky March 2017 Part I (Some) Theory and Principles Machine Learning: estimation of dependencies from empirical data (V. Vapnik) enabling

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 254 Part V

More information

Math 350 Fall 2011 Notes about inner product spaces. In this notes we state and prove some important properties of inner product spaces.

Math 350 Fall 2011 Notes about inner product spaces. In this notes we state and prove some important properties of inner product spaces. Math 350 Fall 2011 Notes about inner product spaces In this notes we state and prove some important properties of inner product spaces. First, recall the dot product on R n : if x, y R n, say x = (x 1,...,

More information

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015 EE613 Machine Learning for Engineers Kernel methods Support Vector Machines jean-marc odobez 2015 overview Kernel methods introductions and main elements defining kernels Kernelization of k-nn, K-Means,

More information

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani PCA & ICA CE-717: Machine Learning Sharif University of Technology Spring 2015 Soleymani Dimensionality Reduction: Feature Selection vs. Feature Extraction Feature selection Select a subset of a given

More information

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26 Principal Component Analysis Brett Bernstein CDS at NYU April 25, 2017 Brett Bernstein (CDS at NYU) Lecture 13 April 25, 2017 1 / 26 Initial Question Intro Question Question Let S R n n be symmetric. 1

More information

Linear Regression and Its Applications

Linear Regression and Its Applications Linear Regression and Its Applications Predrag Radivojac October 13, 2014 Given a data set D = {(x i, y i )} n the objective is to learn the relationship between features and the target. We usually start

More information

Simulation study on using moment functions for sufficient dimension reduction

Simulation study on using moment functions for sufficient dimension reduction Michigan Technological University Digital Commons @ Michigan Tech Dissertations, Master's Theses and Master's Reports - Open Dissertations, Master's Theses and Master's Reports 2012 Simulation study on

More information

Statistical Machine Learning

Statistical Machine Learning Statistical Machine Learning Christoph Lampert Spring Semester 2015/2016 // Lecture 12 1 / 36 Unsupervised Learning Dimensionality Reduction 2 / 36 Dimensionality Reduction Given: data X = {x 1,..., x

More information

CLOSE-TO-CLEAN REGULARIZATION RELATES

CLOSE-TO-CLEAN REGULARIZATION RELATES Worshop trac - ICLR 016 CLOSE-TO-CLEAN REGULARIZATION RELATES VIRTUAL ADVERSARIAL TRAINING, LADDER NETWORKS AND OTHERS Mudassar Abbas, Jyri Kivinen, Tapani Raio Department of Computer Science, School of

More information

Nonparametric Inference and the Dark Energy Equation of State

Nonparametric Inference and the Dark Energy Equation of State Nonparametric Inference and the Dark Energy Equation of State Christopher R. Genovese Peter E. Freeman Larry Wasserman Department of Statistics Carnegie Mellon University http://www.stat.cmu.edu/ ~ genovese/

More information

Goodness-of-fit tests for the cure rate in a mixture cure model

Goodness-of-fit tests for the cure rate in a mixture cure model Biometrika (217), 13, 1, pp. 1 7 Printed in Great Britain Advance Access publication on 31 July 216 Goodness-of-fit tests for the cure rate in a mixture cure model BY U.U. MÜLLER Department of Statistics,

More information

4 Bias-Variance for Ridge Regression (24 points)

4 Bias-Variance for Ridge Regression (24 points) Implement Ridge Regression with λ = 0.00001. Plot the Squared Euclidean test error for the following values of k (the dimensions you reduce to): k = {0, 50, 100, 150, 200, 250, 300, 350, 400, 450, 500,

More information

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted

More information

CS281 Section 4: Factor Analysis and PCA

CS281 Section 4: Factor Analysis and PCA CS81 Section 4: Factor Analysis and PCA Scott Linderman At this point we have seen a variety of machine learning models, with a particular emphasis on models for supervised learning. In particular, we

More information

d(x n, x) d(x n, x nk ) + d(x nk, x) where we chose any fixed k > N

d(x n, x) d(x n, x nk ) + d(x nk, x) where we chose any fixed k > N Problem 1. Let f : A R R have the property that for every x A, there exists ɛ > 0 such that f(t) > ɛ if t (x ɛ, x + ɛ) A. If the set A is compact, prove there exists c > 0 such that f(x) > c for all x

More information

Inferring Galaxy Morphology Through Texture Analysis

Inferring Galaxy Morphology Through Texture Analysis Inferring Galaxy Morphology Through Texture Analysis 1 Galaxy Morphology Kinman Au, Christopher Genovese, Andrew Connolly Galaxies are not static objects; they evolve by interacting with the gas, dust

More information

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin 1 Introduction to Machine Learning PCA and Spectral Clustering Introduction to Machine Learning, 2013-14 Slides: Eran Halperin Singular Value Decomposition (SVD) The singular value decomposition (SVD)

More information

Lecture 2: CDF and EDF

Lecture 2: CDF and EDF STAT 425: Introduction to Nonparametric Statistics Winter 2018 Instructor: Yen-Chi Chen Lecture 2: CDF and EDF 2.1 CDF: Cumulative Distribution Function For a random variable X, its CDF F () contains all

More information

Learning gradients: prescriptive models

Learning gradients: prescriptive models Department of Statistical Science Institute for Genome Sciences & Policy Department of Computer Science Duke University May 11, 2007 Relevant papers Learning Coordinate Covariances via Gradients. Sayan

More information

Motivating the Covariance Matrix

Motivating the Covariance Matrix Motivating the Covariance Matrix Raúl Rojas Computer Science Department Freie Universität Berlin January 2009 Abstract This note reviews some interesting properties of the covariance matrix and its role

More information

A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables

A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables Niharika Gauraha and Swapan Parui Indian Statistical Institute Abstract. We consider the problem of

More information

Statistical Pattern Recognition

Statistical Pattern Recognition Statistical Pattern Recognition Feature Extraction Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi, Payam Siyari Spring 2014 http://ce.sharif.edu/courses/92-93/2/ce725-2/ Agenda Dimensionality Reduction

More information

J. Cwik and J. Koronacki. Institute of Computer Science, Polish Academy of Sciences. to appear in. Computational Statistics and Data Analysis

J. Cwik and J. Koronacki. Institute of Computer Science, Polish Academy of Sciences. to appear in. Computational Statistics and Data Analysis A Combined Adaptive-Mixtures/Plug-In Estimator of Multivariate Probability Densities 1 J. Cwik and J. Koronacki Institute of Computer Science, Polish Academy of Sciences Ordona 21, 01-237 Warsaw, Poland

More information

A Program for Data Transformations and Kernel Density Estimation

A Program for Data Transformations and Kernel Density Estimation A Program for Data Transformations and Kernel Density Estimation John G. Manchuk and Clayton V. Deutsch Modeling applications in geostatistics often involve multiple variables that are not multivariate

More information

QUALIFYING EXAMINATION Harvard University Department of Mathematics Tuesday August 31, 2010 (Day 1)

QUALIFYING EXAMINATION Harvard University Department of Mathematics Tuesday August 31, 2010 (Day 1) QUALIFYING EXAMINATION Harvard University Department of Mathematics Tuesday August 31, 21 (Day 1) 1. (CA) Evaluate sin 2 x x 2 dx Solution. Let C be the curve on the complex plane from to +, which is along

More information

Statistical and Learning Techniques in Computer Vision Lecture 1: Random Variables Jens Rittscher and Chuck Stewart

Statistical and Learning Techniques in Computer Vision Lecture 1: Random Variables Jens Rittscher and Chuck Stewart Statistical and Learning Techniques in Computer Vision Lecture 1: Random Variables Jens Rittscher and Chuck Stewart 1 Motivation Imaging is a stochastic process: If we take all the different sources of

More information

Sparse Nonparametric Density Estimation in High Dimensions Using the Rodeo

Sparse Nonparametric Density Estimation in High Dimensions Using the Rodeo Outline in High Dimensions Using the Rodeo Han Liu 1,2 John Lafferty 2,3 Larry Wasserman 1,2 1 Statistics Department, 2 Machine Learning Department, 3 Computer Science Department, Carnegie Mellon University

More information

Variance Function Estimation in Multivariate Nonparametric Regression

Variance Function Estimation in Multivariate Nonparametric Regression Variance Function Estimation in Multivariate Nonparametric Regression T. Tony Cai 1, Michael Levine Lie Wang 1 Abstract Variance function estimation in multivariate nonparametric regression is considered

More information

On variable bandwidth kernel density estimation

On variable bandwidth kernel density estimation JSM 04 - Section on Nonparametric Statistics On variable bandwidth kernel density estimation Janet Nakarmi Hailin Sang Abstract In this paper we study the ideal variable bandwidth kernel estimator introduced

More information

Density Estimation (II)

Density Estimation (II) Density Estimation (II) Yesterday Overview & Issues Histogram Kernel estimators Ideogram Today Further development of optimization Estimating variance and bias Adaptive kernels Multivariate kernel estimation

More information

Learning sets and subspaces: a spectral approach

Learning sets and subspaces: a spectral approach Learning sets and subspaces: a spectral approach Alessandro Rudi DIBRIS, Università di Genova Optimization and dynamical processes in Statistical learning and inverse problems Sept 8-12, 2014 A world of

More information

41903: Introduction to Nonparametrics

41903: Introduction to Nonparametrics 41903: Notes 5 Introduction Nonparametrics fundamentally about fitting flexible models: want model that is flexible enough to accommodate important patterns but not so flexible it overspecializes to specific

More information

Introduction. Chapter 1

Introduction. Chapter 1 Chapter 1 Introduction In this book we will be concerned with supervised learning, which is the problem of learning input-output mappings from empirical data (the training dataset). Depending on the characteristics

More information

O Combining cross-validation and plug-in methods - for kernel density bandwidth selection O

O Combining cross-validation and plug-in methods - for kernel density bandwidth selection O O Combining cross-validation and plug-in methods - for kernel density selection O Carlos Tenreiro CMUC and DMUC, University of Coimbra PhD Program UC UP February 18, 2011 1 Overview The nonparametric problem

More information

Small sample size in high dimensional space - minimum distance based classification.

Small sample size in high dimensional space - minimum distance based classification. Small sample size in high dimensional space - minimum distance based classification. Ewa Skubalska-Rafaj lowicz Institute of Computer Engineering, Automatics and Robotics, Department of Electronics, Wroc

More information

PCA, Kernel PCA, ICA

PCA, Kernel PCA, ICA PCA, Kernel PCA, ICA Learning Representations. Dimensionality Reduction. Maria-Florina Balcan 04/08/2015 Big & High-Dimensional Data High-Dimensions = Lot of Features Document classification Features per

More information

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop Music and Machine Learning (IFT68 Winter 8) Prof. Douglas Eck, Université de Montréal These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

More information

Introduction to Empirical Processes and Semiparametric Inference Lecture 09: Stochastic Convergence, Continued

Introduction to Empirical Processes and Semiparametric Inference Lecture 09: Stochastic Convergence, Continued Introduction to Empirical Processes and Semiparametric Inference Lecture 09: Stochastic Convergence, Continued Michael R. Kosorok, Ph.D. Professor and Chair of Biostatistics Professor of Statistics and

More information

Introduction to Machine Learning

Introduction to Machine Learning 10-701 Introduction to Machine Learning PCA Slides based on 18-661 Fall 2018 PCA Raw data can be Complex, High-dimensional To understand a phenomenon we measure various related quantities If we knew what

More information

Optimal Polynomial Admissible Meshes on the Closure of C 1,1 Bounded Domains

Optimal Polynomial Admissible Meshes on the Closure of C 1,1 Bounded Domains Optimal Polynomial Admissible Meshes on the Closure of C 1,1 Bounded Domains Constructive Theory of Functions Sozopol, June 9-15, 2013 F. Piazzon, joint work with M. Vianello Department of Mathematics.

More information

Exploiting k-nearest Neighbor Information with Many Data

Exploiting k-nearest Neighbor Information with Many Data Exploiting k-nearest Neighbor Information with Many Data 2017 TEST TECHNOLOGY WORKSHOP 2017. 10. 24 (Tue.) Yung-Kyun Noh Robotics Lab., Contents Nonparametric methods for estimating density functions Nearest

More information

Local regression, intrinsic dimension, and nonparametric sparsity

Local regression, intrinsic dimension, and nonparametric sparsity Local regression, intrinsic dimension, and nonparametric sparsity Samory Kpotufe Toyota Technological Institute - Chicago and Max Planck Institute for Intelligent Systems I. Local regression and (local)

More information

Mean-Shift Tracker Computer Vision (Kris Kitani) Carnegie Mellon University

Mean-Shift Tracker Computer Vision (Kris Kitani) Carnegie Mellon University Mean-Shift Tracker 16-385 Computer Vision (Kris Kitani) Carnegie Mellon University Mean Shift Algorithm A mode seeking algorithm Fukunaga & Hostetler (1975) Mean Shift Algorithm A mode seeking algorithm

More information

Analysis Qualifying Exam

Analysis Qualifying Exam Analysis Qualifying Exam Spring 2017 Problem 1: Let f be differentiable on R. Suppose that there exists M > 0 such that f(k) M for each integer k, and f (x) M for all x R. Show that f is bounded, i.e.,

More information

Classifier Complexity and Support Vector Classifiers

Classifier Complexity and Support Vector Classifiers Classifier Complexity and Support Vector Classifiers Feature 2 6 4 2 0 2 4 6 8 RBF kernel 10 10 8 6 4 2 0 2 4 6 Feature 1 David M.J. Tax Pattern Recognition Laboratory Delft University of Technology D.M.J.Tax@tudelft.nl

More information

Adaptive Piecewise Polynomial Estimation via Trend Filtering

Adaptive Piecewise Polynomial Estimation via Trend Filtering Adaptive Piecewise Polynomial Estimation via Trend Filtering Liubo Li, ShanShan Tu The Ohio State University li.2201@osu.edu, tu.162@osu.edu October 1, 2015 Liubo Li, ShanShan Tu (OSU) Trend Filtering

More information

Heat kernels of some Schrödinger operators

Heat kernels of some Schrödinger operators Heat kernels of some Schrödinger operators Alexander Grigor yan Tsinghua University 28 September 2016 Consider an elliptic Schrödinger operator H = Δ + Φ, where Δ = n 2 i=1 is the Laplace operator in R

More information

A Note on Hilbertian Elliptically Contoured Distributions

A Note on Hilbertian Elliptically Contoured Distributions A Note on Hilbertian Elliptically Contoured Distributions Yehua Li Department of Statistics, University of Georgia, Athens, GA 30602, USA Abstract. In this paper, we discuss elliptically contoured distribution

More information

CS 534: Computer Vision Segmentation III Statistical Nonparametric Methods for Segmentation

CS 534: Computer Vision Segmentation III Statistical Nonparametric Methods for Segmentation CS 534: Computer Vision Segmentation III Statistical Nonparametric Methods for Segmentation Ahmed Elgammal Dept of Computer Science CS 534 Segmentation III- Nonparametric Methods - - 1 Outlines Density

More information

Discussion of High-dimensional autocovariance matrices and optimal linear prediction,

Discussion of High-dimensional autocovariance matrices and optimal linear prediction, Electronic Journal of Statistics Vol. 9 (2015) 1 10 ISSN: 1935-7524 DOI: 10.1214/15-EJS1007 Discussion of High-dimensional autocovariance matrices and optimal linear prediction, Xiaohui Chen University

More information

STAT 200C: High-dimensional Statistics

STAT 200C: High-dimensional Statistics STAT 200C: High-dimensional Statistics Arash A. Amini May 30, 2018 1 / 57 Table of Contents 1 Sparse linear models Basis Pursuit and restricted null space property Sufficient conditions for RNS 2 / 57

More information

Sliced Inverse Regression

Sliced Inverse Regression Sliced Inverse Regression Ge Zhao gzz13@psu.edu Department of Statistics The Pennsylvania State University Outline Background of Sliced Inverse Regression (SIR) Dimension Reduction Definition of SIR Inversed

More information

Grassmann Averages for Scalable Robust PCA Supplementary Material

Grassmann Averages for Scalable Robust PCA Supplementary Material Grassmann Averages for Scalable Robust PCA Supplementary Material Søren Hauberg DTU Compute Lyngby, Denmark sohau@dtu.dk Aasa Feragen DIKU and MPIs Tübingen Denmark and Germany aasa@diku.dk Michael J.

More information

Robustness of Principal Components

Robustness of Principal Components PCA for Clustering An objective of principal components analysis is to identify linear combinations of the original variables that are useful in accounting for the variation in those original variables.

More information

21.2 Example 1 : Non-parametric regression in Mean Integrated Square Error Density Estimation (L 2 2 risk)

21.2 Example 1 : Non-parametric regression in Mean Integrated Square Error Density Estimation (L 2 2 risk) 10-704: Information Processing and Learning Spring 2015 Lecture 21: Examples of Lower Bounds and Assouad s Method Lecturer: Akshay Krishnamurthy Scribes: Soumya Batra Note: LaTeX template courtesy of UC

More information

Unsupervised Learning

Unsupervised Learning 2018 EE448, Big Data Mining, Lecture 7 Unsupervised Learning Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/ee448/index.html ML Problem Setting First build and

More information

Defect Detection using Nonparametric Regression

Defect Detection using Nonparametric Regression Defect Detection using Nonparametric Regression Siana Halim Industrial Engineering Department-Petra Christian University Siwalankerto 121-131 Surabaya- Indonesia halim@petra.ac.id Abstract: To compare

More information

Convergence of Eigenspaces in Kernel Principal Component Analysis

Convergence of Eigenspaces in Kernel Principal Component Analysis Convergence of Eigenspaces in Kernel Principal Component Analysis Shixin Wang Advanced machine learning April 19, 2016 Shixin Wang Convergence of Eigenspaces April 19, 2016 1 / 18 Outline 1 Motivation

More information

On the simplest expression of the perturbed Moore Penrose metric generalized inverse

On the simplest expression of the perturbed Moore Penrose metric generalized inverse Annals of the University of Bucharest (mathematical series) 4 (LXII) (2013), 433 446 On the simplest expression of the perturbed Moore Penrose metric generalized inverse Jianbing Cao and Yifeng Xue Communicated

More information

SHARP BOUNDARY TRACE INEQUALITIES. 1. Introduction

SHARP BOUNDARY TRACE INEQUALITIES. 1. Introduction SHARP BOUNDARY TRACE INEQUALITIES GILES AUCHMUTY Abstract. This paper describes sharp inequalities for the trace of Sobolev functions on the boundary of a bounded region R N. The inequalities bound (semi-)norms

More information

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions Pattern Recognition and Machine Learning Chapter 2: Probability Distributions Cécile Amblard Alex Kläser Jakob Verbeek October 11, 27 Probability Distributions: General Density Estimation: given a finite

More information

PRINCIPAL COMPONENTS ANALYSIS

PRINCIPAL COMPONENTS ANALYSIS 121 CHAPTER 11 PRINCIPAL COMPONENTS ANALYSIS We now have the tools necessary to discuss one of the most important concepts in mathematical statistics: Principal Components Analysis (PCA). PCA involves

More information

Chapter 9. Non-Parametric Density Function Estimation

Chapter 9. Non-Parametric Density Function Estimation 9-1 Density Estimation Version 1.2 Chapter 9 Non-Parametric Density Function Estimation 9.1. Introduction We have discussed several estimation techniques: method of moments, maximum likelihood, and least

More information

Recursive Generalized Eigendecomposition for Independent Component Analysis

Recursive Generalized Eigendecomposition for Independent Component Analysis Recursive Generalized Eigendecomposition for Independent Component Analysis Umut Ozertem 1, Deniz Erdogmus 1,, ian Lan 1 CSEE Department, OGI, Oregon Health & Science University, Portland, OR, USA. {ozertemu,deniz}@csee.ogi.edu

More information

Nonparametric Inference in Cosmology and Astrophysics: Biases and Variants

Nonparametric Inference in Cosmology and Astrophysics: Biases and Variants Nonparametric Inference in Cosmology and Astrophysics: Biases and Variants Christopher R. Genovese Department of Statistics Carnegie Mellon University http://www.stat.cmu.edu/ ~ genovese/ Collaborators:

More information

Aggregation of Spectral Density Estimators

Aggregation of Spectral Density Estimators Aggregation of Spectral Density Estimators Christopher Chang Department of Mathematics, University of California, San Diego, La Jolla, CA 92093-0112, USA; chrchang@alumni.caltech.edu Dimitris Politis Department

More information

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013 Learning Theory Ingo Steinwart University of Stuttgart September 4, 2013 Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 1 / 62 Basics Informal Introduction Informal Description

More information

Algorithms for Nonsmooth Optimization

Algorithms for Nonsmooth Optimization Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented at Center for Optimization and Statistical Learning, Northwestern University 2 March 2018 Algorithms for Nonsmooth Optimization

More information

Boundary Correction Methods in Kernel Density Estimation Tom Alberts C o u(r)a n (t) Institute joint work with R.J. Karunamuni University of Alberta

Boundary Correction Methods in Kernel Density Estimation Tom Alberts C o u(r)a n (t) Institute joint work with R.J. Karunamuni University of Alberta Boundary Correction Methods in Kernel Density Estimation Tom Alberts C o u(r)a n (t) Institute joint work with R.J. Karunamuni University of Alberta November 29, 2007 Outline Overview of Kernel Density

More information

Kernel-Based Contrast Functions for Sufficient Dimension Reduction

Kernel-Based Contrast Functions for Sufficient Dimension Reduction Kernel-Based Contrast Functions for Sufficient Dimension Reduction Michael I. Jordan Departments of Statistics and EECS University of California, Berkeley Joint work with Kenji Fukumizu and Francis Bach

More information

Problem set 7 Math 207A, Fall 2011 Solutions

Problem set 7 Math 207A, Fall 2011 Solutions Problem set 7 Math 207A, Fall 2011 s 1. Classify the equilibrium (x, y) = (0, 0) of the system x t = x, y t = y + x 2. Is the equilibrium hyperbolic? Find an equation for the trajectories in (x, y)- phase

More information

Self-Tuning Semantic Image Segmentation

Self-Tuning Semantic Image Segmentation Self-Tuning Semantic Image Segmentation Sergey Milyaev 1,2, Olga Barinova 2 1 Voronezh State University sergey.milyaev@gmail.com 2 Lomonosov Moscow State University obarinova@graphics.cs.msu.su Abstract.

More information

A Modified Incremental Principal Component Analysis for On-Line Learning of Feature Space and Classifier

A Modified Incremental Principal Component Analysis for On-Line Learning of Feature Space and Classifier A Modified Incremental Principal Component Analysis for On-Line Learning of Feature Space and Classifier Seiichi Ozawa 1, Shaoning Pang 2, and Nikola Kasabov 2 1 Graduate School of Science and Technology,

More information