New ways of dimension reduction? Cutting data sets into small pieces

New ways of dimension reduction? Cutting data sets into small pieces Roman Vershynin University of Michigan, Department of Mathematics Statistical Machine Learning Ann Arbor, June 5, 2012 Joint work with Yaniv Plan, University of Michigan, Mathematics

Plan: 1 Dimension reduction: projecting vs. cutting data sets 2 Data recovery 3 Applications in sparse recovery 4 Applications in sparse binomial regression

Dimension reduction Data set: K R n. Dimension reduction Φ : R n R m : (1) m n; (2) Φ preserves the structure (geometry) of K. Classical way: Φ linear; orthogonal projection; random. Johnson-Lindenstrauss Lemma. Let K R n be a finite set. Consider an orthogonal projection Φ onto a random m-dimensional subspace in R n. If m δ log K, then with high probability, Φ(x) Φ(y) 2 δ x y 2 for all x, y K. Φ is an almost isometric embedding K R m.

An alternative way of dimension reduction Cut rather than project. Use m hyperplanes. For a pair x, y K, count d H (x, y) = number of hyperplanes separating x, y. For simplicity, K S n 1. Lemma (Cutting finite sets). Let K S n 1 be a finite set, and consider m δ log K independent random hyperplanes. Then with high probability, 1 m d H(x, y) = 1 π x y 2 ± δ for all x, y K. A similar result for K R n. Dimension reduction of K?

d H (x, y) = number of separating hyperplanes 1 x y 2 π Dimension reduction of K: For x K, record the orientations of x with respect to the m hyperplanes: Φ(x) = the vector of orientations { 1, 1} m. { 1, 1} m is the Hamming cube; the Hamming distance dist(φ(x), Φ(y)) = d H (x, y). Conclusion of Cutting Lemma: the map Φ : K { 1, 1} m, m log K is an almost isometric embedding.

Φ : K { 1, 1} m, m log K is an almost isometric embedding. Target dimension m log K, same as the binary encoding requires. Optimal. Φ is simpler than binary it is non-iterative, close to linear.

What is the cut embedding good for? First extend to general sets K R n, usually infinite. (JL lemma is available for infinite sets [Klartag-Mendelson 05, Schechtman 06].) What is the size of K? The width of K in the direction of η R n : sup η, u inf η, u = sup η, x. u K u K x K K Mean width of K: average over all directions: w(k) = E sup x K K g, x where g N(0, I n ).

Mean width: w(k) = E sup x K K g, x where g N(0, I n). Observation. Let K S n 1. 1. If K is finite, then w(k) log K. 2. If dim K = k, then w(k) k. So w(k) 2 = effective dimension of K. Theorem (Cutting general sets). Let K S n 1, and consider m δ w(k) 2 independent random hyperplanes. Then with high probability, 1 m d H(x, y) = 1 π x y 2 ± δ for all x, y K. Conclusion: Φ : K { 1, 1} m is an almost isometric embedding. Corollary. If K is finite, then m log K, just like in JL.

Thm (Cutting general sets). Let K S n 1, and consider m δ w(k) 2 independent random hyperplanes. Then with high probability, 1 m d H(x, y) = 1 x y 2 ± δ for all x, y K. π An immediate geometric consequence: Corollary (Cutting data sets into small pieces). These hyperplanes cut K into pieces of diameter δ. Proof. If x, y K are in the same cell then d H (x, y) = 0, thus by Theorem Q.E.D. 1 π x y 2 δ.

Φ : K { 1, 1} m, m w(k) 2 is an almost isometric embedding. Data recovery Recovery Problem. Estimate x K from y = Φ(x) { 1, 1} m. Suppose K is convex. (If not, pass to conv(k); the mean width won t change.) Corollary (Recovery). One can accurately estimate x K from y = Φ(x) by solving the convex feasibility program Find x K subject to y = Φ(x ). Indeed, the solution satisfies x x 2 δ. Proof. The feasible set of this program is some cell of K. Both x and x belong to this cell. But as we know, all cells have diameter δ. Q.E.D.

Corollary (Recovery). One can estimate x K from y = Φ(x) { 1, 1} m by solving the convex feasibility program Find x K subject to y = Φ(x ). Indeed, x x 2 δ. Robust? No. Flip a few bits of y, get an infeasible program. Is robust recovery possible? Yes. Change the viewpoint a little:

Linear algebraic view of cutting: Oriented hyperplane normal a R n. Random hyperplane random normal a N(0, I n ). m random hyperplanes m n Gaussian random matrix of normals a 1 A = a m Vector of orientations of x R n : Φ(x) = (sign a i, x ) m i=1 = sign(ax).

A is an m n Gaussian matrix. Robust Recovery Problem. Estimate x K from y = sign(ax) { 1, 1} m after some proportion of bits of y are corrupted (flipped). Answer: maximize correlation with the data: max Ax, y subject to x K. Theorem (Robust recovery). Let x K and y = sign(ax). We corrupt τm bits of y, getting ỹ. One can still accurately estimate x from ỹ by solving the convex program above (with ỹ). Indeed, the solution satisfies x x 2 δ + τ log(1/τ). Proof is based on the full power of the Cutting Theorem.

Applications in sparse recovery Specialize to sparse x: few non-zeros, x 0 s. S n,s = {x R n : x 2 1, x 0 s}. But S n,s is not convex. Convexify: conv(s n,s ) K n,s = {x R n : x 2 1, x 1 s}. K n,s = {approximately sparse vectors}. w(k n,s ) s log(n/s). Thus we have dimension reduction K { 1, 1} m, m w(k) 2 s log(n/s). Note: m is linear in the sparsity s.

Approx. sparse: K n,s = {x R n : x 2 1, x 1 s}. m w(k) 2 s log(n/s). Specialize the Robust Recovery Theorem to the sparse case: Corollary (Sparse recovery). Let x be approximately sparse, x K n,s and let y = sign(ax) { 1, 1} m where m δ s log(n/s). We can estimate x from y by solving the convex program max Ax, y subject to x K. Indeed, the solution satisfies x x 2 δ. The recovery is robust as before (i.e. one can flip bits of y).

Single-bit compressed sensing Traditional compressed sensing: recover an s-sparse signal x R n from m linear measurements given by y = Ax R m. Available results: recovery by convex programming, m s log(n/s). Single-bit compressed sensing: recover an s-sparse signal x R n from m single-bit measurements given by y = sign(ax) { 1, 1} m. An extreme way of measurement quantization, A/D conversion. [Boufounos-Baraniuk 08] formulated single-bit CS, connections with embeddings into the Hamming cube, algorithms. +[Gupta-Nowak-Recht 10, Jacques-Laska-Boufounos-Baraniuk 11] No tractable algorithms have been known (unless x has constant dynamic range, or for adaptive measurements). Present work: robust sparse recovery via convex programming.

Applications in binomial regression Our model of m one-bit measurements was y i = sign a i, x, i = 1,..., m. (a i N(0, I n )) More general stochastic model: y i = ±1 r.v s independent given {a i }, E y i = θ( a i, x ), i = 1,..., m. Here θ(u) is some function satisfying the correlation assumption: E θ(g)g =: λ > 0 g N(0, 1). Reason: ensures positive correlation with data. Since a i, x N(0, 1), E y i a i, x = E θ(g)g =: λ.

Model: E y i = θ( a i, x ), i = 1,..., m. Correlation assumption: E θ(g)g =: λ > 0. This is the generalized linear model (GLM) with link function θ 1. Example. θ(z) = tanh(z/2): logistic regression P{y i = 1} = f ( a i, x ), f (z) = ez e z + 1. Statistical notation: x = unknown coefficient vector (β), y i = binary response variables, a i = independent variables (x i ); ( n P{y i = 1} = f β j x ij ), i = 1,..., m j=1 Recent work on sparse logistic regression: [Negahban-Ravikumar-Wainwright-Yu 11, Bunea 08, Van De Geer 08, Bach 10, Ravikumar-Wainwright-Lafferty 10, Meier-Van De Geer-Bühlmann 08, Kakade-Shamir-Sridharan-Tewari 11]

GLM: E y i = θ( a i, x ), i = 1,..., m. Correlation assumption: E θ(g)g =: λ > 0. Theorem (Sparse binomial regression). Suppose we have a GLM where the coefficient vector x R n, x 2 = 1 is approximately s-sparse, x K n,s. If the sample size is m δ s log(n/s) then we can estimate x by solving the convex program max m y i a i, x subject to x K n,s. i=1 Indeed, the solution satisfies x x 2 δ/λ w.h.p. In statistics notation, the sample size is n s log(p/s), thus n p. New, unusual feature? The knowledge of the link function θ is not needed (unlike in max-likelihood approaches). Here the form of GLM may be unknown. The solution is non-parametric.

Summary: JL lemma: dimension reduction K R m by projecting K onto m-dimensional subspace. Alternative way: dimension reduction K { 1, 1} m is by cutting K into small pieces by m hyperplanes. Dimension reduction map: y = sign(ax) where A is an m n random Gaussian matrix. Target dimension m w(k) 2 the effective dimension of K. If K = {approximately s-sparse vectors}, then m s log(n/s). One can accurately and robustly estimate x from y by a convex program. More generally, one can accurately estimate a sparse solution to GLM E y = θ(ax), and without even knowing the link function θ.