New ways of dimension reduction? Cutting data sets into small pieces

Similar documents
Random hyperplane tessellations and dimension reduction

Sampling and high-dimensional convex geometry

arxiv: v2 [cs.it] 8 Apr 2013

arxiv: v3 [cs.it] 19 Jul 2012

arxiv: v1 [cs.it] 21 Feb 2013

Reconstruction from Anisotropic Random Measurements

Exponential decay of reconstruction error from binary measurements of sparse signals

The Pros and Cons of Compressive Sensing

Signal Recovery from Permuted Observations

Acceleration of Randomized Kaczmarz Method

Optimal linear estimation under unknown nonlinear transform arxiv: v1 [stat.ml] 13 May 2015

Methods for sparse analysis of high-dimensional data, II

Compressed Sensing: Lecture I. Ronald DeVore

Quantized Iterative Hard Thresholding:

The Pros and Cons of Compressive Sensing

Learning discrete graphical models via generalized inverse covariance matrices

Greedy Sparsity-Constrained Optimization

Bregman Divergences for Data Mining Meta-Algorithms

Introduction to graphical models: Lecture III

Lecture 13 October 6, Covering Numbers and Maurey s Empirical Method

Restricted Strong Convexity Implies Weak Submodularity

COMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Binary matrix completion

Accelerated Block-Coordinate Relaxation for Regularized Optimization

Optimal Linear Estimation under Unknown Nonlinear Transform

THE HIDDEN CONVEXITY OF SPECTRAL CLUSTERING

CoSaMP. Iterative signal recovery from incomplete and inaccurate samples. Joel A. Tropp

Introduction to Machine Learning

High-dimensional graphical model selection: Practical and information-theoretic limits

Nearest Neighbor based Coordinate Descent

Gaussian discriminant analysis Naive Bayes

Adaptive one-bit matrix completion

Lecture Notes 9: Constrained Optimization

High-dimensional graphical model selection: Practical and information-theoretic limits

Compressive Sensing and Beyond

Lecture 3: Compressive Classification

Greedy Signal Recovery and Uniform Uncertainty Principles

Sparse and Low Rank Recovery via Null Space Properties

GREEDY SIGNAL RECOVERY REVIEW

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

ROBUST BINARY FUSED COMPRESSIVE SENSING USING ADAPTIVE OUTLIER PURSUIT. Xiangrong Zeng and Mário A. T. Figueiredo

COMS 4771 Regression. Nakul Verma

sample lectures: Compressed Sensing, Random Sampling I & II

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Linear & nonlinear classifiers

Machine Learning. Lecture 3: Logistic Regression. Feng Li.

On Iterative Hard Thresholding Methods for High-dimensional M-Estimation

arxiv: v2 [cs.lg] 6 May 2017

CS on CS: Computer Science insights into Compresive Sensing (and vice versa) Piotr Indyk MIT

Combining geometry and combinatorics

A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables

Methods for sparse analysis of high-dimensional data, II

sparse and low-rank tensor recovery Cubic-Sketching

Introduction to Compressed Sensing

Sparsity Models. Tong Zhang. Rutgers University. T. Zhang (Rutgers) Sparsity Models 1 / 28

High dimensional Ising model selection

University of Luxembourg. Master in Mathematics. Student project. Compressed sensing. Supervisor: Prof. I. Nourdin. Author: Lucien May

A SIMPLE TOOL FOR BOUNDING THE DEVIATION OF RANDOM MATRICES ON GEOMETRIC SETS

Supplementary material for a unified framework for high-dimensional analysis of M-estimators with decomposable regularizers

Near Ideal Behavior of a Modified Elastic Net Algorithm in Compressed Sensing

Least Squares Regression

4. Algebra and Duality

ECE521 week 3: 23/26 January 2017

Sparse Proteomics Analysis (SPA)

Corrupted Sensing: Novel Guarantees for Separating Structured Signals

Advances in Manifold Learning Presented by: Naku Nak l Verm r a June 10, 2008

Low Resolution Adaptive Compressed Sensing for mmwave MIMO receivers

Problem Set 6: Solutions Math 201A: Fall a n x n,

Three Generalizations of Compressed Sensing

1-Bit Matrix Completion

High-dimensional covariance estimation based on Gaussian graphical models

Introduction How it works Theory behind Compressed Sensing. Compressed Sensing. Huichao Xue. CS3750 Fall 2011

Support Vector Machines

Convex Optimization: Applications

MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications. Class 19: Data Representation by Design

Randomized Algorithms

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

Gauge optimization and duality

MAT 585: Johnson-Lindenstrauss, Group testing, and Compressed Sensing

Convex optimization. Javier Peña Carnegie Mellon University. Universidad de los Andes Bogotá, Colombia September 2014

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Lecture 5. Theorems of Alternatives and Self-Dual Embedding

Sparse Interactions: Identifying High-Dimensional Multilinear Systems via Compressed Sensing

An algebraic perspective on integer sparse recovery

Statistical Data Mining and Machine Learning Hilary Term 2016

General principles for high-dimensional estimation: Statistics and computation

Solution Recovery via L1 minimization: What are possible and Why?

High-dimensional Statistical Models

Sparse recovery for spherical harmonic expansions

Ad Placement Strategies

Machine Learning (CS 567) Lecture 5

Dimensionality Reduction Notes 3

1-Bit Matrix Completion

Optimization methods

Estimators based on non-convex programs: Statistical and computational guarantees

Inner Product, Length, and Orthogonality

Robust 1-Bit Compressive Sensing via Binary Stable Embeddings of Sparse Vectors

Faster Johnson-Lindenstrauss style reductions

Transcription:

New ways of dimension reduction? Cutting data sets into small pieces Roman Vershynin University of Michigan, Department of Mathematics Statistical Machine Learning Ann Arbor, June 5, 2012 Joint work with Yaniv Plan, University of Michigan, Mathematics

Plan: 1 Dimension reduction: projecting vs. cutting data sets 2 Data recovery 3 Applications in sparse recovery 4 Applications in sparse binomial regression

Dimension reduction Data set: K R n. Dimension reduction Φ : R n R m : (1) m n; (2) Φ preserves the structure (geometry) of K. Classical way: Φ linear; orthogonal projection; random. Johnson-Lindenstrauss Lemma. Let K R n be a finite set. Consider an orthogonal projection Φ onto a random m-dimensional subspace in R n. If m δ log K, then with high probability, Φ(x) Φ(y) 2 δ x y 2 for all x, y K. Φ is an almost isometric embedding K R m.

An alternative way of dimension reduction Cut rather than project. Use m hyperplanes. For a pair x, y K, count d H (x, y) = number of hyperplanes separating x, y. For simplicity, K S n 1. Lemma (Cutting finite sets). Let K S n 1 be a finite set, and consider m δ log K independent random hyperplanes. Then with high probability, 1 m d H(x, y) = 1 π x y 2 ± δ for all x, y K. A similar result for K R n. Dimension reduction of K?

d H (x, y) = number of separating hyperplanes 1 x y 2 π Dimension reduction of K: For x K, record the orientations of x with respect to the m hyperplanes: Φ(x) = the vector of orientations { 1, 1} m. { 1, 1} m is the Hamming cube; the Hamming distance dist(φ(x), Φ(y)) = d H (x, y). Conclusion of Cutting Lemma: the map Φ : K { 1, 1} m, m log K is an almost isometric embedding.

Φ : K { 1, 1} m, m log K is an almost isometric embedding. Target dimension m log K, same as the binary encoding requires. Optimal. Φ is simpler than binary it is non-iterative, close to linear.

What is the cut embedding good for? First extend to general sets K R n, usually infinite. (JL lemma is available for infinite sets [Klartag-Mendelson 05, Schechtman 06].) What is the size of K? The width of K in the direction of η R n : sup η, u inf η, u = sup η, x. u K u K x K K Mean width of K: average over all directions: w(k) = E sup x K K g, x where g N(0, I n ).

Mean width: w(k) = E sup x K K g, x where g N(0, I n). Observation. Let K S n 1. 1. If K is finite, then w(k) log K. 2. If dim K = k, then w(k) k. So w(k) 2 = effective dimension of K. Theorem (Cutting general sets). Let K S n 1, and consider m δ w(k) 2 independent random hyperplanes. Then with high probability, 1 m d H(x, y) = 1 π x y 2 ± δ for all x, y K. Conclusion: Φ : K { 1, 1} m is an almost isometric embedding. Corollary. If K is finite, then m log K, just like in JL.

Thm (Cutting general sets). Let K S n 1, and consider m δ w(k) 2 independent random hyperplanes. Then with high probability, 1 m d H(x, y) = 1 x y 2 ± δ for all x, y K. π An immediate geometric consequence: Corollary (Cutting data sets into small pieces). These hyperplanes cut K into pieces of diameter δ. Proof. If x, y K are in the same cell then d H (x, y) = 0, thus by Theorem Q.E.D. 1 π x y 2 δ.

Φ : K { 1, 1} m, m w(k) 2 is an almost isometric embedding. Data recovery Recovery Problem. Estimate x K from y = Φ(x) { 1, 1} m. Suppose K is convex. (If not, pass to conv(k); the mean width won t change.) Corollary (Recovery). One can accurately estimate x K from y = Φ(x) by solving the convex feasibility program Find x K subject to y = Φ(x ). Indeed, the solution satisfies x x 2 δ. Proof. The feasible set of this program is some cell of K. Both x and x belong to this cell. But as we know, all cells have diameter δ. Q.E.D.

Corollary (Recovery). One can estimate x K from y = Φ(x) { 1, 1} m by solving the convex feasibility program Find x K subject to y = Φ(x ). Indeed, x x 2 δ. Robust? No. Flip a few bits of y, get an infeasible program. Is robust recovery possible? Yes. Change the viewpoint a little:

Linear algebraic view of cutting: Oriented hyperplane normal a R n. Random hyperplane random normal a N(0, I n ). m random hyperplanes m n Gaussian random matrix of normals a 1 A = a m Vector of orientations of x R n : Φ(x) = (sign a i, x ) m i=1 = sign(ax).

A is an m n Gaussian matrix. Robust Recovery Problem. Estimate x K from y = sign(ax) { 1, 1} m after some proportion of bits of y are corrupted (flipped). Answer: maximize correlation with the data: max Ax, y subject to x K. Theorem (Robust recovery). Let x K and y = sign(ax). We corrupt τm bits of y, getting ỹ. One can still accurately estimate x from ỹ by solving the convex program above (with ỹ). Indeed, the solution satisfies x x 2 δ + τ log(1/τ). Proof is based on the full power of the Cutting Theorem.

Applications in sparse recovery Specialize to sparse x: few non-zeros, x 0 s. S n,s = {x R n : x 2 1, x 0 s}. But S n,s is not convex. Convexify: conv(s n,s ) K n,s = {x R n : x 2 1, x 1 s}. K n,s = {approximately sparse vectors}. w(k n,s ) s log(n/s). Thus we have dimension reduction K { 1, 1} m, m w(k) 2 s log(n/s). Note: m is linear in the sparsity s.

Approx. sparse: K n,s = {x R n : x 2 1, x 1 s}. m w(k) 2 s log(n/s). Specialize the Robust Recovery Theorem to the sparse case: Corollary (Sparse recovery). Let x be approximately sparse, x K n,s and let y = sign(ax) { 1, 1} m where m δ s log(n/s). We can estimate x from y by solving the convex program max Ax, y subject to x K. Indeed, the solution satisfies x x 2 δ. The recovery is robust as before (i.e. one can flip bits of y).

Single-bit compressed sensing Traditional compressed sensing: recover an s-sparse signal x R n from m linear measurements given by y = Ax R m. Available results: recovery by convex programming, m s log(n/s). Single-bit compressed sensing: recover an s-sparse signal x R n from m single-bit measurements given by y = sign(ax) { 1, 1} m. An extreme way of measurement quantization, A/D conversion. [Boufounos-Baraniuk 08] formulated single-bit CS, connections with embeddings into the Hamming cube, algorithms. +[Gupta-Nowak-Recht 10, Jacques-Laska-Boufounos-Baraniuk 11] No tractable algorithms have been known (unless x has constant dynamic range, or for adaptive measurements). Present work: robust sparse recovery via convex programming.

Applications in binomial regression Our model of m one-bit measurements was y i = sign a i, x, i = 1,..., m. (a i N(0, I n )) More general stochastic model: y i = ±1 r.v s independent given {a i }, E y i = θ( a i, x ), i = 1,..., m. Here θ(u) is some function satisfying the correlation assumption: E θ(g)g =: λ > 0 g N(0, 1). Reason: ensures positive correlation with data. Since a i, x N(0, 1), E y i a i, x = E θ(g)g =: λ.

Model: E y i = θ( a i, x ), i = 1,..., m. Correlation assumption: E θ(g)g =: λ > 0. This is the generalized linear model (GLM) with link function θ 1. Example. θ(z) = tanh(z/2): logistic regression P{y i = 1} = f ( a i, x ), f (z) = ez e z + 1. Statistical notation: x = unknown coefficient vector (β), y i = binary response variables, a i = independent variables (x i ); ( n P{y i = 1} = f β j x ij ), i = 1,..., m j=1 Recent work on sparse logistic regression: [Negahban-Ravikumar-Wainwright-Yu 11, Bunea 08, Van De Geer 08, Bach 10, Ravikumar-Wainwright-Lafferty 10, Meier-Van De Geer-Bühlmann 08, Kakade-Shamir-Sridharan-Tewari 11]

GLM: E y i = θ( a i, x ), i = 1,..., m. Correlation assumption: E θ(g)g =: λ > 0. Theorem (Sparse binomial regression). Suppose we have a GLM where the coefficient vector x R n, x 2 = 1 is approximately s-sparse, x K n,s. If the sample size is m δ s log(n/s) then we can estimate x by solving the convex program max m y i a i, x subject to x K n,s. i=1 Indeed, the solution satisfies x x 2 δ/λ w.h.p. In statistics notation, the sample size is n s log(p/s), thus n p. New, unusual feature? The knowledge of the link function θ is not needed (unlike in max-likelihood approaches). Here the form of GLM may be unknown. The solution is non-parametric.

Summary: JL lemma: dimension reduction K R m by projecting K onto m-dimensional subspace. Alternative way: dimension reduction K { 1, 1} m is by cutting K into small pieces by m hyperplanes. Dimension reduction map: y = sign(ax) where A is an m n random Gaussian matrix. Target dimension m w(k) 2 the effective dimension of K. If K = {approximately s-sparse vectors}, then m s log(n/s). One can accurately and robustly estimate x from y by a convex program. More generally, one can accurately estimate a sparse solution to GLM E y = θ(ax), and without even knowing the link function θ.