Least-Squares Regression on Sparse Spaces

Similar documents
Compressed Least-Squares Regression on Sparse Spaces

Lecture Introduction. 2 Examples of Measure Concentration. 3 The Johnson-Lindenstrauss Lemma. CS-621 Theory Gems November 28, 2012

A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks

Concentration of Measure Inequalities for Compressive Toeplitz Matrices with Applications to Detection and System Identification

Time-of-Arrival Estimation in Non-Line-Of-Sight Environments

7.1 Support Vector Machine

arxiv: v4 [math.pr] 27 Jul 2016

Linear Regression with Limited Observation

On combinatorial approaches to compressed sensing

An Optimal Algorithm for Bandit and Zero-Order Convex Optimization with Two-Point Feedback

Robustness and Perturbations of Minimal Bases

Lower Bounds for Local Monotonicity Reconstruction from Transitive-Closure Spanners

Robust Forward Algorithms via PAC-Bayes and Laplace Distributions. ω Q. Pr (y(ω x) < 0) = Pr A k

6 General properties of an autonomous system of two first order ODE

Necessary and Sufficient Conditions for Sketched Subspace Clustering

Tractability results for weighted Banach spaces of smooth functions

LATTICE-BASED D-OPTIMUM DESIGN FOR FOURIER REGRESSION

Leaving Randomness to Nature: d-dimensional Product Codes through the lens of Generalized-LDPC codes

A. Exclusive KL View of the MLE

Lectures - Week 10 Introduction to Ordinary Differential Equations (ODES) First Order Linear ODEs

Monotonicity for excited random walk in high dimensions

Lower Bounds for the Smoothed Number of Pareto optimal Solutions

Lecture 6 : Dimensionality Reduction

Inter-domain Gaussian Processes for Sparse Inference using Inducing Features

arxiv: v1 [cs.lg] 22 Mar 2014

Gaussian processes with monotonicity information

Linear First-Order Equations

Relative Entropy and Score Function: New Information Estimation Relationships through Arbitrary Additive Perturbation

PDE Notes, Lecture #11

Influence of weight initialization on multilayer perceptron performance

Lower bounds on Locality Sensitive Hashing

Lagrangian and Hamiltonian Mechanics

Introduction to the Vlasov-Poisson system

Multi-View Clustering via Canonical Correlation Analysis

The derivative of a function f(x) is another function, defined in terms of a limiting expression: f(x + δx) f(x)

u!i = a T u = 0. Then S satisfies

Robust Low Rank Kernel Embeddings of Multivariate Distributions

TOEPLITZ AND POSITIVE SEMIDEFINITE COMPLETION PROBLEM FOR CYCLE GRAPH

Topic 7: Convergence of Random Variables

Situation awareness of power system based on static voltage security region

Structural Risk Minimization over Data-Dependent Hierarchies

Robust Bounds for Classification via Selective Sampling

Schrödinger s equation.

Sparse Reconstruction of Systems of Ordinary Differential Equations

'HVLJQ &RQVLGHUDWLRQ LQ 0DWHULDO 6HOHFWLRQ 'HVLJQ 6HQVLWLYLW\,1752'8&7,21

FLUCTUATIONS IN THE NUMBER OF POINTS ON SMOOTH PLANE CURVES OVER FINITE FIELDS. 1. Introduction

SINGULAR PERTURBATION AND STATIONARY SOLUTIONS OF PARABOLIC EQUATIONS IN GAUSS-SOBOLEV SPACES

Physics 5153 Classical Mechanics. The Virial Theorem and The Poisson Bracket-1

Qubit channels that achieve capacity with two states

arxiv: v1 [cs.it] 21 Aug 2017

Capacity Analysis of MIMO Systems with Unknown Channel State Information

Image Denoising Using Spatial Adaptive Thresholding

Equilibrium in Queues Under Unknown Service Times and Service Value

Generalized Tractability for Multivariate Problems

Lecture 2: Correlated Topic Model

Logarithmic spurious regressions

Binary Discrimination Methods for High Dimensional Data with a. Geometric Representation

Some Examples. Uniform motion. Poisson processes on the real line

A New Minimum Description Length

Energy behaviour of the Boris method for charged-particle dynamics

A LIMIT THEOREM FOR RANDOM FIELDS WITH A SINGULARITY IN THE SPECTRUM

Analyzing Tensor Power Method Dynamics in Overcomplete Regime

Table of Common Derivatives By David Abraham

All s Well That Ends Well: Supplementary Proofs

. Using a multinomial model gives us the following equation for P d. , with respect to same length term sequences.

A Course in Machine Learning

Modelling and simulation of dependence structures in nonlife insurance with Bernstein copulas

Hyperbolic Moment Equations Using Quadrature-Based Projection Methods

Separation of Variables

Conservation Laws. Chapter Conservation of Energy

arxiv: v4 [cs.ds] 7 Mar 2014

An Analytical Expression of the Probability of Error for Relaying with Decode-and-forward

STATISTICAL LIKELIHOOD REPRESENTATIONS OF PRIOR KNOWLEDGE IN MACHINE LEARNING

Database-friendly Random Projections

Lie symmetry and Mei conservation law of continuum system

Ramsey numbers of some bipartite graphs versus complete graphs

Optimal CDMA Signatures: A Finite-Step Approach

On conditional moments of high-dimensional random vectors given lower-dimensional projections

Resistant Polynomials and Stronger Lower Bounds for Depth-Three Arithmetical Formulas

Perfect Matchings in Õ(n1.5 ) Time in Regular Bipartite Graphs

Harmonic Modelling of Thyristor Bridges using a Simplified Time Domain Method

Acute sets in Euclidean spaces

On the Surprising Behavior of Distance Metrics in High Dimensional Space

The Subtree Size Profile of Plane-oriented Recursive Trees

The Role of Models in Model-Assisted and Model- Dependent Estimation for Domains and Small Areas

A Spectral Method for the Biharmonic Equation

Stable and compact finite difference schemes

ON ISENTROPIC APPROXIMATIONS FOR COMPRESSIBLE EULER EQUATIONS

Fast image compression using matrix K-L transform

Math 342 Partial Differential Equations «Viktor Grigoryan

Survey Sampling. 1 Design-based Inference. Kosuke Imai Department of Politics, Princeton University. February 19, 2013

19 Eigenvalues, Eigenvectors, Ordinary Differential Equations, and Control

Function Spaces. 1 Hilbert Spaces

Permanent vs. Determinant

Agmon Kolmogorov Inequalities on l 2 (Z d )

Technion - Computer Science Department - M.Sc. Thesis MSC Constrained Codes for Two-Dimensional Channels.

Switching Time Optimization in Discretized Hybrid Dynamical Systems

Node Density and Delay in Large-Scale Wireless Networks with Unreliable Links

ALGEBRAIC AND ANALYTIC PROPERTIES OF ARITHMETIC FUNCTIONS

WEIGHTING A RESAMPLED PARTICLE IN SEQUENTIAL MONTE CARLO. L. Martino, V. Elvira, F. Louzada

Transcription:

Least-Squares Regression on Sparse Spaces Yuri Grinberg, Mahi Milani Far, Joelle Pineau School of Computer Science McGill University Montreal, Canaa {ygrinb,mmilan1,jpineau}@cs.mcgill.ca 1 Introuction Compresse sampling has been stuie in the context of regression theory from two prospectives. One is when given a training set, we aim to compress the set into a smaller size by combining training instances using ranom projections see e.g. [1]. Such metho is useful, for instance, when the training set is too large or one has to hanle privacy issues. Another application is when one uses ranom projections to project each input vector into a lower imensional space, an then train a preictor in the new compresse space compression on the feature space. As is typical of imensionality reuction techniques, this will reuce the variance of most preictors at the expense of introucing some bias. Ranom projections on the feature space, along with least-squares preictors are stuie in [2], an the metho is shown to reuces the estimation error at the price of a controlle approximation error. The analysis in [2] provies on-sample error bouns an extens them to bouns on the sampling measure, assuming an i.i.. sampling strategy. This paper inclues the bias variance analysis of regression in compresse spaces when ranom projections are applie on sparse input signals. We show that the sparsity assumption let us work with arbitrary non i.i.. sampling strategies an we erive a worst-case boun on the entire space. Such a boun can be use to select the optimal size of projection, such as to minimize the sum of expecte estimation an preiction errors. It also provies the means to compare the error of linear preictors in the original an compresse spaces. 2 Notations an Sparsity Assumption Throughout this paper, column vectors are represente by lower case bol letters, an matrices are represente by bol capital letters.. enotes the size of a set, an. 0 is Donoho s zero norm inicating the number of non-zero elements in a vector.. enotes the L 2 norm for vectors an the operator norm for matrices: M = sup v Mv / v. Also, we enote the Moore-Penrose pseuo-inverse of a matrix M with M an the smallest singular value of M by σ M min. We will be working in sparse input spaces for our preiction task. Our input is represente by a vector x X of D features, having x 1. We assume that x is k-sparse in some known or unknown basis Ψ, implying that X {Ψz, s.t. z 0 k an z 1}. 3 Ranom Projections an Inner Prouct It is well known that ranom projections of appropriate sizes preserve enough information for exact reconstruction with high probability see e.g. [3, 4]. In this section, we show that a function almost-linear in the original space is almost linear in the projecte space, when we have ranom projections of appropriate sizes.

There are several types of ranom projection matrices that can be use. In this work, we assume that each entry in a projection Φ D is an i.i.. sample from a Gaussian 1 : φ i,j = N 0, 1/. 1 We buil our work on the following base on theorem 4.1 from [3], which shows that for a finite set of points, inner prouct with a fixe vector is almost preserve after a ranom projection. Theorem 1. Let Φ D be a ranom projection accoring to Eqn 1. Let S be a finite set of points in R D. Then for any fixe w R D an ɛ > 0: s S : Φ T w, Φ T s w, s ɛ w s, 2 fails with probability less than 4 S + 2e ɛ2 /48. The above theorem is base on the well-known Johnson Linenstrauss lemma see [3], which consiers ranom projections of finite sets of points. We erive the corresponing theorem for sparse feature spaces. Theorem 2. Let Φ D be a ranom projection accoring to Eqn 1. Let X be a D-imensional k-sparse space. Then for any fixe w an ɛ > 0: fails with probability less than: x X : Φ T w, Φ T x w, x ɛ w x, 3 ed/k k 412/ɛ k + 2e ɛ2 /192 e k log12ed/ɛk ɛ2 /192+log 5. Note that the above theorem oes not require w to be in the sparse space, an thus is ifferent from guarantees on the preservation of inner prouct between vectors in the sparse space. Proof of Theorem 2. The proof follows the steps of the proof of theorem 5.2 from [5]. Because Φ is a linear transformation, we only nee to prove the theorem when w = x = 1. Denote Ψ to be the basis with respect to which X is sparse. Let T {1, 2,..., D} be any set of k inexes. For each set of inexes T, we efine a k-imensional hyperplane in the D-imensional input space: X T {Ψz, s.t. z is zero outsie T an z 1}. By efinition we have X = T X T. We first show that Eqn 3 hols for each X T an then use the union boun to prove the theorem. For any given T, we choose a set S X T such that we have: x X T : min x s ɛ/4. 4 s S It is easy to prove see e.g. Chapter 13 of [6] that these conitions can be satisfie by choosing a gri of size S 12/ɛ k, since X T is a k-imensional hyperplane in R n S fills up the space within ɛ/4 istance. Now applying Theorem 1, an with w = 1 we have that: s S : Φ T w, Φ T s w, s ɛ s, 5 2 fails with probability less than 412/ɛ k + 2e ɛ2 /192. Let a be the smallest number such that: x X T : Φ T w, Φ T x w, x a x, 6 hols when Eqn 5 hols. The goal is to show that a ɛ. For any given x X T, we choose an s S for which x s ɛ/4. Therefore we have: Φ T w, Φ T x w, x Φ T w, Φ T x Φ T w, Φ T s w, x + w, s + 7 Φ T w, Φ T s w, s 8 Φ T w, Φ T x s w, x s + 9 Φ T w, Φ T s w, s 10 aɛ/4 + ɛ/2. 11 1 The elements of the projection are typically taken to be istribute with N 0, 1/D, but we scale them by D/, so that we avoi scaling the projecte values see e.g. [3].

The last line is by the efinition of a, an by applying Eqn 5 with high probability. Because of the efinition of a, there is an x X T an by scaling, one with size 1, for which Eqn 6 is tight. Therefore we have a aɛ/4 + ɛ/2, which proves a ɛ for any choice of ɛ < 1. Note that there are D k possible sets T. Since D k ed/k k an X = T X T, the union boun gives us that the theorem fails with probability less than ed/k k 412/ɛ k + 2e ɛ2 /192. 4 Bias Variance Analysis of Orinary Least-Squares In this section, we analyze the worst case preiction error mae by the orinary least-squares OLS solution. For completeness, we provie bouns on OLS in the original space which is partly a classical result in linear preiction theory. Then, we procee to the main result of this paper, which is the bias variance analysis of OLS in the projecte space. We seek to preict a signal f that is assume to be a near-linear function of x X : fx = x T w + b f x, where b f x ɛ f, 12 for some ɛ f > 0, where we assume w 1. We are given a training set of n input output pairs, consisting of a full-rank input matrix X n D, along with noisy observations of f: y = Xw + b f + η, 13 where for the aitive bias term overloaing the notation b f,i = b f x i ; an we assume a homosceastic noise term η to be a vector of i.i.. ranom variables istribute as N 0, σ 2 η. Given the above, we seek to fin a preictor that for any query x X preicts the target signal fx. The following lemma provies a boun over worst-case error of the orinary least-squares preictor. Lemma 3. Let w ols be the OLS solution of Eqn 13 with aitive bias boune by ɛ f an i.i.. noise with variance ση. 2 Then for any 0 < δ var 2/eπ, for all x X, with probability no less than 1 δ var the error in the OLS preiction follows this boun: fx x T w ols x X ɛ f n + ση log2/πδ 2 var + ɛ f. 14 Proof of Lemma 3. For the OLS solution of Eqn 13 we have: w ols = X y = X Xw + b f + η = w + X b f + X η. 15 Therefore for all x X we have the error: fx x T w ols x T w ols x T w + ɛ f 16 x T X b f + x T X η + ɛ f. 17 For the first term part of preiction bias on the right han sie, we have: x T X b f x T X b f x X ɛ f n. 18 For the secon term in line 17 preiction variance, we have that the expectation of x T X η is 0, as η is inepenent of ata an its expectation is zero. We also know that it is a weighte sum of normally istribute ranom variables, an thus is normal with the variance: Var[x T X η] = E[x T X ηη T X T x] 19 = σ 2 ηx T X X T x 20 σ 2 η x T X X T x 21 σ 2 η x 2 X 2, 22 where in line 20 we use the i.i.. assumption on the noise. Thereby we can boun x T X η by the tail probability of the normal istribution as neee. Using an stanar upper boun on the tail probability of normals, when 0 < δ var 2/eπ, with probability no less than 1 δ var : x T X η σ η x X log2/πδ 2 var. 23 Aing up the bias an the variance term gives us the boun in the lemma.

5 Compresse Orinary Least-Squares We are now reay to stuy an upper boun for the worst-case error of the OLS preictor in a compresse space. In this setting, we will first project the inputs into a lower imensional space using ranom projections, an then use the OLS estimator on the compresse input signals. Theorem 4. Let Φ D be a ranom projection accoring to Eqn 1 an w Φ ols be the OLS solution in the compresse space inuce by the projection. Assume an aitive bias in the original space boune by some ɛ f > 0 an i.i.. noise with variance ση. 2 Choose any 0 < δ prj, δ Φ < 1 an 0 < δ var 2/eπ. Then, with probability no less than 1 δ prj + δ Φ, we have x X with probability no less than 1 δ var : fx x T Φw Φ ols ɛ f + ɛ prj n + σ η log2/πδ 2 var 12 1 + σ X D min 2 log δ Φ log 2 24 δ prj 1 where, +ɛ f + ɛ prj, 25 k log log12ed/kδprj ɛ prj = c. Proof of Theorem 4. Using Theorem 2, the following hols with probability no less than 1 δ prj : fx = Φ T x T Φ T w + b f x + b prj x, 26 where b f x ɛ f, b prj x ɛ prj. Note that XΦ X Φ 1/σ X with probability 1 δ Φ : min σφ min σ Φ D 2 log min δφ 1.. Using the boun iscusse in [7], we have Now, using Lemma 3 with the form of a function escribe in Eqn 26, we have: fx x T Φw Φ ols xt Φ XΦ ɛ f + ɛ prj n + σ η log2/πδ 2 var + ɛ f + ɛ prj, 27 which yiels the theorem after the substitution of ɛ prj an matrix norms. Assuming that ɛ f = 0 for simplification we can rewrite the boun as: k fx x T Φw Φ ols Õ 1 n log logd/k + σ X + Õ σ η /D. min D σ X min The first Õ term of the RHS is a part of a bias ue to the projection. The secon Õ term is the variance term. This boun is particularly useful when n > D. With that assumption, in orer to illustrate a more clear bias variance trae-off, assume that X i,j comes from N 0, 1 D. Then we have σ X min n D. Fixing the values for δ s an ignoring the log term slow growing function of, we get that the error is boune by: 1 σ c 0 + c 1 k logd/k η + 1 + c 2, n in which case we clearly observe the trae-off with respect to the compresse imension. Now if < n < D with X i,j N 0, 1 D, we have σx min 1 n D which gives us the following boun: 1 n σ c 0 + c 1 k logd/k + η + c 2. D n D n This boun is, however, counter-intuitive, as the error grows when n is increase. This is ue to the fact that the boun XΦ X Φ gets looser as n gets close to D. When n is close to D, we might not have a tight close-form boun over the XΦ term, but we can still calculate it empirically for any given value of by sampling a specific projection of the corresponing size.

References [1] S. Zhou, J. Lafferty, an L. Wasserman. Compresse regression. In Proceeings of Avances in neural information processing systems, 2007. [2] O.A. Maillar an R. Munos. Compresse least-squares regression. In Proceeings of Avances in neural information processing systems, 2009. [3] M.A. Davenport, M.B. Wakin, an R.G. Baraniuk. Detection an estimation with compressive measurements. Dept. of ECE, Rice University, Tech. Rep, 2006. [4] E.J. Canès an M.B. Wakin. An introuction to compressive sampling. Signal Processing Magazine, IEEE, 252:21 30, 2008. [5] R. Baraniuk, M. Davenport, R. DeVore, an M. Wakin. The Johnson Linenstrauss lemma meets compresse sensing. Constructive Approximation, 2007. [6] G.G. Lorentz, M. von Golitschek, an Y. Makovoz. Constructive approximation: avance problems, volume 304. Springer Berlin, 1996. [7] E.J. Canès an T. Tao. Near-optimal signal recovery from ranom projections: Universal encoing strategies. Information Theory, IEEE Transactions on, 5212:5406 5425, 2006.