Linear Regression with Limited Observation

Similar documents
An Optimal Algorithm for Bandit and Zero-Order Convex Optimization with Two-Point Feedback

Least-Squares Regression on Sparse Spaces

A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks

Lecture Introduction. 2 Examples of Measure Concentration. 3 The Johnson-Lindenstrauss Lemma. CS-621 Theory Gems November 28, 2012

A Course in Machine Learning

Lower bounds on Locality Sensitive Hashing

Robust Bounds for Classification via Selective Sampling

Lower Bounds for the Smoothed Number of Pareto optimal Solutions

Robust Forward Algorithms via PAC-Bayes and Laplace Distributions. ω Q. Pr (y(ω x) < 0) = Pr A k

u!i = a T u = 0. Then S satisfies

Algorithms and matching lower bounds for approximately-convex optimization

LATTICE-BASED D-OPTIMUM DESIGN FOR FOURIER REGRESSION

Survey Sampling. 1 Design-based Inference. Kosuke Imai Department of Politics, Princeton University. February 19, 2013

Linear First-Order Equations

FLUCTUATIONS IN THE NUMBER OF POINTS ON SMOOTH PLANE CURVES OVER FINITE FIELDS. 1. Introduction

On the Complexity of Bandit and Derivative-Free Stochastic Convex Optimization

Expected Value of Partial Perfect Information

Schrödinger s equation.

Necessary and Sufficient Conditions for Sketched Subspace Clustering

Agmon Kolmogorov Inequalities on l 2 (Z d )

Optimization of Geometries by Energy Minimization

7.1 Support Vector Machine

Influence of weight initialization on multilayer perceptron performance

6 General properties of an autonomous system of two first order ODE

On combinatorial approaches to compressed sensing

Table of Common Derivatives By David Abraham

Topic 7: Convergence of Random Variables

Sublinear Optimization for Machine Learning

How to Minimize Maximum Regret in Repeated Decision-Making

This module is part of the. Memobust Handbook. on Methodology of Modern Business Statistics

Linear and quadratic approximation

Cascaded redundancy reduction

A Weak First Digit Law for a Class of Sequences

Separation of Variables

arxiv: v4 [cs.ds] 7 Mar 2014

Lecture 2: Correlated Topic Model

The derivative of a function f(x) is another function, defined in terms of a limiting expression: f(x + δx) f(x)

Homework 2 Solutions EM, Mixture Models, PCA, Dualitys

Tractability results for weighted Banach spaces of smooth functions

arxiv: v4 [math.pr] 27 Jul 2016

Parameter estimation: A new approach to weighting a priori information

Analyzing Tensor Power Method Dynamics in Overcomplete Regime

Proof of SPNs as Mixture of Trees

Generalized Tractability for Multivariate Problems

A Sketch of Menshikov s Theorem

Permanent vs. Determinant

Perfect Matchings in Õ(n1.5 ) Time in Regular Bipartite Graphs

Introduction to Machine Learning

Introduction to the Vlasov-Poisson system

Gaussian processes with monotonicity information

Logarithmic spurious regressions

The total derivative. Chapter Lagrangian and Eulerian approaches

Tutorial on Maximum Likelyhood Estimation: Parametric Density Estimation

Math 1B, lecture 8: Integration by parts

All s Well That Ends Well: Supplementary Proofs

Full-information Online Learning

Multi-View Clustering via Canonical Correlation Analysis

Modelling and simulation of dependence structures in nonlife insurance with Bernstein copulas

Time-of-Arrival Estimation in Non-Line-Of-Sight Environments

A. Exclusive KL View of the MLE

LECTURE NOTES ON DVORETZKY S THEOREM

Approximate Constraint Satisfaction Requires Large LP Relaxations

Multi-View Clustering via Canonical Correlation Analysis

Admin BACKPROPAGATION. Neural network. Neural network 11/3/16. Assignment 7. Assignment 8 Goals today. David Kauchak CS158 Fall 2016

CUSTOMER REVIEW FEATURE EXTRACTION Heng Ren, Jingye Wang, and Tony Wu

Self-normalized Martingale Tail Inequality

Leaving Randomness to Nature: d-dimensional Product Codes through the lens of Generalized-LDPC codes

Lectures - Week 10 Introduction to Ordinary Differential Equations (ODES) First Order Linear ODEs

Math 342 Partial Differential Equations «Viktor Grigoryan

Switching Time Optimization in Discretized Hybrid Dynamical Systems

Multi-View Clustering via Canonical Correlation Analysis

ON THE OPTIMALITY SYSTEM FOR A 1 D EULER FLOW PROBLEM

Robustness and Perturbations of Minimal Bases

On the Value of Partial Information for Learning from Examples

Online Appendix for Trade Policy under Monopolistic Competition with Firm Selection

Monte Carlo Methods with Reduced Error

Adaptive Online Learning in Dynamic Environments

A variance decomposition and a Central Limit Theorem for empirical losses associated with resampling designs

Estimation of the Maximum Domination Value in Multi-Dimensional Data Sets

Technion - Computer Science Department - M.Sc. Thesis MSC Constrained Codes for Two-Dimensional Channels.

Capacity Analysis of MIMO Systems with Unknown Channel State Information

Sturm-Liouville Theory

arxiv: v5 [cs.lg] 28 Mar 2017

inflow outflow Part I. Regular tasks for MAE598/494 Task 1

arxiv: v3 [cs.lg] 3 Dec 2017

Multi-View Clustering via Canonical Correlation Analysis

WEIGHTING A RESAMPLED PARTICLE IN SEQUENTIAL MONTE CARLO. L. Martino, V. Elvira, F. Louzada

. Using a multinomial model gives us the following equation for P d. , with respect to same length term sequences.

Level Construction of Decision Trees in a Partition-based Framework for Classification

On the number of isolated eigenvalues of a pair of particles in a quantum wire

AN INTRODUCTION TO NUMERICAL METHODS USING MATHCAD. Mathcad Release 14. Khyruddin Akbar Ansari, Ph.D., P.E.

Euler equations for multiple integrals

Lecture 5. Symmetric Shearer s Lemma

A comparison of small area estimators of counts aligned with direct higher level estimates

WUCHEN LI AND STANLEY OSHER

Equilibrium in Queues Under Unknown Service Times and Service Value

Computing Exact Confidence Coefficients of Simultaneous Confidence Intervals for Multinomial Proportions and their Functions

Math Notes on differentials, the Chain Rule, gradients, directional derivative, and normal vectors

Ramsey numbers of some bipartite graphs versus complete graphs

IPA Derivatives for Make-to-Stock Production-Inventory Systems With Backorders Under the (R,r) Policy

Transcription:

Ela Hazan Tomer Koren Technion Israel Institute of Technology, Technion City 32000, Haifa, Israel ehazan@ie.technion.ac.il tomerk@cs.technion.ac.il Abstract We consier the most common variants of linear regression, incluing Rige, Lasso an Support-vector regression, in a setting where the learner is allowe to observe only a fixe number of attributes of each example at training time. We present simple an efficient algorithms for these problems: for Lasso an Rige regression they nee the same total number of attributes (up to constants) as o full-information algorithms, for reaching a certain accuracy. For Support-vector regression, we require exponentially less attributes compare to the state of the art. By that, we resolve an open problem recently pose by Cesa-Bianchi et al. (2010). Experiments show the theoretical bouns to be justifie by superior performance compare to the state of the art. 1. Introuction In regression analysis the statistician attempts to learn from examples the unerlying variables affecting a given phenomenon. For example, in meical iagnosis a certain combination of conitions reflects whether a patient is afflicte with a certain isease. In certain common regression cases various limitations are place on the information available from the examples. In the meical example, not all parameters of a certain patient can be measure ue to cost, time an patient reluctance. In this paper we stuy the problem of regression in which only a small subset of the attributes per example can be observe. In this setting, we have access to all attributes an we are require to choose which of them to observe. Recently, Cesa-Bianchi et al. (2010) Appearing in Proceeings of the 29 th International Conference on Machine Learning, Einburgh, Scotlan, UK, 2012. Copyright 2012 by the author(s)/owner(s). stuie this problem an aske the following interesting question: can we efficiently learn the optimal regressor in the attribute efficient setting with the same total number of attributes as in the unrestricte regression setting? In other wors, the question amounts to whether the information limitation hiners our ability to learn efficiently at all. Ieally, one woul hope that instea of observing all attributes of every example, one coul compensate for fewer attributes by analyzing more examples, but retain the same overall sample an computational complexity. Inee, we answer this question on the affirmative for the main variants of regression: Rige an Lasso. For support-vector regression we make significant avancement, reucing the parameter epenence by an exponential factor. Our results are summarize in the table below 1, which gives bouns for the number of examples neee to attain an error of ε, such that at most k attributes 2 are viewable per example. We enote by the imension of the attribute space. Regression New boun Prev. boun Rige O ( ) ( ) kε O 2 log 2 ε kε ( ) ( 2 ) Lasso O log kε O 2 log 2 ε kε 2 SVR O ( k ) e O(log 2 1 ε) O (e 2 kε 2 ) Table 1. Our sample complexity bouns. Our bouns imply that for reaching a certain accuracy, our algorithms nee the same number of attributes as their full information counterparts. In particular, when k = Ω() our bouns coincie with those of full information regression, up to constants (cf. Kakae et al. 2008). We complement these upper bouns an prove that Ω( ε 2 ) attributes are in fact necessary to learn an ε- 1 The previous bouns are ue to (Cesa-Bianchi et al., 2010). For SVR, the boun is obtaine by aitionally incorporating the methos of (Cesa-Bianchi et al., 2011). 2 For SVR, the number of attributes viewe per example is a ranom variable whose expectation is k.

accurate Rige regressor. For Lasso regression, Cesa- Bianchi et al. (2010) prove that Ω( ε ) attributes are necessary, an aske what is the correct epenence on the problem imension. Our bouns imply that the number of attributes necessary for regression learning grows linearly with the problem imensions. The algorithms themselves are very simple to implement, an run in linear time. As we show in later sections, these theoretical improvements are clearly visible in experiments on stanar atasets. 1.1. Relate work The setting of learning with limite attribute observation (LAO) was first put forth in (Ben-Davi & Dichterman, 1998), who coine the term learning with restricte focus of attention. Cesa-Biachi et al. (2010) were the first to iscuss linear preiction in the LAO setting, an gave an efficient algorithm (as well as lower bouns) for linear regression, which is the primary focus of this paper. 2. Setting an Result Statement 2.1. Linear regression In the linear regression problem, each instance is a pair (x, y) of an attributes vector x R an a target variable y R. We assume the stanar framework of statistical learning (Haussler, 1992), in which the pairs (x, y) follow a joint probability istribution D over R R. The goal of the learner is to fin a vector w for which the linear rule ŷ w x provies a goo preiction of the target y. To measure the performance of the preiction, we use a convex loss function l(ŷ, y) : R 2 R. The most common choice is the square loss l(ŷ, y) = 1 2 (ŷ y)2, which stans for the popular leastsquares regression. Hence, in terms of the istribution D, the learner woul like to fin a regressor w R with low expecte loss, efine as L D (w) = E (x, y) D [l(w x, y). (1) The stanar paraigm for learning such regressor is seeking a vector w R that minimizes a trae-off between the expecte loss an an aitional regularization term, which is usually a norm of w. An equivalent form of this optimization problem is obtaine by replacing the regularization term with a proper constraint, giving rise to the problem min L D (w) s.t. w p B, (2) w R where B > 0 is a regularization parameter an p 1. The main variants of regression iffer on the type of l p norm constraint as well as the loss functions in the above efinition: Rige regression: p = 2 an square loss, l(ŷ, y) = 1 2 (ŷ y)2. Lasso regression: p = 1 an square loss. Support-vector regression: p = 2 an the δ- insensitive absolute loss (Vapnik, 1995), l(ŷ, y) = ŷ y δ := max{0, ŷ y δ}. Since the istribution D is unknown, we learn by relying on a training set S = {(x t, y t )} m of examples, that are assume to be sample inepenently from D. We use the notation l t (w) := l(w x t, y t ) to refer to the loss function inuce by the instance (x t, y t ). We istinguish between two learning scenarios. In the full information setup, the learner has unrestricte access to the entire ata set. In the limite attribute observation (LAO) setting, for any given example pair (x, y), the learner can observe y, but only k attributes of x (where k 1 is a parameter of the problem). The learner can actively choose which attributes to observe. 2.2. Limitations on LAO regression Cesa-Biachi et al. (2010) prove the following sample complexity lower boun on any LAO Lasso regression algorithm. Theorem 2.1. Let 0 < ε < 1 16, k 1 an > 4k. For any regression algorithm accessing at most k attributes per training example, there exist a istribution D over {x : x 1} {±1} an a regressor w with w 1 1 such that the algorithm must see (in expectation) at least Ω( kε ) examples in orer to learn a linear regressor w with L D (w) L D (w ) < ε. We complement this lower boun, by proviing a stronger lower boun on the sample complexity of any Rige regression algorithm, using informationtheoretic arguments. Theorem 2.2. Let ε = Ω(1/ ). For any regression algorithm accessing at most k attributes per training example, there exist a istribution D over {x : x 2 1} {±1} an a regressor w with w 2 1 such that the algorithm must see (in expectation) at least Ω( kε ) examples in orer to learn a linear regressor w, 2 w 2 1 with L D (w) L D (w ) ε. Our algorithm for LAO Rige regression (see section 3) imply this lower boun to be tight up to constants.

Note, however, that the boun applies only to a particular regime of the problem parameters 3. 2.3. Our algorithmic results We give efficient regression algorithms that attain the following risk bouns. For our Rige regression algorithm, we prove the risk boun ( ) E [L D ( w) min L D(w) + O B 2, w 2 B km while for our Lasso regression algorithm we establish the boun ( ) log E [L D ( w) min L D(w) + O B 2. w 1 B km Here we use w to enote the output of each algorithm on a training set of m examples, an the expectations are taken with respect to the ranomization of the algorithms. For Support-vector regression we obtain a risk boun that epens on the esire accuracy ε. Our boun implies that m = O ( k ) exp ( ( )) O log 2 B. ε examples are neee (in expectation) for obtaining an ε-accurate regressor. 3. Algorithms for LAO least-squares regression In this section we present an analyze our algorithms for Rige an Lasso regression in the LAO setting. The loss function uner consieration here is the square loss, that is, l t (w) = 1 2 (w x t y t ) 2. For convenience, we show algorithms that use k + 1 attributes of each instance, for k 1 4. Our algorithms are iterative an maintain a regressor w t along the iterations. The upate of the regressor at iteration t is base on graient information, an specifically on g t := l t (w t ) that equals (w t x t y t ) x t for the square loss. In the LAO setting, however, we o not have the access to this information, thus we buil upon unbiase estimators of the graients. 3 Inee, there are (full-information) algorithms that are known to converge in O(1/ε) rate see e.g. (Hazan et al., 2007). 4 We note that by our approach it is impossible to learn using a single attribute of each example (i.e., with k = 0), an we are not aware of any algorithm that is able to o so. See (Cesa-Bianchi et al., 2011) for a relate impossibility result. Algorithm 1 AERR Parameters: B, η > 0 Input: training set S = {(x t, y t )} t [m an k > 0 Output: regressor w with w 2 B 1: Initialize w 1 0, w 1 2 B arbitrarily 2: for t = 1 to m o 3: for r = 1 to k o 4: Pick i t,r [ uniformly an observe x t [i t,r 5: x t,r x t [i t,r e it,r 6: en for 7: x t 1 k k r=1 x t,r 8: Choose j t [ with probability w t [j 2 / w t 2 2, an observe x t [j t 9: φt w t 2 2 x t [j t /w t [j t y t 10: g t φ t x t 11: v t w t η g t 12: w t+1 v t B/ max{ v t 2, B} 13: en for 14: w 1 m m w t 3.1. Rige regression Recall that in Rige regression, we are intereste in the linear regressor that is the solution to the optimization problem (2) with p = 2, given explicitly as min L D (w) s.t. w 2 B, (3) w R Our algorithm for the LAO setting is base on a ranomize Online Graient Descent (OGD) strategy (Zinkevich, 2003). More specifically, at each iteration t we use a ranomize estimator g t of the graient g t to upate the regressor w t via an aitive rule. Our graient estimators make use of an importance-sampling metho inspire by (Clarkson et al., 2010). The pseuo-coe of our Attribute Efficient Rige Regression (AERR) algorithm is given in Algorithm 1. In the following theorem, we show that the regressor learne by our algorithm is competitive with the optimal linear regressor having 2-norm boune by B. Theorem 3.1. Assume the istribution D is such that x 2 1 an y B with probability 1. Let w be the output of AERR, when run with η = k/2m. Then, w 2 B an for any w R with w 2 B, 3.1.1. Analysis E [L D ( w) L D (w ) + 4B 2 2 km. Theorem 3.1 is a consequence of the following two lemmas. The first lemma is obtaine as a result of a stanar regret boun for the OGD algorithm (see Zinkevich 2003), applie to the vectors g 1,..., g m.

Lemma 3.2. For any w 2 B we have g t (w t w ) 2B2 η + η 2 g t 2 2. (4) The secon lemma shows that the vector g t is an unbiase estimator of the graient g t := l t (w t ) at iteration t, an establishes a variance boun for this estimator. To simplify notations, here an in the rest of the paper we use E t [ to enote the conitional expectation with respect to all ranomness up to time t. Lemma 3.3. The vector g t is an unbiase estimator of the graient g t := l t (w t ), that is E t [ g t = g t. In aition, for all t we have E t [ g t 2 2 8B 2 /k. For a proof of the lemma, see (Hazan & Koren, 2011). We now turn to prove Theorem 3.1. Proof (of Theorem 3.1). First note that as w t 2 B, we clearly have w 2 B. Taking the expectation of (4) with respect to the ranomization of the algorithm, an letting G 2 := max t E t [ g t 2 2, we obtain [ m E gt (w t w ) 2B2 η + η 2 G2 m. On the other han, the convexity of l t gives l t (w t ) l t (w ) gt (w t w ). Together with the above this implies that for η = 2B/G m, [ 1 E l t (w t ) 1 m m l t (w ) + 2 BG m. Taking the expectation of both sies with respect to the ranom choice of the training set, an using G 2B 2/k (accoring to Lemma 3.3), we get [ 1 E L D (w t ) m L D (w ) + 4B 2 2 km. Finally, recalling the convexity of L D an using Jensen s inequality, the Theorem follows. 3.2. Lasso regression We now turn to escribe our algorithm for Lasso regression in the LAO setting, in which we woul like to solve the problem min L D (w) s.t. w 1 B. (5) w R The algorithm we provie for this problem is base on a stochastic variant of the EG algorithm (Kivinen & Warmuth, 1997), that employs multiplicative upates Algorithm 2 AELR Parameters: B, η > 0 Input: training set S = {(x t, y t )} t [m an k > 0 Output: regressor w with w 1 B 1: Initialize z + 1 1, z 1 1 2: for t = 1 to m o 3: w t (z + t z t ) B/( z + t 1 + z t 1 ) 4: for r = 1 to k o 5: Pick i t,r [ uniformly an observe x t [i t,r 6: x t,r x t [i t,r e it,r 7: en for 8: x t 1 k k r=1 x t,r 9: Choose j t [ with probability w[j / w 1, an observe x t [j t 10: φt w t 1 sign(w t [j t ) x t [j t y t 11: g t φ t x t 12: for i = 1 to o 13: ḡ t [i clip( g t [i, 1/η) 14: z + t+1 [i z+ t [i exp( η ḡ t [i) 15: z t+1 [i z t [i exp(+η ḡ t [i) 16: en for 17: en for 18: w 1 m m w t base on an estimation of the graients l t. The multiplicative nature of the algorithm, however, makes it highly sensitive to the magnitue of the upates. To make the upates more robust, we clip the entries of the graient estimator so as to prevent them from getting too large. Formally, this is accomplishe via the following clip operation: clip(x, c) := max{min{x, c}, c} for x R an c > 0. This clipping has an even stronger effect in the more general setting we consier in Section 4. We give our Attribute Efficient Lasso Regression (AELR) algorithm in Algorithm 2, an establish a corresponing risk boun in the following theorem. Theorem 3.4. Assume the istribution D is such that x 1 an y B with probability 1. Let w be the output of AELR, when run with η = 1 4B 2 2k log 2 5m, Then, w 1 B an for any w R with w 1 B we have E [L D ( w) L D (w ) + 4B 2 10 log 2 km provie that m log 2.,

3.2.1. Analysis In the rest of the section, for a vector v we let v 2 enote the vector for which v 2 [i = (v[i) 2 for all i. In orer to prove Theorem 3.4, we first consier the augmente vectors z t := (z + t, z t ) R 2 an ḡ t := (ḡ t, ḡ t ) R 2, an let p t := z t/ z t 1. For these vectors, we have the following. Lemma 3.5. p t ḡ t min i [2 ḡ t[i + log 2 η + η p t (ḡ t) 2 The lemma is a consequence of a secon-orer regret boun for the Multiplicative-Weights algorithm, essentially ue to (Clarkson et al., 2010). By means of this lemma, we establish a risk boun with respect to the clippe linear functions ḡt w. Lemma 3.6. Assume that E t [ g t 2 G 2 for all t, for some G > 0. Then, for any w 1 B, we have [ m [ m ( ) log 2 E ḡt w t E ḡt w + B + ηg 2 m η Our next step is to relate the risk generate by the linear functions g t w, to that generate by the clippe functions ḡt w. Lemma 3.7. Assume that E t [ g t 2 G 2 for all t, for some G > 0. Then, for 0 < η 1/2G we have [ m [ m E g t w t E ḡt w t + 4BηG 2 m. The final component of the proof is a variance boun, similar to that of Lemma 3.3. Lemma 3.8. The vector g t is an unbiase estimator of the graient g t := l t (w t ), that is E t [ g t = g t. In aition, for all t we have E t [ g t 2 8B 2 /k. For the complete proofs, refer to (Hazan & Koren, 2011). We are now reay to prove Theorem 3.4. Proof (of Theorem 3.4). Since w t 1 B for all t, we obtain w 2 B. Next, note that as E t [ g t = g t, we have E[ m g t w t = E[ m g t w t. Putting Lemmas 3.6 an 3.7 together, we get for η 1/2G that [ T ( ) log 2 E gt (w t w ) B + 5ηG 2 m. η Proceeing as in the proof of Theorem 3.1, an choosing η = 1 log 2 G 5m, we obtain the boun 5 log 2 E [L D ( w) L D (w ) + 2BG m. Note that for this choice of η we inee have η 1/2G, as we originally assume that m log 2. Finally, putting G = 2B 2/k as implie by Lemma 3.8, we obtain the boun in the statement of the theorem. 4. Support-vector regression In this section we show how our approach can be extene to eal with loss functions other than the square loss, of the form l(w x, y) = f(w x y), (6) (with f real an convex) an most importantly, with the δ-insensitive absolute loss function of SVR, for which f(x) = x δ := max{ x δ, 0} for some fixe 0 δ B (recall that in our results we assume the labels y t have y t B). For concreteness, we consier only the 2-norm variant of the problem (as in the stanar formulation of SVR) the results we obtain can be easily ajuste to the 1-norm setting. We overloa notation, an keep using the shorthan l t (w) := l(w x t, y t ) for referring the loss function inuce by the instance (x t, y t ). It shoul be highlighte that our techniques can be aapte to eal with many other common loss functions, incluing classification losses (i.e., of the form l(w x, y) = f(y w x)). Due to its importance an popularity, we chose to escribe our metho in the context of SVR. Unfortunately, there are strong inications that SVR learning (more generally, learning with non-smooth loss function) in the LAO setting is impossible via our approach of unbiase graient estimations (see Cesa- Bianchi et al. 2011 an the references therein). For that reason, we make two moifications to the learning setting: first, we shall henceforth relax the buget constraint to allow k observe attributes per instance in expectation; an secon, we shall aim for biase graient estimators, instea of unbiase as before. To obtain such biase estimators, we uniformly ε- approximate the function f by an analytic function f ε an learn with the approximate loss function l ε t (w) = f ε (w x t y t ) instea. Clearly, any ε- suboptimal regressor of the approximate problem is an 2ε-suboptimal regressor of the original problem. For learning the approximate problem we use a novel technique, inspire by (Cesa-Bianchi et al., 2011), for estimating graients of analytic loss functions. Our estimators for l ε t can then be viewe as biase estimators of l t (we note, however, that the resulting bias might be quite large).

Proceure 3 GenEst Parameters: {a n } n=0 Taylor coefficients of f Input: regressor w, instance (x, y) Output: ˆφ with E[ ˆφ = f (w x y) 1: Let N = 4B 2. 2: Choose n 0 with probability Pr[n = ( 1 2 )n+1 3: if n 2 log 2 N then 4: for r = 1,..., n o 5: Choose j [ with probability w[j 2 / w 2 2, an observe x[j 6: θr w 2 2 x[j/w[j y 7: en for 8: else 9: for r = 1,..., n o 10: Choose j 1,..., j N [ w.p. w[j 2 / w 2 2, (inepenently), an observe x[j 1,..., x[j N 11: θr 1 N N s=1 w 2 2 x[j s /w[j s y 12: en for 13: en if 14: ˆφ 2 n+1 a n θ 1 θ2 θ n 4.1. Estimators for analytic loss functions Let f : R R be a real, analytic function (on the entire real line). The erivative f is thus also analytic an can be expresse as f (x) = n=0 a nx n where {a n } are the Taylor expansion coefficients of f. In Proceure 3 we give an unbiase estimator of f (w x y) in the LAO setting, efine in terms of the coefficients {a n } of f. For this estimator, we have the following (proof is omitte). Lemma 4.1. The estimator ˆφ is an unbiase estimator of f (w x y). Also, assuming x 2 1, w 2 B an y B, the secon-moment E[ ˆφ 2 is upper boune by exp(o(log 2 B)), provie that the Taylor series of f (x) converges absolutely for x 1. Finally, the expecte number of attributes of x use by this estimator is no more than 3. 4.2. Approximating SVR In orer to approximate the δ-insensitive absolute loss function, we efine f ε (x) = ε ( ) x δ 2 ρ + ε ( ) x + δ ε 2 ρ δ ε where ρ is expresse in terms of the error function erf, ρ(x) = x erf(x) + 1 π e x2, an consier the approximate loss functions l ε t (w) = f ε (w x t y t ). Inee, we have the following. Algorithm 4 AESVR Parameters: B, δ, η > 0 an accuracy ε > 0 Input: training set S = {(x t, y t )} t [m an k > 0 Output: regressor w with w 2 B 1: Let a 2n = 0 for n 0, an a 2n+1 = 2 ( 1) n, n 0 (7) π n!(2n + 1) 2: Execute algorithm 1 with lines 8 9 replace by: x t x t /ε y + t (y t + δ)/ε, y t (y t δ)/ε φ t 1 2 [GenEst(w t, x t, y + t ) + GenEst(w t, x t, y t ) 3: Return the output w of the algorithm Claim 4.2. For any ε > 0, f ε is convex, analytic on the entire real line an sup f ε (x) x δ ε. x R The claim follows easily from the ientity x δ = 1 2 x δ + 1 2 x + δ δ. In aition, for using Proceure 3 we nee the following simple observation, that follows immeiately from the series expansion of erf(x). Claim 4.3. ρ (x) = n=0 a 2n+1x 2n+1, with the coefficients {a 2n+1 } n 0 given in (7). We now give the main result of this section, which is a sample complexity boun for the Attribute Efficient SVR (AESVR) algorithm, given in Algorithm 4. Theorem 4.4. Assume the istribution D is such that x 2 1 an y B with probability 1. Then, for any w R with w 2 B, we have E [L D ( w) L D (w ) + ε where w is the output of AESVR (with η properly tune) on a training set of size m = O ( ) ( ( )) exp O log 2 B. (8) k ε The algorithm queries at most k + 6 attributes of each instance in expectation. Proof. First, note that for the approximate loss functions l ε t we have l ε t (w t ) = 1 2 [ ρ (w t x t y + t ) + ρ (w t x t y t ) x t. Hence, Lemma 4.1 an Claim 4.3 above imply that g t in Algorithm 4 is an unbiase estimator of l ε t (w t ). Furthermore, since x t 2 1 ε an y t ± 2 B ε, accoring to the same lemma we have E t [ φ 2 t = exp(o(log 2 B ε )). Repeating the proof of Lemma 3.3,

we then have E t [ g t 2 2 = E t [ φ 2 t E t [ x t 2 2 = exp(o(log 2 B ε )) k. Replacing G 2 in the proof of theorem 3.1 with the above boun, we get for the output of Algorithm 4, E [L D ( w) L D (w ) + exp(o(log 2 B ε )) km, which imply that for obtaining an ε-accurate regressor w of the approximate problem, it is enough to take m as given in (8). However, claim 4.2 now gives that w itself is an 2ε-accurate regressor of the original problem, an the proof is complete. 5. Experiments In this section we give experimental evience that support our theoretical bouns, an emonstrate the superior performance of our algorithms compare to the state of the art. Naturally, we chose to compare our AERR an AELR algorithms 5 with the AER algorithm of (Cesa-Bianchi et al., 2010). We note that AER is in fact a hybri algorithm that combines 1- norm an 2-norm regularizations, thus we use it for benchmarking in both the Rige an Lasso settings. We essentially repeate the experiments of (Cesa- Bianchi et al., 2010) an use the popular MNIST igit recognition ataset (LeCun et al., 1998). Each instance in this ataset is a 28 28 image of a hanwritten igit 0 9. We focuse on the 3 vs. 5 task, on a subset of the ataset that consists of the 3 igits (labele 1) an the 5 igits (labele +1). We applie the regression algorithms to this task by regressing to the labels. In all our experiments, we ranomly split the ata to training an test sets, an use 10-fol cross-valiation for tuning the parameters of each algorithm. Then, we ran each algorithm on increasingly longer prefixes of the ataset an tracke the obtaine square-error on the test set. For faithfully comparing partial- an full-information algorithms, we also recore the total number of attributes use by each algorithm. In our first experiment, we execute AELR, AER an (offline) Lasso on the 3 vs. 5 task. We allowe both AELR an AER to use only k = 4 pixels of each training image, while giving Lasso unrestricte access to the entire set of attributes (total of 784) of each instance. The results, average over 10 runs on 5 The AESVR algorithm is presente mainly for theoretical consierations, an was not implemente in the experiments. Test square error 0.9 0.8 0.7 0.6 0.5 1 2 3 4 Number of attributes AELR AER Offline 10 4 Figure 1. Test square error of Lasso algorithms with k = 4, over increasing prefixes of the 3 vs. 5 ataset. ranom train/test splits, are presente in Figure 1. Note that the x-axis represents the cumulative number of attributes use for training. The graph ens at roughly 48500 attributes, which is the total number of attributes allowe for the partial-information algorithms. Lasso, however, completes this buget after seeing merely 62 examples. As we see from the results, AELR keeps its test error significantly lower than that of AER along the entire execution, almost briging the gap with the fullinformation Lasso. Note that the latter has the clear avantage of being an offline algorithm, while both AELR an AER are online in nature. Inee, when we compare AELR with an online Lasso solver, our algorithm obtaine test error almost 10 times better. In the secon experiment, we evaluate AERR, AER an Rige regression on the same task, but now allowing the partial-information algorithms to use as much as k = 56 pixels (which amounts to 2 rows) of each instance. The results of this experiment are given in Figure 2. We see that even if we allow the algorithms to view a consierable number of attributes, the gap between AERR an AER is large. 6. Conclusions an Open Questions We have consiere the funamental problem of statistical regression analysis, an in particular Lasso an Rige regression, in a setting where the observation upon each training instance is limite to a few attributes, an gave algorithms that improve over the state of the art by a leaing orer term with respect to the sample complexity. This resolves an open question of (Cesa-Bianchi et al., 2010). The algorithms are efficient, an give a clear experimental avantage in

Test square error 0.5 0.45 0.4 0.35 0.3 0.25 1 2 3 4 5 6 Number of attributes AERR AER Offline 10 5 Figure 2. Test square error of Rige algorithms with k = 56, over increasing prefixes of the 3 vs. 5 ataset. previously-consiere benchmarks. For the challenging case of regression with general convex loss functions, we escribe exponential improvement in sample complexity, which apply in particular to support-vector regression. It is interesting to resolve the sample complexity gap of 1 ε which still remains for Lasso regression, an to improve upon the pseuo-polynomial factor in ε for support-vector regression. In aition, establishing analogous bouns for our algorithms that hol with high probability (other than in expectation) appears to be non-trivial, an is left for future work. Another possible irection for future research is aapting our results to the setting of learning with (ranomly) missing ata, that was recently investigate see e.g. (Rostamizaeh et al., 2011; Loh & Wainwright, 2011). The sample complexity bouns our algorithms obtain in this setting are slightly worse than those presente in the current paper, an it is interesting to check if one can o better. Acknowlegments We thank Shai Shalev-Shwartz for several useful iscussions, an the anonymous referees for their etaile comments. References Ben-Davi, S. an Dichterman, E. Learning with restricte focus of attention. Journal of Computer an System Sciences, 56(3):277 298, 1998. Cesa-Bianchi, N., Shalev-Shwartz, S., an Shamir, O. Efficient learning with partially observe attributes. In Proceeings of the 27th international conference on Machine learning, 2010. Cesa-Bianchi, N., Shalev-Shwartz, S., an Shamir, O. Online learning of noisy ata. IEEE Transactions on Information Theory, 57(12):7907 7931, ec. 2011. ISSN 0018-9448. oi: 10.1109/TIT.2011.2164053. Clarkson, K.L., Hazan, E., an Wooruff, D.P. Sublinear optimization for machine learning. In 2010 IEEE 51st Annual Symposium on Founations of Computer Science, pp. 449 457. IEEE, 2010. Haussler, D. Decision theoretic generalizations of the PAC moel for neural net an other learning applications. Information an computation, 100(1):78 150, 1992. Hazan, E. an Koren, T. Optimal algorithms for rige an lasso regression with partially observe attributes. Arxiv preprint arxiv:1108.4559, 2011. Hazan, E., Agarwal, A., an Kale, S. Logarithmic regret algorithms for online convex optimization. Machine Learning, 69(2):169 192, 2007. Kakae, S.M., Sriharan, K., an Tewari, A. On the complexity of linear preiction: Risk bouns, margin bouns, an regularization. In Avances in Neural Information Processing Systems, volume 22, 2008. Kivinen, J. an Warmuth, M.K. Exponentiate graient versus graient escent for linear preictors. Information an Computation, 132(1):1 63, 1997. LeCun, Y., Bottou, L., Bengio, Y., an Haffner, P. Graient-base learning applie to ocument recognition. Proceeings of the IEEE, 86(11):2278 2324, 1998. Loh, P.L. an Wainwright, M.J. High-imensional regression with noisy an missing ata: Provable guarantees with non-convexity. In Avances in Neural Information Processing Systems, 2011. Rostamizaeh, A., Agarwal, A., an Bartlett, P. Learning with missing features. In The 27th Conference on Uncertainty in Artificial Intelligence, 2011. Vapnik, V.N. The nature of statistical learning theory. Springer-Verlag, 1995. Zinkevich, M. Online convex programming an generalize infinitesimal graient ascent. In Proceeings of the 20th international conference on Machine learning, volume 20, pp. 928 936. ACM, 2003.