Minimax Learning for Remote Prediction

Size: px

Start display at page:

Download "Minimax Learning for Remote Prediction"

Candice Page
5 years ago
Views:

1 1 Minimax Learning for Remote Prediction Cheuk Ting Li *, Xiugang Wu, Afer Ozgur, Abbas El Gamal * Department of Electrical Engineering and Computer Sciences Universit of California, Berkele ctli@berkele.edu Department of Electrical Engineering Stanford Universit {x3wu, aozgur}@stanford.edu; abbas@ee.stanford.edu arxiv: v1 [cs.it] 31 Ma 018 Abstract The classical problem of ervised learning is to infer an accurate predictor of a target variable Y from a measured variable X b using a finite number of labeled training samples. Motivated b the increasingl distributed nature of data and decision making, in this paper we consider a variation of this classical problem in which the prediction is performed remotel based on a rate-constrained description M of X. Upon receiving M, the remote node computes an estimate Ŷ of Y. We follow the recent minimax approach to stud this learning problem and show that it corresponds to a one-shot minimax nois source coding problem. We then establish information theoretic bounds on the risk-rate Lagrangian cost and a general method to design a near-optimal descriptor-estimator pair, which can be viewed as a rate-constrained analog to the maximum conditional entrop principle used in the classical minimax learning problem. Our results show that a naive estimate-compress scheme for rate-constrained prediction is not in general optimal. I. INTRODUCTION The classical problem of ervised learning is to infer an accurate predictor of a target variable Y from a measured variable X on the basis of n labeled training samples {X i, Y i } n i=1 independentl drawn from an unknown joint distribution P. The standard approach for solving this problem in statistical learning theor is empirical risk minimization ERM. For a given set of allowable predictors and a loss function that quantifies the risk of each predictor, ERM chooses the predictor with minimal risk under the empirical distribution of samples. To avoid overfitting, the set of allowable predictors is restricted to a class with limited complexit. Recentl, an alternative viewpoint has emerged which seeks distributionall robust predictors. Given the labeled training samples, this approach learns a predictor b minimizing its worst-case risk over an ambiguit distribution set centered at the empirical distribution of samples. In other words, instead of restricting the set of allowable predictors, it aims to avoid overfitting b requiring that the learned predictor performs well under an distribution in a chosen neighborhood of the empirical distribution. This minimax approach has been investigated under different assumptions on how the ambiguit set is constructed, e.g., b restricting the moments [1], forming the f-divergence balls [] and Wasserstein balls [3] see also references therein. In these previous works, the learning algorithm finds a predictor that acts directl on a fresh unlabeled sample X to predict the corresponding target variable Y. Often, however the fresh sample X ma be onl remotel available, and when designing the predictor it is desirable to also take into account the cost of communicating X. This is motivated b the fact that bandwidth and energ limitations on communication in networks and within multiprocessor sstems often impose significant bottlenecks on the performance of algorithms. There are also an increasing number of applications in which data is generated in a distributed manner and it or features of it are communicated over bandwidth-limited links to a central processor to perform inference. For instance, applications such as Google Goggles and Siri process the locall collected data on clouds. It is thus important to stud prediction in distributed and rate-constrained settings. In this paper, we stud an extension of the classical learning problem in which given a finite set of training samples, the learning algorithm needs to infer a descriptor-estimator pair with a desired communication rate in between them. This is especiall relevant when both X and Y come from a large alphabet or are continuous random variables as in regression problems, so neither the sample X nor its predicted value of Y can be simpl communicated in a lossless fashion. We adopt the minimax framework for learning the descriptor-estimator pair. Given a set of labeled training samples, our goal is to find a descriptor-estimator pair b minimizing their resultant worst-case risk over an ambiguit distribution set, where the risk now incorporates both the statistical risk and the communication cost. One of the important conclusions that emerge from the minimax approach to ervised learning in [1] is that the problem of finding the predictor with minimal worstcase risk over an ambiguit set can be broken into two smaller steps: 1 find the worst-case distribution in the ambiguit set that maximizes the generalized conditional entrop of Y given X, and find the optimal predictor under this worstcase distribution. In this paper, we show that an analogous principle approximatel holds for rate-constrained prediction. The This paper will be presented in part at the IEEE International Smposium on Information Theor, Vail, Colorado, USA, June 018.

2 descriptor-estimator pair with minimal worst-case risk can be found in two steps: 1 find the worst-case distribution in the ambiguit set that maximizes the risk-information Lagrangian cost, and find the optimal descriptor-estimator pair under this worst-case distribution. We then appl our results to characterize the optimal descriptor-estimator pairs for two applications: rate-constrained linear regression and rate-constrained classification. While a simple scheme whereb we first find the optimal predictor ignoring the rate constraint, then compress and communicate the predictor output, is optimal for the linear regression application, we show via the classification application that such an estimate-compress approach is not optimal in general. We show that when prediction is rate-constrained, the optimal descriptor aims to send sufficientl but not necessaril maximall informative features of the observed variable, which are at the same time eas to communicate. When applied to the case in which the ambiguit distribution set contains onl a single distribution for example, the true or empirical distribution of X, Y and the loss function for the prediction is logarithmic loss, our results provide a new one-shot operational interpretation of the information bottleneck problem. A ke technical ingredient in our results is the strong functional representation lemma SFRL developed in [4], which we use to design the optimal descriptor-estimator pair for the worst-case distribution. Notation We assume that log is base and the entrop H is in bits. The length of a variable-length description M {0, 1} is denoted as M. For random variables U, V, denote the joint distribution b P U,V and the conditional distribution of U given V b P U V. For brevit we denote the distribution of X, Y as P. We write I P X; Ŷ for IX; Ŷ when X, Y P, and is clear from the context. II. PROBLEM FORMULATION We begin b reviewing the minimax approach to the classical learning problem [1]. A. Minimax Approach to Supervised Learning Let X X and Y Y be jointl distributed random variables. The problem of statistical learning is to design an accurate predictor of a target variable Y from a measured variable X on the basis of a number of independent training samples {X i, Y i } n i=1 drawn from an unknown joint distribution. The standard approach for solving this problem is to use empirical risk minimization ERM in which one defines an admissible class of predictors F that consists of functions f : X Ŷ where the reconstruction alphabet Ŷ can be in general different from Y and a loss function l : Ŷ Y R. The risk associated with a predictor f when the underling joint distribution of X and Y is P is Lf, P E P [lfx, Y ]. ERM simpl chooses the predictor f n F with minimal risk under the empirical distribution P n of the training samples. Recentl, an alternative approach has emerged which seeks distributionall robust predictors. This approach learns a predictor b minimizing its worst-case risk over an ambiguit distribution set ΓP n, i.e., f n = argmin f max P n Lf, P, 1 where f can be an function and ΓP n can be constructed in various was, e.g., b restricting the moments, forming the f-divergence balls or Wasserstein balls. While in ERM it is important to restrict the set F of admissible predictors to a low-complexit class to prevent overfitting, in the minimax approach overfitting is prevented b explicitl requiring that the chosen predictor is distributionall robust. The learned function f n can be then used for predicting Y when presented with fresh samples of X. The learning and inference phases are illustrated in Figure 1. Learning: {X i,y i } n i=1 P n P n Learner f n Inference: X Inferrer f n Ŷ = f n X Fig. 1. Minimax approach to ervised learning.

3 3 B. Minimax Learning for Remote Prediction In this paper, we extend the minimax learning approach to the setting in which the prediction needs to be performed based on a rate-constrained description of X. In particular, given a set of finite training samples {X i, Y i } n i=1 independentl drawn from an unknown joint distribution P, our goal is to learn a pair of functions e, f, where e is a descriptor used to compress X into M = ex {0, 1} a prefix-free code, and f is an estimator that takes the compression M and generates an estimate Ŷ of Y. See Figure. Let Re, P E P [ ex ] be the rate of the descriptor e and Le, f, P E P [lfex, Y ] be the risk associated with the descriptor-estimator pair e, f, when the underling distribution of X, Y is P, and define the risk-rate Lagrangian cost parametrized b λ > 0 as L λ e, f, P = Le, f, P + λre, P. Note that this cost function takes into account both the resultant statistical prediction risk of e, f, as well as the communication rate the require. The task of a minimax learner is to find an e n, f n pair that minimizes the worst-case L λ e, f, P over the ambiguit distribution set ΓP n, i.e., e n, f n = argmin e,f max L λe, f, P, 3 P n for an appropriatel chosen ΓP n centered at the empirical distribution of samples P n. Note that we allow here all possible e, f pairs. We also assume that the descriptor and the estimator can use unlimited common randomness W which is independent of the data, i.e., e and f can be expressed as functions of X, W and M, W, respectivel, and the prefixfree codebook for M can depend on W. The availabilit of such common randomness can be justified b the fact that in practice, although the inference scheme is one-shot, it is used man times b the same user and b different users, hence the descriptor and the estimator can share a common randomness seed before communication commences without impacting the communication rate. Learning: {X i,y i } n i=1 P n P n Learner e n,f n Inference: X M = e n X Ŷ = f n M e n f n Fig.. Minimax learning for remote prediction. III. MAIN RESULTS We first consider the case where Γ consists of a single distribution P, which ma be the empirical distribution P n as in ERM. Define the minimax risk-rate cost as L λγ = inf e,f L λ e, f, P. 4 While it is difficult to minimize the risk-rate cost directl, the minimax risk-rate cost can be bounded in terms of the mutual information between X and Ŷ. Theorem 1. Let Γ = {P }. Then L λ inf E L λ inf E [lŷ, Y ] + λix; Ŷ, ] [lŷ, Y + λ IX; Ŷ + logix; Ŷ As in other one-shot compression results e.g., zero-error compression, there is a gap between the upper and lower bound. While the logarithmic gap in Theorem 1 is not as small as the 1-bit gap in the zero-error compression, it is dominated b the linear term IX; Ŷ when it is large. To prove Theorem 1, we use the strong functional representation lemma given in [4] also see [5], [6]: for an random variables X, Ŷ, there exists random variable W independent of X, such that Ŷ is a function of X, W, and HŶ W IX; Ŷ + logix; Ŷ

4 4 Here, W can be intuitivel viewed as the part of Ŷ which is not contained in X. Note that for an W such that Ŷ is a function of X, W and W is independent of X, HŶ W IX; Ŷ. The statement 5 ensures the existence of an W, independent of X, which comes close to this lower bound, and in this sense it is most informative about Ŷ. This is critical for the proof of Theorem 1 as we will see next. Identifing the part of Ŷ which is not contained in X allows us to generate and share this part between the descriptor and the estimator ahead of time, eliminating the need to communicate it during the course of inference. To find W, we use the Poisson functional representation construction detailed in [4]. Proof of Theorem 1: Recall that Ŷ = fex, W, W. The lower bound follows from the fact that I P X; Ŷ H P M E[ M ]. To establish the upper bound, fix an. Let W be obtained from 5. Note that W is independent of X and can be generated from a random seed shared between the descriptor and the estimator ahead of time. For a given w, take m = ex, w to be the Huffman codeword of ŷx, w according to the distribution PŶ W w recall that Ŷ is a function of X, W, and take fm, w to be the decoding function of the Huffman code. The expected codeword length Taking an infimum over all completes the proof. E[ M ] HŶ W + 1 IX; Ŷ + logix; Ŷ Remark 1. If we consider the logarithmic loss lŷ, = log ŷ, where ŷ is a distribution over Y, then the lower bound in Theorem 1 reduces to inf HY U + λix; U = HY + inf λix; U IY ; U, P U X P U X which is the information bottleneck function [7]. Therefore the setting of remote prediction provides an approximate one-shot operational interpretation of the information bottleneck up to a logarithmic gap. In [8], [9] it was shown that the asmptotic nois source coding problem also provides an operational interpretation of the information bottleneck. Our operational interpretation, however, is more satisfing since the feature extraction problem originall considered in [7] is b nature one-shot. We now extend Theorem 1 to the minimax setting. Theorem. Suppose Γ is convex. Then L λ inf L λ inf + λ E P [lŷ, Y ] + λi P X; Ŷ E P [lŷ, Y ] I P X; Ŷ + logi P X; Ŷ This result is related to minimax nois source coding [10]. The main difference is that we consider the one-shot expected length instead of the asmptotic rate. To prove this theorem, we first invoke a minimax result for relative entrop in [11] which generalizes the redundanccapacit theorem [1]. Then we appl the following refined version of the strong functional representation lemma that is proved in the proof of Theorem 1 in [4] also see [5]. Lemma 1. For an and PŶ, there exists random variable W, and functions kx, w {1,,...} and ŷk, w such that ŷkx, W, W x, and E [log kx, W ] D x PŶ We are now read to prove Theorem. Proof: The lower bound follows from E P [ M ] H P M I P X; Ŷ. To prove the upper bound, we fix an P Ŷ X, and show that the following risk-rate cost is achievable: ] L = E P [lŷ, Y + λ I P X; Ŷ + logi P X; Ŷ Let [ ] gp, PŶ = E P lŷ, Y + λ D =x PŶ dp x + log D PŶ X=x PŶ dp x

5 Note that g is concave in P for fixed PŶ since E P [ lŷ, Y ] and D =x PŶ dp x are linear in P. Also g is quasiconvex in PŶ for fixed P since D =x PŶ dp x is convex in PŶ, and is lower semicontinuous in PŶ since D =x PŶ is lower semicontinuous with respect to the topolog of weak convergence [13], and hence D =x PŶ dp x is lower semicontinuous b Fatou s lemma. Write P for the distribution of Ŷ when X, Y P and Ŷ {X = x} P Ŷ X x. Let Γ Ŷ = {P Ŷ X P : P Γ} and ΓŶ be the closure of ΓŶ in the topolog of weak convergence. It can be shown using the same arguments as in [11] on g instead of relative entrop, and using Sion s minimax theorem [14] instead of Lemma in [11] that if ΓŶ then there exists P Γ Ŷ Ŷ such that gp, P = inf gp, Ŷ PŶ = L. PŶ If ΓŶ is not uniforml tight, then b Lemma 4 in [11], inf PŶ inf PŶ gp, PŶ =. Appling Lemma 1 to, P Ŷ Ŷ = ŷk, W following the conditional distribution, and 5 is uniforml tight, D =x PŶ dp x =, and hence L = we obtain W independent of X, random variable K = kx, W {1,,...}, and X = x E [log K X = x] D P Ŷ for an x. Then we use Elias delta code [15] for K to produce M. Note that the average length of the Elias delta code is upper bounded b log K + log log K Hence, we have Hence E P [ M ] E P [log K] + log E P [log K] D =x P Ŷ dp x + log D =x P Ŷ dp x ] L λ E P [lŷ, Y + λ M gp, P Ŷ L. Theorem suggest that we can simplif the analsis of the risk-rate cost L λ = E P [ lŷ, Y ] + λ E P [ M ] b replacing the rate E P [ M ] with the mutual information I P X; Ŷ. Define the risk-information cost as [ L λ, P = E P l Ŷ, Y ] + λi P X; Ŷ. 7 Theorem implies that the minimax risk-rate cost L λ can be approximated b the minimax risk-information cost L λγ = inf L λ PŶ, P, 8 X within a logarithmic gap. Theorem can also be stated in the following slightl weaker form L λ L λ L λ + λ logλ 1 L λ λ. The risk-information cost has more desirable properties than the risk-rate cost. For example, it is convex in for fixed P, and concave in P for fixed. This allows us to exchange the infimum and remum in Theorem b Sion s minimax theorem [14], which gives the following proposition. Proposition 1. Suppose X, Y and Ŷ are finite, Γ is convex and closed, and λ 0, then L λγ = inf L λ PŶ, P = inf Lλ PŶ X, P. X Moreover, there exists P attaining the infimum in the left hand side, which also attains the infimum on the right hand side Ŷ X when P is fixed to P, the distribution that attains the remum on the right hand side. Proposition 1 means that in order to design a robust descriptor-estimator pair that work for an P Γ, we onl need to design them according to the worst-case distribution P as follows. Principle of maximum risk-information cost: Given a convex and closed Γ, we design the descriptor-estimator pair based on the worst-case distribution P = arg max inf Lλ PŶ, P. X We then find that minimizes L λ, P and design the descriptor-estimator pair accordingl, e.g. using Lemma 1 on and the induced distribution P Ŷ from P Ŷ X and P.

6 6 A. Rate-constrained Minimax Linear Regression IV. APPLICATIONS Suppose X R d, Y R, lŷ, = ŷ is the mean-squared loss, and we observe the data {X i, Y i } n i=1. Take Γ to be the set of distributions with the same first and second moments as given b the empirical distribution, i.e., Γ = { P XY : E[X] = µ X, E[Y ] = µ Y, Var[X] = Σ X, Var[Y ] = σ Y, Cov[X, Y ] = C XY }, 9 where µ X, µ Y, Σ X, σy, C XY are the corresponding statistics of the empirical distribution. The following proposition shows that P is Gaussian. Proposition Linear regression with rate constraint. Consider mean-squared loss and define Γ as in 9. Then the minimax risk-information cost 8 is where the optimal P XY where L λ = σy CXY T Σ 1 X C XY + λ log ect XY Σ 1 λ log e X C XY, 10 is Gaussian with its mean and covariance matrix specified in 9, and the optimal estimate { acxy T Ŷ = Σ 1 X X + b + Z λ log e if < CXY T Σ 1 X C XY 0 if λ log e CXY T Σ 1 X C XY, a = 1 and Z N0, σ Z is independent of X with σ Z λ log e C T XY Σ 1 X C XY = λa log e., b = µ Y ac T XY Σ 1 X µ X, Note that this setting does not satisf the conditions in Proposition 1. We directl analze 8 to obtain the optimal PXY. Given the optimal PXY, Theorem and Lemma 1 can be used to construct the scheme. Operationall, e nx, w is a random quantizer of acxy T Σ 1 X X + b such that the quantization noise follows N0, σ Z. With this natural choice of the ambiguit set, our formulation recovers a compressed version of the familiar MMSE estimator. Figure 3 plots the tradeoff between the rate and the risk when d = 1, µ X = µ Y = 0, σx = σ Y = 1, C XY = 0.95 for the scheme constructed using the Poisson functional representation in [4], with the lower bound given b the minimax risk-information cost L λ, and the upper bound given in Theorem. Fig. 3. Tradeoff between the rate and the risk in rate-constrained minimax linear regression. Proof of Proposition : Without loss of generalit, assume µ X = 0 and µ Y = 0. We first prove in 10. For this, fix as given in the proposition and consider an P Γ. When λ log e < CXY T Σ 1 X C XY, we have [ ] [ E P lŷ, Y = E P Ŷ Y ] σy + λ log e CXY T Σ 1 X C XY, and I P X; Ŷ = hŷ hŷ X

7 Therefore, 1 C T log XY Σ 1 X C XY. λ log e ] inf E P [lŷ, Y + λi P X; Ŷ R.H.S. of 10. It can also be checked that the above relation holds when λ log e CXY T Σ 1 X C XY, and thus we have proved in 10. To prove in 10, fix a Gaussian P XY with its mean and covariance matrix specified in 9 and consider an arbitrar. We have [ ] [ E P lŷ, Y = E P Y Ŷ ] [ = σy CXY T Σ 1 X C XY + E P Ŷ CT XY Σ 1 X X], and I P X; Ŷ = I P CXY T Σ 1 X X ; Ŷ h CXY T Σ 1 X X h CXY T Σ 1 X X Ŷ 1 log CT XY Σ 1 X C XY 1 [ log E P Ŷ CT XY Σ 1 X X]. [ Letting γ = E P Ŷ CT XY Σ 1 X ], X we have [ ] E P lŷ, Y + λi P X; Ŷ σ Y C T XY Σ 1 X C XY + λ log CT XY Σ 1 X C XY + γ λ log γ R.H.S. of 10, where the second inequalit follows b evaluating the minimum value of γ λ log γ. Combing this with the above completes the proof of Proposition. The optimal scheme in the above example corresponds to compressing and communicating the minimax optimal rateunconstrained predictor Ȳ = CT XY Σ 1 X X µ X + µ Y, since the optimal Ŷ can be obtained from Ȳ b shifting, scaling and adding noise. This estimate-compress approach can be thought as a separation scheme, since we first optimall estimate Ȳ, then optimall communicate it while satisfing the rate constraint. In the next application, we show that such separation is not optimal in general. B. Rate-constrained Minimax Classification We assume Y = Ŷ = {1,..., k} and X are finite, lŷ, = 1{ŷ }, and Γ is closed and convex. The following proposition gives the minimax risk-information cost and the optimal estimator. Proposition 3. Consider the setting described above. The minimax risk-information cost is given b L λ = 1 + λ inf E P log λ 1 P Y X X PŶ, PŶ the worst-case distribution P is the one attaining the remum, and the optimal estimator is given b P Ŷ X ŷ x λ 1 P Y X x P, where Ŷ P attains the infimum, and P Ŷ Y X is obtained from P. In particular, if Γ is smmetric for different values of Y i.e., for an 1, Y, there exists permutation π of Y, τ of X such that π 1 = and P X,Y Γ P τx,πy Γ, L λ = 1 + λ log k λ E P log λ 1 P Y X X. We can see that when λ 0, P Ŷ X tends to the maximum a posteriori estimator under P, the worst-case distribution when λ = 0. 7

8 8 Proof: Assume Γ is closed and convex. B Proposition 1, the minimax rate-information cost is L λ = inf Lλ, P, where inf Lλ PŶ, P X ] E P [lŷ, Y + λi P X; Ŷ = inf = inf = inf PŶ, P {Ŷ Y } + λ inf PŶ P {Ŷ Y } + λ = 1 + λ inf PŶ, E P = 1 + λ inf PŶ a = 1 + λ inf PŶ E P D =x PŶ dp x D =x PŶ dp x X log P Ŷ X X λ 1 P Y X X PŶ X inf X log λ 1 P Y X X PŶ / log P λ 1 Y X X PŶ E P log λ 1 P Y X X PŶ, λ 1 P Y X X PŶ where a is due to that relative entrop is nonnegative, and equalit is attained when x λ 1 P Y X X PŶ. Next we consider the case in which Γ is smmetric. Consider the minimax rate-information cost ] L λ = inf L λ PŶ, P = inf E P [lŷ, Y + λi P X; Ŷ. X For an i, j Y = {1,..., k}, let π ij be the permutation over Y such that π ij i = j and let τ ij be the corresponding permutation over X in the smmetr assumption. Since the function L λ, P is convex and smmetric about π ij and τ ij i.e., Lλ, P = Lλ P πij Ŷ τ ijx, P, to find its infimum, we onl need to consider s satisfing = P πij Ŷ τ ijx for all i, j if not, we can instead consider the average of P π a ij Ŷ τij a X for a from 1 up to the product of the periods of π ij and τ ij, which gives a value of the function not larger than that of. For brevit we sa is smmetric if it satisfies this condition. Fix an smmetric. Since the function P L λ, P is concave and smmetric about π ij and τ ij i.e., L λ, P X,Y = L λ, P τijx,π ijy, to find its remum, we onl need to consider smmetric P s. Hence, L λ = inf smm. = inf smm. = inf smm. smm. smm. smm. = 1 + λ log k + λ inf L λ, P ] E P [lŷ, Y + λi P X; Ŷ smm. = 1 + λ log k + λ inf 1 + λ log k + λ = smm. smm. inf smm. P {Ŷ Y } + λlog k H P Ŷ X E P smm. 1 + λ log k λ E P log X log X λ 1 P Y X X X E P X log smm. λ 1 P Y X X / log P λ 1 Y X X E P log λ 1 P Y X X smm. λ 1 P Y X X, λ 1 P Y X X

9 9 where the inequalit is because relative entrop is nonnegative and equalit is attained when x λ 1 P Y X x. Note that 1 + λ log k λ E P log λ 1 P Y X X = inf P {Ŷ Y } + λlog k H P Ŷ X PŶ X is an infimum of affine functions of P, hence it is concave in P. Also it is smmetric about π and τ, hence L λ smm. = 1 + λ log k λ E P log 1 + λ log k λ E P log λ 1 P Y X X λ 1 P Y X X The other direction follows from setting PŶ = 1/k. To show that the estimate-compress approach is not alwas optimal, let lŷ, = 1{ŷ }, Y = Y 1 Y, where Y 1 Y = and Y i = k i is finite. Let Γ = {P }, where P is such that X 1, X UnifY 1 Y, and Y = X i with probabilit q i for i = 1,. B Proposition 3, the optimal risk-information cost is 1 1 λ log max{ λ 1 q , λ 1 q 1 + 1}, 11 k 1 k and the optimal scheme is to generate Ŷ b compressing X 1 b passing it through a smmetric channel if 1 k 1 λ 1 q > 1 k λ 1 q 1+1, and compressing X, otherwise. Assume q 1 > q, then the optimal MAP estimate is Ȳ = X 1. An estimatecompress approach would tr to communicate a compressed version of Ȳ = X 1. However, the optimal rate constrained descriptor communicates a loss version of X rather than X 1 if k 1 k. The risk-information cost achieved b the estimatecompress approach is { 1 } 1 λ log max λ 1 q , λ 1 q k 1, 1 k 1 which is larger than 11 when k 1 k. Moreover, the gap between the rates needed b the two approaches for a fixed risk can be unbounded. Take q 1 = 1 q = /3, k =, k The minimum rate needed to achieve a risk /3 is 1 b Ŷ = X. For the estimate-compress approach, since Ŷ UnifY gives a risk 5/6, we have to compressing X 1 to achieve a risk /3, which requires an unbounded rate log k 1 1 logk Figure 4 compares the optimal scheme, the lower bound obtained from the optimal risk-information tradeoff 11, the upper bound of the optimal rate b Theorem 1, and the risk-information tradeoff for the estimate-compress approach 1 for q 1 = 1 q = /3, k 1 = 3, k =. Note that the optimal scheme is to perform time sharing using common randomness between encoding X 1 using 3 bits with risk 1/3, encoding X using 1 bit with risk /3, and fixing the output at one value of X using 0 bit with risk 5/6. The mutual information needed b the estimate-compress approach which is a lower bound on the actual rate needed b this approach is strictl greater than the optimal rate except when the risk is at its minimum 1/3 or maximum 5/6.. V. ACKNOWLEDGEMENTS This work was partiall ported b a gift from Huawei Technologies and b the Center for Science of Information CSoI, an NSF Science and Technolog Center, under grant agreement CCF REFERENCES [1] F. Farnia and D. Tse, A minimax approach to ervised learning, in Advances in Neural Information Processing Sstems, 016, pp [] H. Namkoong and J. C. Duchi, Variance-based regularization with convex objectives, in Advances in Neural Information Processing Sstems, 017, pp [3] J. Lee and M. Raginsk, Minimax statistical learning and domain adaptation with Wasserstein distances, arxiv preprint arxiv: , 017. [4] C. T. Li and A. El Gamal, Strong functional representation lemma and applications to coding theorems, in Proc. IEEE Int. Smp. Inf. Theor, June 017, pp [5] P. Harsha, R. Jain, D. McAllester, and J. Radhakrishnan, The communication complexit of correlation, IEEE Trans. Inf. Theor, vol. 56, no. 1, pp , Jan 010. [6] M. Braverman and A. Garg, Public vs private coin in bounded-round information, in International Colloquium on Automata, Languages, and Programming. Springer, 014, pp [7] N. Tishb, F. C. Pereira, and W. Bialek, The information bottleneck method, arxiv preprint phsics/ , 000. [8] P. Harremoës and N. Tishb, The information bottleneck revisited or how to choose a good distortion measure, in Information Theor, 007. ISIT 007. IEEE International Smposium on. IEEE, 007, pp

10 Optimal scheme Lower bound Upper bound Estimate-compress 5 Rate Risk Fig. 4. Tradeoff between the rate and risk in rate-constrained minimax linear classification for the optimal scheme, lower bound 11, upper bounnd b Theorem 1, and estimate-compress approach 1. [9] T. A. Courtade and T. Weissman, Multiterminal source coding under logarithmic loss, IEEE Transactions on Information Theor, vol. 60, no. 1, pp , 014. [10] A. Dembo and T. Weissman, The minimax distortion redundanc in nois source coding, IEEE Transactions on Information Theor, vol. 49, no. 11, pp , 003. [11] D. Haussler, A general minimax result for relative entrop, IEEE Transactions on Information Theor, vol. 43, no. 4, pp , Jul [1] R. G. Gallager, Source coding with side information and universal coding, Technical Report LIDS-P-937, MIT Laborator for Information and Decision Sstems, [13] E. Posner, Random coding strategies for minimum entrop, IEEE Transactions on Information Theor, vol. 1, no. 4, pp , Jul [14] M. Sion, On general minimax theorems, Pacific Journal of mathematics, vol. 8, no. 1, pp , [15] P. Elias, Universal codeword sets and representations of the integers, IEEE Trans. Inf. Theor, vol. 1, no., pp , Mar 1975.

Stochastic Interpretation for the Arimoto Algorithm

Stochastic Interpretation for the Arimoto Algorithm Serge Tridenski EE - Sstems Department Tel Aviv Universit Tel Aviv, Israel Email: sergetr@post.tau.ac.il Ram Zamir EE - Sstems Department Tel Aviv Universit