Minimax Learning for Remote Prediction

Size: px
Start display at page:

Download "Minimax Learning for Remote Prediction"

Transcription

1 1 Minimax Learning for Remote Prediction Cheuk Ting Li *, Xiugang Wu, Afer Ozgur, Abbas El Gamal * Department of Electrical Engineering and Computer Sciences Universit of California, Berkele ctli@berkele.edu Department of Electrical Engineering Stanford Universit {x3wu, aozgur}@stanford.edu; abbas@ee.stanford.edu arxiv: v1 [cs.it] 31 Ma 018 Abstract The classical problem of ervised learning is to infer an accurate predictor of a target variable Y from a measured variable X b using a finite number of labeled training samples. Motivated b the increasingl distributed nature of data and decision making, in this paper we consider a variation of this classical problem in which the prediction is performed remotel based on a rate-constrained description M of X. Upon receiving M, the remote node computes an estimate Ŷ of Y. We follow the recent minimax approach to stud this learning problem and show that it corresponds to a one-shot minimax nois source coding problem. We then establish information theoretic bounds on the risk-rate Lagrangian cost and a general method to design a near-optimal descriptor-estimator pair, which can be viewed as a rate-constrained analog to the maximum conditional entrop principle used in the classical minimax learning problem. Our results show that a naive estimate-compress scheme for rate-constrained prediction is not in general optimal. I. INTRODUCTION The classical problem of ervised learning is to infer an accurate predictor of a target variable Y from a measured variable X on the basis of n labeled training samples {X i, Y i } n i=1 independentl drawn from an unknown joint distribution P. The standard approach for solving this problem in statistical learning theor is empirical risk minimization ERM. For a given set of allowable predictors and a loss function that quantifies the risk of each predictor, ERM chooses the predictor with minimal risk under the empirical distribution of samples. To avoid overfitting, the set of allowable predictors is restricted to a class with limited complexit. Recentl, an alternative viewpoint has emerged which seeks distributionall robust predictors. Given the labeled training samples, this approach learns a predictor b minimizing its worst-case risk over an ambiguit distribution set centered at the empirical distribution of samples. In other words, instead of restricting the set of allowable predictors, it aims to avoid overfitting b requiring that the learned predictor performs well under an distribution in a chosen neighborhood of the empirical distribution. This minimax approach has been investigated under different assumptions on how the ambiguit set is constructed, e.g., b restricting the moments [1], forming the f-divergence balls [] and Wasserstein balls [3] see also references therein. In these previous works, the learning algorithm finds a predictor that acts directl on a fresh unlabeled sample X to predict the corresponding target variable Y. Often, however the fresh sample X ma be onl remotel available, and when designing the predictor it is desirable to also take into account the cost of communicating X. This is motivated b the fact that bandwidth and energ limitations on communication in networks and within multiprocessor sstems often impose significant bottlenecks on the performance of algorithms. There are also an increasing number of applications in which data is generated in a distributed manner and it or features of it are communicated over bandwidth-limited links to a central processor to perform inference. For instance, applications such as Google Goggles and Siri process the locall collected data on clouds. It is thus important to stud prediction in distributed and rate-constrained settings. In this paper, we stud an extension of the classical learning problem in which given a finite set of training samples, the learning algorithm needs to infer a descriptor-estimator pair with a desired communication rate in between them. This is especiall relevant when both X and Y come from a large alphabet or are continuous random variables as in regression problems, so neither the sample X nor its predicted value of Y can be simpl communicated in a lossless fashion. We adopt the minimax framework for learning the descriptor-estimator pair. Given a set of labeled training samples, our goal is to find a descriptor-estimator pair b minimizing their resultant worst-case risk over an ambiguit distribution set, where the risk now incorporates both the statistical risk and the communication cost. One of the important conclusions that emerge from the minimax approach to ervised learning in [1] is that the problem of finding the predictor with minimal worstcase risk over an ambiguit set can be broken into two smaller steps: 1 find the worst-case distribution in the ambiguit set that maximizes the generalized conditional entrop of Y given X, and find the optimal predictor under this worstcase distribution. In this paper, we show that an analogous principle approximatel holds for rate-constrained prediction. The This paper will be presented in part at the IEEE International Smposium on Information Theor, Vail, Colorado, USA, June 018.

2 descriptor-estimator pair with minimal worst-case risk can be found in two steps: 1 find the worst-case distribution in the ambiguit set that maximizes the risk-information Lagrangian cost, and find the optimal descriptor-estimator pair under this worst-case distribution. We then appl our results to characterize the optimal descriptor-estimator pairs for two applications: rate-constrained linear regression and rate-constrained classification. While a simple scheme whereb we first find the optimal predictor ignoring the rate constraint, then compress and communicate the predictor output, is optimal for the linear regression application, we show via the classification application that such an estimate-compress approach is not optimal in general. We show that when prediction is rate-constrained, the optimal descriptor aims to send sufficientl but not necessaril maximall informative features of the observed variable, which are at the same time eas to communicate. When applied to the case in which the ambiguit distribution set contains onl a single distribution for example, the true or empirical distribution of X, Y and the loss function for the prediction is logarithmic loss, our results provide a new one-shot operational interpretation of the information bottleneck problem. A ke technical ingredient in our results is the strong functional representation lemma SFRL developed in [4], which we use to design the optimal descriptor-estimator pair for the worst-case distribution. Notation We assume that log is base and the entrop H is in bits. The length of a variable-length description M {0, 1} is denoted as M. For random variables U, V, denote the joint distribution b P U,V and the conditional distribution of U given V b P U V. For brevit we denote the distribution of X, Y as P. We write I P X; Ŷ for IX; Ŷ when X, Y P, and is clear from the context. II. PROBLEM FORMULATION We begin b reviewing the minimax approach to the classical learning problem [1]. A. Minimax Approach to Supervised Learning Let X X and Y Y be jointl distributed random variables. The problem of statistical learning is to design an accurate predictor of a target variable Y from a measured variable X on the basis of a number of independent training samples {X i, Y i } n i=1 drawn from an unknown joint distribution. The standard approach for solving this problem is to use empirical risk minimization ERM in which one defines an admissible class of predictors F that consists of functions f : X Ŷ where the reconstruction alphabet Ŷ can be in general different from Y and a loss function l : Ŷ Y R. The risk associated with a predictor f when the underling joint distribution of X and Y is P is Lf, P E P [lfx, Y ]. ERM simpl chooses the predictor f n F with minimal risk under the empirical distribution P n of the training samples. Recentl, an alternative approach has emerged which seeks distributionall robust predictors. This approach learns a predictor b minimizing its worst-case risk over an ambiguit distribution set ΓP n, i.e., f n = argmin f max P n Lf, P, 1 where f can be an function and ΓP n can be constructed in various was, e.g., b restricting the moments, forming the f-divergence balls or Wasserstein balls. While in ERM it is important to restrict the set F of admissible predictors to a low-complexit class to prevent overfitting, in the minimax approach overfitting is prevented b explicitl requiring that the chosen predictor is distributionall robust. The learned function f n can be then used for predicting Y when presented with fresh samples of X. The learning and inference phases are illustrated in Figure 1. Learning: {X i,y i } n i=1 P n P n Learner f n Inference: X Inferrer f n Ŷ = f n X Fig. 1. Minimax approach to ervised learning.

3 3 B. Minimax Learning for Remote Prediction In this paper, we extend the minimax learning approach to the setting in which the prediction needs to be performed based on a rate-constrained description of X. In particular, given a set of finite training samples {X i, Y i } n i=1 independentl drawn from an unknown joint distribution P, our goal is to learn a pair of functions e, f, where e is a descriptor used to compress X into M = ex {0, 1} a prefix-free code, and f is an estimator that takes the compression M and generates an estimate Ŷ of Y. See Figure. Let Re, P E P [ ex ] be the rate of the descriptor e and Le, f, P E P [lfex, Y ] be the risk associated with the descriptor-estimator pair e, f, when the underling distribution of X, Y is P, and define the risk-rate Lagrangian cost parametrized b λ > 0 as L λ e, f, P = Le, f, P + λre, P. Note that this cost function takes into account both the resultant statistical prediction risk of e, f, as well as the communication rate the require. The task of a minimax learner is to find an e n, f n pair that minimizes the worst-case L λ e, f, P over the ambiguit distribution set ΓP n, i.e., e n, f n = argmin e,f max L λe, f, P, 3 P n for an appropriatel chosen ΓP n centered at the empirical distribution of samples P n. Note that we allow here all possible e, f pairs. We also assume that the descriptor and the estimator can use unlimited common randomness W which is independent of the data, i.e., e and f can be expressed as functions of X, W and M, W, respectivel, and the prefixfree codebook for M can depend on W. The availabilit of such common randomness can be justified b the fact that in practice, although the inference scheme is one-shot, it is used man times b the same user and b different users, hence the descriptor and the estimator can share a common randomness seed before communication commences without impacting the communication rate. Learning: {X i,y i } n i=1 P n P n Learner e n,f n Inference: X M = e n X Ŷ = f n M e n f n Fig.. Minimax learning for remote prediction. III. MAIN RESULTS We first consider the case where Γ consists of a single distribution P, which ma be the empirical distribution P n as in ERM. Define the minimax risk-rate cost as L λγ = inf e,f L λ e, f, P. 4 While it is difficult to minimize the risk-rate cost directl, the minimax risk-rate cost can be bounded in terms of the mutual information between X and Ŷ. Theorem 1. Let Γ = {P }. Then L λ inf E L λ inf E [lŷ, Y ] + λix; Ŷ, ] [lŷ, Y + λ IX; Ŷ + logix; Ŷ As in other one-shot compression results e.g., zero-error compression, there is a gap between the upper and lower bound. While the logarithmic gap in Theorem 1 is not as small as the 1-bit gap in the zero-error compression, it is dominated b the linear term IX; Ŷ when it is large. To prove Theorem 1, we use the strong functional representation lemma given in [4] also see [5], [6]: for an random variables X, Ŷ, there exists random variable W independent of X, such that Ŷ is a function of X, W, and HŶ W IX; Ŷ + logix; Ŷ

4 4 Here, W can be intuitivel viewed as the part of Ŷ which is not contained in X. Note that for an W such that Ŷ is a function of X, W and W is independent of X, HŶ W IX; Ŷ. The statement 5 ensures the existence of an W, independent of X, which comes close to this lower bound, and in this sense it is most informative about Ŷ. This is critical for the proof of Theorem 1 as we will see next. Identifing the part of Ŷ which is not contained in X allows us to generate and share this part between the descriptor and the estimator ahead of time, eliminating the need to communicate it during the course of inference. To find W, we use the Poisson functional representation construction detailed in [4]. Proof of Theorem 1: Recall that Ŷ = fex, W, W. The lower bound follows from the fact that I P X; Ŷ H P M E[ M ]. To establish the upper bound, fix an. Let W be obtained from 5. Note that W is independent of X and can be generated from a random seed shared between the descriptor and the estimator ahead of time. For a given w, take m = ex, w to be the Huffman codeword of ŷx, w according to the distribution PŶ W w recall that Ŷ is a function of X, W, and take fm, w to be the decoding function of the Huffman code. The expected codeword length Taking an infimum over all completes the proof. E[ M ] HŶ W + 1 IX; Ŷ + logix; Ŷ Remark 1. If we consider the logarithmic loss lŷ, = log ŷ, where ŷ is a distribution over Y, then the lower bound in Theorem 1 reduces to inf HY U + λix; U = HY + inf λix; U IY ; U, P U X P U X which is the information bottleneck function [7]. Therefore the setting of remote prediction provides an approximate one-shot operational interpretation of the information bottleneck up to a logarithmic gap. In [8], [9] it was shown that the asmptotic nois source coding problem also provides an operational interpretation of the information bottleneck. Our operational interpretation, however, is more satisfing since the feature extraction problem originall considered in [7] is b nature one-shot. We now extend Theorem 1 to the minimax setting. Theorem. Suppose Γ is convex. Then L λ inf L λ inf + λ E P [lŷ, Y ] + λi P X; Ŷ E P [lŷ, Y ] I P X; Ŷ + logi P X; Ŷ This result is related to minimax nois source coding [10]. The main difference is that we consider the one-shot expected length instead of the asmptotic rate. To prove this theorem, we first invoke a minimax result for relative entrop in [11] which generalizes the redundanccapacit theorem [1]. Then we appl the following refined version of the strong functional representation lemma that is proved in the proof of Theorem 1 in [4] also see [5]. Lemma 1. For an and PŶ, there exists random variable W, and functions kx, w {1,,...} and ŷk, w such that ŷkx, W, W x, and E [log kx, W ] D x PŶ We are now read to prove Theorem. Proof: The lower bound follows from E P [ M ] H P M I P X; Ŷ. To prove the upper bound, we fix an P Ŷ X, and show that the following risk-rate cost is achievable: ] L = E P [lŷ, Y + λ I P X; Ŷ + logi P X; Ŷ Let [ ] gp, PŶ = E P lŷ, Y + λ D =x PŶ dp x + log D PŶ X=x PŶ dp x

5 Note that g is concave in P for fixed PŶ since E P [ lŷ, Y ] and D =x PŶ dp x are linear in P. Also g is quasiconvex in PŶ for fixed P since D =x PŶ dp x is convex in PŶ, and is lower semicontinuous in PŶ since D =x PŶ is lower semicontinuous with respect to the topolog of weak convergence [13], and hence D =x PŶ dp x is lower semicontinuous b Fatou s lemma. Write P for the distribution of Ŷ when X, Y P and Ŷ {X = x} P Ŷ X x. Let Γ Ŷ = {P Ŷ X P : P Γ} and ΓŶ be the closure of ΓŶ in the topolog of weak convergence. It can be shown using the same arguments as in [11] on g instead of relative entrop, and using Sion s minimax theorem [14] instead of Lemma in [11] that if ΓŶ then there exists P Γ Ŷ Ŷ such that gp, P = inf gp, Ŷ PŶ = L. PŶ If ΓŶ is not uniforml tight, then b Lemma 4 in [11], inf PŶ inf PŶ gp, PŶ =. Appling Lemma 1 to, P Ŷ Ŷ = ŷk, W following the conditional distribution, and 5 is uniforml tight, D =x PŶ dp x =, and hence L = we obtain W independent of X, random variable K = kx, W {1,,...}, and X = x E [log K X = x] D P Ŷ for an x. Then we use Elias delta code [15] for K to produce M. Note that the average length of the Elias delta code is upper bounded b log K + log log K Hence, we have Hence E P [ M ] E P [log K] + log E P [log K] D =x P Ŷ dp x + log D =x P Ŷ dp x ] L λ E P [lŷ, Y + λ M gp, P Ŷ L. Theorem suggest that we can simplif the analsis of the risk-rate cost L λ = E P [ lŷ, Y ] + λ E P [ M ] b replacing the rate E P [ M ] with the mutual information I P X; Ŷ. Define the risk-information cost as [ L λ, P = E P l Ŷ, Y ] + λi P X; Ŷ. 7 Theorem implies that the minimax risk-rate cost L λ can be approximated b the minimax risk-information cost L λγ = inf L λ PŶ, P, 8 X within a logarithmic gap. Theorem can also be stated in the following slightl weaker form L λ L λ L λ + λ logλ 1 L λ λ. The risk-information cost has more desirable properties than the risk-rate cost. For example, it is convex in for fixed P, and concave in P for fixed. This allows us to exchange the infimum and remum in Theorem b Sion s minimax theorem [14], which gives the following proposition. Proposition 1. Suppose X, Y and Ŷ are finite, Γ is convex and closed, and λ 0, then L λγ = inf L λ PŶ, P = inf Lλ PŶ X, P. X Moreover, there exists P attaining the infimum in the left hand side, which also attains the infimum on the right hand side Ŷ X when P is fixed to P, the distribution that attains the remum on the right hand side. Proposition 1 means that in order to design a robust descriptor-estimator pair that work for an P Γ, we onl need to design them according to the worst-case distribution P as follows. Principle of maximum risk-information cost: Given a convex and closed Γ, we design the descriptor-estimator pair based on the worst-case distribution P = arg max inf Lλ PŶ, P. X We then find that minimizes L λ, P and design the descriptor-estimator pair accordingl, e.g. using Lemma 1 on and the induced distribution P Ŷ from P Ŷ X and P.

6 6 A. Rate-constrained Minimax Linear Regression IV. APPLICATIONS Suppose X R d, Y R, lŷ, = ŷ is the mean-squared loss, and we observe the data {X i, Y i } n i=1. Take Γ to be the set of distributions with the same first and second moments as given b the empirical distribution, i.e., Γ = { P XY : E[X] = µ X, E[Y ] = µ Y, Var[X] = Σ X, Var[Y ] = σ Y, Cov[X, Y ] = C XY }, 9 where µ X, µ Y, Σ X, σy, C XY are the corresponding statistics of the empirical distribution. The following proposition shows that P is Gaussian. Proposition Linear regression with rate constraint. Consider mean-squared loss and define Γ as in 9. Then the minimax risk-information cost 8 is where the optimal P XY where L λ = σy CXY T Σ 1 X C XY + λ log ect XY Σ 1 λ log e X C XY, 10 is Gaussian with its mean and covariance matrix specified in 9, and the optimal estimate { acxy T Ŷ = Σ 1 X X + b + Z λ log e if < CXY T Σ 1 X C XY 0 if λ log e CXY T Σ 1 X C XY, a = 1 and Z N0, σ Z is independent of X with σ Z λ log e C T XY Σ 1 X C XY = λa log e., b = µ Y ac T XY Σ 1 X µ X, Note that this setting does not satisf the conditions in Proposition 1. We directl analze 8 to obtain the optimal PXY. Given the optimal PXY, Theorem and Lemma 1 can be used to construct the scheme. Operationall, e nx, w is a random quantizer of acxy T Σ 1 X X + b such that the quantization noise follows N0, σ Z. With this natural choice of the ambiguit set, our formulation recovers a compressed version of the familiar MMSE estimator. Figure 3 plots the tradeoff between the rate and the risk when d = 1, µ X = µ Y = 0, σx = σ Y = 1, C XY = 0.95 for the scheme constructed using the Poisson functional representation in [4], with the lower bound given b the minimax risk-information cost L λ, and the upper bound given in Theorem. Fig. 3. Tradeoff between the rate and the risk in rate-constrained minimax linear regression. Proof of Proposition : Without loss of generalit, assume µ X = 0 and µ Y = 0. We first prove in 10. For this, fix as given in the proposition and consider an P Γ. When λ log e < CXY T Σ 1 X C XY, we have [ ] [ E P lŷ, Y = E P Ŷ Y ] σy + λ log e CXY T Σ 1 X C XY, and I P X; Ŷ = hŷ hŷ X

7 Therefore, 1 C T log XY Σ 1 X C XY. λ log e ] inf E P [lŷ, Y + λi P X; Ŷ R.H.S. of 10. It can also be checked that the above relation holds when λ log e CXY T Σ 1 X C XY, and thus we have proved in 10. To prove in 10, fix a Gaussian P XY with its mean and covariance matrix specified in 9 and consider an arbitrar. We have [ ] [ E P lŷ, Y = E P Y Ŷ ] [ = σy CXY T Σ 1 X C XY + E P Ŷ CT XY Σ 1 X X], and I P X; Ŷ = I P CXY T Σ 1 X X ; Ŷ h CXY T Σ 1 X X h CXY T Σ 1 X X Ŷ 1 log CT XY Σ 1 X C XY 1 [ log E P Ŷ CT XY Σ 1 X X]. [ Letting γ = E P Ŷ CT XY Σ 1 X ], X we have [ ] E P lŷ, Y + λi P X; Ŷ σ Y C T XY Σ 1 X C XY + λ log CT XY Σ 1 X C XY + γ λ log γ R.H.S. of 10, where the second inequalit follows b evaluating the minimum value of γ λ log γ. Combing this with the above completes the proof of Proposition. The optimal scheme in the above example corresponds to compressing and communicating the minimax optimal rateunconstrained predictor Ȳ = CT XY Σ 1 X X µ X + µ Y, since the optimal Ŷ can be obtained from Ȳ b shifting, scaling and adding noise. This estimate-compress approach can be thought as a separation scheme, since we first optimall estimate Ȳ, then optimall communicate it while satisfing the rate constraint. In the next application, we show that such separation is not optimal in general. B. Rate-constrained Minimax Classification We assume Y = Ŷ = {1,..., k} and X are finite, lŷ, = 1{ŷ }, and Γ is closed and convex. The following proposition gives the minimax risk-information cost and the optimal estimator. Proposition 3. Consider the setting described above. The minimax risk-information cost is given b L λ = 1 + λ inf E P log λ 1 P Y X X PŶ, PŶ the worst-case distribution P is the one attaining the remum, and the optimal estimator is given b P Ŷ X ŷ x λ 1 P Y X x P, where Ŷ P attains the infimum, and P Ŷ Y X is obtained from P. In particular, if Γ is smmetric for different values of Y i.e., for an 1, Y, there exists permutation π of Y, τ of X such that π 1 = and P X,Y Γ P τx,πy Γ, L λ = 1 + λ log k λ E P log λ 1 P Y X X. We can see that when λ 0, P Ŷ X tends to the maximum a posteriori estimator under P, the worst-case distribution when λ = 0. 7

8 8 Proof: Assume Γ is closed and convex. B Proposition 1, the minimax rate-information cost is L λ = inf Lλ, P, where inf Lλ PŶ, P X ] E P [lŷ, Y + λi P X; Ŷ = inf = inf = inf PŶ, P {Ŷ Y } + λ inf PŶ P {Ŷ Y } + λ = 1 + λ inf PŶ, E P = 1 + λ inf PŶ a = 1 + λ inf PŶ E P D =x PŶ dp x D =x PŶ dp x X log P Ŷ X X λ 1 P Y X X PŶ X inf X log λ 1 P Y X X PŶ / log P λ 1 Y X X PŶ E P log λ 1 P Y X X PŶ, λ 1 P Y X X PŶ where a is due to that relative entrop is nonnegative, and equalit is attained when x λ 1 P Y X X PŶ. Next we consider the case in which Γ is smmetric. Consider the minimax rate-information cost ] L λ = inf L λ PŶ, P = inf E P [lŷ, Y + λi P X; Ŷ. X For an i, j Y = {1,..., k}, let π ij be the permutation over Y such that π ij i = j and let τ ij be the corresponding permutation over X in the smmetr assumption. Since the function L λ, P is convex and smmetric about π ij and τ ij i.e., Lλ, P = Lλ P πij Ŷ τ ijx, P, to find its infimum, we onl need to consider s satisfing = P πij Ŷ τ ijx for all i, j if not, we can instead consider the average of P π a ij Ŷ τij a X for a from 1 up to the product of the periods of π ij and τ ij, which gives a value of the function not larger than that of. For brevit we sa is smmetric if it satisfies this condition. Fix an smmetric. Since the function P L λ, P is concave and smmetric about π ij and τ ij i.e., L λ, P X,Y = L λ, P τijx,π ijy, to find its remum, we onl need to consider smmetric P s. Hence, L λ = inf smm. = inf smm. = inf smm. smm. smm. smm. = 1 + λ log k + λ inf L λ, P ] E P [lŷ, Y + λi P X; Ŷ smm. = 1 + λ log k + λ inf 1 + λ log k + λ = smm. smm. inf smm. P {Ŷ Y } + λlog k H P Ŷ X E P smm. 1 + λ log k λ E P log X log X λ 1 P Y X X X E P X log smm. λ 1 P Y X X / log P λ 1 Y X X E P log λ 1 P Y X X smm. λ 1 P Y X X, λ 1 P Y X X

9 9 where the inequalit is because relative entrop is nonnegative and equalit is attained when x λ 1 P Y X x. Note that 1 + λ log k λ E P log λ 1 P Y X X = inf P {Ŷ Y } + λlog k H P Ŷ X PŶ X is an infimum of affine functions of P, hence it is concave in P. Also it is smmetric about π and τ, hence L λ smm. = 1 + λ log k λ E P log 1 + λ log k λ E P log λ 1 P Y X X λ 1 P Y X X The other direction follows from setting PŶ = 1/k. To show that the estimate-compress approach is not alwas optimal, let lŷ, = 1{ŷ }, Y = Y 1 Y, where Y 1 Y = and Y i = k i is finite. Let Γ = {P }, where P is such that X 1, X UnifY 1 Y, and Y = X i with probabilit q i for i = 1,. B Proposition 3, the optimal risk-information cost is 1 1 λ log max{ λ 1 q , λ 1 q 1 + 1}, 11 k 1 k and the optimal scheme is to generate Ŷ b compressing X 1 b passing it through a smmetric channel if 1 k 1 λ 1 q > 1 k λ 1 q 1+1, and compressing X, otherwise. Assume q 1 > q, then the optimal MAP estimate is Ȳ = X 1. An estimatecompress approach would tr to communicate a compressed version of Ȳ = X 1. However, the optimal rate constrained descriptor communicates a loss version of X rather than X 1 if k 1 k. The risk-information cost achieved b the estimatecompress approach is { 1 } 1 λ log max λ 1 q , λ 1 q k 1, 1 k 1 which is larger than 11 when k 1 k. Moreover, the gap between the rates needed b the two approaches for a fixed risk can be unbounded. Take q 1 = 1 q = /3, k =, k The minimum rate needed to achieve a risk /3 is 1 b Ŷ = X. For the estimate-compress approach, since Ŷ UnifY gives a risk 5/6, we have to compressing X 1 to achieve a risk /3, which requires an unbounded rate log k 1 1 logk Figure 4 compares the optimal scheme, the lower bound obtained from the optimal risk-information tradeoff 11, the upper bound of the optimal rate b Theorem 1, and the risk-information tradeoff for the estimate-compress approach 1 for q 1 = 1 q = /3, k 1 = 3, k =. Note that the optimal scheme is to perform time sharing using common randomness between encoding X 1 using 3 bits with risk 1/3, encoding X using 1 bit with risk /3, and fixing the output at one value of X using 0 bit with risk 5/6. The mutual information needed b the estimate-compress approach which is a lower bound on the actual rate needed b this approach is strictl greater than the optimal rate except when the risk is at its minimum 1/3 or maximum 5/6.. V. ACKNOWLEDGEMENTS This work was partiall ported b a gift from Huawei Technologies and b the Center for Science of Information CSoI, an NSF Science and Technolog Center, under grant agreement CCF REFERENCES [1] F. Farnia and D. Tse, A minimax approach to ervised learning, in Advances in Neural Information Processing Sstems, 016, pp [] H. Namkoong and J. C. Duchi, Variance-based regularization with convex objectives, in Advances in Neural Information Processing Sstems, 017, pp [3] J. Lee and M. Raginsk, Minimax statistical learning and domain adaptation with Wasserstein distances, arxiv preprint arxiv: , 017. [4] C. T. Li and A. El Gamal, Strong functional representation lemma and applications to coding theorems, in Proc. IEEE Int. Smp. Inf. Theor, June 017, pp [5] P. Harsha, R. Jain, D. McAllester, and J. Radhakrishnan, The communication complexit of correlation, IEEE Trans. Inf. Theor, vol. 56, no. 1, pp , Jan 010. [6] M. Braverman and A. Garg, Public vs private coin in bounded-round information, in International Colloquium on Automata, Languages, and Programming. Springer, 014, pp [7] N. Tishb, F. C. Pereira, and W. Bialek, The information bottleneck method, arxiv preprint phsics/ , 000. [8] P. Harremoës and N. Tishb, The information bottleneck revisited or how to choose a good distortion measure, in Information Theor, 007. ISIT 007. IEEE International Smposium on. IEEE, 007, pp

10 Optimal scheme Lower bound Upper bound Estimate-compress 5 Rate Risk Fig. 4. Tradeoff between the rate and risk in rate-constrained minimax linear classification for the optimal scheme, lower bound 11, upper bounnd b Theorem 1, and estimate-compress approach 1. [9] T. A. Courtade and T. Weissman, Multiterminal source coding under logarithmic loss, IEEE Transactions on Information Theor, vol. 60, no. 1, pp , 014. [10] A. Dembo and T. Weissman, The minimax distortion redundanc in nois source coding, IEEE Transactions on Information Theor, vol. 49, no. 11, pp , 003. [11] D. Haussler, A general minimax result for relative entrop, IEEE Transactions on Information Theor, vol. 43, no. 4, pp , Jul [1] R. G. Gallager, Source coding with side information and universal coding, Technical Report LIDS-P-937, MIT Laborator for Information and Decision Sstems, [13] E. Posner, Random coding strategies for minimum entrop, IEEE Transactions on Information Theor, vol. 1, no. 4, pp , Jul [14] M. Sion, On general minimax theorems, Pacific Journal of mathematics, vol. 8, no. 1, pp , [15] P. Elias, Universal codeword sets and representations of the integers, IEEE Trans. Inf. Theor, vol. 1, no., pp , Mar 1975.

Stochastic Interpretation for the Arimoto Algorithm

Stochastic Interpretation for the Arimoto Algorithm Stochastic Interpretation for the Arimoto Algorithm Serge Tridenski EE - Sstems Department Tel Aviv Universit Tel Aviv, Israel Email: sergetr@post.tau.ac.il Ram Zamir EE - Sstems Department Tel Aviv Universit

More information

Computation of Csiszár s Mutual Information of Order α

Computation of Csiszár s Mutual Information of Order α Computation of Csiszár s Mutual Information of Order Damianos Karakos, Sanjeev Khudanpur and Care E. Priebe Department of Electrical and Computer Engineering and Center for Language and Speech Processing

More information

Extended Gray Wyner System with Complementary Causal Side Information

Extended Gray Wyner System with Complementary Causal Side Information Extended Gray Wyner System with Complementary Causal Side Information Cheuk Ting Li and Abbas El Gamal Department of Electrical Engineering, Stanford University Email: ctli@stanford.edu, abbas@ee.stanford.edu

More information

Gaussian Estimation under Attack Uncertainty

Gaussian Estimation under Attack Uncertainty Gaussian Estimation under Attack Uncertainty Tara Javidi Yonatan Kaspi Himanshu Tyagi Abstract We consider the estimation of a standard Gaussian random variable under an observation attack where an adversary

More information

On Common Information and the Encoding of Sources that are Not Successively Refinable

On Common Information and the Encoding of Sources that are Not Successively Refinable On Common Information and the Encoding of Sources that are Not Successively Refinable Kumar Viswanatha, Emrah Akyol, Tejaswi Nanjundaswamy and Kenneth Rose ECE Department, University of California - Santa

More information

Quiz 2 Date: Monday, November 21, 2016

Quiz 2 Date: Monday, November 21, 2016 10-704 Information Processing and Learning Fall 2016 Quiz 2 Date: Monday, November 21, 2016 Name: Andrew ID: Department: Guidelines: 1. PLEASE DO NOT TURN THIS PAGE UNTIL INSTRUCTED. 2. Write your name,

More information

The Entropy Power Inequality and Mrs. Gerber s Lemma for Groups of order 2 n

The Entropy Power Inequality and Mrs. Gerber s Lemma for Groups of order 2 n The Entrop Power Inequalit and Mrs. Gerber s Lemma for Groups of order 2 n Varun Jog EECS, UC Berkele Berkele, CA-94720 Email: varunjog@eecs.berkele.edu Venkat Anantharam EECS, UC Berkele Berkele, CA-94720

More information

ECE598: Information-theoretic methods in high-dimensional statistics Spring 2016

ECE598: Information-theoretic methods in high-dimensional statistics Spring 2016 ECE598: Information-theoretic methods in high-dimensional statistics Spring 06 Lecture : Mutual Information Method Lecturer: Yihong Wu Scribe: Jaeho Lee, Mar, 06 Ed. Mar 9 Quick review: Assouad s lemma

More information

Universality of Logarithmic Loss in Lossy Compression

Universality of Logarithmic Loss in Lossy Compression Universality of Logarithmic Loss in Lossy Compression Albert No, Member, IEEE, and Tsachy Weissman, Fellow, IEEE arxiv:709.0034v [cs.it] Sep 207 Abstract We establish two strong senses of universality

More information

Lecture 22: Final Review

Lecture 22: Final Review Lecture 22: Final Review Nuts and bolts Fundamental questions and limits Tools Practical algorithms Future topics Dr Yao Xie, ECE587, Information Theory, Duke University Basics Dr Yao Xie, ECE587, Information

More information

QUADRATIC AND CONVEX MINIMAX CLASSIFICATION PROBLEMS

QUADRATIC AND CONVEX MINIMAX CLASSIFICATION PROBLEMS Journal of the Operations Research Societ of Japan 008, Vol. 51, No., 191-01 QUADRATIC AND CONVEX MINIMAX CLASSIFICATION PROBLEMS Tomonari Kitahara Shinji Mizuno Kazuhide Nakata Toko Institute of Technolog

More information

A New Achievable Region for Gaussian Multiple Descriptions Based on Subset Typicality

A New Achievable Region for Gaussian Multiple Descriptions Based on Subset Typicality 0 IEEE Information Theory Workshop A New Achievable Region for Gaussian Multiple Descriptions Based on Subset Typicality Kumar Viswanatha, Emrah Akyol and Kenneth Rose ECE Department, University of California

More information

3.0 PROBABILITY, RANDOM VARIABLES AND RANDOM PROCESSES

3.0 PROBABILITY, RANDOM VARIABLES AND RANDOM PROCESSES 3.0 PROBABILITY, RANDOM VARIABLES AND RANDOM PROCESSES 3.1 Introduction In this chapter we will review the concepts of probabilit, rom variables rom processes. We begin b reviewing some of the definitions

More information

COMPRESSIVE (CS) [1] is an emerging framework,

COMPRESSIVE (CS) [1] is an emerging framework, 1 An Arithmetic Coding Scheme for Blocked-based Compressive Sensing of Images Min Gao arxiv:1604.06983v1 [cs.it] Apr 2016 Abstract Differential pulse-code modulation (DPCM) is recentl coupled with uniform

More information

Functional Properties of MMSE

Functional Properties of MMSE Functional Properties of MMSE Yihong Wu epartment of Electrical Engineering Princeton University Princeton, NJ 08544, USA Email: yihongwu@princeton.edu Sergio Verdú epartment of Electrical Engineering

More information

School of Computer and Communication Sciences. Information Theory and Coding Notes on Random Coding December 12, 2003.

School of Computer and Communication Sciences. Information Theory and Coding Notes on Random Coding December 12, 2003. ÉCOLE POLYTECHNIQUE FÉDÉRALE DE LAUSANNE School of Computer and Communication Sciences Handout 8 Information Theor and Coding Notes on Random Coding December 2, 2003 Random Coding In this note we prove

More information

An Achievable Rate Region for the 3-User-Pair Deterministic Interference Channel

An Achievable Rate Region for the 3-User-Pair Deterministic Interference Channel Forty-Ninth Annual Allerton Conference Allerton House, UIUC, Illinois, USA September 8-3, An Achievable Rate Region for the 3-User-Pair Deterministic Interference Channel Invited Paper Bernd Bandemer and

More information

Stability Analysis for Linear Systems under State Constraints

Stability Analysis for Linear Systems under State Constraints Stabilit Analsis for Linear Sstems under State Constraints Haijun Fang Abstract This paper revisits the problem of stabilit analsis for linear sstems under state constraints New and less conservative sufficient

More information

Lecture 6: Gaussian Channels. Copyright G. Caire (Sample Lectures) 157

Lecture 6: Gaussian Channels. Copyright G. Caire (Sample Lectures) 157 Lecture 6: Gaussian Channels Copyright G. Caire (Sample Lectures) 157 Differential entropy (1) Definition 18. The (joint) differential entropy of a continuous random vector X n p X n(x) over R is: Z h(x

More information

Superposition Encoding and Partial Decoding Is Optimal for a Class of Z-interference Channels

Superposition Encoding and Partial Decoding Is Optimal for a Class of Z-interference Channels Superposition Encoding and Partial Decoding Is Optimal for a Class of Z-interference Channels Nan Liu and Andrea Goldsmith Department of Electrical Engineering Stanford University, Stanford CA 94305 Email:

More information

11. Learning graphical models

11. Learning graphical models Learning graphical models 11-1 11. Learning graphical models Maximum likelihood Parameter learning Structural learning Learning partially observed graphical models Learning graphical models 11-2 statistical

More information

A. Derivation of regularized ERM duality

A. Derivation of regularized ERM duality A. Derivation of regularized ERM dualit For completeness, in this section we derive the dual 5 to the problem of computing proximal operator for the ERM objective 3. We can rewrite the primal problem as

More information

Entropy, Relative Entropy and Mutual Information Exercises

Entropy, Relative Entropy and Mutual Information Exercises Entrop, Relative Entrop and Mutual Information Exercises Exercise 2.: Coin Flips. A fair coin is flipped until the first head occurs. Let X denote the number of flips required. (a) Find the entrop H(X)

More information

Information Theoretic Limits of Randomness Generation

Information Theoretic Limits of Randomness Generation Information Theoretic Limits of Randomness Generation Abbas El Gamal Stanford University Shannon Centennial, University of Michigan, September 2016 Information theory The fundamental problem of communication

More information

Machine Learning And Applications: Supervised Learning-SVM

Machine Learning And Applications: Supervised Learning-SVM Machine Learning And Applications: Supervised Learning-SVM Raphaël Bournhonesque École Normale Supérieure de Lyon, Lyon, France raphael.bournhonesque@ens-lyon.fr 1 Supervised vs unsupervised learning Machine

More information

Approximately achieving the feedback interference channel capacity with point-to-point codes

Approximately achieving the feedback interference channel capacity with point-to-point codes Approximately achieving the feedback interference channel capacity with point-to-point codes Joyson Sebastian*, Can Karakus*, Suhas Diggavi* Abstract Superposition codes with rate-splitting have been used

More information

EQUIVALENT FORMULATIONS OF HYPERCONTRACTIVITY USING INFORMATION MEASURES

EQUIVALENT FORMULATIONS OF HYPERCONTRACTIVITY USING INFORMATION MEASURES 1 EQUIVALENT FORMULATIONS OF HYPERCONTRACTIVITY USING INFORMATION MEASURES CHANDRA NAIR 1 1 Dept. of Information Engineering, The Chinese Universit of Hong Kong Abstract We derive alternate characterizations

More information

Lecture 5 Channel Coding over Continuous Channels

Lecture 5 Channel Coding over Continuous Channels Lecture 5 Channel Coding over Continuous Channels I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw November 14, 2014 1 / 34 I-Hsiang Wang NIT Lecture 5 From

More information

Statistical Data Mining and Machine Learning Hilary Term 2016

Statistical Data Mining and Machine Learning Hilary Term 2016 Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes

More information

Linear programming: Theory

Linear programming: Theory Division of the Humanities and Social Sciences Ec 181 KC Border Convex Analsis and Economic Theor Winter 2018 Topic 28: Linear programming: Theor 28.1 The saddlepoint theorem for linear programming The

More information

Perturbation Theory for Variational Inference

Perturbation Theory for Variational Inference Perturbation heor for Variational Inference Manfred Opper U Berlin Marco Fraccaro echnical Universit of Denmark Ulrich Paquet Apple Ale Susemihl U Berlin Ole Winther echnical Universit of Denmark Abstract

More information

The Poisson Channel with Side Information

The Poisson Channel with Side Information The Poisson Channel with Side Information Shraga Bross School of Enginerring Bar-Ilan University, Israel brosss@macs.biu.ac.il Amos Lapidoth Ligong Wang Signal and Information Processing Laboratory ETH

More information

Capacity Region of Reversely Degraded Gaussian MIMO Broadcast Channel

Capacity Region of Reversely Degraded Gaussian MIMO Broadcast Channel Capacity Region of Reversely Degraded Gaussian MIMO Broadcast Channel Jun Chen Dept. of Electrical and Computer Engr. McMaster University Hamilton, Ontario, Canada Chao Tian AT&T Labs-Research 80 Park

More information

Sequential prediction with coded side information under logarithmic loss

Sequential prediction with coded side information under logarithmic loss under logarithmic loss Yanina Shkel Department of Electrical Engineering Princeton University Princeton, NJ 08544, USA Maxim Raginsky Department of Electrical and Computer Engineering Coordinated Science

More information

Design of Optimal Quantizers for Distributed Source Coding

Design of Optimal Quantizers for Distributed Source Coding Design of Optimal Quantizers for Distributed Source Coding David Rebollo-Monedero, Rui Zhang and Bernd Girod Information Systems Laboratory, Electrical Eng. Dept. Stanford University, Stanford, CA 94305

More information

Upper Bounds on the Capacity of Binary Intermittent Communication

Upper Bounds on the Capacity of Binary Intermittent Communication Upper Bounds on the Capacity of Binary Intermittent Communication Mostafa Khoshnevisan and J. Nicholas Laneman Department of Electrical Engineering University of Notre Dame Notre Dame, Indiana 46556 Email:{mhoshne,

More information

OPTIMAL CONTROL FOR A PARABOLIC SYSTEM MODELLING CHEMOTAXIS

OPTIMAL CONTROL FOR A PARABOLIC SYSTEM MODELLING CHEMOTAXIS Trends in Mathematics Information Center for Mathematical Sciences Volume 6, Number 1, June, 23, Pages 45 49 OPTIMAL CONTROL FOR A PARABOLIC SYSTEM MODELLING CHEMOTAXIS SANG UK RYU Abstract. We stud the

More information

Convexity/Concavity of Renyi Entropy and α-mutual Information

Convexity/Concavity of Renyi Entropy and α-mutual Information Convexity/Concavity of Renyi Entropy and -Mutual Information Siu-Wai Ho Institute for Telecommunications Research University of South Australia Adelaide, SA 5095, Australia Email: siuwai.ho@unisa.edu.au

More information

Computation of Total Capacity for Discrete Memoryless Multiple-Access Channels

Computation of Total Capacity for Discrete Memoryless Multiple-Access Channels IEEE TRANSACTIONS ON INFORATION THEORY, VOL. 50, NO. 11, NOVEBER 2004 2779 Computation of Total Capacit for Discrete emorless ultiple-access Channels ohammad Rezaeian, ember, IEEE, and Ale Grant, Senior

More information

Simultaneous Nonunique Decoding Is Rate-Optimal

Simultaneous Nonunique Decoding Is Rate-Optimal Fiftieth Annual Allerton Conference Allerton House, UIUC, Illinois, USA October 1-5, 2012 Simultaneous Nonunique Decoding Is Rate-Optimal Bernd Bandemer University of California, San Diego La Jolla, CA

More information

Lecture 2 Machine Learning Review

Lecture 2 Machine Learning Review Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017 Things we will look at today Formal Setup for Supervised Learning Things

More information

An Achievable Error Exponent for the Mismatched Multiple-Access Channel

An Achievable Error Exponent for the Mismatched Multiple-Access Channel An Achievable Error Exponent for the Mismatched Multiple-Access Channel Jonathan Scarlett University of Cambridge jms265@camacuk Albert Guillén i Fàbregas ICREA & Universitat Pompeu Fabra University of

More information

A Minimax Approach to Supervised Learning

A Minimax Approach to Supervised Learning A Minimax Approach to Supervised Learning Farzan Farnia farnia@stanford.edu David Tse dntse@stanford.edu Abstract Given a task of predicting Y from X, a loss function L, and a set of probability distributions

More information

FORMULATION OF THE LEARNING PROBLEM

FORMULATION OF THE LEARNING PROBLEM FORMULTION OF THE LERNING PROBLEM MIM RGINSKY Now that we have seen an informal statement of the learning problem, as well as acquired some technical tools in the form of concentration inequalities, we

More information

The Capacity of the Semi-Deterministic Cognitive Interference Channel and its Application to Constant Gap Results for the Gaussian Channel

The Capacity of the Semi-Deterministic Cognitive Interference Channel and its Application to Constant Gap Results for the Gaussian Channel The Capacity of the Semi-Deterministic Cognitive Interference Channel and its Application to Constant Gap Results for the Gaussian Channel Stefano Rini, Daniela Tuninetti, and Natasha Devroye Department

More information

Information Theory. Lecture 10. Network Information Theory (CT15); a focus on channel capacity results

Information Theory. Lecture 10. Network Information Theory (CT15); a focus on channel capacity results Information Theory Lecture 10 Network Information Theory (CT15); a focus on channel capacity results The (two-user) multiple access channel (15.3) The (two-user) broadcast channel (15.6) The relay channel

More information

On the Capacity Region of the Gaussian Z-channel

On the Capacity Region of the Gaussian Z-channel On the Capacity Region of the Gaussian Z-channel Nan Liu Sennur Ulukus Department of Electrical and Computer Engineering University of Maryland, College Park, MD 74 nkancy@eng.umd.edu ulukus@eng.umd.edu

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 89 Part II

More information

Lecture 3: Statistical Decision Theory (Part II)

Lecture 3: Statistical Decision Theory (Part II) Lecture 3: Statistical Decision Theory (Part II) Hao Helen Zhang Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 1 / 27 Outline of This Note Part I: Statistics Decision Theory (Classical

More information

Naïve Bayes classification

Naïve Bayes classification Naïve Bayes classification 1 Probability theory Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. Examples: A person s height, the outcome of a coin toss

More information

Arimoto Channel Coding Converse and Rényi Divergence

Arimoto Channel Coding Converse and Rényi Divergence Arimoto Channel Coding Converse and Rényi Divergence Yury Polyanskiy and Sergio Verdú Abstract Arimoto proved a non-asymptotic upper bound on the probability of successful decoding achievable by any code

More information

Information Theory Meets Game Theory on The Interference Channel

Information Theory Meets Game Theory on The Interference Channel Information Theory Meets Game Theory on The Interference Channel Randall A. Berry Dept. of EECS Northwestern University e-mail: rberry@eecs.northwestern.edu David N. C. Tse Wireless Foundations University

More information

SHARED INFORMATION. Prakash Narayan with. Imre Csiszár, Sirin Nitinawarat, Himanshu Tyagi, Shun Watanabe

SHARED INFORMATION. Prakash Narayan with. Imre Csiszár, Sirin Nitinawarat, Himanshu Tyagi, Shun Watanabe SHARED INFORMATION Prakash Narayan with Imre Csiszár, Sirin Nitinawarat, Himanshu Tyagi, Shun Watanabe 2/41 Outline Two-terminal model: Mutual information Operational meaning in: Channel coding: channel

More information

Solutions to Homework Set #1 Sanov s Theorem, Rate distortion

Solutions to Homework Set #1 Sanov s Theorem, Rate distortion st Semester 00/ Solutions to Homework Set # Sanov s Theorem, Rate distortion. Sanov s theorem: Prove the simple version of Sanov s theorem for the binary random variables, i.e., let X,X,...,X n be a sequence

More information

Ordinal One-Switch Utility Functions

Ordinal One-Switch Utility Functions Ordinal One-Switch Utilit Functions Ali E. Abbas Universit of Southern California, Los Angeles, California 90089, aliabbas@usc.edu David E. Bell Harvard Business School, Boston, Massachusetts 0163, dbell@hbs.edu

More information

Latent Tree Approximation in Linear Model

Latent Tree Approximation in Linear Model Latent Tree Approximation in Linear Model Navid Tafaghodi Khajavi Dept. of Electrical Engineering, University of Hawaii, Honolulu, HI 96822 Email: navidt@hawaii.edu ariv:1710.01838v1 [cs.it] 5 Oct 2017

More information

Transmitting k samples over the Gaussian channel: energy-distortion tradeoff

Transmitting k samples over the Gaussian channel: energy-distortion tradeoff Transmitting samples over the Gaussian channel: energy-distortion tradeoff Victoria Kostina Yury Polyansiy Sergio Verdú California Institute of Technology Massachusetts Institute of Technology Princeton

More information

The Gallager Converse

The Gallager Converse The Gallager Converse Abbas El Gamal Director, Information Systems Laboratory Department of Electrical Engineering Stanford University Gallager s 75th Birthday 1 Information Theoretic Limits Establishing

More information

Optimal Power Control in Decentralized Gaussian Multiple Access Channels

Optimal Power Control in Decentralized Gaussian Multiple Access Channels 1 Optimal Power Control in Decentralized Gaussian Multiple Access Channels Kamal Singh Department of Electrical Engineering Indian Institute of Technology Bombay. arxiv:1711.08272v1 [eess.sp] 21 Nov 2017

More information

3238 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 60, NO. 6, JUNE 2014

3238 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 60, NO. 6, JUNE 2014 3238 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 60, NO. 6, JUNE 2014 The Lossy Common Information of Correlated Sources Kumar B. Viswanatha, Student Member, IEEE, Emrah Akyol, Member, IEEE, and Kenneth

More information

Lecture 7 Introduction to Statistical Decision Theory

Lecture 7 Introduction to Statistical Decision Theory Lecture 7 Introduction to Statistical Decision Theory I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 20, 2016 1 / 55 I-Hsiang Wang IT Lecture 7

More information

Distirbutional robustness, regularizing variance, and adversaries

Distirbutional robustness, regularizing variance, and adversaries Distirbutional robustness, regularizing variance, and adversaries John Duchi Based on joint work with Hongseok Namkoong and Aman Sinha Stanford University November 2017 Motivation We do not want machine-learned

More information

Mismatched Multi-letter Successive Decoding for the Multiple-Access Channel

Mismatched Multi-letter Successive Decoding for the Multiple-Access Channel Mismatched Multi-letter Successive Decoding for the Multiple-Access Channel Jonathan Scarlett University of Cambridge jms265@cam.ac.uk Alfonso Martinez Universitat Pompeu Fabra alfonso.martinez@ieee.org

More information

SHARED INFORMATION. Prakash Narayan with. Imre Csiszár, Sirin Nitinawarat, Himanshu Tyagi, Shun Watanabe

SHARED INFORMATION. Prakash Narayan with. Imre Csiszár, Sirin Nitinawarat, Himanshu Tyagi, Shun Watanabe SHARED INFORMATION Prakash Narayan with Imre Csiszár, Sirin Nitinawarat, Himanshu Tyagi, Shun Watanabe 2/40 Acknowledgement Praneeth Boda Himanshu Tyagi Shun Watanabe 3/40 Outline Two-terminal model: Mutual

More information

Distributed Functional Compression through Graph Coloring

Distributed Functional Compression through Graph Coloring Distributed Functional Compression through Graph Coloring Vishal Doshi, Devavrat Shah, Muriel Médard, and Sidharth Jaggi Laboratory for Information and Decision Systems Massachusetts Institute of Technology

More information

Oblivious Equilibrium for Large-Scale Stochastic Games with Unbounded Costs

Oblivious Equilibrium for Large-Scale Stochastic Games with Unbounded Costs Proceedings of the 47th IEEE Conference on Decision and Control Cancun, Mexico, Dec. 9-11, 008 Oblivious Equilibrium for Large-Scale Stochastic Games with Unbounded Costs Sachin Adlakha, Ramesh Johari,

More information

Estimation of information-theoretic quantities

Estimation of information-theoretic quantities Estimation of information-theoretic quantities Liam Paninski Gatsby Computational Neuroscience Unit University College London http://www.gatsby.ucl.ac.uk/ liam liam@gatsby.ucl.ac.uk November 16, 2004 Some

More information

Lecture 14 February 28

Lecture 14 February 28 EE/Stats 376A: Information Theory Winter 07 Lecture 4 February 8 Lecturer: David Tse Scribe: Sagnik M, Vivek B 4 Outline Gaussian channel and capacity Information measures for continuous random variables

More information

On MMSE Estimation from Quantized Observations in the Nonasymptotic Regime

On MMSE Estimation from Quantized Observations in the Nonasymptotic Regime On MMSE Estimation from Quantized Observations in the Nonasymptotic Regime Jaeho Lee, Maxim Raginsky, and Pierre Moulin Abstract This paper studies MMSE estimation on the basis of quantized noisy observations

More information

On the Optimality of Treating Interference as Noise in Competitive Scenarios

On the Optimality of Treating Interference as Noise in Competitive Scenarios On the Optimality of Treating Interference as Noise in Competitive Scenarios A. DYTSO, D. TUNINETTI, N. DEVROYE WORK PARTIALLY FUNDED BY NSF UNDER AWARD 1017436 OUTLINE CHANNEL MODEL AND PAST WORK ADVANTAGES

More information

Multiterminal Source Coding with an Entropy-Based Distortion Measure

Multiterminal Source Coding with an Entropy-Based Distortion Measure Multiterminal Source Coding with an Entropy-Based Distortion Measure Thomas Courtade and Rick Wesel Department of Electrical Engineering University of California, Los Angeles 4 August, 2011 IEEE International

More information

EE376A - Information Theory Final, Monday March 14th 2016 Solutions. Please start answering each question on a new page of the answer booklet.

EE376A - Information Theory Final, Monday March 14th 2016 Solutions. Please start answering each question on a new page of the answer booklet. EE376A - Information Theory Final, Monday March 14th 216 Solutions Instructions: You have three hours, 3.3PM - 6.3PM The exam has 4 questions, totaling 12 points. Please start answering each question on

More information

On Optimal Coding of Hidden Markov Sources

On Optimal Coding of Hidden Markov Sources 2014 Data Compression Conference On Optimal Coding of Hidden Markov Sources Mehdi Salehifar, Emrah Akyol, Kumar Viswanatha, and Kenneth Rose Department of Electrical and Computer Engineering University

More information

Worst-Case Bounds for Gaussian Process Models

Worst-Case Bounds for Gaussian Process Models Worst-Case Bounds for Gaussian Process Models Sham M. Kakade University of Pennsylvania Matthias W. Seeger UC Berkeley Abstract Dean P. Foster University of Pennsylvania We present a competitive analysis

More information

WE start with a general discussion. Suppose we have

WE start with a general discussion. Suppose we have 646 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 43, NO. 2, MARCH 1997 Minimax Redundancy for the Class of Memoryless Sources Qun Xie and Andrew R. Barron, Member, IEEE Abstract Let X n = (X 1 ; 111;Xn)be

More information

(Classical) Information Theory III: Noisy channel coding

(Classical) Information Theory III: Noisy channel coding (Classical) Information Theory III: Noisy channel coding Sibasish Ghosh The Institute of Mathematical Sciences CIT Campus, Taramani, Chennai 600 113, India. p. 1 Abstract What is the best possible way

More information

Lossy Compression Coding Theorems for Arbitrary Sources

Lossy Compression Coding Theorems for Arbitrary Sources Lossy Compression Coding Theorems for Arbitrary Sources Ioannis Kontoyiannis U of Cambridge joint work with M. Madiman, M. Harrison, J. Zhang Beyond IID Workshop, Cambridge, UK July 23, 2018 Outline Motivation

More information

ARTIFICIAL SATELLITES, Vol. 41, No DOI: /v

ARTIFICIAL SATELLITES, Vol. 41, No DOI: /v ARTIFICIAL SATELLITES, Vol. 41, No. 1-006 DOI: 10.478/v10018-007-000-8 On InSAR AMBIGUITY RESOLUTION FOR DEFORMATION MONITORING P.J.G. Teunissen Department of Earth Observation and Space Sstems (DEOS)

More information

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability Probability theory Naïve Bayes classification Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. s: A person s height, the outcome of a coin toss Distinguish

More information

The Communication Complexity of Correlation. Prahladh Harsha Rahul Jain David McAllester Jaikumar Radhakrishnan

The Communication Complexity of Correlation. Prahladh Harsha Rahul Jain David McAllester Jaikumar Radhakrishnan The Communication Complexity of Correlation Prahladh Harsha Rahul Jain David McAllester Jaikumar Radhakrishnan Transmitting Correlated Variables (X, Y) pair of correlated random variables Transmitting

More information

Equivalence for Networks with Adversarial State

Equivalence for Networks with Adversarial State Equivalence for Networks with Adversarial State Oliver Kosut Department of Electrical, Computer and Energy Engineering Arizona State University Tempe, AZ 85287 Email: okosut@asu.edu Jörg Kliewer Department

More information

A Comparison of Superposition Coding Schemes

A Comparison of Superposition Coding Schemes A Comparison of Superposition Coding Schemes Lele Wang, Eren Şaşoğlu, Bernd Bandemer, and Young-Han Kim Department of Electrical and Computer Engineering University of California, San Diego La Jolla, CA

More information

Block 2: Introduction to Information Theory

Block 2: Introduction to Information Theory Block 2: Introduction to Information Theory Francisco J. Escribano April 26, 2015 Francisco J. Escribano Block 2: Introduction to Information Theory April 26, 2015 1 / 51 Table of contents 1 Motivation

More information

Layered Synthesis of Latent Gaussian Trees

Layered Synthesis of Latent Gaussian Trees Layered Synthesis of Latent Gaussian Trees Ali Moharrer, Shuangqing Wei, George T. Amariucai, and Jing Deng arxiv:1608.04484v2 [cs.it] 7 May 2017 Abstract A new synthesis scheme is proposed to generate

More information

Efficient Use of Joint Source-Destination Cooperation in the Gaussian Multiple Access Channel

Efficient Use of Joint Source-Destination Cooperation in the Gaussian Multiple Access Channel Efficient Use of Joint Source-Destination Cooperation in the Gaussian Multiple Access Channel Ahmad Abu Al Haija ECE Department, McGill University, Montreal, QC, Canada Email: ahmad.abualhaija@mail.mcgill.ca

More information

3F1: Signals and Systems INFORMATION THEORY Examples Paper Solutions

3F1: Signals and Systems INFORMATION THEORY Examples Paper Solutions Engineering Tripos Part IIA THIRD YEAR 3F: Signals and Systems INFORMATION THEORY Examples Paper Solutions. Let the joint probability mass function of two binary random variables X and Y be given in the

More information

CS8803: Statistical Techniques in Robotics Byron Boots. Hilbert Space Embeddings

CS8803: Statistical Techniques in Robotics Byron Boots. Hilbert Space Embeddings CS8803: Statistical Techniques in Robotics Byron Boots Hilbert Space Embeddings 1 Motivation CS8803: STR Hilbert Space Embeddings 2 Overview Multinomial Distributions Marginal, Joint, Conditional Sum,

More information

Lecture 1: The Multiple Access Channel. Copyright G. Caire 12

Lecture 1: The Multiple Access Channel. Copyright G. Caire 12 Lecture 1: The Multiple Access Channel Copyright G. Caire 12 Outline Two-user MAC. The Gaussian case. The K-user case. Polymatroid structure and resource allocation problems. Copyright G. Caire 13 Two-user

More information

Optimal Natural Encoding Scheme for Discrete Multiplicative Degraded Broadcast Channels

Optimal Natural Encoding Scheme for Discrete Multiplicative Degraded Broadcast Channels Optimal Natural Encoding Scheme for Discrete Multiplicative Degraded Broadcast Channels Bike ie, Student Member, IEEE and Richard D. Wesel, Senior Member, IEEE Abstract Certain degraded broadcast channels

More information

Lecture 5: GPs and Streaming regression

Lecture 5: GPs and Streaming regression Lecture 5: GPs and Streaming regression Gaussian Processes Information gain Confidence intervals COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 1 Recall: Non-parametric regression Input space X

More information

Multiuser Successive Refinement and Multiple Description Coding

Multiuser Successive Refinement and Multiple Description Coding Multiuser Successive Refinement and Multiple Description Coding Chao Tian Laboratory for Information and Communication Systems (LICOS) School of Computer and Communication Sciences EPFL Lausanne Switzerland

More information

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 52, NO. 4, APRIL

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 52, NO. 4, APRIL IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 52, NO. 4, APRIL 2006 1545 Bounds on Capacity Minimum EnergyPerBit for AWGN Relay Channels Abbas El Gamal, Fellow, IEEE, Mehdi Mohseni, Student Member, IEEE,

More information

Channels with cost constraints: strong converse and dispersion

Channels with cost constraints: strong converse and dispersion Channels with cost constraints: strong converse and dispersion Victoria Kostina, Sergio Verdú Dept. of Electrical Engineering, Princeton University, NJ 08544, USA Abstract This paper shows the strong converse

More information

Randomized Quantization and Optimal Design with a Marginal Constraint

Randomized Quantization and Optimal Design with a Marginal Constraint Randomized Quantization and Optimal Design with a Marginal Constraint Naci Saldi, Tamás Linder, Serdar Yüksel Department of Mathematics and Statistics, Queen s University, Kingston, ON, Canada Email: {nsaldi,linder,yuksel}@mast.queensu.ca

More information

Probability and Statistical Decision Theory

Probability and Statistical Decision Theory Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2019s/ Probability and Statistical Decision Theory Many slides attributable to: Erik Sudderth (UCI) Prof. Mike Hughes

More information

Machine Learning Basics: Maximum Likelihood Estimation

Machine Learning Basics: Maximum Likelihood Estimation Machine Learning Basics: Maximum Likelihood Estimation Sargur N. srihari@cedar.buffalo.edu This is part of lecture slides on Deep Learning: http://www.cedar.buffalo.edu/~srihari/cse676 1 Topics 1. Learning

More information

Lossy Distributed Source Coding

Lossy Distributed Source Coding Lossy Distributed Source Coding John MacLaren Walsh, Ph.D. Multiterminal Information Theory, Spring Quarter, 202 Lossy Distributed Source Coding Problem X X 2 S {,...,2 R } S 2 {,...,2 R2 } Ẑ Ẑ 2 E d(z,n,

More information

Optimization in Information Theory

Optimization in Information Theory Optimization in Information Theory Dawei Shen November 11, 2005 Abstract This tutorial introduces the application of optimization techniques in information theory. We revisit channel capacity problem from

More information

On the Capacity of the Interference Channel with a Relay

On the Capacity of the Interference Channel with a Relay On the Capacity of the Interference Channel with a Relay Ivana Marić, Ron Dabora and Andrea Goldsmith Stanford University, Stanford, CA {ivanam,ron,andrea}@wsl.stanford.edu Abstract Capacity gains due

More information

An Information Theory For Preferences

An Information Theory For Preferences An Information Theor For Preferences Ali E. Abbas Department of Management Science and Engineering, Stanford Universit, Stanford, Ca, 94305 Abstract. Recent literature in the last Maimum Entrop workshop

More information