Information Theoretic Guarantees for Empirical Risk Minimization with Applications to Model Selection and Large-Scale Optimization

Size: px
Start display at page:

Download "Information Theoretic Guarantees for Empirical Risk Minimization with Applications to Model Selection and Large-Scale Optimization"

Transcription

1 with Applications to Model Selection and Large-Scale Optiization Ibrahi Alabdulohsin 1 Abstract In this paper, we derive bounds on the utual inforation of the epirical risk iniization (ERM) procedure for both 0-1 and stronglyconvex loss classes. We prove that under the Axio of Choice, the existence of an ERM learning rule with a vanishing utual inforation is equivalent to the assertion that the loss class has a finite VC diension, thus bridging inforation theory with statistical learning theory. Siilarly, an asyptotic bound on the utual inforation is established for strongly-convex loss classes in ters of the nuber of odel paraeters. The latter result rests on a central liit theore (CLT) that we derive in this paper. In addition, we use our results to analyze the excess risk in stochastic convex optiization and unify previous works. Finally, we present two iportant applications. First, we show that the ERM of strongly-convex loss classes can be trivially scaled to big data using a naïve parallelization algorith with provable guarantees. Second, we propose a siple inforation criterion for odel selection and deonstrate experientally that it outperfors the popular Akaike s inforation criterion (AIC) and Schwarz s Bayesian inforation criterion (BIC). 1. Introduction Learning via the epirical risk iniization (ERM) of stochastic loss is a ubiquitous fraework that has been widely applied in achine learning and statistics. It is often regarded as the default strategy to use, due to its siplicity, generality, and statistical efficiency (Shalev-Shwartz et al., 2009; Koren & Levy, 2015; Vapnik, 1999; Shalev-Shwartz & Ben-David, 2014). Given a fixed hypothesis space H, 1 Saudi Araco, Dhahran 31311, Saudi Arabia. Correspondence to: Ibrahi Alabdulohsin <ibrahi.alabdulohsin@kaust.edu.sa>. Proceedings of the 35 th International Conference on Machine Learning, Stockhol, Sweden, PMLR 80, Copyright 2018 by the author(s). a doain Z, and a loss function on the product space f : H Z R, the ERM learning rule selects the hypothesis ĥ that iniizes the epirical risk: { ĥ = arg in F S (h) = 1 h H i=1 } f(h, z i ), (1) where S = (z 1,..., z ) Z is a saple of observations drawn i.i.d. fro soe unknown probability distribution D. To siplify notation, we oit the dependence of ĥ on the saple S. By contrast, the true risk iniizer is denoted h : h = arg in h H { } F (h) = E z D [f(h, z)]. (2) Hence, learning via ERM is justified if and only if F (ĥ) F (h ) + ɛ, for soe provably sall ɛ. In soe applications, such as classification, the stochastic loss of interest is often the 0-1 loss. However, iniizing the 0-1 loss is, in general, coputationally intractable even for siple hypothesis spaces, such as linear classifiers (Feldan et al., 2009). To circuvent this difficulty, a convex, surrogate loss is used instead, which is justified by consistency results when the loss is calibrated (Bartlett et al., 2006). When the stochastic loss is convex, ERM has occasionally been referred to by various naes in the literature, such as stochastic convex optiization and stochastic average approxiation (Shalev-Shwartz et al., 2009; Feldan, 2016). Exaples of the latter fraework include least squares, logistic regression, and SVM. In the literature, several approaches have been proposed for analyzing the generalization risk and, consequently, the consistency of epirical risk iniization. The ost doinant approach is unifor convergence, with seinal results for 0-1 loss classes dating to the early work of Vapnik and Chervonenkis in the 1970s (Shalev-Shwartz & Ben-David, 2014; Abu-Mostafa et al., 2012; Vapnik, 1999). The Fundaental Theore of Statistical Learning states that a hypothesis space H is agnostic PAC-learnable via ERM if and only if it is PAC-learnable at all, and that this occurs if and only if H has a finite VC diension (Shalev-Shwartz & Ben- David, 2014). This elegant characterization shows that the generalization risk for 0-1 loss classes can be bounded by coputing their VC diensions.

2 Siilarly, when the stochastic loss is both convex and of the generalized linear for, where f(h, z i ) = r(h) + g( h, φ(z i ), z i ) and r : H R is a strongly-convex regularizer, then, under ild additional conditions of soothness and boundedness, it has been shown that the true risk sub-optiality F (h) F (h ) is bounded uniforly in H by a constant ultiple of the epirical risk sub-optiality F S (h) F S (ĥ) plus an O(1/) ter (Sridharan et al., 2009). In general, it can be shown that for strongly-convex stochastic losses, the risk of the epirical risk iniizer converges to the optial risk with rate O(1/), which justifies learning via the epirical risk iniization (Shalev- Shwartz et al., 2009). Hence, the consistency of epirical risk iniization for both 0-1 loss classes and stronglyconvex loss functions is, generally, well-understood. However, any inforation-theoretic approaches have been recently proposed for analyzing achine learning algoriths. These include generalization bounds that are based on ax-inforation (Dwork et al., 2015), leave-one-out inforation (Raginsky et al., 2016), and the utual inforation (Alabdulohsin, 2015; Russo & Zou, 2016). They have found applications in iportant areas, such as in privacy and adaptive data analysis, since they naturally lead to guaranteed stability, bounded inforation leakage, and robustness against post-processing. Copared to other inforationtheoretic approaches, unifor generalization bounds using the variational inforation, also called T -inforation (Raginsky et al., 2016), yield the tightest results for bounded loss functions, as deduced by the Pinsker inequality (Cover & Thoas, 1991). It was proposed in (Alabdulohsin, 2015), who used it to prove that unifor generalization, algorithic stability, and bounded inforation were, in fact, equivalent conditions on the learning algorith. A siilar notion, called robust generalization" was later proposed in (Cuings et al., 2016), who analyzed its significance in the adaptive learning setting and showed that it could be achieved using saple copression schees, finite description lengths, and differential privacy. However, these two notions of unifor" and robust" generalization turned out to be equivalent (Alabdulohsin, 2017). To describe unifor generalization in inforal ters, suppose we have a learning algorith L : Z H, which selects a hypothesis h H according to a training saple S Z. Let the generalization risk of L w.r.t. soe bounded loss function l : H Z [0, 1] be defined by: [ R gen (L) = E S D, h L(h) LS (h) ], (3) where L(h) = E z D [l(h, z)] and L S is its epirical counterpart L S (h) = E z S [l(h, z)]. In Eq. (3), the expectation is taken over the rando choice of the saple and the internal randoness (if any) in the learning algorith. Note that l in Eq. (3) can be different fro the loss f in Eq. (1) that is optiized during the learning stage, e.g. l can be a 0-1 isclassification error rate while f is soe convex, surrogate loss. Then, L is said to generalize uniforly with rate ɛ > 0 if R gen (L) ɛ is guaranteed to hold independently of the choice of l. That is, the generalization guarantee holds uniforly in expectation across all paraetric loss functions, hence the nae. Definition 1 (Unifor Generalization). A learning algorith L : Z H generalizes uniforly with rate ɛ 0 if for all bounded paraetric losses l : H Z [0, 1], we have R gen (L) ɛ, where R gen (L) is given in Eq. (3). Inforally, Definition 1 states that once a hypothesis h is selected by a learning algorith L that achieves unifor generalization, then no adversary" can post-process the hypothesis in a anner that causes over-fitting to occur (Alabdulohsin, 2015; Cuings et al., 2016). Equivalently, unifor generalization iplies that the epirical perforance of h on the saple S is a faithful approxiation to its true risk, regardless of how that perforance is easured. For exaple, the loss function l in Eq. (3) can be the isclassification error rate, a cost-sensitive error rate in fraud detection and edical diagnosis (Elkan, 2001), or it can be the Brier score in probabilistic predictions (Kull & Flach, 2015). The generalization guarantee would hold in any case. The ain theore of (Alabdulohsin, 2015) states that the unifor generalization risk has a precise inforationtheoretic characterization: Theore 1 (Alabdulohsin, 2015). Given a fixed 0 ɛ 1 and a learning algorith L : Z H that selects a hypothesis h H according to a training saple S = {z 1,..., z }, where z i D are i.i.d., then L generalizes uniforly with rate ɛ if and only if J (h; ẑ) ɛ, where ẑ S is a single rando training exaple, J (x; y) = p(x)p(y), p(x, y) T, and q 1, q 2 T is the total variation distance between the probability easures q 1 and q 2. Throughout this paper, we will adopt the terinology used in (Alabdulohsin, 2017) and call J (x; y) the variational inforation" between the rando variables x and y. Variational inforation is an instance of the class of inforativity easures using f-divergences, for which an axioatic basis has been proposed (Csiszár, 1972; 2008). To illustrate Theore 1, consider the case of a finite hypothesis space H <. Then, a classical arguent using the union bound (Shalev-Shwartz & Ben-David, 2014) can be used to show that the generalization risk w.r.t. any fixed bounded loss l : H Z [0, 1] is Õ( log H /). This follows fro the fact that the union bound arguent does not ake any additional assuptions on the loss l beyond the fact that it has a bounded range. Hence, the unifor generalization risk is Õ( log H /). However, Theore 1 states that this bound ust also hold for the variational inforation J (h; ẑ) as well. This clai can, in fact, be ver-

3 ified readily using inforation theoretic inequalities since we have (Alabdulohsin, 2015): I(h; ẑ) I(h; S) log H J (h; ẑ) 2 2 2, where I(x; y) is the Shannon utual inforation easured in nats (i.e. using natural logariths). Here, the first inequality is due to Pinsker (Cover & Thoas, 1991), the second inequality follows because z i are i.i.d., and the last inequality holds because the Shannon utual inforation is bounded by the entropy. Note that in the latter case, only inforation-theoretic inequalities were used to recover this classical result without relying on the union bound. The generalization guarantee in Theore 1 depends on the probability distribution D. It can be ade distribution-free by defining the capacity of a learning algorith L by: C(L) = sup{j (h; ẑ) = Eẑ p(z) p(h), p(h ẑ) T }, (4) p(z) where the supreu is taken over all possible distributions of observations. This is analogous to the capacity of counication channels in inforation theory (Cover & Thoas, 1991). Then, the generalization risk of L is bounded by C(L) for any distribution D and any bounded loss function l : H Z [0, 1]. In any cases, such as in finite description lengths, countable doains, ean estiation, differential privacy, and saple copression schees, the capacity of C(L) can be tightly bounded (Alabdulohsin, 2015; 2017). In this paper, we derive new bounds for the capacity of the epirical risk iniization (ERM) procedure. Finally, we point out that even though unifor generalization guarantees are originally defined only in expectation as shown in Eq. (3), it has recently been established that these guarantees of unifor generalization in expectation also iplied a generalization with a high probability as well (Alabdulohsin, 2017). This is in stark contrast with traditional generalization in expectation guarantees that do not iply concentration. 2. Contributions The first contribution of this paper is to establish tight relations between unifor generalization and the epirical risk iniization (ERM) procedure. This will allow us to bridge inforation theory with statistical learning theory. More specifically, we will prove that under the Axio of Choice, an ERM learning rule always exists that has a vanishing learning capacity C(L) if and only if the 0-1 loss class has a finite VC diension. Second, we prove that the epirical risk iniization of strongly-convex stochastic loss also generalizes uniforly in expectation. To establish the latter result, we prove the asyptotic norality of the epirical risk iniizer over the rando choice of the training saple, which is a useful result in its own right, such as for uncertainty quantification and hypothesis testing. In the achine learning literature, central liit theores have been eployed to analyze online learning algoriths. For instance, (Polyak & Juditsky, 1992) derived a central liit theore (CLT) when using an averaging ethod to accelerate convergence. More recently, (Mandt et al., 2016) derived a central liit theore for SGD with a fixed learning rate under siplifying assuptions. Then, (Mandt et al., 2016) used their result to select the learning rate such that the trajectory of SGD generates saples fro the posterior distribution. Unlike the work in (Mandt et al., 2016), we prove our CLT without relying on soe unnecessary siplifying assuptions, such as the Gaussianity of noise. As a consequence of our work, we present a generalization bound for stochastic convex optiization, which only depends on the nuber of odel paraeters and applies to any learning task, such as regression, ulti-class classification, and ranking. Next, we use our results to analyze the excess risk in stochastic convex optiization and unify previous works. Finally, we deonstrate two iportant applications. First, we show that the ERM of strongly-convex loss classes can be trivially distributed and scaled to big data using a naïve parallelization algorith whose perforance is provably equivalent to that of the ERM learning rule. Unlike previous works, our parallelization algorith does not rely on advanced treatents, such as the ADMM procedure (Boyd et al., 2011). Second, we use our results to propose a siple inforation criterion for odel selection in ters of the nuber of odel paraeters and deonstrate experientally that it outperfors the popular Akaike s inforation criterion (AIC) (Akaike, 1998; Bishop, 2006) and Schwarz s Bayesian inforation criterion (BIC) (Schwarz, 1978). 3. Notation Our notation is fairly standard. Soe exceptions that ay require further clarification are as follows. First, when x is a rando variable whose value is drawn uniforly at rando fro a finite set S, we will write x S to denote this fact. Second, if x is a predicate, then I{x} = 1 if and only if x is true, otherwise I{x} = 0. Third, A B is a linear atrix inequality (LMI), i.e. A B is equivalent to the assertion that B A is positive seidefinite. Moreover, we will use the order in probability notation for real-valued rando variables. Here, we adopt the notation used in (Janson, 2011; Tao, 2012). In particular, let x = x n be a real-valued rando variable that depends on soe paraeter n N. Then, we will write x n = O p (f(n)) if for any δ > 0, there exists absolute constants C and n 0 such that for any fixed n n 0, the inequality x n < C f(n)

4 holds with a probability of, at least, 1 δ. In other words, the ratio x n /f(n) is stochastically bounded (Janson, 2011). Siilarly, we write x n = o p (f(n)) if x n /f(n) converges to zero in probability. As an exaple, if x N (0, I d ) is a standard ultivariate Gaussian vector, then x 2 = O p ( d) even though x 2 can be arbitrarily large. Intuitively, the probability of the event x 2 d 1 2 +ɛ when ɛ > 0 goes to zero as d so x 2 is effectively of the order O( d). Finally, let H be a fixed hypothesis space and let Z be a fixed doain of observations. Given a 0-1 loss function f : H Z {0, 1}, we will abuse terinology slightly by speaking about the VC diension" of H, when we actually ean the VC diension of the loss class {f(h, ) : h H}. Because the 0-1 loss f will be clear fro the context, this should not cause any abiguity. 4. Epirical Risk Miniization of 0-1 Loss Classes We begin by analyzing the ERM learning rule for 0-1 loss classes. Before we establish that a finite VC diension is sufficient to guarantee the existence of ERM learning rules that generalize uniforly in expectation, we first describe why ERM by itself is not sufficient even when the hypothesis space has a finite VC diension 1. Proposition 1. For any saple size 1 and a positive constant ɛ > 0, there exists a hypothesis space H, a doain Z, and a 0-1 loss function f : H Z {0, 1} such that: (1) H has a VC diension d = 1, and (2) a learning algorith L : Z H exists that outputs an epirical risk iniizer ĥ with J (ĥ; ẑ) 1 ɛ, where ẑ S is a single rando training exaple. Proposition 1 shows that one cannot obtain a non-trivial bound on the unifor generalization risk of an ERM learning rule in ters of the VC diension d and the saple size without iposing soe additional restrictions. Next, we prove that an ERM learning rule exists that satisfies the unifor generalization property if the hypothesis space has a finite VC diension. We begin by recalling a fundaental result in odern set theory. A non-epty set Q is said to be well-ordered if Q is endowed with a total order such that every nonepty subset of Q contains a least eleent. The following fundaental result is due to Ernst Zerelo. Theore 2 (Well-Ordering Theore). Under the Axio of Choice, every non-epty subset can be well-ordered. Theore 2 was proved by Zerelo in 1904 (Kologorov & Foin, 1970). 1 Detailed foral proofs are available in the suppleentary aterials. Theore 3. Given a hypothesis space H, a doain Z, and a 0-1 loss f : H Z {0, 1}, let be a well-ordering on H and let L : Z H be the learning rule that outputs the least" epirical risk iniizer to the training saple S Z according to. Then, C(L) 0 as if H has a finite VC diension. In particular: C(L) d log 2e d +, where C(L) is given by Eq. (4) and d is the VC diension of H, provided that d. Next, we prove a converse stateent. Before we do this, we present a learning proble that shows why a converse to Theore 3 is not generally possible without aking soe additional assuptions. Hence, our converse will be later established for the binary classification setting only. Exaple 1 (Integer Subset Learning Proble). Let Z = {1, 2, 3,..., d} be a finite set of positive integers. Let H = 2 Z and define the loss of a hypothesis h H to be f(h, z) = I{z / h}. Then, the VC diension is d. However, the learning rule that outputs h = Z is always an ERM learning rule that generalize uniforly with rate ɛ = 0 regardless of the saple size and the distribution of observations. The previous exaple shows that a converse to Theore 3 is not possible without iposing soe additional constraints. In particular, in the Integer Subset Learning Proble, the VC diension is not a useful easure of the coplexity of the hypothesis space H because any hypotheses doinate others (i.e. perfor better across all distributions of observations). For exaple, the hypothesis h = {1, 2, 3} doinates h = {1} because there is no distribution on observations in which h outperfors h. Even worse, the hypothesis h = Z doinates all other hypotheses. Consequently, in order to prove a lower bound for all ERM rules, we consider the standard binary classification setting. Theore 4. In any fixed doain Z = X Y, let the hypothesis space H be a concept class on X and let f(h, x, y) = I{y h(x)} be the isclassification error. Then, any ERM learning rule L w.r.t. f has a learning( capacity C(L) that is bounded fro below by C(L) , d) where is the training saple size and d is the VC diension of H. Using both Theore 3 with Theore 4, we arrive a crisp characterization of the VC diension of concept classes in ters of inforation theory. Theore 5. Given a fixed doain Z = X Y, let the hypothesis space H be a concept class on X and let f(h, x, y) = I{y h(x)} be the isclassification error. Let be the saple size. Then, the following stateents are equivalent under the Axio of Choice:

5 1. H adits an ERM learning rule L whose learning capacity C(L) satisfies C(L) 0 as. 2. H has a finite VC diension. Proof. The lower bound in Theore 4 holds for all ERM learning rules. Hence, an ERM learning rule exists that generalize uniforly with a vanishing rate across all distributions only if H has a finite VC diension. However, under the Axio of Choice, H can always be well-ordered by Theore 2 so, by Theore 3, a finite VC diension is also sufficient to guarantee the existence of a learning rule that generalize uniforly. Theore 5 presents a crisp characterization of the VC diension in ters of inforation theory. According to the theore, an ERM learning rule can be constructed that does not encode the training saple if and only if the hypothesis space has a finite VC diension. Reark 1. In (Cuings et al., 2016), it has been argued that unifor generalization, called robust generalization" in the paper, is iportant in the adaptive learning setting because it iplies that no adversary can post-process the hypothesis and causes over-fitting to occur. In other words, no adversary can use the hypothesis to infer new conclusions, which do not theselves generalize. It was shown in (Cuings et al., 2016) that this robust generalization guarantee was achievable by saple copression schees and differential privacy. Theore 3 shows that an ERM learning rule of 0-1 loss classes with finite VC diensions always exists that satisfies this robust generalization property. Reark 2. One ethod of constructing a well-ordering on a hypothesis space H is to use the fact that coputers are equipped with finite precisions. Hence, in practice, every hypothesis space is enuerable, fro which the noral ordering of the integers is a valid well-ordering. 5. Epirical Risk Miniization of Strongly-Convex Loss Classes Next, we analyze the ERM learning rule for strongly-convex loss classes. To recall, the ERM learning rule selects the hypothesis ĥ given by Eq. (1), which is identical to what was assued before. However, we now further assue that f(h, z) is γ-strongly convex, L-Lipschitz, and twicedifferentiable on its first arguent. Moreover, the hypothesis space is R d for soe finite d < Central Liit Theore We begin by establishing a central liit theore (CLT). Throughout this section, we will siplify notation by writing f i (h) = f(h, z i ). Predicted Actual Figure 1. This figure presents the predicted (left) and actual (right) covariance atrices of the epirical risk iniizer to an l 2- regularized logistic regression proble, where d = 10 and = The colorap used is parula, and siilar colors correspond to siilar values. Theore 6. Let ĥ and h be as given in Eq. (1) and Eq. (2) respectively for soe i.i.d. realizations of the stochastic loss f D. If the distribution D is supported on γ-strongly convex, L-Lipschitz, and twice differentiable loss functions, then ( ĥ h ) N ( 0, Σ ) as, where Σ is equal to: ( Ef D [ 2 f(h )] ) 1 Cov( f(h )) (E f D [ 2 f(h )] ) 1 Proof. Here is the outline of the proof. First, we use strong convexity, the first-order optiality condition of h, and Theore 6 in (Shalev-Shwartz et al., 2009) to show that ĥ h 2 = O p (1/ ). This allows us to estiate the error ter of the second-order Taylor expansion. Next, we use the continuous apping theore (Mann & Wald, 1943) and the atrix inversion lea (Hager, 1989) to show that: ĥ h = 1 ( Ef D [ 2 f(h )] ) 1 ( 1 ) f i (h )+o p i=1 Because (1/) i=1 f i(h ) is an average of i.i.d. realizations, applying the classical central liit theore, the first-order optiality condition on h, and the affine property of the ultivariate Gaussian distribution will yield the desired result. Reark 3. The CLT in Theore 6 is an asyptotic result; it holds for a sufficiently large. When the gradient f i (w) is, additionally, R-Lipschitz, it can be shown using the Matrix-Hoeffding bound (Tropp, 2012) that the noral approxiation is valid as long as L 2 /γ 2 + RL log d. To verify Theore 6 epirically, we have ipleented the l 2 -regularized logistic regression on a ixture of two Gaussians with different eans and covariance atrices, one corresponding to the positive class and one corresponding to the negative class. Both the actual and predicted covariance atrices of the epirical risk iniizer ĥ, whose randoness is derived fro the randoness of the training saple, are depicted in Fig. 1. As shown in the figure, the actual covariance atrix atches with the one predicted by Theore 6.

6 Theore 6 shows that the saple coplexity of stochastic convex optiization depends on the curvature of the risk E f D [f(h)] at its iniizer h. Next, we show that the dependence of ERM on this curvature is, in fact, optial. Proposition 2. There exists a stochastic loss f : R d R that satisfies the conditions of Theore 6 and a distribution D such that if h is an unbiased estiator of h = arg in h R d{e f D [f(h)], then the covariance of the estiator is bounded fro below (in the positive seidefinite sense) by: E S D [( h h ) ( h h )] (1/) Σ, where Σ is the covariance of ERM given by Theore The Excess Risk Theore 6 states that when the loss f is strongly convex, the epirical risk iniizer ĥ is norally distributed around the population risk iniizer h with a vanishing O(1/) covariance. This justifies learning via the epirical risk iniization rule. We use this central liit theore, next, to derive the asyptotic behavior of the excess risk F (ĥ) F (h ), where F (h) is defined in Eq. (2). In particular, we would like to deterine the ipact of regularization on the asyptotic behavior of the excess risk. We write: f(h) = λ 2 h g(h), (5) where λ 0, g is a κ-strongly-convex function for soe κ 0, and λ + κ = γ > 0. Because f satisfies the conditions of Theore 6, we know that ĥ h as. Hence, we can take the Taylor expansion of F (h) around h and write: F (ĥ) F (h ) = 1 ( 1 ) 2 (ĥ h ) T 2 F (h ) (ĥ h )+o p, which holds since F (h ) = 0 and ĥ h 2 = O p (1/ ). Using Theore 6 and siplifying yields: F (ĥ) F (h ) = 1 ( 1 ) 2 xt U T 2 F (h ) U x+o p, (6) where Σ = UU T, x = U 1 (ĥ h ), and Σ is given by Theore 6. By substituting Eq. (5), we arrive at the following corollary. Corollary 1. Let f be decoposed into the su of a regularization ter and an unregularized loss, as given by Eq. (5). Write G(h) to denote the unregularized risk: G(h) = E f D [g(h)] = E f D [f(h) (λ/2) h 2 2]. Then, under the conditions of Theore 6, G(ĥ) G(h ) converges in distribution (over the rando choice of the training saple) to λ c T x + xt Dx 2 for soe c Rd and soe D 0 that are both independent of, where x N (0, I d ) is a standard ultivariate Gaussian rando variable. We can interpret Corollary 1 in sharp Θ( ) ters. In the following rearks, both expectation" and "probability" are taken over the rando choice of the training saple: In the absence of regularization, the ERM learning rule of strongly convex loss enjoys a fast learning rate of Θ(1/) both in expectation and in probability. When regularization is used, the unregularized risk converges to its infinite-data liit with a Θ(1/) rate only in expectation. By contrast, it has a Θ(1/ ) rate of convergence in probability. Corollary 1 unifies previous results, such as those reported in (Shalev-Shwartz & Ben-David, 2014; Sridharan et al., 2009; Shalev-Shwartz et al., 2009). It also illustrates why a fast Θ(1/) learning rate in expectation, such as those reported for Exp-Concave iniization (Koren & Levy, 2015), do not necessarily correspond to fast learning rates in practice because the sae excess risk can be Θ(1/ ) in probability even if it is Θ(1/) in expectation. To validate these clais experientally, we have ipleented a linear regression proble by iniizing the Huber loss of the residual. The ean and the standard deviation of G(ĥ) G(h ) are plotted in Fig 2 against the saple size. As shown in Fig 2 (a, b), we have a fast Θ(1/) convergence both in expectation and in probability when no regularization is used. However, when regularization is used (γ = 0.01 in this experient), then the expectation of G(ĥ) G(h ) is O(1/) but its standard deviation is O(1/ ) as shown in Fig 2 (c, d). These results are in agreeent with Corollary 1. Corollary 1 provides the excess risk between the regularized epirical risk iniizer ĥ and its infinite data liit h. We can use the sae corollary to obtain a bound on the excess risk G(ĥ) inf h Rd {G(h)} using oracle inequalities in a anner that is siilar to (Sridharan et al., 2009; Shalev- Shwartz & Ben-David, 2014). Moreover, the regularization agnitude λ ay be optiized to iniize this excess risk. Because this is quite siilar to previous works, the reader is referred to (Sridharan et al., 2009; Shalev-Shwartz & Ben-David, 2014) for further details Unifor Generalization Bound Theore 6 can be a viable tool in uncertainty quantification and hypothesis testing. In this section, however, we focus on its iplication for the unifor generalization risk. We begin with the following conditional" version of the central liit theore.

7 (a) (b) (c) (d) Figure 2. This figure plots the quantity E() = G(ĥ) G(h ) vs. the saple size for a regression proble, where we iniize the Huber loss of the residual. Figures (a) and (b) display the expectation and standard deviation of E(), respectively, when γ = 0 (no regularization). By contrast, Figures (c) and (d) display the sae quantities when γ = The blue curves are the best fitted curves assuing a fast Θ(1/) convergence while the red curves are the best fitted curves assuing a standard Θ(1/ ) convergence. Proposition 3. Let ˆf D be a fixed instance of the stochastic loss, and let the training saple be S = {ˆf} {f 2, f 3,..., f } with f i D drawn i.i.d. and independently of ˆf. Let ĥ Rd be the epirical risk iniizer given by Eq. (1). Then, under the conditions of Theore 6: where: p(ĥ ˆf) N ( µ, 1 1 Σ), { [ 1 ]} µ = arg in E f D f(h) + ˆf(h) h R d 1 Here, Σ is the covariance atrix given by Theore 6. Proposition 3 states, in other words, that a single realization of the stochastic loss shifts the expectation of the epirical risk iniizer ĥ and rescales its covariance. Both Theore 6 and Proposition 3 iply that the capacity of the ERM learning rule of stochastic, strongly-convex loss classes satisfies C(L) 0 as. We establish a ore useful quantitative result in the following theore. Theore 7. Suppose that norality holds, where the epirical risk iniizer ĥ has the probability density p(ĥ) = N (µ, Σ/) with µ and Σ given by Theore 6, and for any given single realization of the stochastic loss ˆf, we also have p(ĥ ˆf) = N ( µ, Σ/( 1)) as given by Proposition 3. Then: I(ĥ; ˆf) d ( 1 ) + o, (7) where I(x; y) is the Shannon utual inforation between the rando variables x and y. Corollary 2. Under the conditions of Theore 7, the unifor generalization risk satisfies: J (ĥ; ˆf) d ( 1 ) 2 + o (8) Proof. This follows iediately fro Theore 7 and Pinsker s inequality (Cover & Thoas, 1991). 6. Applications 6.1. Large-Scale Stochastic Convex Optiization Theore 6 states that the epirical risk iniizer of strongly convex stochastic loss is norally distributed around the population risk iniizer h with covariance (1/)Σ. This justifies learning via the epirical risk iniization procedure. However, the true goal behind stochastic convex optiization in the achine learning setting is not to copute the epirical risk iniizer ĥ per se but to estiate h. The epirical risk iniizer provides such an estiate. However, a different estiator can be constructed, which is as effective as the epirical risk iniizer ĥ. Theore 8. Under the conditions of Theore 6, let S = {f 1,..., f } be i.i.d. realizations of the stochastic loss f D and fix a positive integer K 1. Let K j=1 S j be a partitioning of S into K subsets of equal size and define ĥ j to be the epirical risk iniizer for S j only. Then, h = 1 K K j=1 ĥj is asyptotically norally distributed around the population risk iniizer h with covariance (1/)Σ, where Σ is given by Theore 6. Proof. By Theore 6, every ĥj is asyptotically norally distributed around h with covariance (K/)Σ. Hence, the average of those hypotheses is asyptotically norally distributed around h with covariance (1/)Σ. Theore 8 shows that in the achine learning setting, one can trivially scale the epirical risk iniization procedure to big data using a naïve parallelization algorith. Indeed, the estiator h described in the theore is not a iniizer to the epirical risk but it is as effective as the epirical risk iniizer in estiating the population risk iniizer h. Hence, both ĥ and h enjoy the sae perforance guarantee. In the literature, ethods for scaling achine learning algoriths to big data using distributed algoriths do not always distinguish between optiization as it is used in the traditional setting vs. the optiization that is used in

8 Algorith Test Error Tie ERM ( = 50, 000) 14.8 ± 0.3% 2.93s ERM ( = 1, 000) 15.3 ± 0.5% 0.03s Parallelized (K = 50, = 1, 000) 14.8 ± 0.1% 0.03s Table 1. In this experient, we run the l 2 regularized logistic regression on the MiniBooNE Particle Identification proble (Blake & Merz, 1998). The first row corresponds to running the algorith on 50,000 training exaples, the second row is for 1,000 exaples, while the last row is for the parallelization algorith of Theore 8, which splits 50,000 exaples into 50 separate saller probles. the achine learning setting despite several calls that highlighted such a subtle distinction (Bousquet & Bottou, 2008; Shalev-Shwartz et al., 2012). One popular, and often-cited, procedure that does not ake such a distinction is the alternating direction ethod of the ultiplier (ADMM) (Boyd et al., 2011). The ADMM procedure produces a distributed algorith with essage passing for iniizing the epirical risk by reforulating stochastic convex optiization into a global consensus proble". However, the epirical risk is erely a proxy for the true risk that one seeks to iniize. Theore 8, by contrast, presents a uch sipler algorith that achieves the desired goal. Table 1 validates this clai experientally. Reark 4. The theoretical guarantee of Theore 8 is established for stochastic convex optiization only. When the stochastic loss is non-convex, such as the case in neural networks, then the parallelization ethod is likely to fail. Intuitively, the average of good solutions is not necessarily a good solution itself when the loss is non-convex Inforation Criterion for Model Selection Theore 7 shows that under the assuption of norality, which is justified by the central liit theores (Theore 6 and Proposition 3), the unifor generalization risk of the ERM learning rule of stochastic, strongly-convex loss is asyptotically bounded by d/(2). Rearkably, this bound does not depend on the strong-convexity paraeter γ, nor on any other properties of the stochastic loss f that is optiized during the training stage. In addition, because it is a unifor generalization bound, it also holds for any bounded loss function l : R d Z [0, 1]. Hence, it can serve as a siple inforation criterion for odel selection. As a result, a learning algorith should ai at iniizing the Unifor Inforation Criterion (UIC) given by: UIC = E z S [l(ĥ, z)] + d 2, (9) where d is the nuber of learned paraeters. The UIC above is analogous to the Akaike inforation criteria (AIC) (Akaike, 1998; Bishop, 2006) which seeks to iniize E z S [f(ĥ, z)] + d/. It is also siilar to Schwarz s AIC BIC UIC d Figure 3. In this experient, we have a least-squares polynoial regression where X = [ 1, +1], y = f(x) + ɛ, f is a polynoial of degree d, and ɛ is i.i.d. Gaussian noise. The odel selection proble seeks to deterine the optial polynoial degree d. In each of the three figures above, a value of d is selected, which is arked by the dashed vertical line. Then, 100 d training exaples are provided. Each curve plots an inforation criterion against d. Ideally, d iniizes the corresponding curve. As shown above, only UIC succeeds in estiating d in all probles. In this experient, the epirical risk is noralized in the range [0, 1] by dividing it by the axiu observed loss in the training saple. Bayesian inforation criterion (BIC) (Schwarz, 1978), which seeks to iniize E z S [f(ĥ, z)] + (log /2) d/. In all three criteria, the generalization risk is estiated by the nuber of learned paraeters. However, the penalty for the nuber of odel paraeters is proportional to d in both AIC and BIC, which, by the discretization trick (Shalev- Shwartz & Ben-David, 2014), is known to be pessiistic. Indeed, the discretization trick suggests that the generalization risk is proportional to O( d/), which agrees with the UIC. Experientally, the UIC succeeds when both AIC and BIC fail, as deonstrated in Fig Conclusion In this paper, we derive bounds on the utual inforation of the ERM rule for both 0-1 and strongly-convex loss classes. We prove that under the Axio of Choice, the existence of an ERM rule with a vanishing utual inforation is equivalent to the assertion that the loss class has a finite VC diension, thus bridging inforation theory with statistical learning theory. Siilarly, an asyptotic bound on the utual inforation is established for strongly-convex loss classes in ters of the nuber of odel paraeters. The latter result uses a central liit theore that we derive in this paper. After that, we prove that the ERM learning rule for strongly-convex loss classes can be trivially scaled to big data. Finally, we propose a siple inforation criterion for odel selection and deonstrate experientally that it outperfors previous works. 2 The MATLAB codes that generate Table 1 and Fig. 3 are provided in the suppleentary aterials.

9 References Abu-Mostafa, Yaser S, Magdon-Isail, Malik, and Lin, Hsuan-Tien. Learning fro data, volue 4. AMLBook New York, NY, USA:, Akaike, Hirotogu. Inforation theory and an extension of the axiu likelihood principle. In Selected Papers of Hirotugu Akaike, pp Springer, Alabdulohsin, Ibrahi M. Algorithic stability and unifor generalization. In NIPS, pp , Alabdulohsin, Ibrahi M. An inforation theoretic route fro generalization in expectation to generalization in probability. In AISTATS, Bartlett, Peter L, Jordan, Michael I, and McAuliffe, Jon D. Convexity, classification, and risk bounds. Journal of the Aerican Statistical Association, 101(473): , Bishop, Christopher M.. Pattern recognition and achine learning. Springer, Blake, C. L. and Merz, C. J. UCI repository of achine learning databases, Bousquet, Olivier and Bottou, Léon. The tradeoffs of large scale learning. In Advances in neural inforation processing systes, pp , Boyd, Stephen, Parikh, Neal, Chu, Eric, Peleato, Borja, and Eckstein, Jonathan. Distributed optiization and statistical learning via the alternating direction ethod of ultipliers. Foundations and Trends R in Machine Learning, 3(1):1 122, Cover, Thoas M and Thoas, Joy A. Eleents of inforation theory. Wiley & Sons, Csiszár, Ire. A class of easures of inforativity of observation channels. Periodica Matheatica Hungarica, 2: , Csiszár, Ire. Axioatic characterizations of inforation easures. Entropy, 10: , Cuings, Rachel, Ligett, Katrina, Nissi, Kobbi, Roth, Aaron, and Wu, Zhiwei Steven. Adaptive learning with robust generalization guarantees. In COLT, pp , Dwork, Cynthia, Feldan, Vitaly, Hardt, Moritz, Pitassi, Toniann, Reingold, Oer, and Roth, Aaron. Preserving statistical validity in adaptive data analysis. In Proceedings of the Forty-Seventh Annual ACM on Syposiu on Theory of Coputing (STOC), pp , Elkan, Charles. The foundations of cost-sensitive learning. In IJCAI, Feldan, V., Guruswai, V., Raghavendra, P., and Wu, Y. Agnostic learning of onoials by halfspaces is hard. In th Annual IEEE Syposiu on Foundations of Coputer Science, pp , Oct doi: / FOCS Feldan, Vitaly. Generalization of ERM in stochastic convex optiization: The diension strikes back. In NIPS, pp , Hager, Willia W. Updating the inverse of a atrix. SIAM review, 31(2): , Janson, Svante. Probability asyptotics: notes on notation. arxiv preprint arxiv: , Kologorov, Andreĭ Nikolaevich and Foin. Introductory real analysis. Dover Publication, Inc., Koren, Toer and Levy, Kfir. Fast rates for exp-concave epirical risk iniization. In NIPS, pp , Kull, Meelis and Flach, Peter. Novel decopositions of proper scoring rules for classification: score adjustent as precursor to calibration. In ECML-PKDD, pp Springer, Mandt, Stephan, Hoffan, Matthew D., and Blei, David M. A variational analysis of stochastic gradient algoriths. In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volue 48, ICML, pp JMLR.org, URL cf?id= Mann, H. B. and Wald, A. On stochastic liit and order relationships. The Annals of Matheatical Statistics, 14(3): , ISSN URL http: // Polyak, B. T. and Juditsky, A. B. Acceleration of stochastic approxiation by averaging. SIAM J. Control Opti., 30 (4): , July ISSN doi: / URL Raginsky, Maxi, Rakhlin, Alexander, Tsao, Matthew, Wu, Yihong, and Xu, Aolin. Inforation-theoretic analysis of stability and bias of learning algoriths. In Inforation Theory Workshop (ITW), 2016 IEEE, pp IEEE, Russo, Daniel and Zou, Jaes. Controlling bias in adaptive data analysis using inforation theory. In Proceedings

10 of the 19th International Conference on Artificial Intelligence and Statistics (AISTATS), Schwarz, Gideon. Estiating the diension of a odel. The annals of statistics, 6(2): , Shalev-Shwartz, Shai and Ben-David, Shai. Understanding achine learning: Fro theory to algoriths. Cabridge University Press, Shalev-Shwartz, Shai, Shair, Ohad, Srebro, Nathan, and Sridharan, Karthik. Stochastic convex optiization. In COLT, Shalev-Shwartz, Shai, Shair, Ohad, and Troer, Eran. Using ore data to speed-up training tie. In AISTATS, pp , Sridharan, Karthik, Shalev-Shwartz, Shai, and Srebro, Nathan. Fast rates for regularized objectives. In NIPS, pp , Tao, Terence. Topics in rando atrix theory. Aerican Matheatical Society, Tropp, Joel A. User-friendly tail bounds for sus of rando atrices. Foundations of coputational atheatics, 12 (4): , Vapnik, Vladiir N. An overview of statistical learning theory. IEEE transactions on neural networks, 10(5): , 1999.

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Coputational and Statistical Learning Theory Proble sets 5 and 6 Due: Noveber th Please send your solutions to learning-subissions@ttic.edu Notations/Definitions Recall the definition of saple based Radeacher

More information

Learnability and Stability in the General Learning Setting

Learnability and Stability in the General Learning Setting Learnability and Stability in the General Learning Setting Shai Shalev-Shwartz TTI-Chicago shai@tti-c.org Ohad Shair The Hebrew University ohadsh@cs.huji.ac.il Nathan Srebro TTI-Chicago nati@uchicago.edu

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Coputational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 2: PAC Learning and VC Theory I Fro Adversarial Online to Statistical Three reasons to ove fro worst-case deterinistic

More information

E0 370 Statistical Learning Theory Lecture 6 (Aug 30, 2011) Margin Analysis

E0 370 Statistical Learning Theory Lecture 6 (Aug 30, 2011) Margin Analysis E0 370 tatistical Learning Theory Lecture 6 (Aug 30, 20) Margin Analysis Lecturer: hivani Agarwal cribe: Narasihan R Introduction In the last few lectures we have seen how to obtain high confidence bounds

More information

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Course Notes for EE227C (Spring 2018): Convex Optiization and Approxiation Instructor: Moritz Hardt Eail: hardt+ee227c@berkeley.edu Graduate Instructor: Max Sichowitz Eail: sichow+ee227c@berkeley.edu October

More information

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Course Notes for EE7C (Spring 018: Convex Optiization and Approxiation Instructor: Moritz Hardt Eail: hardt+ee7c@berkeley.edu Graduate Instructor: Max Sichowitz Eail: sichow+ee7c@berkeley.edu October 15,

More information

Understanding Machine Learning Solution Manual

Understanding Machine Learning Solution Manual Understanding Machine Learning Solution Manual Written by Alon Gonen Edited by Dana Rubinstein Noveber 17, 2014 2 Gentle Start 1. Given S = ((x i, y i )), define the ultivariate polynoial p S (x) = i []:y

More information

A Simple Regression Problem

A Simple Regression Problem A Siple Regression Proble R. M. Castro March 23, 2 In this brief note a siple regression proble will be introduced, illustrating clearly the bias-variance tradeoff. Let Y i f(x i ) + W i, i,..., n, where

More information

E0 370 Statistical Learning Theory Lecture 5 (Aug 25, 2011)

E0 370 Statistical Learning Theory Lecture 5 (Aug 25, 2011) E0 370 Statistical Learning Theory Lecture 5 Aug 5, 0 Covering Nubers, Pseudo-Diension, and Fat-Shattering Diension Lecturer: Shivani Agarwal Scribe: Shivani Agarwal Introduction So far we have seen how

More information

Machine Learning Basics: Estimators, Bias and Variance

Machine Learning Basics: Estimators, Bias and Variance Machine Learning Basics: Estiators, Bias and Variance Sargur N. srihari@cedar.buffalo.edu This is part of lecture slides on Deep Learning: http://www.cedar.buffalo.edu/~srihari/cse676 1 Topics in Basics

More information

Intelligent Systems: Reasoning and Recognition. Perceptrons and Support Vector Machines

Intelligent Systems: Reasoning and Recognition. Perceptrons and Support Vector Machines Intelligent Systes: Reasoning and Recognition Jaes L. Crowley osig 1 Winter Seester 2018 Lesson 6 27 February 2018 Outline Perceptrons and Support Vector achines Notation...2 Linear odels...3 Lines, Planes

More information

1 Proof of learning bounds

1 Proof of learning bounds COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #4 Scribe: Akshay Mittal February 13, 2013 1 Proof of learning bounds For intuition of the following theore, suppose there exists a

More information

Tight Complexity Bounds for Optimizing Composite Objectives

Tight Complexity Bounds for Optimizing Composite Objectives Tight Coplexity Bounds for Optiizing Coposite Objectives Blake Woodworth Toyota Technological Institute at Chicago Chicago, IL, 60637 blake@ttic.edu Nathan Srebro Toyota Technological Institute at Chicago

More information

Kernel Methods and Support Vector Machines

Kernel Methods and Support Vector Machines Intelligent Systes: Reasoning and Recognition Jaes L. Crowley ENSIAG 2 / osig 1 Second Seester 2012/2013 Lesson 20 2 ay 2013 Kernel ethods and Support Vector achines Contents Kernel Functions...2 Quadratic

More information

1 Bounding the Margin

1 Bounding the Margin COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #12 Scribe: Jian Min Si March 14, 2013 1 Bounding the Margin We are continuing the proof of a bound on the generalization error of AdaBoost

More information

Sharp Time Data Tradeoffs for Linear Inverse Problems

Sharp Time Data Tradeoffs for Linear Inverse Problems Sharp Tie Data Tradeoffs for Linear Inverse Probles Saet Oyak Benjain Recht Mahdi Soltanolkotabi January 016 Abstract In this paper we characterize sharp tie-data tradeoffs for optiization probles used

More information

Using EM To Estimate A Probablity Density With A Mixture Of Gaussians

Using EM To Estimate A Probablity Density With A Mixture Of Gaussians Using EM To Estiate A Probablity Density With A Mixture Of Gaussians Aaron A. D Souza adsouza@usc.edu Introduction The proble we are trying to address in this note is siple. Given a set of data points

More information

Optimal Jamming Over Additive Noise: Vector Source-Channel Case

Optimal Jamming Over Additive Noise: Vector Source-Channel Case Fifty-first Annual Allerton Conference Allerton House, UIUC, Illinois, USA October 2-3, 2013 Optial Jaing Over Additive Noise: Vector Source-Channel Case Erah Akyol and Kenneth Rose Abstract This paper

More information

Distributed Subgradient Methods for Multi-agent Optimization

Distributed Subgradient Methods for Multi-agent Optimization 1 Distributed Subgradient Methods for Multi-agent Optiization Angelia Nedić and Asuan Ozdaglar October 29, 2007 Abstract We study a distributed coputation odel for optiizing a su of convex objective functions

More information

PAC-Bayes Analysis Of Maximum Entropy Learning

PAC-Bayes Analysis Of Maximum Entropy Learning PAC-Bayes Analysis Of Maxiu Entropy Learning John Shawe-Taylor and David R. Hardoon Centre for Coputational Statistics and Machine Learning Departent of Coputer Science University College London, UK, WC1E

More information

Computable Shell Decomposition Bounds

Computable Shell Decomposition Bounds Coputable Shell Decoposition Bounds John Langford TTI-Chicago jcl@cs.cu.edu David McAllester TTI-Chicago dac@autoreason.co Editor: Leslie Pack Kaelbling and David Cohn Abstract Haussler, Kearns, Seung

More information

1 Generalization bounds based on Rademacher complexity

1 Generalization bounds based on Rademacher complexity COS 5: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #0 Scribe: Suqi Liu March 07, 08 Last tie we started proving this very general result about how quickly the epirical average converges

More information

A Theoretical Framework for Deep Transfer Learning

A Theoretical Framework for Deep Transfer Learning A Theoretical Fraewor for Deep Transfer Learning Toer Galanti The School of Coputer Science Tel Aviv University toer22g@gail.co Lior Wolf The School of Coputer Science Tel Aviv University wolf@cs.tau.ac.il

More information

A Theoretical Analysis of a Warm Start Technique

A Theoretical Analysis of a Warm Start Technique A Theoretical Analysis of a War Start Technique Martin A. Zinkevich Yahoo! Labs 701 First Avenue Sunnyvale, CA Abstract Batch gradient descent looks at every data point for every step, which is wasteful

More information

arxiv: v1 [cs.lg] 8 Jan 2019

arxiv: v1 [cs.lg] 8 Jan 2019 Data Masking with Privacy Guarantees Anh T. Pha Oregon State University phatheanhbka@gail.co Shalini Ghosh Sasung Research shalini.ghosh@gail.co Vinod Yegneswaran SRI international vinod@csl.sri.co arxiv:90.085v

More information

The Frequent Paucity of Trivial Strings

The Frequent Paucity of Trivial Strings The Frequent Paucity of Trivial Strings Jack H. Lutz Departent of Coputer Science Iowa State University Aes, IA 50011, USA lutz@cs.iastate.edu Abstract A 1976 theore of Chaitin can be used to show that

More information

Stochastic Subgradient Methods

Stochastic Subgradient Methods Stochastic Subgradient Methods Lingjie Weng Yutian Chen Bren School of Inforation and Coputer Science University of California, Irvine {wengl, yutianc}@ics.uci.edu Abstract Stochastic subgradient ethods

More information

Consistent Multiclass Algorithms for Complex Performance Measures. Supplementary Material

Consistent Multiclass Algorithms for Complex Performance Measures. Supplementary Material Consistent Multiclass Algoriths for Coplex Perforance Measures Suppleentary Material Notations. Let λ be the base easure over n given by the unifor rando variable (say U over n. Hence, for all easurable

More information

Estimating Parameters for a Gaussian pdf

Estimating Parameters for a Gaussian pdf Pattern Recognition and achine Learning Jaes L. Crowley ENSIAG 3 IS First Seester 00/0 Lesson 5 7 Noveber 00 Contents Estiating Paraeters for a Gaussian pdf Notation... The Pattern Recognition Proble...3

More information

1 Rademacher Complexity Bounds

1 Rademacher Complexity Bounds COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #10 Scribe: Max Goer March 07, 2013 1 Radeacher Coplexity Bounds Recall the following theore fro last lecture: Theore 1. With probability

More information

Support Vector Machine Classification of Uncertain and Imbalanced data using Robust Optimization

Support Vector Machine Classification of Uncertain and Imbalanced data using Robust Optimization Recent Researches in Coputer Science Support Vector Machine Classification of Uncertain and Ibalanced data using Robust Optiization RAGHAV PAT, THEODORE B. TRAFALIS, KASH BARKER School of Industrial Engineering

More information

Boosting with log-loss

Boosting with log-loss Boosting with log-loss Marco Cusuano-Towner Septeber 2, 202 The proble Suppose we have data exaples {x i, y i ) i =... } for a two-class proble with y i {, }. Let F x) be the predictor function with the

More information

13.2 Fully Polynomial Randomized Approximation Scheme for Permanent of Random 0-1 Matrices

13.2 Fully Polynomial Randomized Approximation Scheme for Permanent of Random 0-1 Matrices CS71 Randoness & Coputation Spring 018 Instructor: Alistair Sinclair Lecture 13: February 7 Disclaier: These notes have not been subjected to the usual scrutiny accorded to foral publications. They ay

More information

A note on the multiplication of sparse matrices

A note on the multiplication of sparse matrices Cent. Eur. J. Cop. Sci. 41) 2014 1-11 DOI: 10.2478/s13537-014-0201-x Central European Journal of Coputer Science A note on the ultiplication of sparse atrices Research Article Keivan Borna 12, Sohrab Aboozarkhani

More information

arxiv: v1 [cs.ds] 3 Feb 2014

arxiv: v1 [cs.ds] 3 Feb 2014 arxiv:40.043v [cs.ds] 3 Feb 04 A Bound on the Expected Optiality of Rando Feasible Solutions to Cobinatorial Optiization Probles Evan A. Sultani The Johns Hopins University APL evan@sultani.co http://www.sultani.co/

More information

Model Fitting. CURM Background Material, Fall 2014 Dr. Doreen De Leon

Model Fitting. CURM Background Material, Fall 2014 Dr. Doreen De Leon Model Fitting CURM Background Material, Fall 014 Dr. Doreen De Leon 1 Introduction Given a set of data points, we often want to fit a selected odel or type to the data (e.g., we suspect an exponential

More information

Pattern Recognition and Machine Learning. Learning and Evaluation for Pattern Recognition

Pattern Recognition and Machine Learning. Learning and Evaluation for Pattern Recognition Pattern Recognition and Machine Learning Jaes L. Crowley ENSIMAG 3 - MMIS Fall Seester 2017 Lesson 1 4 October 2017 Outline Learning and Evaluation for Pattern Recognition Notation...2 1. The Pattern Recognition

More information

Rademacher Complexity Margin Bounds for Learning with a Large Number of Classes

Rademacher Complexity Margin Bounds for Learning with a Large Number of Classes Radeacher Coplexity Margin Bounds for Learning with a Large Nuber of Classes Vitaly Kuznetsov Courant Institute of Matheatical Sciences, 25 Mercer street, New York, NY, 002 Mehryar Mohri Courant Institute

More information

Pattern Recognition and Machine Learning. Artificial Neural networks

Pattern Recognition and Machine Learning. Artificial Neural networks Pattern Recognition and Machine Learning Jaes L. Crowley ENSIMAG 3 - MMIS Fall Seester 2016 Lessons 7 14 Dec 2016 Outline Artificial Neural networks Notation...2 1. Introduction...3... 3 The Artificial

More information

Uniform Approximation and Bernstein Polynomials with Coefficients in the Unit Interval

Uniform Approximation and Bernstein Polynomials with Coefficients in the Unit Interval Unifor Approxiation and Bernstein Polynoials with Coefficients in the Unit Interval Weiang Qian and Marc D. Riedel Electrical and Coputer Engineering, University of Minnesota 200 Union St. S.E. Minneapolis,

More information

Non-Parametric Non-Line-of-Sight Identification 1

Non-Parametric Non-Line-of-Sight Identification 1 Non-Paraetric Non-Line-of-Sight Identification Sinan Gezici, Hisashi Kobayashi and H. Vincent Poor Departent of Electrical Engineering School of Engineering and Applied Science Princeton University, Princeton,

More information

Lower Bounds for Quantized Matrix Completion

Lower Bounds for Quantized Matrix Completion Lower Bounds for Quantized Matrix Copletion Mary Wootters and Yaniv Plan Departent of Matheatics University of Michigan Ann Arbor, MI Eail: wootters, yplan}@uich.edu Mark A. Davenport School of Elec. &

More information

On Conditions for Linearity of Optimal Estimation

On Conditions for Linearity of Optimal Estimation On Conditions for Linearity of Optial Estiation Erah Akyol, Kuar Viswanatha and Kenneth Rose {eakyol, kuar, rose}@ece.ucsb.edu Departent of Electrical and Coputer Engineering University of California at

More information

Block designs and statistics

Block designs and statistics Bloc designs and statistics Notes for Math 447 May 3, 2011 The ain paraeters of a bloc design are nuber of varieties v, bloc size, nuber of blocs b. A design is built on a set of v eleents. Each eleent

More information

Computable Shell Decomposition Bounds

Computable Shell Decomposition Bounds Journal of Machine Learning Research 5 (2004) 529-547 Subitted 1/03; Revised 8/03; Published 5/04 Coputable Shell Decoposition Bounds John Langford David McAllester Toyota Technology Institute at Chicago

More information

ESE 523 Information Theory

ESE 523 Information Theory ESE 53 Inforation Theory Joseph A. O Sullivan Sauel C. Sachs Professor Electrical and Systes Engineering Washington University 11 Urbauer Hall 10E Green Hall 314-935-4173 (Lynda Marha Answers) jao@wustl.edu

More information

Feature Extraction Techniques

Feature Extraction Techniques Feature Extraction Techniques Unsupervised Learning II Feature Extraction Unsupervised ethods can also be used to find features which can be useful for categorization. There are unsupervised ethods that

More information

Robustness and Regularization of Support Vector Machines

Robustness and Regularization of Support Vector Machines Robustness and Regularization of Support Vector Machines Huan Xu ECE, McGill University Montreal, QC, Canada xuhuan@ci.cgill.ca Constantine Caraanis ECE, The University of Texas at Austin Austin, TX, USA

More information

Nyström Method vs Random Fourier Features: A Theoretical and Empirical Comparison

Nyström Method vs Random Fourier Features: A Theoretical and Empirical Comparison yströ Method vs : A Theoretical and Epirical Coparison Tianbao Yang, Yu-Feng Li, Mehrdad Mahdavi, Rong Jin, Zhi-Hua Zhou Machine Learning Lab, GE Global Research, San Raon, CA 94583 Michigan State University,

More information

Pattern Recognition and Machine Learning. Artificial Neural networks

Pattern Recognition and Machine Learning. Artificial Neural networks Pattern Recognition and Machine Learning Jaes L. Crowley ENSIMAG 3 - MMIS Fall Seester 2017 Lessons 7 20 Dec 2017 Outline Artificial Neural networks Notation...2 Introduction...3 Key Equations... 3 Artificial

More information

Probability Distributions

Probability Distributions Probability Distributions In Chapter, we ephasized the central role played by probability theory in the solution of pattern recognition probles. We turn now to an exploration of soe particular exaples

More information

Support Vector Machines. Goals for the lecture

Support Vector Machines. Goals for the lecture Support Vector Machines Mark Craven and David Page Coputer Sciences 760 Spring 2018 www.biostat.wisc.edu/~craven/cs760/ Soe of the slides in these lectures have been adapted/borrowed fro aterials developed

More information

e-companion ONLY AVAILABLE IN ELECTRONIC FORM

e-companion ONLY AVAILABLE IN ELECTRONIC FORM OPERATIONS RESEARCH doi 10.1287/opre.1070.0427ec pp. ec1 ec5 e-copanion ONLY AVAILABLE IN ELECTRONIC FORM infors 07 INFORMS Electronic Copanion A Learning Approach for Interactive Marketing to a Custoer

More information

Learnability of Gaussians with flexible variances

Learnability of Gaussians with flexible variances Learnability of Gaussians with flexible variances Ding-Xuan Zhou City University of Hong Kong E-ail: azhou@cityu.edu.hk Supported in part by Research Grants Council of Hong Kong Start October 20, 2007

More information

The Transactional Nature of Quantum Information

The Transactional Nature of Quantum Information The Transactional Nature of Quantu Inforation Subhash Kak Departent of Coputer Science Oklahoa State University Stillwater, OK 7478 ABSTRACT Inforation, in its counications sense, is a transactional property.

More information

OPTIMIZATION in multi-agent networks has attracted

OPTIMIZATION in multi-agent networks has attracted Distributed constrained optiization and consensus in uncertain networks via proxial iniization Kostas Margellos, Alessandro Falsone, Sione Garatti and Maria Prandini arxiv:603.039v3 [ath.oc] 3 May 07 Abstract

More information

Foundations of Machine Learning Boosting. Mehryar Mohri Courant Institute and Google Research

Foundations of Machine Learning Boosting. Mehryar Mohri Courant Institute and Google Research Foundations of Machine Learning Boosting Mehryar Mohri Courant Institute and Google Research ohri@cis.nyu.edu Weak Learning Definition: concept class C is weakly PAC-learnable if there exists a (weak)

More information

Combining Classifiers

Combining Classifiers Cobining Classifiers Generic ethods of generating and cobining ultiple classifiers Bagging Boosting References: Duda, Hart & Stork, pg 475-480. Hastie, Tibsharini, Friedan, pg 246-256 and Chapter 10. http://www.boosting.org/

More information

Ensemble Based on Data Envelopment Analysis

Ensemble Based on Data Envelopment Analysis Enseble Based on Data Envelopent Analysis So Young Sohn & Hong Choi Departent of Coputer Science & Industrial Systes Engineering, Yonsei University, Seoul, Korea Tel) 82-2-223-404, Fax) 82-2- 364-7807

More information

ASSUME a source over an alphabet size m, from which a sequence of n independent samples are drawn. The classical

ASSUME a source over an alphabet size m, from which a sequence of n independent samples are drawn. The classical IEEE TRANSACTIONS ON INFORMATION THEORY Large Alphabet Source Coding using Independent Coponent Analysis Aichai Painsky, Meber, IEEE, Saharon Rosset and Meir Feder, Fellow, IEEE arxiv:67.7v [cs.it] Jul

More information

Multi-Dimensional Hegselmann-Krause Dynamics

Multi-Dimensional Hegselmann-Krause Dynamics Multi-Diensional Hegselann-Krause Dynaics A. Nedić Industrial and Enterprise Systes Engineering Dept. University of Illinois Urbana, IL 680 angelia@illinois.edu B. Touri Coordinated Science Laboratory

More information

Keywords: Estimator, Bias, Mean-squared error, normality, generalized Pareto distribution

Keywords: Estimator, Bias, Mean-squared error, normality, generalized Pareto distribution Testing approxiate norality of an estiator using the estiated MSE and bias with an application to the shape paraeter of the generalized Pareto distribution J. Martin van Zyl Abstract In this work the norality

More information

On the Communication Complexity of Lipschitzian Optimization for the Coordinated Model of Computation

On the Communication Complexity of Lipschitzian Optimization for the Coordinated Model of Computation journal of coplexity 6, 459473 (2000) doi:0.006jco.2000.0544, available online at http:www.idealibrary.co on On the Counication Coplexity of Lipschitzian Optiization for the Coordinated Model of Coputation

More information

Improved Guarantees for Agnostic Learning of Disjunctions

Improved Guarantees for Agnostic Learning of Disjunctions Iproved Guarantees for Agnostic Learning of Disjunctions Pranjal Awasthi Carnegie Mellon University pawasthi@cs.cu.edu Avri Blu Carnegie Mellon University avri@cs.cu.edu Or Sheffet Carnegie Mellon University

More information

1 Identical Parallel Machines

1 Identical Parallel Machines FB3: Matheatik/Inforatik Dr. Syaantak Das Winter 2017/18 Optiizing under Uncertainty Lecture Notes 3: Scheduling to Miniize Makespan In any standard scheduling proble, we are given a set of jobs J = {j

More information

On the Inapproximability of Vertex Cover on k-partite k-uniform Hypergraphs

On the Inapproximability of Vertex Cover on k-partite k-uniform Hypergraphs On the Inapproxiability of Vertex Cover on k-partite k-unifor Hypergraphs Venkatesan Guruswai and Rishi Saket Coputer Science Departent Carnegie Mellon University Pittsburgh, PA 1513. Abstract. Coputing

More information

Bayes Decision Rule and Naïve Bayes Classifier

Bayes Decision Rule and Naïve Bayes Classifier Bayes Decision Rule and Naïve Bayes Classifier Le Song Machine Learning I CSE 6740, Fall 2013 Gaussian Mixture odel A density odel p(x) ay be ulti-odal: odel it as a ixture of uni-odal distributions (e.g.

More information

arxiv: v1 [cs.ds] 17 Mar 2016

arxiv: v1 [cs.ds] 17 Mar 2016 Tight Bounds for Single-Pass Streaing Coplexity of the Set Cover Proble Sepehr Assadi Sanjeev Khanna Yang Li Abstract arxiv:1603.05715v1 [cs.ds] 17 Mar 2016 We resolve the space coplexity of single-pass

More information

DEPARTMENT OF ECONOMETRICS AND BUSINESS STATISTICS

DEPARTMENT OF ECONOMETRICS AND BUSINESS STATISTICS ISSN 1440-771X AUSTRALIA DEPARTMENT OF ECONOMETRICS AND BUSINESS STATISTICS An Iproved Method for Bandwidth Selection When Estiating ROC Curves Peter G Hall and Rob J Hyndan Working Paper 11/00 An iproved

More information

Supplementary Material for Fast and Provable Algorithms for Spectrally Sparse Signal Reconstruction via Low-Rank Hankel Matrix Completion

Supplementary Material for Fast and Provable Algorithms for Spectrally Sparse Signal Reconstruction via Low-Rank Hankel Matrix Completion Suppleentary Material for Fast and Provable Algoriths for Spectrally Sparse Signal Reconstruction via Low-Ran Hanel Matrix Copletion Jian-Feng Cai Tianing Wang Ke Wei March 1, 017 Abstract We establish

More information

3.8 Three Types of Convergence

3.8 Three Types of Convergence 3.8 Three Types of Convergence 3.8 Three Types of Convergence 93 Suppose that we are given a sequence functions {f k } k N on a set X and another function f on X. What does it ean for f k to converge to

More information

Testing equality of variances for multiple univariate normal populations

Testing equality of variances for multiple univariate normal populations University of Wollongong Research Online Centre for Statistical & Survey Methodology Working Paper Series Faculty of Engineering and Inforation Sciences 0 esting equality of variances for ultiple univariate

More information

CS Lecture 13. More Maximum Likelihood

CS Lecture 13. More Maximum Likelihood CS 6347 Lecture 13 More Maxiu Likelihood Recap Last tie: Introduction to axiu likelihood estiation MLE for Bayesian networks Optial CPTs correspond to epirical counts Today: MLE for CRFs 2 Maxiu Likelihood

More information

Efficient Learning of Generalized Linear and Single Index Models with Isotonic Regression

Efficient Learning of Generalized Linear and Single Index Models with Isotonic Regression Efficient Learning of Generalized Linear and Single Index Models with Isotonic Regression Sha M Kakade Microsoft Research and Wharton, U Penn skakade@icrosoftco Varun Kanade SEAS, Harvard University vkanade@fasharvardedu

More information

Asynchronous Gossip Algorithms for Stochastic Optimization

Asynchronous Gossip Algorithms for Stochastic Optimization Asynchronous Gossip Algoriths for Stochastic Optiization S. Sundhar Ra ECE Dept. University of Illinois Urbana, IL 680 ssrini@illinois.edu A. Nedić IESE Dept. University of Illinois Urbana, IL 680 angelia@illinois.edu

More information

Intelligent Systems: Reasoning and Recognition. Artificial Neural Networks

Intelligent Systems: Reasoning and Recognition. Artificial Neural Networks Intelligent Systes: Reasoning and Recognition Jaes L. Crowley MOSIG M1 Winter Seester 2018 Lesson 7 1 March 2018 Outline Artificial Neural Networks Notation...2 Introduction...3 Key Equations... 3 Artificial

More information

Supplement to: Subsampling Methods for Persistent Homology

Supplement to: Subsampling Methods for Persistent Homology Suppleent to: Subsapling Methods for Persistent Hoology A. Technical results In this section, we present soe technical results that will be used to prove the ain theores. First, we expand the notation

More information

Constrained Consensus and Optimization in Multi-Agent Networks arxiv: v2 [math.oc] 17 Dec 2008

Constrained Consensus and Optimization in Multi-Agent Networks arxiv: v2 [math.oc] 17 Dec 2008 LIDS Report 2779 1 Constrained Consensus and Optiization in Multi-Agent Networks arxiv:0802.3922v2 [ath.oc] 17 Dec 2008 Angelia Nedić, Asuan Ozdaglar, and Pablo A. Parrilo February 15, 2013 Abstract We

More information

SPECTRUM sensing is a core concept of cognitive radio

SPECTRUM sensing is a core concept of cognitive radio World Acadey of Science, Engineering and Technology International Journal of Electronics and Counication Engineering Vol:6, o:2, 202 Efficient Detection Using Sequential Probability Ratio Test in Mobile

More information

On Constant Power Water-filling

On Constant Power Water-filling On Constant Power Water-filling Wei Yu and John M. Cioffi Electrical Engineering Departent Stanford University, Stanford, CA94305, U.S.A. eails: {weiyu,cioffi}@stanford.edu Abstract This paper derives

More information

Tight Bounds for the Expected Risk of Linear Classifiers and PAC-Bayes Finite-Sample Guarantees

Tight Bounds for the Expected Risk of Linear Classifiers and PAC-Bayes Finite-Sample Guarantees Tight Bounds for the Expected Risk of Linear Classifiers and PAC-Bayes Finite-Saple Guarantees Jean Honorio CSAIL, MIT Cabridge, MA 0239, USA jhonorio@csail.it.edu Toi Jaakkola CSAIL, MIT Cabridge, MA

More information

Extension of CSRSM for the Parametric Study of the Face Stability of Pressurized Tunnels

Extension of CSRSM for the Parametric Study of the Face Stability of Pressurized Tunnels Extension of CSRSM for the Paraetric Study of the Face Stability of Pressurized Tunnels Guilhe Mollon 1, Daniel Dias 2, and Abdul-Haid Soubra 3, M.ASCE 1 LGCIE, INSA Lyon, Université de Lyon, Doaine scientifique

More information

Quantum algorithms (CO 781, Winter 2008) Prof. Andrew Childs, University of Waterloo LECTURE 15: Unstructured search and spatial search

Quantum algorithms (CO 781, Winter 2008) Prof. Andrew Childs, University of Waterloo LECTURE 15: Unstructured search and spatial search Quantu algoriths (CO 781, Winter 2008) Prof Andrew Childs, University of Waterloo LECTURE 15: Unstructured search and spatial search ow we begin to discuss applications of quantu walks to search algoriths

More information

Support Vector Machines. Maximizing the Margin

Support Vector Machines. Maximizing the Margin Support Vector Machines Support vector achines (SVMs) learn a hypothesis: h(x) = b + Σ i= y i α i k(x, x i ) (x, y ),..., (x, y ) are the training exs., y i {, } b is the bias weight. α,..., α are the

More information

Fairness via priority scheduling

Fairness via priority scheduling Fairness via priority scheduling Veeraruna Kavitha, N Heachandra and Debayan Das IEOR, IIT Bobay, Mubai, 400076, India vavitha,nh,debayan}@iitbacin Abstract In the context of ulti-agent resource allocation

More information

A Note on the Applied Use of MDL Approximations

A Note on the Applied Use of MDL Approximations A Note on the Applied Use of MDL Approxiations Daniel J. Navarro Departent of Psychology Ohio State University Abstract An applied proble is discussed in which two nested psychological odels of retention

More information

The Weierstrass Approximation Theorem

The Weierstrass Approximation Theorem 36 The Weierstrass Approxiation Theore Recall that the fundaental idea underlying the construction of the real nubers is approxiation by the sipler rational nubers. Firstly, nubers are often deterined

More information

Lecture 20 November 7, 2013

Lecture 20 November 7, 2013 CS 229r: Algoriths for Big Data Fall 2013 Prof. Jelani Nelson Lecture 20 Noveber 7, 2013 Scribe: Yun Willia Yu 1 Introduction Today we re going to go through the analysis of atrix copletion. First though,

More information

Efficient Learning with Partially Observed Attributes

Efficient Learning with Partially Observed Attributes Nicolò Cesa-Bianchi DSI, Università degli Studi di Milano, Italy Shai Shalev-Shwartz The Hebrew University, Jerusale, Israel Ohad Shair The Hebrew University, Jerusale, Israel Abstract We describe and

More information

Research in Area of Longevity of Sylphon Scraies

Research in Area of Longevity of Sylphon Scraies IOP Conference Series: Earth and Environental Science PAPER OPEN ACCESS Research in Area of Longevity of Sylphon Scraies To cite this article: Natalia Y Golovina and Svetlana Y Krivosheeva 2018 IOP Conf.

More information

A Low-Complexity Congestion Control and Scheduling Algorithm for Multihop Wireless Networks with Order-Optimal Per-Flow Delay

A Low-Complexity Congestion Control and Scheduling Algorithm for Multihop Wireless Networks with Order-Optimal Per-Flow Delay A Low-Coplexity Congestion Control and Scheduling Algorith for Multihop Wireless Networks with Order-Optial Per-Flow Delay Po-Kai Huang, Xiaojun Lin, and Chih-Chun Wang School of Electrical and Coputer

More information

Fixed-to-Variable Length Distribution Matching

Fixed-to-Variable Length Distribution Matching Fixed-to-Variable Length Distribution Matching Rana Ali Ajad and Georg Böcherer Institute for Counications Engineering Technische Universität München, Gerany Eail: raa2463@gail.co,georg.boecherer@tu.de

More information

On the Use of A Priori Information for Sparse Signal Approximations

On the Use of A Priori Information for Sparse Signal Approximations ITS TECHNICAL REPORT NO. 3/4 On the Use of A Priori Inforation for Sparse Signal Approxiations Oscar Divorra Escoda, Lorenzo Granai and Pierre Vandergheynst Signal Processing Institute ITS) Ecole Polytechnique

More information

RANDOM GRADIENT EXTRAPOLATION FOR DISTRIBUTED AND STOCHASTIC OPTIMIZATION

RANDOM GRADIENT EXTRAPOLATION FOR DISTRIBUTED AND STOCHASTIC OPTIMIZATION RANDOM GRADIENT EXTRAPOLATION FOR DISTRIBUTED AND STOCHASTIC OPTIMIZATION GUANGHUI LAN AND YI ZHOU Abstract. In this paper, we consider a class of finite-su convex optiization probles defined over a distributed

More information

Ch 12: Variations on Backpropagation

Ch 12: Variations on Backpropagation Ch 2: Variations on Backpropagation The basic backpropagation algorith is too slow for ost practical applications. It ay take days or weeks of coputer tie. We deonstrate why the backpropagation algorith

More information

Lecture October 23. Scribes: Ruixin Qiang and Alana Shine

Lecture October 23. Scribes: Ruixin Qiang and Alana Shine CSCI699: Topics in Learning and Gae Theory Lecture October 23 Lecturer: Ilias Scribes: Ruixin Qiang and Alana Shine Today s topic is auction with saples. 1 Introduction to auctions Definition 1. In a single

More information

Experimental Design For Model Discrimination And Precise Parameter Estimation In WDS Analysis

Experimental Design For Model Discrimination And Precise Parameter Estimation In WDS Analysis City University of New York (CUNY) CUNY Acadeic Works International Conference on Hydroinforatics 8-1-2014 Experiental Design For Model Discriination And Precise Paraeter Estiation In WDS Analysis Giovanna

More information

Exact tensor completion with sum-of-squares

Exact tensor completion with sum-of-squares Proceedings of Machine Learning Research vol 65:1 54, 2017 30th Annual Conference on Learning Theory Exact tensor copletion with su-of-squares Aaron Potechin Institute for Advanced Study, Princeton David

More information

Shannon Sampling II. Connections to Learning Theory

Shannon Sampling II. Connections to Learning Theory Shannon Sapling II Connections to Learning heory Steve Sale oyota echnological Institute at Chicago 147 East 60th Street, Chicago, IL 60637, USA E-ail: sale@athberkeleyedu Ding-Xuan Zhou Departent of Matheatics,

More information

AN OPTIMAL SHRINKAGE FACTOR IN PREDICTION OF ORDERED RANDOM EFFECTS

AN OPTIMAL SHRINKAGE FACTOR IN PREDICTION OF ORDERED RANDOM EFFECTS Statistica Sinica 6 016, 1709-178 doi:http://dx.doi.org/10.5705/ss.0014.0034 AN OPTIMAL SHRINKAGE FACTOR IN PREDICTION OF ORDERED RANDOM EFFECTS Nilabja Guha 1, Anindya Roy, Yaakov Malinovsky and Gauri

More information