informs DOI /moor.xxxx.xxxx c 20xx INFORMS

Size: px
Start display at page:

Download "informs DOI /moor.xxxx.xxxx c 20xx INFORMS"

Transcription

1 MATHEMATICS OF OPERATIONS RESEARCH Vol. 00, No. 0, Xxxxxx 20xx, pp. xxx xxx ISSN X EISSN xx xxx informs DOI 0.287/moor.xxxx.xxxx c 20xx INFORMS A Distributional Interpretation of Robust Optimization Huan Xu Department of Electrical and Computer Engineering, The Uniersity of Texas at Austin, USA huan.xu@mail.utexas.edu Constantine Caramanis Department of Electrical and Computer Engineering, The Uniersity of Texas at Austin, USA caramanis@mail.utexas.edu Shie Mannor Department of Electrical Engineering, Technion, Israel shie.mannor@ee.technion.ac.il Motiated by data-drien decision making and sampling problems, we inestigate probabilistic interpretations of Robust Optimization (RO). We establish a connection between RO and Distributionally Robust Stochastic Programming (DRSP), showing that the solution to any RO problem is also a solution to a DRSP problem. Specifically, we consider the case where multiple uncertain parameters belong to the same fixed dimensional space, and find the set of distributions of the equialent DRSP. The equialence we derie enables us to construct RO formulations for sampled problems (as in stochastic programming and machine learning) that are statistically consistent, een when the original sampled problem is not. In the process, this proides a systematic approach for tuning the uncertainty set. The equialence further proides a probabilistic explanation for the common shrinkage heuristic, where the uncertainty set used in a RO problem is a shrunken ersion of the original uncertainty set. Key words: robust optimization, distributionally robust stochastic program, consistency, machine learning, kernel density estimator MSC2000 Subject Classification: P90C99 OR/MS subject classification: Programming: Stochastic; Statistics: Nonparametric History: Receied: Xxxx xx, xxxx; Reised: Yyyyyy yy, yyyy and Zzzzzz zz, zzzz.. Introduction Robust Optimization (RO) considers deterministic (set-based) uncertainty models in optimization, where a (potentially malicious) adersary has a bounded capability to change the parameters of the function the decision-maker seeks to optimize. Thus, the standard optimization problem becomes max max : f(), () : min : f(,x), (2) x Z where the ector x denotes some uncertain parameters of the objectie function f, and may take any alue in the set Z. This approach to uncertainty has a long history in control; in optimization it traces back seeral decades to the early work in Soyster [32]. Particularly in the last decade since the work of Ben-Tal and Nemiroski [3, 4], Bertsimas and Sim [7], and El Ghaoui and Lebret [6], it has become a common approach in operations research, computer science, engineering, and many other related fields (e.g., Shiaswamy et al. [3], Lanckriet et al. [20], El Ghaoui and Lebret [6], Ben-Tal et al. [5, 6], and Boyd et al. [0]); see the recent monograph by Ben-Tal and co-authors [2] for a detailed surey. A key reason for its success has been its computational tractability and the fact that robustified ersions of We gie the unconstrained ersion here without loss of generality, since one can let the objectie function be for infeasible solutions.

2 2 many common optimization classes (linear programming, second order cone programming, among others) remain relatiely easy to sole. A much-researched alternatie to RO s set-based uncertainty, is to represent the uncertain parameter in a probabilistic way, i.e., assume that the parameter x is a random ariable with distribution µ. If we assume the generating distribution, µ, is known, the result is the standard stochastic programming paradigm (e.g., Birge and Loueaux [8], Prékopa[23], and Shapiro [28]). If µ is not precisely known, and instead µ is only known to lie in some set of distributions, D, the resulting optimization formulation is the so-called Distributionally Robust Stochastic Program(DRSP), initially proposed in Scarf [25], almost two decades before the first appearance of RO. In DRSP, the decision maker soles the following problem max : min µ D : E x µf(,x). (3) Since its introduction, DRSP has attracted extensie research(e.g., Kall[7], Dupacoá[5], Popescu[22], Shapiro [29], Goh and Sim [9], Delage and Ye [2]). The main focus of this paper is the relationship of these two paradigms. In particular, we show in Section 2 that RO can be reformulated as a DRSP with respect to a particular class of distributions. For the special case where each uncertain parameter belongs to a different space, such a re-interpretation is a well known folk theorem (see Delage and Ye [2]). Yet as we discuss below, in data-drien (or sample-based) optimization problems such as those appearing in stochastic optimization and machine learning, the uncertain parameters belong to the same space. To deelop a framework for these problems, we generalize the equialence of RO and DRSP to the setting where multiple uncertain parameters x,...,x n belong to the same space R m. Instead of formulating the RO problem for such a problem as a DRSP with respect to a class of distributions supported on the product space R m n, as techniques from the standard literature would require, we seek to find a DRSP interpretation with respect to a class of distributions supported on R m. We now elaborate on this, explaining in particular the significance of this generalized equialence. Relationship to Optimization from Samples The setup we consider is motiated by sampling problems. In soling min : E x µ f(,x), a sampled distribution (/n) n i= δ x i is often used instead of the true (potentially continuous) distribution µ. This is often done in machine learning because the true distribution is unknown, and the decision-maker has only access to a finite set of samples generated from that distribution. In stochastic programming this is widely applied, either when the distribution is unknown, or when it is known but too complicated to manipulate directly within an optimization problem, and hence the empirical distribution is used instead. It is well known that such a sampling approach may not always be consistent that is, een as the number of samples goes to infinity, the solution recoered may remain a bounded distance away from the true optimal solution. A DRSP interpretation of RO with respect to a class of distributions supported on the same fixed dimensional space would enable us to examine how well or poorly elements of this class of distributions approximate the true (potentially unknown) distribution as the sample size, n, increases. In Section 3, we explore how such an approach can be used to proe that a robust optimization formulation is asymptotically statistically consistent. Moreoer, we show how, gien a sampled optimization problem, one can design a robustified formulation that is guaranteed to be consistent, een if the original sampled problem fails to be consistent. As an important byproduct, we obtain a systematic way to tune the uncertainty set to guarantee consistency. In Section 4, we inestigate other implications of the equialence between RO and DRSP. We start from discussing optimization in a stochastic programming setup, and particularly in machine learning in Section 4.. In Section 4.2, we proide an explanation for the shrinkage heuristic in RO where instead of using the original uncertainty set one uses a shrunken ersion to obtain better performance in practice. In Section 5 we further illustrate some possible extensions of the equialence relationship. Notation: Throughout the paper, without loss of generality, we assume the unknown parameters belong to R m. We use P to denote the set of probability distributions on R m (with respect to the Borel set). We use [ : n] to denote the set {,2,,n}. A Euclidean ball centered at x with a radius r is denoted by B(x, r).

3 3 2. Distributional interpretation for robust optimization In this section, we turn our attention to the relationship between RO and DRSP. We consider a general case when multiple uncertain parameters lie in the same space, as opposed to the Cartesian product of the space of each parameter. As discussed aboe, this setting arises naturally in data-drien problems. We proe a statement that is slightly stronger than the equialence of RO and DRSP: fixing a candidate solution, the worst case reward of RO is equialent to the minimal expected reward of DRSP. That is, the inner minimization of RO is equialent to the inner minimization of DRSP. Hence, in Theorem 2. and the proof, we suppress the decision ariable, in order to reduce unnecessary notation. Theorem 2. Gien a measurable function f : R m R { }, c,...,c n > 0 such that n i= c i =, and non-empty Borel sets Z,...,Z n R m, let P n {µ P S [ : n] : µ( i S Z i ) i S c i }. Then the following holds i= c i inf f(x i ) = inf f(x)dµ(x). (4) x i Z i µ P n R m Note that the uncertainty sets Z,,Z n can hae nonempty intersection, or een be identical, as is the case when the points are sampled from the same space. Proof. We can assume without loss of generality that inf xi Z i f(x i ) > for i =,2,...,n, since otherwise both sides equal and the theorem holds triially. Let ˆx i be an ǫ-optimal solution to inf xi Z i f(x i ). We first show that the left-hand-side of Equation (4) is larger or equal to the right-hand-side. Consider the following distribution Obsere that ˆµ belongs to P n, and Thus we hae i= ˆµ({ˆx i }) = c i + j i:ˆx j=ˆx i c j. c i f(ˆx i ) = f(x)dˆµ(x). R m i= c i inf f(x i )+ǫ inf f(x)dµ(x). x Z i µ P n R m Since ǫ can be arbitrarily close to zero, we hae inf c i f(x i ) inf x i Z i i= µ P n R m f(x)dµ(x). (5) We next proe the reerse inequality to complete the proof. By re-indexing if necessary, assume f(ˆx ) f(ˆx 2 ) f(ˆx n ). (6) Now construct the following function { maxi x Zi f(ˆx i ) if x n ˆf(x) j= Z j; f(x) otherwise. Obsere that f(x) ˆf(x) ǫ for all x. Furthermore, fixing a µ P n, we hae f(x)dµ(x)+ǫ ˆf(x)dµ(x) = R m R m k= [ k f(ˆx k ) µ( i= k Z i ) µ( i= ] Z i ).

4 4 Here the inequality holds because f(x) ˆf(x) ǫ, and the equality holds because f(ˆx ) f(ˆx 2 ) f(ˆx n ). [ Define α k µ( k i= Z i) µ( ] k i= Z i). Then, by telescoping and the fact that µ P n, we hae t t α k = µ( Z k ) i=k k= t c k ; k= Hence, since f(ˆx ) f(ˆx 2 ) f(ˆx n ), we hae α k f(ˆx k ) Thus, for any µ P n, we hae k= R m f(x)dµ(x) +ǫ Since ǫ and µ are arbitrary, this leads to n α k = µ( Z k ) =. k= c k f(ˆx k ). k= c k f(ˆx k ) k= inf f(x)dµ(x) µ P n R m k= Combining this with Equation (5), we establish the theorem. k= k= c k inf x i Z i f(x k ). c k inf x i Z i f(x k ). We obsere that if Z i are disjoint, then the equialent distributional set P n has the following form: P n = {µ P µ(z i ) = c i, i =,,n}. Taking n =, this reduces to the following theorem, well known in the literature (e.g., Delage and Ye [2]), stating that the solution to a robust optimization problem is the solution to a special DRSP problem, where the distributional set is the one that contains all distributions whose support is contained in the uncertainty set, in the Cartesian product of the space of each parameter. Corollary 2. Gien a measurable function f : R m R { }, and a non-empty Borel set Z R m, the following holds: inf x Z f(x ) = inf f(x)dµ(x). (7) µ P µ(z)= R m 3. Consistency of robust optimization The theoretical equialence established in Theorem 2. has algorithmic consequences as well. In this section, we consider sampled stochastic optimization problems. As discussed aboe, it is well-known that these problems may fail to be asymptotically consistent, i.e., een as the number of samples goes to infinite, we may not recoer the optimal solution. We use the equialence relationship stated in Theorem 2. to construct a sequence of robust optimization problems that are asymptotically consistent in a statistical sense, een if the original sampled problem fails consistency. En route, this construction proides a systematic approach to choosing the appropriate size of the uncertainty set. The main theorem of the section states that as long as the utility function f(, ) is bounded, and satisfies a mild continuity condition, then an explicitly stated robust optimization formulation for the sampled stochastic optimization, is asymptotically consistent. That is, it recoers the optimal solution to max : E x µ [f(,x)], where µ is gien only through a sequence of i.i.d. samples. We note that these conditions are weaker than those required for consistency of sampled stochastic programs, e.g., as in King and Wets [8]. Thus, Theorem 3. proides a computationally tractable aenue for deeloping algorithms for solution of stochastic programming with stronger consistency guarantees (that is, they require weaker assumptions on the problem). We proide seeral examples of this in the next section. Theorem 3. Suppose that x,...,x n,... are i.i.d. samples of a distribution h on R m, and the utility function f(, ) satisfies

5 5 (i) Boundedness: max,x f(,x) C. (ii) Equicontinuity: d(ǫ) 0 where d(ǫ) max f(,x) f(,x+δ).,x, δ ǫ If {ǫ(n)} satisfies ǫ(n) 0; nǫ(n) m, then the sequence of optimal solutions to the RO formulation (n) argmax inf n f(,x i +δ i ) δ i ǫ(n) satisfies lim f((n),x)h n R (x)dx = inf f(,x)h (x)dx. m R m That is, the RO formulation is consistent. i= Proof. The key to the proof rests on the equialence established in Theorem 2.. This equialence then allows us to show that the RO formulation gien in the theorem statement, is equialent to a DSRP, whose distribution set contains a Kernel Density Estimator (KDE). From here, the proof is essentially immediate: exploiting the fact that a KDE conerges to the generating distribution in the l sense, one can easily conclude that the sequence of solutions to the RO problems gien, are consistent. For completeness, we include a brief introduction to Kernel Density Estimators in the Appendix. Consider a fixed n. Define the sets Z i { x i +δ δi ǫ(n) }. Thus, the RO formulation is to maximize n i= n inf x i Zi f(,x i ). By Theorem 2., we know that for any, the following holds: n inf f(,x i) = inf f(,x)d µ(x); x i i= Zi µ P n R m where: P n = {µ P S [ : n] : µ( i SZ (8) i ) S /n}. Next we show that the set of distributions, P n, contains a kernel density estimator. Consider a distribution h n defined as n ( ) h n (x) = (nǫ(n) m ) x xi K ; ǫ where: K(z) = ( z ) 2 m. Indeed, obsere that h n is a kernel density estimator. Now, for any S [ : n], we hae (x Z j )h n (x)dx R m j S i= = (x n ( Z j )(nǫ(n) m ) x xi K R ǫ(n) m j S i= (x j )(nǫ(n) R j SZ m ) ( x xi K ǫ(n) m i S = (x ( Z j )(nǫ(n) m ) x xi K i S R ǫ(n) m j S ( ) (x Z i )(nǫ(n) m ) x xi K dx i S R ǫ(n) m ( ) (nǫ(n) m ) x xi K dx = S /n. R ǫ(n) m (a) = i S ) dx ) dx ) dx

6 6 Here, the second-to-last equality, (a), holds because, due to the definition of K and Z i, K((x x i )/ǫ(n)) is non-zero only when x Z i. Hence, h n P n, 2 which by Equation (8) implies n inf f(,x x i ) f(,x)h n (x)dx. i Zi R m i= Since h n is a kernel density estimator, it is well-known (e.g., see Deroye and Györfi [3]) that under the condition that ǫ(n) 0 and nǫ(n) m the following holds, R m hn (x) h (x) dx n 0. Therefore, since f(,x) C, there exists {M n } 0 such that the following holds for all, f(,x)h n (x)dx f(,x)h (x)dx+m n, R m R m which leads to for all, i= By symmetry, we also hae Further note that n inf f(,x i) M n f(,x)h (x)dx. x i Zi R m R m f(,x)h (x)dx i= n sup x i Zi f(,x i )+M n. sup f(,x) inf f(,x) d(2ǫ(n)). x Z i x Z i Thus we hae for all n inf f(,x i) M n f(,x)h (x)dx x i Zi R m i= Since both M n and d(2ǫ(n)) go to zero, the theorem follows easily. i= n inf x i Zi f(,x i)+m n +d(2ǫ(n)). Remark 3. Obsere from the proof that if we relax the requirement of equicontinuity of {f(, )}, then RO is essentially maximizing an asymptotic lower bound of the true expected reward. Furthermore, if we instead only require the equicontinuity of {f((n), ) n =,2, }, then the consistency result still holds. In fact, as (n) is the optimal solution of a robust optimization, this condition is much easier to satisfy than the equicontinuity of {f(, )}. Remark 3.2 Theorem 3. suggests a methodology for choosing an appropriate size ǫ(n) of the uncertainty set. Preious work on RO (e.g., Bertsimas and Sim [7]) considers the setting where the obsered parameter is the result of corruption of the true parameter (ia additie noise). Consequently, the decision maker tunes the size of the uncertainty set used in the RO formulation, to satisfy some probabilistic bounds or risk measure constraints, gien a priori information of the noise and in particular the noise magnitude. Theorem 3. proides a different paradigm for the design of the uncertainty set: when the uncertainty is due to inherent randomness of the parameters, then to approximately sole the stochastic program, the decision maker can use RO, with the size of uncertainty set slowly decreasing in the number of samples. To be more specific, the theorem shows that if the size scales like o() and ω(n /m ), then we achiee asymptotic consistency. Finally, also note that it is straightforward to modify the uncertainty set from a l ball to other parameterized uncertainty sets. Indeed, the only requirement is that the family of distributions obtained using Theorem 2., contains a kernel density estimator. 4. Implications of Theorem 3. In this section, we describe the applications of Theorem 3. to consistency of sampled optimization problems in machine learning (Support Vector Machines, and also Lasso, or l -regularized regression) and then to the popular but as-of-yet not well-understood technique known as shrinkage, used, e.g., in Chapter 2 of [2], and [40]. 2 More precisely, h n is the density function of a probability measure that belongs to P n.

7 7 4. Sampled Optimization. Many decision problems hae the form max : E x P {f(,x)}, (9) where the expectation is difficult or impossible to optimize, or een ealuate exactly. In Machine Learning problems, the typical set up is that the distribution P is unknown, and the decision-maker has knowledge of P only through a set of samples (cf. textbooks such as Anthony and Bartlett []). As mentioned aboe, one often sees the same setting in SP, where een if the distribution is known, it may be too complicated to ealuate (cf textbooks such as Birge and Loueaux [8]). A sampling approach is often used instead (Shapiro and de Mello [30]), i.e., to make a decision simply by soling the sampled optimization problem: n argmax n f(,x i ). OnewouldhopethatthesolutiontosuchaproblemisagoodapproximationofthesolutionofProblem(9). Howeer, due to the fact that the decision obtained is dependent on the samples, the empirical utility for n is a biased estimate of its expected utility. Thus, the sampling technique often yields oerly optimistic solutions. Een worse, it may be the case that as n, the sequence {n } does not conerge to the optimal decision. This is often termed oer-fitting in the machine learning literature (Vapnik and Cheronenkis [36]), and has attracted extensie research (see textbooks such as Anthony and Bartlett [], Deroye et al. [4], and many others). There are arious approaches, often custom-tailored to the problem at hand, to try to control the problem of oerfitting, with regularization being one of the foremost such tools. Theorem 3. proides a unified approach to mitigate this unjustified optimism: one may assume some set-based uncertainty in the samples obsered, and subsequently sole the corresponding robust counterpart. If the uncertainty sets are chosen properly (i.e., as in Theorem 3.), then the corresponding class P n to this robust optimization problem approximately contains the true distribution P, i.e., it contains a sequence of distributions that conerges to P uniformly with respect to. The theorem then guarantees that the sequence of min-max decisions conerges to the optimal solution. It turns out that this property is in fact exploited implicitly (i.e., unwittingly) by many widely used learning algorithms, and hence one can show (post hoc) that this seres as an explanation for their success. We now gie two examples which we adapt from our work in [37] and [38]. We show that the classical and much used technique of regularization, is in fact a special case of the more general approach outlined in Theorem 3.. Example 4. (Support Vector Machines) Classification is a fundamental problem in machine learning. Here, a decision maker obseres a set of training samples {x i } m i= (each sample is assumed in R k ) and their labels {y i } m i= (each label is in {,}). The goal is to learn the labeling rule, so as to be able to label future points x R m. The Support Vector Machine (SVM) approach searches for a linear classification rule, of the form (w x + b), to separate the space into points labeled + and. The linear rule gien by (w,b) is selected by attempting to sole the expected classification loss on future samples, i.e., the testing loss: min : E y,x µ [ y( w, x +b )] +. (0) w,b Since the distribution, µ is unknown, instead the decision-maker minimizes the loss on the obsered samples, i.e., minimizes the training error, min w,b : m i= i= [ yi ( w, xi +b )] +. () While we do not go into details here (but see Schölkopf and Smola [26]) this problem in fact is often ery high-dimensional, possibly infinite-dimensional, because one may use a so-called kernel mapping to non-linearly map the data into a higher dimensional space, and look for a linear classifier in that space. Because of this, it has long been known that the empirical optimization as formulated in () will not be consistent in general. To correct for this fact, a long-standing technique in machine learning has been to add an l 2 -norm regularizer on w, as a so-called complexity penalty. The resulting problem is the norm-regularized SVM (Boser et al. [9], and Vapnik and Cheronenkis [35]): min w,b : c w 2 + m [ ( yi w, xi +b )] +. (2) i=

8 8 If the training samples {x i,y i } m i= are non-separable, then we hae shown in [37] that this regularized SVM is equialent (i.e., has the same set of solutions) to the robust optimization problem min w,b : where the uncertainty set is gien by T m max (ŷ,ˆx,...,ŷ m,ˆx m) T m m [ ( ŷi w, ˆxi +b )] +, i= { (ŷ,ˆx,...,ŷ,ˆx m ) : ŷ j y j ; m i= } x i ˆx i 2 c. Then, using Theorem 3., one can show that regularized SVM is consistent in a much more direct fashion than preious equialent results (e.g., Steinwart [33]), that rely on concepts such as algorithmic stability or VC-dimension. Example 4.2 (Lasso) The next example considers regression. In this setup we are gien m ectors in R k denoted by {x i } m i= and m associated real alues {b i} m i=. We are looking for a k dimensional linear regressor such that the expected regression error for a new testing sample is minimized, i.e., we want to sole min : E b,x µ (b x ) 2. (3) As in the preious example, the distribution is unknown except through the gien training samples that are i.i.d. realizations of µ. There are many ways to sole this regression problem and we consider a specific popular framework known as Lasso (Tibshirani [34]). Let b denote the ector form of b,...,b m and X denote a m k matrix such that its i th row is x i. In [38], we show that the l regularized regression problem (also known as Lasso) is equialent to a robust regression with the uncertainty set min : b X 2 +c, min : max U m b (X + ) 2, { [δ ] } U m,,δ k δj 2 c, j =,,k. Thus, as in the preious example, using Theorem 3., we can establish that the robust formulation aboe, and hence the well-known Lasso procedure, is statistically consistent. 4.2 Uncertainty Set Shrinkage When parameter deiation is not adersarial in nature, the standard RO formulation max min f(,x 0 +x δ ), x δ where is the set of all possible deiation, often leads to conseratie solutions (cf Xu and Mannor [39], and Delage and Mannor []). A natural remedy is to shrink the uncertainty set, i.e., fix α (0,) and sole max min f(,x 0 +x δ ). x δ α This method, though intuitiely appealing, easily implemented and hence widely used in application, lacks justification. Specifically, the physical meaning of the set α is unclear. Based on the probabilistic interpretation of RO, we proide a justification for such an approach: indeed, the shrinkage approach approximately soles (see Theorem 4. for the precise statement) the following DRSP inf f(,x)dµ(x), µ ˆP R m where ˆP = {µ P µ({x 0 }) α, µ(x 0 + ) = }. Thus, the uncertainty set shrinkage method indeed describes a system haing a two-scenario setup: with probability at least α the system is in a

9 9 normal state, where the parameters take the nominal alue x 0 ; otherwise the system is in an abnormal state, where the parameter has a deiation that belongs to. Let us first consider the important special case where f(, ) is linear w.r.t. to the second argument. For example, linear programs with uncertain cost, or MDPs with uncertain reward fall into this setting. In this case, the shrinkage approach exactly soles the DRSP. We formalize this statement in the following corollary. The corollary follows immediately from Theorem 4., which we proide below. Corollary 4. If for all, f(, ) is linear, then where, as aboe, min f(,x 0 +x δ ) = inf x δ α µ ˆP R m f(,x)dµ(x), ˆP = {µ P µ({x 0 }) α, µ(x 0 + ) = }. In general, when f(, ) is not linear w.r.t. the second argument, such an equialence relationship does not hold exactly. For example, let x 0 = 0, = [ : ] and take f : R R R to be gien by f(,x) = ( x ). Then the optimal solution to the α-shrinkage problem max min xδ α f(,x 0 +x δ ) is α, whereas the solution to the distributionally robust problem remains. Neertheless, under some assumptions on the curature of f(, ), the equialence continues to approximately hold in much greater generality. This is the content of the following theorem. Theorem 4. Suppose for any, f(, ) is twice differentiable with a uniformly bounded Hessian. That is, there exists h 0 such that for all, x hi H (x) hi, where H (x) is the Hessian of f(, ) ealuated at x. Then for all inf µ ˆP f(,x)dµ(x) αd 2 h min 0 +x δ ) inf R m x δ α µ ˆP f(,x)dµ(x)+αd 2 h, R m where D = max x x 2 and ˆP = {µ P µ({x 0 }) α, µ(x 0 + ) = }. Proof. Denote the gradient of f(, ) by g ( ). Fix and x, and let x = αx, which by definition belongs to α. Since f(, ) is twice differentiable, then there exists β [0,] such that f(,x 0 +x ) = f(,x 0 )+g (( β)x 0 +β(x 0 +x ))x = f(,x 0 )+g (x 0 +βx )x. Similarly, there exists β [0,] such that f(,x 0 +x ) = f(,x 0 )+g (( β )x 0 +β (x 0 +x ))x = f(,x 0 )+αg (x 0 +αβ x )x. Due to the boundedness of Hessian we hae which implies that g (x 0 +βx ) g (x 0 +αβ x ) h βx αβ x h x hd, f(,x 0 +x ) f(,x 0)+αg (x 0 +βx )x +αhd x = ( α)f(,x 0 )+αf(,x 0 +x )+αhd 2, and similarly Thus, f(,x 0 +x ) ( α)f(,x 0)+αf(,x 0 +x ) αhd 2. ( α)f(,x 0 )+αf(,x 0 +x ) αd 2 h f(,x 0 +x ) ( α)f(,x 0 )+αf(,x 0 +x )+αd 2 h. Since this holds for all x, and recall that x = αx, by definition of α we hae ( α)f(,x 0 )+α min x δ f(,x 0+x δ ) αd 2 h min x δ α f(,x 0+x δ ) ( α)f(,x 0)+α min x δ f(,x 0+x δ )+αd 2 h.

10 0 The theorem thus holds due to the following equality implied by Corollary 5.2, ( α)f(,x 0 )+α min f(,x 0 +x δ ) = inf f(,x)dµ(x). x δ µ ˆP R m The following corollary considers shrinkage in the general case where f(, ) is not linear and where the uncertainty set is star-shaped. Instead of additie upper and lower bounds as in Theorem 4. we proide a probabilistic interpretation for both the upper bound and the lower bound. Corollary 4.2 Let be star shaped, i.e., if x, then γx for all γ [0,]. If for all, f(, ) is conex, f(,x 0 ) min xδ f(,x 0 + x δ ) (this can be achieed by normalization), and is twice differentiable with a bounded Hessian, then inf µ ˆP R m f(,x)dµ(x) min x δ α f(,x 0 +x δ ) inf µ ˆP where D, h, and ˆP are the same as in Theorem 4., and R m f(,x)dµ(x); ˆP = {µ P µ({x 0 }) max(0, α αd 2 h), µ(x 0 + ) = }. Proof. The right hand side holds due to the following equation implied by conexity of f(, ), By Theorem 4. we hae f(,x 0 +αx ) ( α)f(,x 0 )+αf(,x ); x. ( α)f(,x 0 )+α min x δ f(,x 0 +x δ ) αd 2 h min x δ α f(,x 0 +x δ). Since f(,x 0 ) min xδ f(,x 0 +x δ ), this implies ( α αd 2 h)f(,x 0 )+(α+αd 2 h) min x δ f(,x 0 +x δ ) min x δ α f(,x 0 +x δ ). We thus establish the left hand side by combining with the following inequality implied by is star shaped, min f(,x 0 +x δ ) min x δ x α f(,x 0 +x δ). δ Corollary 4.2 shows that for conex functions, the shrinkage method indeed soles a problem that is bounded by two DRSPs, both haing a two-scenario setup: with a certain probability the system is in a normal state, where the parameters take the nominal alue x 0 ; otherwise the system is in an abnormal state, where the parameter has a deiation that belongs to. The difference of the normal state probability of these two DRSPs depends on the curature of the function, and diminishes to zero when the function f(, ) is linear. 5. Extensions: Correlated Uncertainty and Computation In this section we show two extensions of Theorem 2. that address situations that arise naturally in practice. First, we show that it is possible to consider uncertainty sets (of different samples) that are coupled, and derie the set of distributions of the equialent DRSP. This is useful in settings where the samples are not generated in an independent manner, but also when they are generated i.i.d., and one seeks to reduce the conseratieness of the robust formulation by modeling them as satisfying some joint constraints. Note that as stated, Theorem 2. cannot capture this, as it implicitly assumes that the realization of each parameter does not depend on that of others. In practice it is often the case that the uncertain parameters need to satisfy some joint constraints (e.g., Bertsimas and Sim [7]). In particular, we consider the setting where each sample is subject to some perturbations, and the total perturbation is bounded. This would be a sensible constraint to place on a robust optimization formulation, if the perturbations of each sample are generated independently according to some unknown distribution, in which case one would expect the total ariation of the perturbations to be small. Note that uncertain parameters of Example 4. (SVM) essentially satisfy such a joint constraint. We omit the proof of Corollary 5. since it is a straightforward extension of Theorem 2..

11 Corollary 5. (Bounded Total Perturbation) The RO formulation of the bounded total perturbation is equialent to DRSP, i.e., the following holds max i xi x i 2 r i= c i f(x i ) = inf f(x)dµ(x) µ P R m where the set of distributions are gien by { P = µ P S [ : n] : µ ( r,,r n 0, i SB(x i,r i )) } c i. i ri=r. i S Corollary 5. shows that Theorem 2. can be generalized to joint constraints on uncertain parameters. Indeed, it is straightforward to adapt Corollary 5. to accommodate other types of joint constraints the main limitation simply being that the resulting robust optimization problem remains computationally tractable. The next corollary demonstrates that the equialence relationship of Theorem 2. has computational consequences. These consequences stem from the fact that, generically, it is considerably easier to sole a (deterministic) RO problem, than it is to sole a DRSP. Indeed, we show that it is possible to sole certain DRSPs that arise naturally in practice by soling their equialent RO formulation. To illustrate this point, consider the setting where the admissible distributions hae a nested-set structure. That is, gien sets Z Z 2 Z n, and p < p 2 < < p n =, suppose each of the admissible distributions satisfies µ(z i ) p i. This nested structure is natural; it comes from the setting where a distribution is estimated from samples. A common technique in such settings is to consider modeling the uncertainty using probabilistic confidence, which naturally decreases in the number of samples, hence yielding a nested structure. Corollary 5.2 (Nested distribution) Let Z Z 2 Z n, and 0 = p 0 < p < p 2 < < p n =. Consider the set of nested distributions Then the DRSP is equialent to the RO: ˆP = {µ P µ(z i ) p i, i =,,n}. inf f(x)dµ(x) = µ ˆP R m Proof. Let c i = p i p i, and define i= ˆP = {µ P S [ : n] : µ( i S (p n p n ) inf x i Z i f(x i ). Z i ) i S c i }. Since Z i Z j for i < j, we hae i S Z i = Z i (S), where i (S) max i S i, and c i c i = p i (S). i S i i (S) Thus, ˆP = ˆP, and note that by Theorem 2. we hae which implies the corollary. inf f(x)dµ(x) = µ ˆP R m i= (p n p n ) inf x i Z i f(x i ),

12 2 6. Conclusion We show that robust optimization problems can be re-formulated as distributionally robust stochastic problems. That is, robust optimization is equialent to maximizing the worst-case expected alue oer a class of distributions. While such an equialence is well known in the special case where each uncertain parameter belongs to a different space, we generalize it to the case where multiple parameters belong to the same fixed dimensional space. This setting arises naturally in stochastic problems that are attacked ia sampling (as in machine learning or stochastic programming). Using this reformulation, we show how to construct robust optimization problems that are statistically consistent, een when the original empirical optimization is not. Our approach further proides a probabilistic interpretation to the common practice of shrinking the uncertainty set in robust optimization to aoid oer conseratieness. Acknowledgements We thank Aharon Ben-Tal for useful discussions and for pointing us to the shrinkage heuristic. The research of C.C. was partially supported by NSF grants EFRI , CNS , CNS , and DTRA grant HDTRA The research of S.M. was partially supported by the Israel Science Foundation (contract 89005). References [] M. Anthony and P. L. Bartlett, Neural network learning: Theoretical foundations, Cambridge Uniersity Press, 999. [2] A. Ben-Tal, L. El Ghaoui, and A. Nemiroski, Robust optimization, Princeton Uniersity Press, [3] A. Ben-Tal and A. Nemiroski, Robust solutions of uncertain linear programs, Operations Research Letters 25 (999), no., 3. [4], Robust solutions of linear programming problems contaminated with uncertain data, Mathematical Programming, Serial A 88 (2000), [5] A. Ben-Tal, B. Golany, A. Nemiroski, and J. P. Vial, Supplier-retailer flexible commitments contracts: A robust optimization approach-manufacturing and serice, Manufacturing and Serice Operations Management 7 (2005), no. 3, [6] A. Ben-Tal, T. Margalit, and A. Nemiroski, Robust modeling of multi-stage portfolio problems, High Performance Optimization (H. Frenk, C. Roos, T. Terlaky, and S. Zhang, eds.), 2000, pp [7] D. Bertsimas and M. Sim, The price of robustness, Operations Research 52 (2004), no., [8] J. R. Birge and F. Loueaux, Introduction to stochastic programming, Springer-Verlag, New York, 997. [9] B. E. Boser, I. M. Guyon, and V. N. Vapnik, A training algorithm for optimal margin classifiers, Proceedings of the Fifth Annual ACM Workshop on Computational Learning Theory (New York, NY), 992, pp [0] S. P. Boyd, S. J. Kim, D. D. Patil, and M. A. Horowitz, Digital circuit optimization ia geometric programming, Operations Research 53 (2005), no. 6, [] E. Delage and S. Mannor, Percentile optimization for Marko decision processes with parameter uncertainty, Operations Research (200), no., [2] E. Delage and Y. Ye, Distributional robust optimization under moment uncertainty with applications to data-drien problems, To appear in Operations Research, 200. [3] L. Deroye and L. Györfi, Nonparametric density estimation: the l iew, John Wiley & Sons, 985. [4] L. Deroye, L. Györfi, and G. Lugosi, A probabilistic theory of pattern recognition, Springer, New York, 996. [5] J. Dupacoá, The minimax approach to stochastic programming and an illustratie application, Stochastics 20 (987), [6] L. El Ghaoui and H. Lebret, Robust solutions to least-squares problems with uncertain data, SIAM Journal on Matrix Analysis and Applications 8 (997), [7] P. Kall, Stochastic programming with recourse: Upper bounds and moment problems, a reiew, Adances in Mathematical Optimization, Academie-Verlag, Berlin, 988.

13 3 [8] A.J. King and R.J.B. Wets, Epi-consistency of conex stochastic programs, Stochastics and Stochastic Reports 34 (99), no., [9] J. Goh and M. Sim, Distributionally Robust Optimization and its tractable approximations, Operations Research (in press), 200. [20] G. R. Lanckriet, L. El Ghaoui, C. Bhattacharyya, and M. I. Jordan, A robust minimax approach to classification, Journal of Machine Learning Research 3 (2003), [2] E. Parzen, On the estimation of a probability density function and the mode, The Annals of Mathematical Statistics 33 (962), [22] I. Popescu, Robust mean-coariance solutions for stochastic optimization, Operations Research 55 (2007), no., [23] A. Prékopa, Stochastic programming, Kluwer, 995. [24] M. Rosenblatt, Remarks on some nonparametric estimates of a density function, The Annals of Mathematical Statistics 27 (956), [25] H. Scarf, A min-max solution of an inentory problem, Studies in Mathematical Theory of Inentory and Production, Stanford Uniersity Press, 958, pp [26] B. Schölkopf and A. J. Smola, Learning with kernels, MIT Press, [27] D. W. Scott, Multiariate density estimation: Theory, practice and isualization, John Wiley & Sons, New York, 992. [28] A. Shapiro, Asymptotic analysis of stochastic programs, Annals of Operations Research 30 (99), no. -4, [29], Worst-case distribution analysis of stochastic programs, Mathematical Programming 07 (2006), no., [30] A. Shapiro and H. de Mello, On rate of conergence of monte carlo approximations of stochastic programs, SIAM J. Optimization (2000), [3] P. K. Shiaswamy, C. Bhattacharyya, and A. J. Smola, Second order cone programming approaches for handling missing and uncertain data, Journal of Machine Learning Research 7 (2006), [32] A. L. Soyster, Conex programming with set-inclusie constraints and applications to inexact linear programming, Operations Research 2 (973), [33] I. Steinwart, Consistency of support ector machines and other regularized kernel classifiers, IEEE Transactions on Information Theory 5 (2005), no., [34] R. Tibshirani, Regression shrinkage and selection ia the Lasso, Journal of the Royal Statistical Society, Series B 58 (996), no., [35] V. N. Vapnik and A. Cheronenkis, Theory of pattern recognition, Nauka, Moscow, 974. [36], The necessary and sufficient conditions for consistency in the empirical risk minimization method, Pattern Recognition and Image Analysis (99), no. 3, [37] H. Xu, C. Caramanis, and S. Mannor, Robustness and regularization of support ector machines, Journal of Machine Learning Research 0 (2009), no. Jul, [38], Robust regression and Lasso, To appear in IEEE Transactions on Information Theory, 200. [39] H. Xu and S. Mannor, The robustness-performance tradeoff in Marko decision processes, Adances in Neural Information Processing Systems 9 (B. Schölkopf, J. C. Platt, and T. Hofmann, eds.), MIT Press, 2007, pp [40] A. Ben-Tal and A. Goryashko and E. Guslitzer and A. Nemiroski, Adjustable robust solutions of uncertain linear programs, Math. Programming, 2003, pp Kernel Density Estimation The kernel density estimator for adensityĥin Rd, originallyproposed in Rosenblatt [24] and Parzen [2], is defined by n ( ) x h n (x) = (nc d ˆxi n ) K, i= where {c n } is a sequence of positie numbers, ˆx i are i.i.d. samples generated according to ĥ, and K is a Borel measurable function (kernel) satisfying K 0, K =. See Deroye and Györfi [3], Scott [27], c n

14 4 and the reference therein for detailed discussions. Figure illustrates a kernel density estimator using Gaussian kernel for a randomly generated sample-set. A celebrated property of a kernel density estimator is that it conerges in l to ĥ, i.e., h R d n (x) ĥ(x) dx 0, when c n 0 and nc d n (Deroye and Györfi [3]). samples 0.2 kernel function 0. estimated density Figure : Illustration of Kernel Density Estimation.

On the Power of Robust Solutions in Two-Stage Stochastic and Adaptive Optimization Problems

On the Power of Robust Solutions in Two-Stage Stochastic and Adaptive Optimization Problems MATHEMATICS OF OPERATIONS RESEARCH Vol. xx, No. x, Xxxxxxx 00x, pp. xxx xxx ISSN 0364-765X EISSN 156-5471 0x xx0x 0xxx informs DOI 10.187/moor.xxxx.xxxx c 00x INFORMS On the Power of Robust Solutions in

More information

Robust Regression and Lasso

Robust Regression and Lasso 1 Robust Regression and Lasso Huan Xu, Constantine Caramanis, Member, and Shie Mannor, Member Abstract arxiv:0811.1790v1 [cs.it] 11 Nov 2008 Lasso, or l 1 regularized least squares, has been explored extensively

More information

Sparse Algorithms are not Stable: A No-free-lunch Theorem

Sparse Algorithms are not Stable: A No-free-lunch Theorem Sparse Algorithms are not Stable: A No-free-lunch Theorem Huan Xu Shie Mannor Constantine Caramanis Abstract We consider two widely used notions in machine learning, namely: sparsity algorithmic stability.

More information

On the Power of Robust Solutions in Two-Stage Stochastic and Adaptive Optimization Problems

On the Power of Robust Solutions in Two-Stage Stochastic and Adaptive Optimization Problems MATHEMATICS OF OPERATIONS RESEARCH Vol. 35, No., May 010, pp. 84 305 issn 0364-765X eissn 156-5471 10 350 084 informs doi 10.187/moor.1090.0440 010 INFORMS On the Power of Robust Solutions in Two-Stage

More information

Convex Optimization in Classification Problems

Convex Optimization in Classification Problems New Trends in Optimization and Computational Algorithms December 9 13, 2001 Convex Optimization in Classification Problems Laurent El Ghaoui Department of EECS, UC Berkeley elghaoui@eecs.berkeley.edu 1

More information

Sample width for multi-category classifiers

Sample width for multi-category classifiers R u t c o r Research R e p o r t Sample width for multi-category classifiers Martin Anthony a Joel Ratsaby b RRR 29-2012, November 2012 RUTCOR Rutgers Center for Operations Research Rutgers University

More information

Robust linear optimization under general norms

Robust linear optimization under general norms Operations Research Letters 3 (004) 50 56 Operations Research Letters www.elsevier.com/locate/dsw Robust linear optimization under general norms Dimitris Bertsimas a; ;, Dessislava Pachamanova b, Melvyn

More information

Robust Kernel-Based Regression

Robust Kernel-Based Regression Robust Kernel-Based Regression Budi Santosa Department of Industrial Engineering Sepuluh Nopember Institute of Technology Kampus ITS Surabaya Surabaya 60111,Indonesia Theodore B. Trafalis School of Industrial

More information

Sparse Functional Regression

Sparse Functional Regression Sparse Functional Regression Junier B. Olia, Barnabás Póczos, Aarti Singh, Jeff Schneider, Timothy Verstynen Machine Learning Department Robotics Institute Psychology Department Carnegie Mellon Uniersity

More information

Quantifying Stochastic Model Errors via Robust Optimization

Quantifying Stochastic Model Errors via Robust Optimization Quantifying Stochastic Model Errors via Robust Optimization IPAM Workshop on Uncertainty Quantification for Multiscale Stochastic Systems and Applications Jan 19, 2016 Henry Lam Industrial & Operations

More information

A Geometric Characterization of the Power of Finite Adaptability in Multistage Stochastic and Adaptive Optimization

A Geometric Characterization of the Power of Finite Adaptability in Multistage Stochastic and Adaptive Optimization MATHEMATICS OF OPERATIONS RESEARCH Vol. 36, No., February 20, pp. 24 54 issn 0364-765X eissn 526-547 360 0024 informs doi 0.287/moor.0.0482 20 INFORMS A Geometric Characterization of the Power of Finite

More information

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008 Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:

More information

Dynamic Vehicle Routing with Moving Demands Part II: High speed demands or low arrival rates

Dynamic Vehicle Routing with Moving Demands Part II: High speed demands or low arrival rates ACC 9, Submitted St. Louis, MO Dynamic Vehicle Routing with Moing Demands Part II: High speed demands or low arrial rates Stephen L. Smith Shaunak D. Bopardikar Francesco Bullo João P. Hespanha Abstract

More information

Semidefinite and Second Order Cone Programming Seminar Fall 2012 Project: Robust Optimization and its Application of Robust Portfolio Optimization

Semidefinite and Second Order Cone Programming Seminar Fall 2012 Project: Robust Optimization and its Application of Robust Portfolio Optimization Semidefinite and Second Order Cone Programming Seminar Fall 2012 Project: Robust Optimization and its Application of Robust Portfolio Optimization Instructor: Farid Alizadeh Author: Ai Kagawa 12/12/2012

More information

Distributionally Robust Convex Optimization

Distributionally Robust Convex Optimization Distributionally Robust Convex Optimization Wolfram Wiesemann 1, Daniel Kuhn 1, and Melvyn Sim 2 1 Department of Computing, Imperial College London, United Kingdom 2 Department of Decision Sciences, National

More information

Astrometric Errors Correlated Strongly Across Multiple SIRTF Images

Astrometric Errors Correlated Strongly Across Multiple SIRTF Images Astrometric Errors Correlated Strongly Across Multiple SIRTF Images John Fowler 28 March 23 The possibility exists that after pointing transfer has been performed for each BCD (i.e. a calibrated image

More information

Perceptron Revisited: Linear Separators. Support Vector Machines

Perceptron Revisited: Linear Separators. Support Vector Machines Support Vector Machines Perceptron Revisited: Linear Separators Binary classification can be viewed as the task of separating classes in feature space: w T x + b > 0 w T x + b = 0 w T x + b < 0 Department

More information

Practicable Robust Markov Decision Processes

Practicable Robust Markov Decision Processes Practicable Robust Markov Decision Processes Huan Xu Department of Mechanical Engineering National University of Singapore Joint work with Shiau-Hong Lim (IBM), Shie Mannor (Techion), Ofir Mebel (Apple)

More information

Distributionally Robust Optimization with ROME (part 1)

Distributionally Robust Optimization with ROME (part 1) Distributionally Robust Optimization with ROME (part 1) Joel Goh Melvyn Sim Department of Decision Sciences NUS Business School, Singapore 18 Jun 2009 NUS Business School Guest Lecture J. Goh, M. Sim (NUS)

More information

Predicting the Probability of Correct Classification

Predicting the Probability of Correct Classification Predicting the Probability of Correct Classification Gregory Z. Grudic Department of Computer Science University of Colorado, Boulder grudic@cs.colorado.edu Abstract We propose a formulation for binary

More information

Consistency of Nearest Neighbor Methods

Consistency of Nearest Neighbor Methods E0 370 Statistical Learning Theory Lecture 16 Oct 25, 2011 Consistency of Nearest Neighbor Methods Lecturer: Shivani Agarwal Scribe: Arun Rajkumar 1 Introduction In this lecture we return to the study

More information

Asymptotic Normality of an Entropy Estimator with Exponentially Decaying Bias

Asymptotic Normality of an Entropy Estimator with Exponentially Decaying Bias Asymptotic Normality of an Entropy Estimator with Exponentially Decaying Bias Zhiyi Zhang Department of Mathematics and Statistics Uniersity of North Carolina at Charlotte Charlotte, NC 28223 Abstract

More information

The sample complexity of agnostic learning with deterministic labels

The sample complexity of agnostic learning with deterministic labels The sample complexity of agnostic learning with deterministic labels Shai Ben-David Cheriton School of Computer Science University of Waterloo Waterloo, ON, N2L 3G CANADA shai@uwaterloo.ca Ruth Urner College

More information

Probabilistic Engineering Design

Probabilistic Engineering Design Probabilistic Engineering Design Chapter One Introduction 1 Introduction This chapter discusses the basics of probabilistic engineering design. Its tutorial-style is designed to allow the reader to quickly

More information

Robust portfolio selection under norm uncertainty

Robust portfolio selection under norm uncertainty Wang and Cheng Journal of Inequalities and Applications (2016) 2016:164 DOI 10.1186/s13660-016-1102-4 R E S E A R C H Open Access Robust portfolio selection under norm uncertainty Lei Wang 1 and Xi Cheng

More information

PRODUCTS IN CONDITIONAL EXTREME VALUE MODEL

PRODUCTS IN CONDITIONAL EXTREME VALUE MODEL PRODUCTS IN CONDITIONAL ETREME VALUE MODEL RAJAT SUBHRA HAZRA AND KRISHANU MAULIK Abstract. The classical multiariate extreme alue theory tries to capture the extremal dependence between the components

More information

Support Vector Machines for Classification: A Statistical Portrait

Support Vector Machines for Classification: A Statistical Portrait Support Vector Machines for Classification: A Statistical Portrait Yoonkyung Lee Department of Statistics The Ohio State University May 27, 2011 The Spring Conference of Korean Statistical Society KAIST,

More information

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22 Outline Basic concepts: SVM and kernels SVM primal/dual problems Chih-Jen Lin (National Taiwan Univ.) 1 / 22 Outline Basic concepts: SVM and kernels Basic concepts: SVM and kernels SVM primal/dual problems

More information

Midterm exam CS 189/289, Fall 2015

Midterm exam CS 189/289, Fall 2015 Midterm exam CS 189/289, Fall 2015 You have 80 minutes for the exam. Total 100 points: 1. True/False: 36 points (18 questions, 2 points each). 2. Multiple-choice questions: 24 points (8 questions, 3 points

More information

Estimation and Optimization: Gaps and Bridges. MURI Meeting June 20, Laurent El Ghaoui. UC Berkeley EECS

Estimation and Optimization: Gaps and Bridges. MURI Meeting June 20, Laurent El Ghaoui. UC Berkeley EECS MURI Meeting June 20, 2001 Estimation and Optimization: Gaps and Bridges Laurent El Ghaoui EECS UC Berkeley 1 goals currently, estimation (of model parameters) and optimization (of decision variables)

More information

Robust Fisher Discriminant Analysis

Robust Fisher Discriminant Analysis Robust Fisher Discriminant Analysis Seung-Jean Kim Alessandro Magnani Stephen P. Boyd Information Systems Laboratory Electrical Engineering Department, Stanford University Stanford, CA 94305-9510 sjkim@stanford.edu

More information

Handout 8: Dealing with Data Uncertainty

Handout 8: Dealing with Data Uncertainty MFE 5100: Optimization 2015 16 First Term Handout 8: Dealing with Data Uncertainty Instructor: Anthony Man Cho So December 1, 2015 1 Introduction Conic linear programming CLP, and in particular, semidefinite

More information

SVM TRADE-OFF BETWEEN MAXIMIZE THE MARGIN AND MINIMIZE THE VARIABLES USED FOR REGRESSION

SVM TRADE-OFF BETWEEN MAXIMIZE THE MARGIN AND MINIMIZE THE VARIABLES USED FOR REGRESSION International Journal of Pure and Applied Mathematics Volume 87 No. 6 2013, 741-750 ISSN: 1311-8080 (printed version); ISSN: 1314-3395 (on-line version) url: http://www.ijpam.eu doi: http://dx.doi.org/10.12732/ijpam.v87i6.2

More information

Scale-Invariance of Support Vector Machines based on the Triangular Kernel. Abstract

Scale-Invariance of Support Vector Machines based on the Triangular Kernel. Abstract Scale-Invariance of Support Vector Machines based on the Triangular Kernel François Fleuret Hichem Sahbi IMEDIA Research Group INRIA Domaine de Voluceau 78150 Le Chesnay, France Abstract This paper focuses

More information

Convex envelopes, cardinality constrained optimization and LASSO. An application in supervised learning: support vector machines (SVMs)

Convex envelopes, cardinality constrained optimization and LASSO. An application in supervised learning: support vector machines (SVMs) ORF 523 Lecture 8 Princeton University Instructor: A.A. Ahmadi Scribe: G. Hall Any typos should be emailed to a a a@princeton.edu. 1 Outline Convexity-preserving operations Convex envelopes, cardinality

More information

Random Convex Approximations of Ambiguous Chance Constrained Programs

Random Convex Approximations of Ambiguous Chance Constrained Programs Random Convex Approximations of Ambiguous Chance Constrained Programs Shih-Hao Tseng Eilyan Bitar Ao Tang Abstract We investigate an approach to the approximation of ambiguous chance constrained programs

More information

Lecture Support Vector Machine (SVM) Classifiers

Lecture Support Vector Machine (SVM) Classifiers Introduction to Machine Learning Lecturer: Amir Globerson Lecture 6 Fall Semester Scribe: Yishay Mansour 6.1 Support Vector Machine (SVM) Classifiers Classification is one of the most important tasks in

More information

Gaussian Estimation under Attack Uncertainty

Gaussian Estimation under Attack Uncertainty Gaussian Estimation under Attack Uncertainty Tara Javidi Yonatan Kaspi Himanshu Tyagi Abstract We consider the estimation of a standard Gaussian random variable under an observation attack where an adversary

More information

Towards Universal Cover Decoding

Towards Universal Cover Decoding International Symposium on Information Theory and its Applications, ISITA2008 Auckland, New Zealand, 7-10, December, 2008 Towards Uniersal Coer Decoding Nathan Axig, Deanna Dreher, Katherine Morrison,

More information

Statistical Data Mining and Machine Learning Hilary Term 2016

Statistical Data Mining and Machine Learning Hilary Term 2016 Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes

More information

A Regularization Framework for Learning from Graph Data

A Regularization Framework for Learning from Graph Data A Regularization Framework for Learning from Graph Data Dengyong Zhou Max Planck Institute for Biological Cybernetics Spemannstr. 38, 7076 Tuebingen, Germany Bernhard Schölkopf Max Planck Institute for

More information

Two-sided bounds for L p -norms of combinations of products of independent random variables

Two-sided bounds for L p -norms of combinations of products of independent random variables Two-sided bounds for L p -norms of combinations of products of independent random ariables Ewa Damek (based on the joint work with Rafał Latała, Piotr Nayar and Tomasz Tkocz) Wrocław Uniersity, Uniersity

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information

Short Course Robust Optimization and Machine Learning. Lecture 6: Robust Optimization in Machine Learning

Short Course Robust Optimization and Machine Learning. Lecture 6: Robust Optimization in Machine Learning Short Course Robust Optimization and Machine Machine Lecture 6: Robust Optimization in Machine Laurent El Ghaoui EECS and IEOR Departments UC Berkeley Spring seminar TRANSP-OR, Zinal, Jan. 16-19, 2012

More information

Discriminative Direction for Kernel Classifiers

Discriminative Direction for Kernel Classifiers Discriminative Direction for Kernel Classifiers Polina Golland Artificial Intelligence Lab Massachusetts Institute of Technology Cambridge, MA 02139 polina@ai.mit.edu Abstract In many scientific and engineering

More information

Second Order Cone Programming, Missing or Uncertain Data, and Sparse SVMs

Second Order Cone Programming, Missing or Uncertain Data, and Sparse SVMs Second Order Cone Programming, Missing or Uncertain Data, and Sparse SVMs Ammon Washburn University of Arizona September 25, 2015 1 / 28 Introduction We will begin with basic Support Vector Machines (SVMs)

More information

Metric Embedding for Kernel Classification Rules

Metric Embedding for Kernel Classification Rules Metric Embedding for Kernel Classification Rules Bharath K. Sriperumbudur University of California, San Diego (Joint work with Omer Lang & Gert Lanckriet) Bharath K. Sriperumbudur (UCSD) Metric Embedding

More information

Understanding Generalization Error: Bounds and Decompositions

Understanding Generalization Error: Bounds and Decompositions CIS 520: Machine Learning Spring 2018: Lecture 11 Understanding Generalization Error: Bounds and Decompositions Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the

More information

The Inverse Function Theorem

The Inverse Function Theorem Inerse Function Theorem April, 3 The Inerse Function Theorem Suppose and Y are Banach spaces and f : Y is C (continuousl differentiable) Under what circumstances does a (local) inerse f exist? In the simplest

More information

A note on the generalization performance of kernel classifiers with margin. Theodoros Evgeniou and Massimiliano Pontil

A note on the generalization performance of kernel classifiers with margin. Theodoros Evgeniou and Massimiliano Pontil MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES A.I. Memo No. 68 November 999 C.B.C.L

More information

Cases of integrability corresponding to the motion of a pendulum on the two-dimensional plane

Cases of integrability corresponding to the motion of a pendulum on the two-dimensional plane Cases of integrability corresponding to the motion of a pendulum on the two-dimensional plane MAXIM V. SHAMOLIN Lomonoso Moscow State Uniersity Institute of Mechanics Michurinskii Ae.,, 99 Moscow RUSSIAN

More information

Polyhedral Computation. Linear Classifiers & the SVM

Polyhedral Computation. Linear Classifiers & the SVM Polyhedral Computation Linear Classifiers & the SVM mcuturi@i.kyoto-u.ac.jp Nov 26 2010 1 Statistical Inference Statistical: useful to study random systems... Mutations, environmental changes etc. life

More information

Pareto Efficiency in Robust Optimization

Pareto Efficiency in Robust Optimization Pareto Efficiency in Robust Optimization Dan Iancu Graduate School of Business Stanford University joint work with Nikolaos Trichakis (HBS) 1/26 Classical Robust Optimization Typical linear optimization

More information

Distributionally Robust Discrete Optimization with Entropic Value-at-Risk

Distributionally Robust Discrete Optimization with Entropic Value-at-Risk Distributionally Robust Discrete Optimization with Entropic Value-at-Risk Daniel Zhuoyu Long Department of SEEM, The Chinese University of Hong Kong, zylong@se.cuhk.edu.hk Jin Qi NUS Business School, National

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Tobias Pohlen Selected Topics in Human Language Technology and Pattern Recognition February 10, 2014 Human Language Technology and Pattern Recognition Lehrstuhl für Informatik 6

More information

Distirbutional robustness, regularizing variance, and adversaries

Distirbutional robustness, regularizing variance, and adversaries Distirbutional robustness, regularizing variance, and adversaries John Duchi Based on joint work with Hongseok Namkoong and Aman Sinha Stanford University November 2017 Motivation We do not want machine-learned

More information

The Value of Adaptability

The Value of Adaptability The Value of Adaptability Dimitris Bertsimas Constantine Caramanis September 30, 2005 Abstract We consider linear optimization problems with deterministic parameter uncertainty. We consider a departure

More information

Classifier Complexity and Support Vector Classifiers

Classifier Complexity and Support Vector Classifiers Classifier Complexity and Support Vector Classifiers Feature 2 6 4 2 0 2 4 6 8 RBF kernel 10 10 8 6 4 2 0 2 4 6 Feature 1 David M.J. Tax Pattern Recognition Laboratory Delft University of Technology D.M.J.Tax@tudelft.nl

More information

Modeling Highway Traffic Volumes

Modeling Highway Traffic Volumes Modeling Highway Traffic Volumes Tomáš Šingliar1 and Miloš Hauskrecht 1 Computer Science Dept, Uniersity of Pittsburgh, Pittsburgh, PA 15260 {tomas, milos}@cs.pitt.edu Abstract. Most traffic management

More information

Margin Maximizing Loss Functions

Margin Maximizing Loss Functions Margin Maximizing Loss Functions Saharon Rosset, Ji Zhu and Trevor Hastie Department of Statistics Stanford University Stanford, CA, 94305 saharon, jzhu, hastie@stat.stanford.edu Abstract Margin maximizing

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information

SUPPLEMENTARY MATERIAL. Authors: Alan A. Stocker (1) and Eero P. Simoncelli (2)

SUPPLEMENTARY MATERIAL. Authors: Alan A. Stocker (1) and Eero P. Simoncelli (2) SUPPLEMENTARY MATERIAL Authors: Alan A. Stocker () and Eero P. Simoncelli () Affiliations: () Dept. of Psychology, Uniersity of Pennsylania 34 Walnut Street 33C Philadelphia, PA 94-68 U.S.A. () Howard

More information

Optimal Joint Detection and Estimation in Linear Models

Optimal Joint Detection and Estimation in Linear Models Optimal Joint Detection and Estimation in Linear Models Jianshu Chen, Yue Zhao, Andrea Goldsmith, and H. Vincent Poor Abstract The problem of optimal joint detection and estimation in linear models with

More information

FORMULATION OF THE LEARNING PROBLEM

FORMULATION OF THE LEARNING PROBLEM FORMULTION OF THE LERNING PROBLEM MIM RGINSKY Now that we have seen an informal statement of the learning problem, as well as acquired some technical tools in the form of concentration inequalities, we

More information

LECTURE 2: CROSS PRODUCTS, MULTILINEARITY, AND AREAS OF PARALLELOGRAMS

LECTURE 2: CROSS PRODUCTS, MULTILINEARITY, AND AREAS OF PARALLELOGRAMS LECTURE : CROSS PRODUCTS, MULTILINEARITY, AND AREAS OF PARALLELOGRAMS MA1111: LINEAR ALGEBRA I, MICHAELMAS 016 1. Finishing up dot products Last time we stated the following theorem, for which I owe you

More information

Collective Risk Models with Dependence Uncertainty

Collective Risk Models with Dependence Uncertainty Collectie Risk Models with Dependence Uncertainty Haiyan Liu and Ruodu Wang ebruary 27, 2017 Abstract We bring the recently deeloped framework of dependence uncertainty into collectie risk models, one

More information

A Bahadur Representation of the Linear Support Vector Machine

A Bahadur Representation of the Linear Support Vector Machine A Bahadur Representation of the Linear Support Vector Machine Yoonkyung Lee Department of Statistics The Ohio State University October 7, 2008 Data Mining and Statistical Learning Study Group Outline Support

More information

Multi-Range Robust Optimization vs Stochastic Programming in Prioritizing Project Selection

Multi-Range Robust Optimization vs Stochastic Programming in Prioritizing Project Selection Multi-Range Robust Optimization vs Stochastic Programming in Prioritizing Project Selection Ruken Düzgün Aurélie Thiele July 2012 Abstract This paper describes a multi-range robust optimization approach

More information

Balanced Partitions of Vector Sequences

Balanced Partitions of Vector Sequences Balanced Partitions of Vector Sequences Imre Bárány Benjamin Doerr December 20, 2004 Abstract Let d,r N and be any norm on R d. Let B denote the unit ball with respect to this norm. We show that any sequence

More information

Does Unlabeled Data Help?

Does Unlabeled Data Help? Does Unlabeled Data Help? Worst-case Analysis of the Sample Complexity of Semi-supervised Learning. Ben-David, Lu and Pal; COLT, 2008. Presentation by Ashish Rastogi Courant Machine Learning Seminar. Outline

More information

SUPPORT VECTOR MACHINE

SUPPORT VECTOR MACHINE SUPPORT VECTOR MACHINE Mainly based on https://nlp.stanford.edu/ir-book/pdf/15svm.pdf 1 Overview SVM is a huge topic Integration of MMDS, IIR, and Andrew Moore s slides here Our foci: Geometric intuition

More information

Lightning Does Not Strike Twice: Robust MDPs with Coupled Uncertainty

Lightning Does Not Strike Twice: Robust MDPs with Coupled Uncertainty JMLR: Workshop and Conference Proceedings vol (212) 1 12 European Workshop on Reinforcement Learning Lightning Does Not Strike Twice: Robust MDPs with Coupled Uncertainty Shie Mannor Technion Ofir Mebel

More information

ECE-271B. Nuno Vasconcelos ECE Department, UCSD

ECE-271B. Nuno Vasconcelos ECE Department, UCSD ECE-271B Statistical ti ti Learning II Nuno Vasconcelos ECE Department, UCSD The course the course is a graduate level course in statistical learning in SLI we covered the foundations of Bayesian or generative

More information

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015 Machine Learning 10-701, Fall 2015 VC Dimension and Model Complexity Eric Xing Lecture 16, November 3, 2015 Reading: Chap. 7 T.M book, and outline material Eric Xing @ CMU, 2006-2015 1 Last time: PAC and

More information

Insights into Cross-validation

Insights into Cross-validation Noname manuscript No. (will be inserted by the editor) Insights into Cross-alidation Amit Dhurandhar Alin Dobra Receied: date / Accepted: date Abstract Cross-alidation is one of the most widely used techniques,

More information

Solving Classification Problems By Knowledge Sets

Solving Classification Problems By Knowledge Sets Solving Classification Problems By Knowledge Sets Marcin Orchel a, a Department of Computer Science, AGH University of Science and Technology, Al. A. Mickiewicza 30, 30-059 Kraków, Poland Abstract We propose

More information

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan Support'Vector'Machines Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan kasthuri.kannan@nyumc.org Overview Support Vector Machines for Classification Linear Discrimination Nonlinear Discrimination

More information

A Simple Algorithm for Learning Stable Machines

A Simple Algorithm for Learning Stable Machines A Simple Algorithm for Learning Stable Machines Savina Andonova and Andre Elisseeff and Theodoros Evgeniou and Massimiliano ontil Abstract. We present an algorithm for learning stable machines which is

More information

An Alternative Characterization of Hidden Regular Variation in Joint Tail Modeling

An Alternative Characterization of Hidden Regular Variation in Joint Tail Modeling An Alternatie Characterization of Hidden Regular Variation in Joint Tail Modeling Grant B. Weller Daniel S. Cooley Department of Statistics, Colorado State Uniersity, Fort Collins, CO USA May 22, 212 Abstract

More information

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods Support Vector Machine (SVM) and Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2015 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin

More information

Tractable Upper Bounds on the Restricted Isometry Constant

Tractable Upper Bounds on the Restricted Isometry Constant Tractable Upper Bounds on the Restricted Isometry Constant Alex d Aspremont, Francis Bach, Laurent El Ghaoui Princeton University, École Normale Supérieure, U.C. Berkeley. Support from NSF, DHS and Google.

More information

THE DESCRIPTIVE COMPLEXITY OF SERIES REARRANGEMENTS

THE DESCRIPTIVE COMPLEXITY OF SERIES REARRANGEMENTS THE DESCRIPTIVE COMPLEXITY OF SERIES REARRANGEMENTS MICHAEL P. COHEN Abstract. We consider the descriptie complexity of some subsets of the infinite permutation group S which arise naturally from the classical

More information

Robust Dual-Response Optimization

Robust Dual-Response Optimization Yanıkoğlu, den Hertog, and Kleijnen Robust Dual-Response Optimization 29 May 1 June 1 / 24 Robust Dual-Response Optimization İhsan Yanıkoğlu, Dick den Hertog, Jack P.C. Kleijnen Özyeğin University, İstanbul,

More information

Support Vector Regression with Automatic Accuracy Control B. Scholkopf y, P. Bartlett, A. Smola y,r.williamson FEIT/RSISE, Australian National University, Canberra, Australia y GMD FIRST, Rudower Chaussee

More information

Interval solutions for interval algebraic equations

Interval solutions for interval algebraic equations Mathematics and Computers in Simulation 66 (2004) 207 217 Interval solutions for interval algebraic equations B.T. Polyak, S.A. Nazin Institute of Control Sciences, Russian Academy of Sciences, 65 Profsoyuznaya

More information

Handout 6: Some Applications of Conic Linear Programming

Handout 6: Some Applications of Conic Linear Programming ENGG 550: Foundations of Optimization 08 9 First Term Handout 6: Some Applications of Conic Linear Programming Instructor: Anthony Man Cho So November, 08 Introduction Conic linear programming CLP, and

More information

Sparse Algorithms are not Stable: A No-free-lunch Theorem

Sparse Algorithms are not Stable: A No-free-lunch Theorem 1 Sparse Algorithms are not Stable: A No-free-lunch Theorem Huan Xu, Constantine Caramanis, Member, IEEE and Shie Mannor, Senior Member, IEEE Abstract We consider two desired properties of learning algorithms:

More information

An Introduction to Statistical Theory of Learning. Nakul Verma Janelia, HHMI

An Introduction to Statistical Theory of Learning. Nakul Verma Janelia, HHMI An Introduction to Statistical Theory of Learning Nakul Verma Janelia, HHMI Towards formalizing learning What does it mean to learn a concept? Gain knowledge or experience of the concept. The basic process

More information

A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES. Wei Chu, S. Sathiya Keerthi, Chong Jin Ong

A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES. Wei Chu, S. Sathiya Keerthi, Chong Jin Ong A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES Wei Chu, S. Sathiya Keerthi, Chong Jin Ong Control Division, Department of Mechanical Engineering, National University of Singapore 0 Kent Ridge Crescent,

More information

Online Companion to Pricing Services Subject to Congestion: Charge Per-Use Fees or Sell Subscriptions?

Online Companion to Pricing Services Subject to Congestion: Charge Per-Use Fees or Sell Subscriptions? Online Companion to Pricing Serices Subject to Congestion: Charge Per-Use Fees or Sell Subscriptions? Gérard P. Cachon Pnina Feldman Operations and Information Management, The Wharton School, Uniersity

More information

Vapnik-Chervonenkis Dimension of Axis-Parallel Cuts arxiv: v2 [math.st] 23 Jul 2012

Vapnik-Chervonenkis Dimension of Axis-Parallel Cuts arxiv: v2 [math.st] 23 Jul 2012 Vapnik-Chervonenkis Dimension of Axis-Parallel Cuts arxiv:203.093v2 [math.st] 23 Jul 202 Servane Gey July 24, 202 Abstract The Vapnik-Chervonenkis (VC) dimension of the set of half-spaces of R d with frontiers

More information

Distributionally Robust Convex Optimization

Distributionally Robust Convex Optimization Submitted to Operations Research manuscript OPRE-2013-02-060 Authors are encouraged to submit new papers to INFORMS journals by means of a style file template, which includes the journal title. However,

More information

Short Course Robust Optimization and Machine Learning. 3. Optimization in Supervised Learning

Short Course Robust Optimization and Machine Learning. 3. Optimization in Supervised Learning Short Course Robust Optimization and 3. Optimization in Supervised EECS and IEOR Departments UC Berkeley Spring seminar TRANSP-OR, Zinal, Jan. 16-19, 2012 Outline Overview of Supervised models and variants

More information

Robust Novelty Detection with Single Class MPM

Robust Novelty Detection with Single Class MPM Robust Novelty Detection with Single Class MPM Gert R.G. Lanckriet EECS, U.C. Berkeley gert@eecs.berkeley.edu Laurent El Ghaoui EECS, U.C. Berkeley elghaoui@eecs.berkeley.edu Michael I. Jordan Computer

More information

BAYESIAN PREMIUM AND ASYMPTOTIC APPROXIMATION FOR A CLASS OF LOSS FUNCTIONS

BAYESIAN PREMIUM AND ASYMPTOTIC APPROXIMATION FOR A CLASS OF LOSS FUNCTIONS TH USHNG HOUS ROCDNGS OF TH ROMANAN ACADMY Series A OF TH ROMANAN ACADMY Volume 6 Number 3/005 pp 000-000 AYSAN RMUM AND ASYMTOTC AROMATON FOR A CASS OF OSS FUNCTONS Roxana CUMARA Department of Mathematics

More information

Econometrics II - EXAM Outline Solutions All questions have 25pts Answer each question in separate sheets

Econometrics II - EXAM Outline Solutions All questions have 25pts Answer each question in separate sheets Econometrics II - EXAM Outline Solutions All questions hae 5pts Answer each question in separate sheets. Consider the two linear simultaneous equations G with two exogeneous ariables K, y γ + y γ + x δ

More information

A Note on Extending Generalization Bounds for Binary Large-Margin Classifiers to Multiple Classes

A Note on Extending Generalization Bounds for Binary Large-Margin Classifiers to Multiple Classes A Note on Extending Generalization Bounds for Binary Large-Margin Classifiers to Multiple Classes Ürün Dogan 1 Tobias Glasmachers 2 and Christian Igel 3 1 Institut für Mathematik Universität Potsdam Germany

More information

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016 Instance-based Learning CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Outline Non-parametric approach Unsupervised: Non-parametric density estimation Parzen Windows Kn-Nearest

More information

Review of Matrices and Vectors 1/45

Review of Matrices and Vectors 1/45 Reiew of Matrices and Vectors /45 /45 Definition of Vector: A collection of comple or real numbers, generally put in a column [ ] T "! Transpose + + + b a b a b b a a " " " b a b a Definition of Vector

More information

ORIGINS OF STOCHASTIC PROGRAMMING

ORIGINS OF STOCHASTIC PROGRAMMING ORIGINS OF STOCHASTIC PROGRAMMING Early 1950 s: in applications of Linear Programming unknown values of coefficients: demands, technological coefficients, yields, etc. QUOTATION Dantzig, Interfaces 20,1990

More information