CS229T/STATS231: Statistical Learning Theory. Lecturer: Tengyu Ma Lecture 11 Scribe: Jongho Kim, Jamie Kang October 29th, 2018

Size: px
Start display at page:

Download "CS229T/STATS231: Statistical Learning Theory. Lecturer: Tengyu Ma Lecture 11 Scribe: Jongho Kim, Jamie Kang October 29th, 2018"

Transcription

1 CS229T/STATS231: Statistical Learning Theory Lecturer: Tengyu Ma Lecture 11 Scribe: Jongho Kim, Jamie Kang October 29th, Overview This lecture mainly covers Recall the statistical theory of GANs from the last lecture Introduce Kontorovich-Rubinstein duality of Wasserstein distance and provide the proof 2 Review Recall the following notations and two definitions from the last lecture: z 1,..., z n i.i.d Z where Z N (0, I k k ) X = G θ (Z) : G θ is the generator R k R d x 1,..., x n i.i.d P : training examples ˆp : Uniform distribution over {x 1,..., x n } p θ : distribution of X = G θ (Z) ˆp θ : uniform distribution over {G θ (z 1 ),..., G θ (z m )} Definition 1 (Wasserstein Distance). W 1 (P, Q) = sup f L 1 E x P [f(x)] E x Q [f(x)] Here, the supremum is over functions f where f is 1-Lipschitz continuous function with respect to metric d. For simplicity, we denote this as f L 1. Recall that the function f is 1-Lipschitz continuous if f(x) f(y) d(x, y) Most of cases, we let the metric d to be l 2 distance. Definition 2 (F-Integral Probability Metric (F-IPM)). W F (P, Q) = sup f F x, y E x P [f(x)] E x Q [f(x)] where F is a family of functions (e.g., a family of neural nets {f φ }). This week (week 6), we would like to address the following question: W F (ˆp, ˆp θ ) }{{} training loss small = W 1 (p, p θ ) small?

2 3 Duality of W 1 (Kontorovich-Rubinstein duality) Theorem 1. Let P, Q be two distributions over X (assume P and Q have bounded support) and let d be a metric on X. Then, we have W 1 (P, Q) = sup f L 1 E x P [f(x)] E x Q [f(x)] = inf P E (x,y) P [d(x, y)] (1) where P is coupling of P and Q. A coupling P is a joint distribution over X X whose marginals are P and Q. (i.e., If (x, y) P then P x = P and P y = Q.) Equivalently, we can think of coupling as if we are sending mass from X to X. Suppose we have a finite set X such that X = {v 1,..., v k }. Let P and Q be the two distributions over X as following: P = (p 1,..., p k ) where i p i = 1 and p i 0 i Q = (q 1,..., q k ) where i q i = 1 and q i 0 i Let γ to denote the cost of sending a mass from P to Q. Figure 1 shows the relationship between two distributions and γ. γ ij = Prob [x = v i, y = v j ] where (x, y) P Note that k i=1 γ ij = p i and k j=1 γ ij = q i. Using this interpretation, we can think of the Wasserstein distance of P and Q as the cost of transportation. We can write it in terms of following mathematical notation: W 1 (P, Q) = inf P E (x,y) P [d(x, y)] = i,j γ ij d(v i, v j ) and we want to minimize this cost. Figure 1: Diagram representation of coupling 2

3 Last lecture, we shortly discussed why other distance measures (e.g., Total Variation (TV) distance and Kullback-Leibler(KL) divergence) are not useful in our setting compared to Wasserstein distance. Before we move on to the formal proof of Theorem 1, here we discuss and compare Wasserstein distance with TV distance and KL divergence. Definition 3 (Total Variation distance). TV(P, Q) = sup = f 1 E x P [f(x)] E x Q [f(x)] (All cases) p(x) q(x) dx where P and Q have densities p and q (Continuous case) Definition 4 (Kullback-Leibler divergence). [ KL(P, Q) = E x P log p(x) ] = P (x) log P (x) q(x) Q(x) x ( ) p(x) = p(x) log dx q(x) (Discrete case) (Continuous case) Example 1. Suppose P is uniform distribution over sphere and Q is also uniform distribution over sphere which is shifted by ɛ on only one coordinate. Let the metric d = l 2. Figure 2 shows two distributions of P and Q. Figure 2: Distributions of P and Q In this case, TV(P, Q) = 1 because we can construct the function f as following: f(x) = 1 if X support(p ) f(x) = 0 if X support(q) Note that KL(P, Q) is not well-defined in this setting because not both P and Q defined on the same probability space. On the other hand, W 1 (P, Q) ɛ. To see why this is the case, we first construct a coupling as following sampling technique: [ ] 1 (x, y) P x P, y = x + ɛ 0 3

4 Note that the marginal distribution would be P and Q. Here, we achieve W 1 (P, Q) = E (x,y) P [d(x, y)] = E (x,y) P [ x y 2 ] Example 2. We state the following claim: = ɛ since x y = ɛ Claim 1. TV(P, Q) = W 1 (P, Q) with respect to the trivial metric d(x, y) = I {x y} (Note: This claim explains why TV distance does not take into account geometry.) Proof. f is 1-Lipschitz with respect to the metric d(x, y) = I {x y}. We claim that [ ] 1 0 f(x) f(y) I {x y} sup x X f(x) inf(f) 1 To see this, first note that from f(x) f(y) I{x y}, we get that f(x) f(y) 1 if x y. Thus, if the elements obtaining the sup and inf are not the same, then the difference is bounded by 1, otherwise the difference will be 0. Furthermore, if sup x X f(x) inf(f) 1, it immediately follows that f(x) f(y) 1 x, y. This establishes the equivalence. From here, we denote E x P [f(x)] = E x P f and E x Q [f(x)] = E x Q f for simplicity. Then, W 1 (P, Q) = sup E x P f E x Q f f:sup f inf f 1 }{{} shift invariant = sup f:0 f 1 E x P f E x Q f = max f:0 f 1, 0 1 f 1 {E x P f E x Q f, E x P (1 f) E x Q (1 f)} = sup E x P f E x Q f 0 f 1 = T V (P, Q) where the second equality follows from the fact that E x P f E x Q f is shift invariant (i.e., E P [f + c] E Q [f + c] = E P f + c E Q f c = E P f E Q f), and thus we can subtract inf f. The fourth equality is again using the fact that E x P (1 f) E x Q (1 f) is shift invariant and thus we can remove 1 and have the second term become E P ( f) E Q ( f). 3.1 Proof of Kontorovich-Rubinstein duality We use LP duality in fuctional space with f as variable to prove Theorem 1. Assume X is finite: X = {v 1,..., v k } and f(v 1 ),..., f(v k ) (f 1,..., f k ) and recall that P = (p 1,..., p k ), Q = (q 1,..., q k ). As usual, we assume p i > 0, q i > 0 for all i. We construct the following linear programming problems: 4

5 Program (A): maximize f=(f 1,...,f k ) k p i f i i=1 k q i f i (OPT 1 ) i=1 subject to f(x) f(y) d(x, y) x, y X (i.e., f(v i ) f(v j ) d(v i, v j ) f i f j d ij ) Note that OPT 1 is the expanded version of the objective function: max f E P f E Q f. Program (P): minimize γ subject to γ ij d ij (OPT 2 ) i,j γ ij = p i i (dual: f i ) j γ ij = q j j (dual: g j ) i γ ij 0 i, j (dual: w ij ) Goal. We want to prove that OPT 1 = OPT 2. Lemma 1 (OPT 1 OPT 2 ). Proof. Let f, γ be optimal solutions to (A) and (P) respectively. Then, OPT 1 = i (p i q i )f i = i ( j γ ij )f i i ( j γ ji )f i (by definition of p i and q i ) = i ( j γ ij )f i j ( i γ ij )f j = i,j i,j γ ij (f i f j ) γ ij d ij = OPT 2. Next, we construct the dual of (P) using dual variables specified in parentheses in (P): Program (D): maximize p i f i q i g i (OPT 3 ) f,g i i subject to f i g j + w ij = d ij i, j i, j because w ij is non- Note that the constraint can be re-written as f i g j d ij negative by definition of dual variable. 5

6 Lemma 2 (OPT 1 OPT 3 ). Proof. The main idea here is to use the fact that d is a metric. Let f, g, γ be optimal solutions to (P) and (D). Then, by complementary slackness 1, f i g j = d ij if γ ij > 0 i, j. First, we show that g j g t d jt j, t by following j, pick i such that γ ij > 0 = f i g j = d ij (1) Also, by constraint: i, t f i g t d it (2) (1) (2) yields: g j g t d it d ij d jt where the last inequality is from triangle inequality as d is a metric. Therefore, g is a feasible solution of (A). Note that from (D) s constraint, f i g i d ii and d ii = 0 since it s a metric. Therefore, f i g i for all i. Using this, we obtain: OPT 1 i i p i g i i p i f i i q i g i (feasibility of g) q i g i (f i g i ) = OPT 3. Finally, by Lemma 1 and Lemma 2 and the fact that OPT 2 = OPT 2 3, we conclude OPT 1 = OPT 2 = OPT 3. 4 Plans for future lectures Recall that the main question we aim to address is: W F (ˆp, ˆp θ ) }{{} training loss small = W 1 (p, p θ ) small? Our plan for addressing this question is following: Plan: W F (ˆp, ˆp θ ) generalization W F (p, p θ ) approx.w 1 W F W1 (p, p θ ). Hence, in the next lecture we will show three observations: 1. If F is complex = generalization is bad 2. If F has small complexity = generalization is good 3. If F has small complexity = approximation may not be good and explore ways to overcome this dilemma. 1 If you are not familiar with this part, please refer to Chapter 5 of Convex Optimization [?] and/or EE 364A course lecture slide 17 on 2 Since OPT 3 is dual of OPT 2 and strong duality holds for linear programming, OPT 2 = OPT 3. 6

Lecture 35: December The fundamental statistical distances

Lecture 35: December The fundamental statistical distances 36-705: Intermediate Statistics Fall 207 Lecturer: Siva Balakrishnan Lecture 35: December 4 Today we will discuss distances and metrics between distributions that are useful in statistics. I will be lose

More information

Nishant Gurnani. GAN Reading Group. April 14th, / 107

Nishant Gurnani. GAN Reading Group. April 14th, / 107 Nishant Gurnani GAN Reading Group April 14th, 2017 1 / 107 Why are these Papers Important? 2 / 107 Why are these Papers Important? Recently a large number of GAN frameworks have been proposed - BGAN, LSGAN,

More information

Wasserstein GAN. Juho Lee. Jan 23, 2017

Wasserstein GAN. Juho Lee. Jan 23, 2017 Wasserstein GAN Juho Lee Jan 23, 2017 Wasserstein GAN (WGAN) Arxiv submission Martin Arjovsky, Soumith Chintala, and Léon Bottou A new GAN model minimizing the Earth-Mover s distance (Wasserstein-1 distance)

More information

ECE598: Information-theoretic methods in high-dimensional statistics Spring 2016

ECE598: Information-theoretic methods in high-dimensional statistics Spring 2016 ECE598: Information-theoretic methods in high-dimensional statistics Spring 06 Lecture : Mutual Information Method Lecturer: Yihong Wu Scribe: Jaeho Lee, Mar, 06 Ed. Mar 9 Quick review: Assouad s lemma

More information

Hands-On Learning Theory Fall 2016, Lecture 3

Hands-On Learning Theory Fall 2016, Lecture 3 Hands-On Learning Theory Fall 016, Lecture 3 Jean Honorio jhonorio@purdue.edu 1 Information Theory First, we provide some information theory background. Definition 3.1 (Entropy). The entropy of a discrete

More information

Lecture 3 January 28

Lecture 3 January 28 EECS 28B / STAT 24B: Advanced Topics in Statistical LearningSpring 2009 Lecture 3 January 28 Lecturer: Pradeep Ravikumar Scribe: Timothy J. Wheeler Note: These lecture notes are still rough, and have only

More information

Lecture 17: Density Estimation Lecturer: Yihong Wu Scribe: Jiaqi Mu, Mar 31, 2016 [Ed. Apr 1]

Lecture 17: Density Estimation Lecturer: Yihong Wu Scribe: Jiaqi Mu, Mar 31, 2016 [Ed. Apr 1] ECE598: Information-theoretic methods in high-dimensional statistics Spring 06 Lecture 7: Density Estimation Lecturer: Yihong Wu Scribe: Jiaqi Mu, Mar 3, 06 [Ed. Apr ] In last lecture, we studied the minimax

More information

Lecture #21. c T x Ax b. maximize subject to

Lecture #21. c T x Ax b. maximize subject to COMPSCI 330: Design and Analysis of Algorithms 11/11/2014 Lecture #21 Lecturer: Debmalya Panigrahi Scribe: Samuel Haney 1 Overview In this lecture, we discuss linear programming. We first show that the

More information

21.1 Lower bounds on minimax risk for functional estimation

21.1 Lower bounds on minimax risk for functional estimation ECE598: Information-theoretic methods in high-dimensional statistics Spring 016 Lecture 1: Functional estimation & testing Lecturer: Yihong Wu Scribe: Ashok Vardhan, Apr 14, 016 In this chapter, we will

More information

1 Review of The Learning Setting

1 Review of The Learning Setting COS 5: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #8 Scribe: Changyan Wang February 28, 208 Review of The Learning Setting Last class, we moved beyond the PAC model: in the PAC model we

More information

Introduction to Statistical Learning Theory

Introduction to Statistical Learning Theory Introduction to Statistical Learning Theory In the last unit we looked at regularization - adding a w 2 penalty. We add a bias - we prefer classifiers with low norm. How to incorporate more complicated

More information

2.1 Optimization formulation of k-means

2.1 Optimization formulation of k-means MGMT 69000: Topics in High-dimensional Data Analysis Falll 2016 Lecture 2: k-means Clustering Lecturer: Jiaming Xu Scribe: Jiaming Xu, September 2, 2016 Outline Optimization formulation of k-means Convergence

More information

Lecture 21: Minimax Theory

Lecture 21: Minimax Theory Lecture : Minimax Theory Akshay Krishnamurthy akshay@cs.umass.edu November 8, 07 Recap In the first part of the course, we spent the majority of our time studying risk minimization. We found many ways

More information

Application of Information Theory, Lecture 7. Relative Entropy. Handout Mode. Iftach Haitner. Tel Aviv University.

Application of Information Theory, Lecture 7. Relative Entropy. Handout Mode. Iftach Haitner. Tel Aviv University. Application of Information Theory, Lecture 7 Relative Entropy Handout Mode Iftach Haitner Tel Aviv University. December 1, 2015 Iftach Haitner (TAU) Application of Information Theory, Lecture 7 December

More information

Series 7, May 22, 2018 (EM Convergence)

Series 7, May 22, 2018 (EM Convergence) Exercises Introduction to Machine Learning SS 2018 Series 7, May 22, 2018 (EM Convergence) Institute for Machine Learning Dept. of Computer Science, ETH Zürich Prof. Dr. Andreas Krause Web: https://las.inf.ethz.ch/teaching/introml-s18

More information

AdaGAN: Boosting Generative Models

AdaGAN: Boosting Generative Models AdaGAN: Boosting Generative Models Ilya Tolstikhin ilya@tuebingen.mpg.de joint work with Gelly 2, Bousquet 2, Simon-Gabriel 1, Schölkopf 1 1 MPI for Intelligent Systems 2 Google Brain Radford et al., 2015)

More information

Lecture Support Vector Machine (SVM) Classifiers

Lecture Support Vector Machine (SVM) Classifiers Introduction to Machine Learning Lecturer: Amir Globerson Lecture 6 Fall Semester Scribe: Yishay Mansour 6.1 Support Vector Machine (SVM) Classifiers Classification is one of the most important tasks in

More information

14. Duality. ˆ Upper and lower bounds. ˆ General duality. ˆ Constraint qualifications. ˆ Counterexample. ˆ Complementary slackness.

14. Duality. ˆ Upper and lower bounds. ˆ General duality. ˆ Constraint qualifications. ˆ Counterexample. ˆ Complementary slackness. CS/ECE/ISyE 524 Introduction to Optimization Spring 2016 17 14. Duality ˆ Upper and lower bounds ˆ General duality ˆ Constraint qualifications ˆ Counterexample ˆ Complementary slackness ˆ Examples ˆ Sensitivity

More information

topics about f-divergence

topics about f-divergence topics about f-divergence Presented by Liqun Chen Mar 16th, 2018 1 Outline 1 f-gan: Training Generative Neural Samplers using Variational Experiments 2 f-gans in an Information Geometric Nutshell Experiments

More information

MATH 51H Section 4. October 16, Recall what it means for a function between metric spaces to be continuous:

MATH 51H Section 4. October 16, Recall what it means for a function between metric spaces to be continuous: MATH 51H Section 4 October 16, 2015 1 Continuity Recall what it means for a function between metric spaces to be continuous: Definition. Let (X, d X ), (Y, d Y ) be metric spaces. A function f : X Y is

More information

CS Lecture 8 & 9. Lagrange Multipliers & Varitional Bounds

CS Lecture 8 & 9. Lagrange Multipliers & Varitional Bounds CS 6347 Lecture 8 & 9 Lagrange Multipliers & Varitional Bounds General Optimization subject to: min ff 0() R nn ff ii 0, h ii = 0, ii = 1,, mm ii = 1,, pp 2 General Optimization subject to: min ff 0()

More information

Optimal Transport Methods in Operations Research and Statistics

Optimal Transport Methods in Operations Research and Statistics Optimal Transport Methods in Operations Research and Statistics Jose Blanchet (based on work with F. He, Y. Kang, K. Murthy, F. Zhang). Stanford University (Management Science and Engineering), and Columbia

More information

Topic: Primal-Dual Algorithms Date: We finished our discussion of randomized rounding and began talking about LP Duality.

Topic: Primal-Dual Algorithms Date: We finished our discussion of randomized rounding and began talking about LP Duality. CS787: Advanced Algorithms Scribe: Amanda Burton, Leah Kluegel Lecturer: Shuchi Chawla Topic: Primal-Dual Algorithms Date: 10-17-07 14.1 Last Time We finished our discussion of randomized rounding and

More information

Posterior Regularization

Posterior Regularization Posterior Regularization 1 Introduction One of the key challenges in probabilistic structured learning, is the intractability of the posterior distribution, for fast inference. There are numerous methods

More information

Lecture 3: Lower Bounds for Bandit Algorithms

Lecture 3: Lower Bounds for Bandit Algorithms CMSC 858G: Bandits, Experts and Games 09/19/16 Lecture 3: Lower Bounds for Bandit Algorithms Instructor: Alex Slivkins Scribed by: Soham De & Karthik A Sankararaman 1 Lower Bounds In this lecture (and

More information

MGMT 69000: Topics in High-dimensional Data Analysis Falll 2016

MGMT 69000: Topics in High-dimensional Data Analysis Falll 2016 MGMT 69000: Topics in High-dimensional Data Analysis Falll 2016 Lecture 14: Information Theoretic Methods Lecturer: Jiaming Xu Scribe: Hilda Ibriga, Adarsh Barik, December 02, 2016 Outline f-divergence

More information

Math 140A - Fall Final Exam

Math 140A - Fall Final Exam Math 140A - Fall 2014 - Final Exam Problem 1. Let {a n } n 1 be an increasing sequence of real numbers. (i) If {a n } has a bounded subsequence, show that {a n } is itself bounded. (ii) If {a n } has a

More information

21.2 Example 1 : Non-parametric regression in Mean Integrated Square Error Density Estimation (L 2 2 risk)

21.2 Example 1 : Non-parametric regression in Mean Integrated Square Error Density Estimation (L 2 2 risk) 10-704: Information Processing and Learning Spring 2015 Lecture 21: Examples of Lower Bounds and Assouad s Method Lecturer: Akshay Krishnamurthy Scribes: Soumya Batra Note: LaTeX template courtesy of UC

More information

Lagrangian Duality and Convex Optimization

Lagrangian Duality and Convex Optimization Lagrangian Duality and Convex Optimization David Rosenberg New York University February 11, 2015 David Rosenberg (New York University) DS-GA 1003 February 11, 2015 1 / 24 Introduction Why Convex Optimization?

More information

Lecture 6: September 19

Lecture 6: September 19 36-755: Advanced Statistical Theory I Fall 2016 Lecture 6: September 19 Lecturer: Alessandro Rinaldo Scribe: YJ Choe Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These notes have

More information

1 Review and Overview

1 Review and Overview DRAFT a final version will be posted shortly CS229T/STATS231: Statistical Learning Theory Lecturer: Tengyu Ma Lecture # 16 Scribe: Chris Cundy, Ananya Kumar November 14, 2018 1 Review and Overview Last

More information

Introduction to Mathematical Programming IE406. Lecture 10. Dr. Ted Ralphs

Introduction to Mathematical Programming IE406. Lecture 10. Dr. Ted Ralphs Introduction to Mathematical Programming IE406 Lecture 10 Dr. Ted Ralphs IE406 Lecture 10 1 Reading for This Lecture Bertsimas 4.1-4.3 IE406 Lecture 10 2 Duality Theory: Motivation Consider the following

More information

8. Geometric problems

8. Geometric problems 8. Geometric problems Convex Optimization Boyd & Vandenberghe extremal volume ellipsoids centering classification placement and facility location 8 1 Minimum volume ellipsoid around a set Löwner-John ellipsoid

More information

9. Geometric problems

9. Geometric problems 9. Geometric problems EE/AA 578, Univ of Washington, Fall 2016 projection on a set extremal volume ellipsoids centering classification 9 1 Projection on convex set projection of point x on set C defined

More information

LECTURE 10 LECTURE OUTLINE

LECTURE 10 LECTURE OUTLINE LECTURE 10 LECTURE OUTLINE Min Common/Max Crossing Th. III Nonlinear Farkas Lemma/Linear Constraints Linear Programming Duality Convex Programming Duality Optimality Conditions Reading: Sections 4.5, 5.1,5.2,

More information

h(x) lim H(x) = lim Since h is nondecreasing then h(x) 0 for all x, and if h is discontinuous at a point x then H(x) > 0. Denote

h(x) lim H(x) = lim Since h is nondecreasing then h(x) 0 for all x, and if h is discontinuous at a point x then H(x) > 0. Denote Real Variables, Fall 4 Problem set 4 Solution suggestions Exercise. Let f be of bounded variation on [a, b]. Show that for each c (a, b), lim x c f(x) and lim x c f(x) exist. Prove that a monotone function

More information

Lecture 5 Channel Coding over Continuous Channels

Lecture 5 Channel Coding over Continuous Channels Lecture 5 Channel Coding over Continuous Channels I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw November 14, 2014 1 / 34 I-Hsiang Wang NIT Lecture 5 From

More information

8. Geometric problems

8. Geometric problems 8. Geometric problems Convex Optimization Boyd & Vandenberghe extremal volume ellipsoids centering classification placement and facility location 8 Minimum volume ellipsoid around a set Löwner-John ellipsoid

More information

Theory of Probability Fall 2008

Theory of Probability Fall 2008 MIT OpenCourseWare http://ocw.mit.edu 8.75 Theory of Probability Fall 008 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms. Section Prekopa-Leindler inequality,

More information

Linear Programming Duality

Linear Programming Duality Summer 2011 Optimization I Lecture 8 1 Duality recap Linear Programming Duality We motivated the dual of a linear program by thinking about the best possible lower bound on the optimal value we can achieve

More information

1 Topology Definition of a topology Basis (Base) of a topology The subspace topology & the product topology on X Y 3

1 Topology Definition of a topology Basis (Base) of a topology The subspace topology & the product topology on X Y 3 Index Page 1 Topology 2 1.1 Definition of a topology 2 1.2 Basis (Base) of a topology 2 1.3 The subspace topology & the product topology on X Y 3 1.4 Basic topology concepts: limit points, closed sets,

More information

CSCI5654 (Linear Programming, Fall 2013) Lecture-8. Lecture 8 Slide# 1

CSCI5654 (Linear Programming, Fall 2013) Lecture-8. Lecture 8 Slide# 1 CSCI5654 (Linear Programming, Fall 2013) Lecture-8 Lecture 8 Slide# 1 Today s Lecture 1. Recap of dual variables and strong duality. 2. Complementary Slackness Theorem. 3. Interpretation of dual variables.

More information

Information theoretic perspectives on learning algorithms

Information theoretic perspectives on learning algorithms Information theoretic perspectives on learning algorithms Varun Jog University of Wisconsin - Madison Departments of ECE and Mathematics Shannon Channel Hangout! May 8, 2018 Jointly with Adrian Tovar-Lopez

More information

Policy Gradient. U(θ) = E[ R(s t,a t );π θ ] = E[R(τ);π θ ] (1) 1 + e θ φ(s t) E[R(τ);π θ ] (3) = max. θ P(τ;θ)R(τ) (6) P(τ;θ) θ log P(τ;θ)R(τ) (9)

Policy Gradient. U(θ) = E[ R(s t,a t );π θ ] = E[R(τ);π θ ] (1) 1 + e θ φ(s t) E[R(τ);π θ ] (3) = max. θ P(τ;θ)R(τ) (6) P(τ;θ) θ log P(τ;θ)R(τ) (9) CS294-40 Learning for Robotics and Control Lecture 16-10/20/2008 Lecturer: Pieter Abbeel Policy Gradient Scribe: Jan Biermeyer 1 Recap Recall: H U() = E[ R(s t,a ;π ] = E[R();π ] (1) Here is a sample path

More information

Definition 6.1. A metric space (X, d) is complete if every Cauchy sequence tends to a limit in X.

Definition 6.1. A metric space (X, d) is complete if every Cauchy sequence tends to a limit in X. Chapter 6 Completeness Lecture 18 Recall from Definition 2.22 that a Cauchy sequence in (X, d) is a sequence whose terms get closer and closer together, without any limit being specified. In the Euclidean

More information

Lecture 1: Entropy, convexity, and matrix scaling CSE 599S: Entropy optimality, Winter 2016 Instructor: James R. Lee Last updated: January 24, 2016

Lecture 1: Entropy, convexity, and matrix scaling CSE 599S: Entropy optimality, Winter 2016 Instructor: James R. Lee Last updated: January 24, 2016 Lecture 1: Entropy, convexity, and matrix scaling CSE 599S: Entropy optimality, Winter 2016 Instructor: James R. Lee Last updated: January 24, 2016 1 Entropy Since this course is about entropy maximization,

More information

AN ELEMENTARY PROOF OF THE TRIANGLE INEQUALITY FOR THE WASSERSTEIN METRIC

AN ELEMENTARY PROOF OF THE TRIANGLE INEQUALITY FOR THE WASSERSTEIN METRIC PROCEEDINGS OF THE AMERICAN MATHEMATICAL SOCIETY Volume 136, Number 1, January 2008, Pages 333 339 S 0002-9939(07)09020- Article electronically published on September 27, 2007 AN ELEMENTARY PROOF OF THE

More information

3. If a choice is broken down into two successive choices, the original H should be the weighted sum of the individual values of H.

3. If a choice is broken down into two successive choices, the original H should be the weighted sum of the individual values of H. Appendix A Information Theory A.1 Entropy Shannon (Shanon, 1948) developed the concept of entropy to measure the uncertainty of a discrete random variable. Suppose X is a discrete random variable that

More information

EC 521 MATHEMATICAL METHODS FOR ECONOMICS. Lecture 1: Preliminaries

EC 521 MATHEMATICAL METHODS FOR ECONOMICS. Lecture 1: Preliminaries EC 521 MATHEMATICAL METHODS FOR ECONOMICS Lecture 1: Preliminaries Murat YILMAZ Boğaziçi University In this lecture we provide some basic facts from both Linear Algebra and Real Analysis, which are going

More information

Duality of LPs and Applications

Duality of LPs and Applications Lecture 6 Duality of LPs and Applications Last lecture we introduced duality of linear programs. We saw how to form duals, and proved both the weak and strong duality theorems. In this lecture we will

More information

THE INVERSE FUNCTION THEOREM

THE INVERSE FUNCTION THEOREM THE INVERSE FUNCTION THEOREM W. PATRICK HOOPER The implicit function theorem is the following result: Theorem 1. Let f be a C 1 function from a neighborhood of a point a R n into R n. Suppose A = Df(a)

More information

Lecture Notes on Metric Spaces

Lecture Notes on Metric Spaces Lecture Notes on Metric Spaces Math 117: Summer 2007 John Douglas Moore Our goal of these notes is to explain a few facts regarding metric spaces not included in the first few chapters of the text [1],

More information

Lecture 2: August 31

Lecture 2: August 31 0-704: Information Processing and Learning Fall 206 Lecturer: Aarti Singh Lecture 2: August 3 Note: These notes are based on scribed notes from Spring5 offering of this course. LaTeX template courtesy

More information

COMPSCI 650 Applied Information Theory Jan 21, Lecture 2

COMPSCI 650 Applied Information Theory Jan 21, Lecture 2 COMPSCI 650 Applied Information Theory Jan 21, 2016 Lecture 2 Instructor: Arya Mazumdar Scribe: Gayane Vardoyan, Jong-Chyi Su 1 Entropy Definition: Entropy is a measure of uncertainty of a random variable.

More information

Lecture 6: Conic Optimization September 8

Lecture 6: Conic Optimization September 8 IE 598: Big Data Optimization Fall 2016 Lecture 6: Conic Optimization September 8 Lecturer: Niao He Scriber: Juan Xu Overview In this lecture, we finish up our previous discussion on optimality conditions

More information

d(x n, x) d(x n, x nk ) + d(x nk, x) where we chose any fixed k > N

d(x n, x) d(x n, x nk ) + d(x nk, x) where we chose any fixed k > N Problem 1. Let f : A R R have the property that for every x A, there exists ɛ > 0 such that f(t) > ɛ if t (x ɛ, x + ɛ) A. If the set A is compact, prove there exists c > 0 such that f(x) > c for all x

More information

Constrained Optimization and Lagrangian Duality

Constrained Optimization and Lagrangian Duality CIS 520: Machine Learning Oct 02, 2017 Constrained Optimization and Lagrangian Duality Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may or may

More information

Midterm 1. Every element of the set of functions is continuous

Midterm 1. Every element of the set of functions is continuous Econ 200 Mathematics for Economists Midterm Question.- Consider the set of functions F C(0, ) dened by { } F = f C(0, ) f(x) = ax b, a A R and b B R That is, F is a subset of the set of continuous functions

More information

6.1 Variational representation of f-divergences

6.1 Variational representation of f-divergences ECE598: Information-theoretic methods in high-dimensional statistics Spring 2016 Lecture 6: Variational representation, HCR and CR lower bounds Lecturer: Yihong Wu Scribe: Georgios Rovatsos, Feb 11, 2016

More information

Stability results for Logarithmic Sobolev inequality

Stability results for Logarithmic Sobolev inequality Stability results for Logarithmic Sobolev inequality Daesung Kim (joint work with Emanuel Indrei) Department of Mathematics Purdue University September 20, 2017 Daesung Kim (Purdue) Stability for LSI Probability

More information

The fundamental theorem of linear programming

The fundamental theorem of linear programming The fundamental theorem of linear programming Michael Tehranchi June 8, 2017 This note supplements the lecture notes of Optimisation The statement of the fundamental theorem of linear programming and the

More information

Convex Optimization and Support Vector Machine

Convex Optimization and Support Vector Machine Convex Optimization and Support Vector Machine Problem 0. Consider a two-class classification problem. The training data is L n = {(x 1, t 1 ),..., (x n, t n )}, where each t i { 1, 1} and x i R p. We

More information

Example Problem. Linear Program (standard form) CSCI5654 (Linear Programming, Fall 2013) Lecture-7. Duality

Example Problem. Linear Program (standard form) CSCI5654 (Linear Programming, Fall 2013) Lecture-7. Duality CSCI5654 (Linear Programming, Fall 013) Lecture-7 Duality Lecture 7 Slide# 1 Lecture 7 Slide# Linear Program (standard form) Example Problem maximize c 1 x 1 + + c n x n s.t. a j1 x 1 + + a jn x n b j

More information

Measure and Integration: Solutions of CW2

Measure and Integration: Solutions of CW2 Measure and Integration: s of CW2 Fall 206 [G. Holzegel] December 9, 206 Problem of Sheet 5 a) Left (f n ) and (g n ) be sequences of integrable functions with f n (x) f (x) and g n (x) g (x) for almost

More information

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017 Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training

More information

The Moment Method; Convex Duality; and Large/Medium/Small Deviations

The Moment Method; Convex Duality; and Large/Medium/Small Deviations Stat 928: Statistical Learning Theory Lecture: 5 The Moment Method; Convex Duality; and Large/Medium/Small Deviations Instructor: Sham Kakade The Exponential Inequality and Convex Duality The exponential

More information

Math 127C, Spring 2006 Final Exam Solutions. x 2 ), g(y 1, y 2 ) = ( y 1 y 2, y1 2 + y2) 2. (g f) (0) = g (f(0))f (0).

Math 127C, Spring 2006 Final Exam Solutions. x 2 ), g(y 1, y 2 ) = ( y 1 y 2, y1 2 + y2) 2. (g f) (0) = g (f(0))f (0). Math 27C, Spring 26 Final Exam Solutions. Define f : R 2 R 2 and g : R 2 R 2 by f(x, x 2 (sin x 2 x, e x x 2, g(y, y 2 ( y y 2, y 2 + y2 2. Use the chain rule to compute the matrix of (g f (,. By the chain

More information

Bayesian Machine Learning - Lecture 7

Bayesian Machine Learning - Lecture 7 Bayesian Machine Learning - Lecture 7 Guido Sanguinetti Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh gsanguin@inf.ed.ac.uk March 4, 2015 Today s lecture 1

More information

Generative Adversarial Networks

Generative Adversarial Networks Generative Adversarial Networks Stefano Ermon, Aditya Grover Stanford University Lecture 10 Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 10 1 / 17 Selected GANs https://github.com/hindupuravinash/the-gan-zoo

More information

Optimal Transport in Risk Analysis

Optimal Transport in Risk Analysis Optimal Transport in Risk Analysis Jose Blanchet (based on work with Y. Kang and K. Murthy) Stanford University (Management Science and Engineering), and Columbia University (Department of Statistics and

More information

Optimality, Duality, Complementarity for Constrained Optimization

Optimality, Duality, Complementarity for Constrained Optimization Optimality, Duality, Complementarity for Constrained Optimization Stephen Wright University of Wisconsin-Madison May 2014 Wright (UW-Madison) Optimality, Duality, Complementarity May 2014 1 / 41 Linear

More information

a) Let x,y be Cauchy sequences in some metric space. Define d(x, y) = lim n d (x n, y n ). Show that this function is well-defined.

a) Let x,y be Cauchy sequences in some metric space. Define d(x, y) = lim n d (x n, y n ). Show that this function is well-defined. Problem 3) Remark: for this problem, if I write the notation lim x n, it should always be assumed that I mean lim n x n, and similarly if I write the notation lim x nk it should always be assumed that

More information

Consistency of the maximum likelihood estimator for general hidden Markov models

Consistency of the maximum likelihood estimator for general hidden Markov models Consistency of the maximum likelihood estimator for general hidden Markov models Jimmy Olsson Centre for Mathematical Sciences Lund University Nordstat 2012 Umeå, Sweden Collaborators Hidden Markov models

More information

Math 328 Course Notes

Math 328 Course Notes Math 328 Course Notes Ian Robertson March 3, 2006 3 Properties of C[0, 1]: Sup-norm and Completeness In this chapter we are going to examine the vector space of all continuous functions defined on the

More information

Solving Dual Problems

Solving Dual Problems Lecture 20 Solving Dual Problems We consider a constrained problem where, in addition to the constraint set X, there are also inequality and linear equality constraints. Specifically the minimization problem

More information

Techinical Proofs for Nonlinear Learning using Local Coordinate Coding

Techinical Proofs for Nonlinear Learning using Local Coordinate Coding Techinical Proofs for Nonlinear Learning using Local Coordinate Coding 1 Notations and Main Results Denition 1.1 (Lipschitz Smoothness) A function f(x) on R d is (α, β, p)-lipschitz smooth with respect

More information

The extreme points of symmetric norms on R^2

The extreme points of symmetric norms on R^2 Graduate Theses and Dissertations Iowa State University Capstones, Theses and Dissertations 2008 The extreme points of symmetric norms on R^2 Anchalee Khemphet Iowa State University Follow this and additional

More information

Basic Properties of Metric and Normed Spaces

Basic Properties of Metric and Normed Spaces Basic Properties of Metric and Normed Spaces Computational and Metric Geometry Instructor: Yury Makarychev The second part of this course is about metric geometry. We will study metric spaces, low distortion

More information

Linear Programming: Chapter 5 Duality

Linear Programming: Chapter 5 Duality Linear Programming: Chapter 5 Duality Robert J. Vanderbei September 30, 2010 Slides last edited on October 5, 2010 Operations Research and Financial Engineering Princeton University Princeton, NJ 08544

More information

Variational Inference (11/04/13)

Variational Inference (11/04/13) STA561: Probabilistic machine learning Variational Inference (11/04/13) Lecturer: Barbara Engelhardt Scribes: Matt Dickenson, Alireza Samany, Tracy Schifeling 1 Introduction In this lecture we will further

More information

Lecture 1: Introduction, Entropy and ML estimation

Lecture 1: Introduction, Entropy and ML estimation 0-704: Information Processing and Learning Spring 202 Lecture : Introduction, Entropy and ML estimation Lecturer: Aarti Singh Scribes: Min Xu Disclaimer: These notes have not been subjected to the usual

More information

16.1 Bounding Capacity with Covering Number

16.1 Bounding Capacity with Covering Number ECE598: Information-theoretic methods in high-dimensional statistics Spring 206 Lecture 6: Upper Bounds for Density Estimation Lecturer: Yihong Wu Scribe: Yang Zhang, Apr, 206 So far we have been mostly

More information

Information geometry for bivariate distribution control

Information geometry for bivariate distribution control Information geometry for bivariate distribution control C.T.J.Dodson + Hong Wang Mathematics + Control Systems Centre, University of Manchester Institute of Science and Technology Optimal control of stochastic

More information

Note 3: LP Duality. If the primal problem (P) in the canonical form is min Z = n (1) then the dual problem (D) in the canonical form is max W = m (2)

Note 3: LP Duality. If the primal problem (P) in the canonical form is min Z = n (1) then the dual problem (D) in the canonical form is max W = m (2) Note 3: LP Duality If the primal problem (P) in the canonical form is min Z = n j=1 c j x j s.t. nj=1 a ij x j b i i = 1, 2,..., m (1) x j 0 j = 1, 2,..., n, then the dual problem (D) in the canonical

More information

Convex Optimization Boyd & Vandenberghe. 5. Duality

Convex Optimization Boyd & Vandenberghe. 5. Duality 5. Duality Convex Optimization Boyd & Vandenberghe Lagrange dual problem weak and strong duality geometric interpretation optimality conditions perturbation and sensitivity analysis examples generalized

More information

Convex Optimization & Lagrange Duality

Convex Optimization & Lagrange Duality Convex Optimization & Lagrange Duality Chee Wei Tan CS 8292 : Advanced Topics in Convex Optimization and its Applications Fall 2010 Outline Convex optimization Optimality condition Lagrange duality KKT

More information

Econ Slides from Lecture 1

Econ Slides from Lecture 1 Econ 205 Sobel Econ 205 - Slides from Lecture 1 Joel Sobel August 23, 2010 Warning I can t start without assuming that something is common knowledge. You can find basic definitions of Sets and Set Operations

More information

A : k n. Usually k > n otherwise easily the minimum is zero. Analytical solution:

A : k n. Usually k > n otherwise easily the minimum is zero. Analytical solution: 1-5: Least-squares I A : k n. Usually k > n otherwise easily the minimum is zero. Analytical solution: f (x) =(Ax b) T (Ax b) =x T A T Ax 2b T Ax + b T b f (x) = 2A T Ax 2A T b = 0 Chih-Jen Lin (National

More information

1 Review and Overview

1 Review and Overview CS229T/STATS231: Statistical Learig Theory Lecturer: Tegyu Ma Lecture #12 Scribe: Garrett Thomas, Pega Liu October 31, 2018 1 Review a Overview Recall the GAN setup: we have iepeet samples x 1,..., x raw

More information

June 21, Peking University. Dual Connections. Zhengchao Wan. Overview. Duality of connections. Divergence: general contrast functions

June 21, Peking University. Dual Connections. Zhengchao Wan. Overview. Duality of connections. Divergence: general contrast functions Dual Peking University June 21, 2016 Divergences: Riemannian connection Let M be a manifold on which there is given a Riemannian metric g =,. A connection satisfying Z X, Y = Z X, Y + X, Z Y (1) for all

More information

Quantitative Biology II Lecture 4: Variational Methods

Quantitative Biology II Lecture 4: Variational Methods 10 th March 2015 Quantitative Biology II Lecture 4: Variational Methods Gurinder Singh Mickey Atwal Center for Quantitative Biology Cold Spring Harbor Laboratory Image credit: Mike West Summary Approximate

More information

EE364a Review Session 5

EE364a Review Session 5 EE364a Review Session 5 EE364a Review announcements: homeworks 1 and 2 graded homework 4 solutions (check solution to additional problem 1) scpd phone-in office hours: tuesdays 6-7pm (650-723-1156) 1 Complementary

More information

CS-E4830 Kernel Methods in Machine Learning

CS-E4830 Kernel Methods in Machine Learning CS-E4830 Kernel Methods in Machine Learning Lecture 3: Convex optimization and duality Juho Rousu 27. September, 2017 Juho Rousu 27. September, 2017 1 / 45 Convex optimization Convex optimisation This

More information

Convex Optimization and SVM

Convex Optimization and SVM Convex Optimization and SVM Problem 0. Cf lecture notes pages 12 to 18. Problem 1. (i) A slab is an intersection of two half spaces, hence convex. (ii) A wedge is an intersection of two half spaces, hence

More information

Distance-Divergence Inequalities

Distance-Divergence Inequalities Distance-Divergence Inequalities Katalin Marton Alfréd Rényi Institute of Mathematics of the Hungarian Academy of Sciences Motivation To find a simple proof of the Blowing-up Lemma, proved by Ahlswede,

More information

Applications of Linear Programming

Applications of Linear Programming Applications of Linear Programming lecturer: András London University of Szeged Institute of Informatics Department of Computational Optimization Lecture 9 Non-linear programming In case of LP, the goal

More information

converges as well if x < 1. 1 x n x n 1 1 = 2 a nx n

converges as well if x < 1. 1 x n x n 1 1 = 2 a nx n Solve the following 6 problems. 1. Prove that if series n=1 a nx n converges for all x such that x < 1, then the series n=1 a n xn 1 x converges as well if x < 1. n For x < 1, x n 0 as n, so there exists

More information

Linear and Combinatorial Optimization

Linear and Combinatorial Optimization Linear and Combinatorial Optimization The dual of an LP-problem. Connections between primal and dual. Duality theorems and complementary slack. Philipp Birken (Ctr. for the Math. Sc.) Lecture 3: Duality

More information

LINEAR PROGRAMMING II

LINEAR PROGRAMMING II LINEAR PROGRAMMING II LP duality strong duality theorem bonus proof of LP duality applications Lecture slides by Kevin Wayne Last updated on 7/25/17 11:09 AM LINEAR PROGRAMMING II LP duality Strong duality

More information

Exponentiated Gradient Descent

Exponentiated Gradient Descent CSE599s, Spring 01, Online Learning Lecture 10-04/6/01 Lecturer: Ofer Dekel Exponentiated Gradient Descent Scribe: Albert Yu 1 Introduction In this lecture we review norms, dual norms, strong convexity,

More information