CS229T/STATS231: Statistical Learning Theory. Lecturer: Tengyu Ma Lecture 11 Scribe: Jongho Kim, Jamie Kang October 29th, 2018
|
|
- Magdalen Leonard
- 5 years ago
- Views:
Transcription
1 CS229T/STATS231: Statistical Learning Theory Lecturer: Tengyu Ma Lecture 11 Scribe: Jongho Kim, Jamie Kang October 29th, Overview This lecture mainly covers Recall the statistical theory of GANs from the last lecture Introduce Kontorovich-Rubinstein duality of Wasserstein distance and provide the proof 2 Review Recall the following notations and two definitions from the last lecture: z 1,..., z n i.i.d Z where Z N (0, I k k ) X = G θ (Z) : G θ is the generator R k R d x 1,..., x n i.i.d P : training examples ˆp : Uniform distribution over {x 1,..., x n } p θ : distribution of X = G θ (Z) ˆp θ : uniform distribution over {G θ (z 1 ),..., G θ (z m )} Definition 1 (Wasserstein Distance). W 1 (P, Q) = sup f L 1 E x P [f(x)] E x Q [f(x)] Here, the supremum is over functions f where f is 1-Lipschitz continuous function with respect to metric d. For simplicity, we denote this as f L 1. Recall that the function f is 1-Lipschitz continuous if f(x) f(y) d(x, y) Most of cases, we let the metric d to be l 2 distance. Definition 2 (F-Integral Probability Metric (F-IPM)). W F (P, Q) = sup f F x, y E x P [f(x)] E x Q [f(x)] where F is a family of functions (e.g., a family of neural nets {f φ }). This week (week 6), we would like to address the following question: W F (ˆp, ˆp θ ) }{{} training loss small = W 1 (p, p θ ) small?
2 3 Duality of W 1 (Kontorovich-Rubinstein duality) Theorem 1. Let P, Q be two distributions over X (assume P and Q have bounded support) and let d be a metric on X. Then, we have W 1 (P, Q) = sup f L 1 E x P [f(x)] E x Q [f(x)] = inf P E (x,y) P [d(x, y)] (1) where P is coupling of P and Q. A coupling P is a joint distribution over X X whose marginals are P and Q. (i.e., If (x, y) P then P x = P and P y = Q.) Equivalently, we can think of coupling as if we are sending mass from X to X. Suppose we have a finite set X such that X = {v 1,..., v k }. Let P and Q be the two distributions over X as following: P = (p 1,..., p k ) where i p i = 1 and p i 0 i Q = (q 1,..., q k ) where i q i = 1 and q i 0 i Let γ to denote the cost of sending a mass from P to Q. Figure 1 shows the relationship between two distributions and γ. γ ij = Prob [x = v i, y = v j ] where (x, y) P Note that k i=1 γ ij = p i and k j=1 γ ij = q i. Using this interpretation, we can think of the Wasserstein distance of P and Q as the cost of transportation. We can write it in terms of following mathematical notation: W 1 (P, Q) = inf P E (x,y) P [d(x, y)] = i,j γ ij d(v i, v j ) and we want to minimize this cost. Figure 1: Diagram representation of coupling 2
3 Last lecture, we shortly discussed why other distance measures (e.g., Total Variation (TV) distance and Kullback-Leibler(KL) divergence) are not useful in our setting compared to Wasserstein distance. Before we move on to the formal proof of Theorem 1, here we discuss and compare Wasserstein distance with TV distance and KL divergence. Definition 3 (Total Variation distance). TV(P, Q) = sup = f 1 E x P [f(x)] E x Q [f(x)] (All cases) p(x) q(x) dx where P and Q have densities p and q (Continuous case) Definition 4 (Kullback-Leibler divergence). [ KL(P, Q) = E x P log p(x) ] = P (x) log P (x) q(x) Q(x) x ( ) p(x) = p(x) log dx q(x) (Discrete case) (Continuous case) Example 1. Suppose P is uniform distribution over sphere and Q is also uniform distribution over sphere which is shifted by ɛ on only one coordinate. Let the metric d = l 2. Figure 2 shows two distributions of P and Q. Figure 2: Distributions of P and Q In this case, TV(P, Q) = 1 because we can construct the function f as following: f(x) = 1 if X support(p ) f(x) = 0 if X support(q) Note that KL(P, Q) is not well-defined in this setting because not both P and Q defined on the same probability space. On the other hand, W 1 (P, Q) ɛ. To see why this is the case, we first construct a coupling as following sampling technique: [ ] 1 (x, y) P x P, y = x + ɛ 0 3
4 Note that the marginal distribution would be P and Q. Here, we achieve W 1 (P, Q) = E (x,y) P [d(x, y)] = E (x,y) P [ x y 2 ] Example 2. We state the following claim: = ɛ since x y = ɛ Claim 1. TV(P, Q) = W 1 (P, Q) with respect to the trivial metric d(x, y) = I {x y} (Note: This claim explains why TV distance does not take into account geometry.) Proof. f is 1-Lipschitz with respect to the metric d(x, y) = I {x y}. We claim that [ ] 1 0 f(x) f(y) I {x y} sup x X f(x) inf(f) 1 To see this, first note that from f(x) f(y) I{x y}, we get that f(x) f(y) 1 if x y. Thus, if the elements obtaining the sup and inf are not the same, then the difference is bounded by 1, otherwise the difference will be 0. Furthermore, if sup x X f(x) inf(f) 1, it immediately follows that f(x) f(y) 1 x, y. This establishes the equivalence. From here, we denote E x P [f(x)] = E x P f and E x Q [f(x)] = E x Q f for simplicity. Then, W 1 (P, Q) = sup E x P f E x Q f f:sup f inf f 1 }{{} shift invariant = sup f:0 f 1 E x P f E x Q f = max f:0 f 1, 0 1 f 1 {E x P f E x Q f, E x P (1 f) E x Q (1 f)} = sup E x P f E x Q f 0 f 1 = T V (P, Q) where the second equality follows from the fact that E x P f E x Q f is shift invariant (i.e., E P [f + c] E Q [f + c] = E P f + c E Q f c = E P f E Q f), and thus we can subtract inf f. The fourth equality is again using the fact that E x P (1 f) E x Q (1 f) is shift invariant and thus we can remove 1 and have the second term become E P ( f) E Q ( f). 3.1 Proof of Kontorovich-Rubinstein duality We use LP duality in fuctional space with f as variable to prove Theorem 1. Assume X is finite: X = {v 1,..., v k } and f(v 1 ),..., f(v k ) (f 1,..., f k ) and recall that P = (p 1,..., p k ), Q = (q 1,..., q k ). As usual, we assume p i > 0, q i > 0 for all i. We construct the following linear programming problems: 4
5 Program (A): maximize f=(f 1,...,f k ) k p i f i i=1 k q i f i (OPT 1 ) i=1 subject to f(x) f(y) d(x, y) x, y X (i.e., f(v i ) f(v j ) d(v i, v j ) f i f j d ij ) Note that OPT 1 is the expanded version of the objective function: max f E P f E Q f. Program (P): minimize γ subject to γ ij d ij (OPT 2 ) i,j γ ij = p i i (dual: f i ) j γ ij = q j j (dual: g j ) i γ ij 0 i, j (dual: w ij ) Goal. We want to prove that OPT 1 = OPT 2. Lemma 1 (OPT 1 OPT 2 ). Proof. Let f, γ be optimal solutions to (A) and (P) respectively. Then, OPT 1 = i (p i q i )f i = i ( j γ ij )f i i ( j γ ji )f i (by definition of p i and q i ) = i ( j γ ij )f i j ( i γ ij )f j = i,j i,j γ ij (f i f j ) γ ij d ij = OPT 2. Next, we construct the dual of (P) using dual variables specified in parentheses in (P): Program (D): maximize p i f i q i g i (OPT 3 ) f,g i i subject to f i g j + w ij = d ij i, j i, j because w ij is non- Note that the constraint can be re-written as f i g j d ij negative by definition of dual variable. 5
6 Lemma 2 (OPT 1 OPT 3 ). Proof. The main idea here is to use the fact that d is a metric. Let f, g, γ be optimal solutions to (P) and (D). Then, by complementary slackness 1, f i g j = d ij if γ ij > 0 i, j. First, we show that g j g t d jt j, t by following j, pick i such that γ ij > 0 = f i g j = d ij (1) Also, by constraint: i, t f i g t d it (2) (1) (2) yields: g j g t d it d ij d jt where the last inequality is from triangle inequality as d is a metric. Therefore, g is a feasible solution of (A). Note that from (D) s constraint, f i g i d ii and d ii = 0 since it s a metric. Therefore, f i g i for all i. Using this, we obtain: OPT 1 i i p i g i i p i f i i q i g i (feasibility of g) q i g i (f i g i ) = OPT 3. Finally, by Lemma 1 and Lemma 2 and the fact that OPT 2 = OPT 2 3, we conclude OPT 1 = OPT 2 = OPT 3. 4 Plans for future lectures Recall that the main question we aim to address is: W F (ˆp, ˆp θ ) }{{} training loss small = W 1 (p, p θ ) small? Our plan for addressing this question is following: Plan: W F (ˆp, ˆp θ ) generalization W F (p, p θ ) approx.w 1 W F W1 (p, p θ ). Hence, in the next lecture we will show three observations: 1. If F is complex = generalization is bad 2. If F has small complexity = generalization is good 3. If F has small complexity = approximation may not be good and explore ways to overcome this dilemma. 1 If you are not familiar with this part, please refer to Chapter 5 of Convex Optimization [?] and/or EE 364A course lecture slide 17 on 2 Since OPT 3 is dual of OPT 2 and strong duality holds for linear programming, OPT 2 = OPT 3. 6
Lecture 35: December The fundamental statistical distances
36-705: Intermediate Statistics Fall 207 Lecturer: Siva Balakrishnan Lecture 35: December 4 Today we will discuss distances and metrics between distributions that are useful in statistics. I will be lose
More informationNishant Gurnani. GAN Reading Group. April 14th, / 107
Nishant Gurnani GAN Reading Group April 14th, 2017 1 / 107 Why are these Papers Important? 2 / 107 Why are these Papers Important? Recently a large number of GAN frameworks have been proposed - BGAN, LSGAN,
More informationWasserstein GAN. Juho Lee. Jan 23, 2017
Wasserstein GAN Juho Lee Jan 23, 2017 Wasserstein GAN (WGAN) Arxiv submission Martin Arjovsky, Soumith Chintala, and Léon Bottou A new GAN model minimizing the Earth-Mover s distance (Wasserstein-1 distance)
More informationECE598: Information-theoretic methods in high-dimensional statistics Spring 2016
ECE598: Information-theoretic methods in high-dimensional statistics Spring 06 Lecture : Mutual Information Method Lecturer: Yihong Wu Scribe: Jaeho Lee, Mar, 06 Ed. Mar 9 Quick review: Assouad s lemma
More informationHands-On Learning Theory Fall 2016, Lecture 3
Hands-On Learning Theory Fall 016, Lecture 3 Jean Honorio jhonorio@purdue.edu 1 Information Theory First, we provide some information theory background. Definition 3.1 (Entropy). The entropy of a discrete
More informationLecture 3 January 28
EECS 28B / STAT 24B: Advanced Topics in Statistical LearningSpring 2009 Lecture 3 January 28 Lecturer: Pradeep Ravikumar Scribe: Timothy J. Wheeler Note: These lecture notes are still rough, and have only
More informationLecture 17: Density Estimation Lecturer: Yihong Wu Scribe: Jiaqi Mu, Mar 31, 2016 [Ed. Apr 1]
ECE598: Information-theoretic methods in high-dimensional statistics Spring 06 Lecture 7: Density Estimation Lecturer: Yihong Wu Scribe: Jiaqi Mu, Mar 3, 06 [Ed. Apr ] In last lecture, we studied the minimax
More informationLecture #21. c T x Ax b. maximize subject to
COMPSCI 330: Design and Analysis of Algorithms 11/11/2014 Lecture #21 Lecturer: Debmalya Panigrahi Scribe: Samuel Haney 1 Overview In this lecture, we discuss linear programming. We first show that the
More information21.1 Lower bounds on minimax risk for functional estimation
ECE598: Information-theoretic methods in high-dimensional statistics Spring 016 Lecture 1: Functional estimation & testing Lecturer: Yihong Wu Scribe: Ashok Vardhan, Apr 14, 016 In this chapter, we will
More information1 Review of The Learning Setting
COS 5: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #8 Scribe: Changyan Wang February 28, 208 Review of The Learning Setting Last class, we moved beyond the PAC model: in the PAC model we
More informationIntroduction to Statistical Learning Theory
Introduction to Statistical Learning Theory In the last unit we looked at regularization - adding a w 2 penalty. We add a bias - we prefer classifiers with low norm. How to incorporate more complicated
More information2.1 Optimization formulation of k-means
MGMT 69000: Topics in High-dimensional Data Analysis Falll 2016 Lecture 2: k-means Clustering Lecturer: Jiaming Xu Scribe: Jiaming Xu, September 2, 2016 Outline Optimization formulation of k-means Convergence
More informationLecture 21: Minimax Theory
Lecture : Minimax Theory Akshay Krishnamurthy akshay@cs.umass.edu November 8, 07 Recap In the first part of the course, we spent the majority of our time studying risk minimization. We found many ways
More informationApplication of Information Theory, Lecture 7. Relative Entropy. Handout Mode. Iftach Haitner. Tel Aviv University.
Application of Information Theory, Lecture 7 Relative Entropy Handout Mode Iftach Haitner Tel Aviv University. December 1, 2015 Iftach Haitner (TAU) Application of Information Theory, Lecture 7 December
More informationSeries 7, May 22, 2018 (EM Convergence)
Exercises Introduction to Machine Learning SS 2018 Series 7, May 22, 2018 (EM Convergence) Institute for Machine Learning Dept. of Computer Science, ETH Zürich Prof. Dr. Andreas Krause Web: https://las.inf.ethz.ch/teaching/introml-s18
More informationAdaGAN: Boosting Generative Models
AdaGAN: Boosting Generative Models Ilya Tolstikhin ilya@tuebingen.mpg.de joint work with Gelly 2, Bousquet 2, Simon-Gabriel 1, Schölkopf 1 1 MPI for Intelligent Systems 2 Google Brain Radford et al., 2015)
More informationLecture Support Vector Machine (SVM) Classifiers
Introduction to Machine Learning Lecturer: Amir Globerson Lecture 6 Fall Semester Scribe: Yishay Mansour 6.1 Support Vector Machine (SVM) Classifiers Classification is one of the most important tasks in
More information14. Duality. ˆ Upper and lower bounds. ˆ General duality. ˆ Constraint qualifications. ˆ Counterexample. ˆ Complementary slackness.
CS/ECE/ISyE 524 Introduction to Optimization Spring 2016 17 14. Duality ˆ Upper and lower bounds ˆ General duality ˆ Constraint qualifications ˆ Counterexample ˆ Complementary slackness ˆ Examples ˆ Sensitivity
More informationtopics about f-divergence
topics about f-divergence Presented by Liqun Chen Mar 16th, 2018 1 Outline 1 f-gan: Training Generative Neural Samplers using Variational Experiments 2 f-gans in an Information Geometric Nutshell Experiments
More informationMATH 51H Section 4. October 16, Recall what it means for a function between metric spaces to be continuous:
MATH 51H Section 4 October 16, 2015 1 Continuity Recall what it means for a function between metric spaces to be continuous: Definition. Let (X, d X ), (Y, d Y ) be metric spaces. A function f : X Y is
More informationCS Lecture 8 & 9. Lagrange Multipliers & Varitional Bounds
CS 6347 Lecture 8 & 9 Lagrange Multipliers & Varitional Bounds General Optimization subject to: min ff 0() R nn ff ii 0, h ii = 0, ii = 1,, mm ii = 1,, pp 2 General Optimization subject to: min ff 0()
More informationOptimal Transport Methods in Operations Research and Statistics
Optimal Transport Methods in Operations Research and Statistics Jose Blanchet (based on work with F. He, Y. Kang, K. Murthy, F. Zhang). Stanford University (Management Science and Engineering), and Columbia
More informationTopic: Primal-Dual Algorithms Date: We finished our discussion of randomized rounding and began talking about LP Duality.
CS787: Advanced Algorithms Scribe: Amanda Burton, Leah Kluegel Lecturer: Shuchi Chawla Topic: Primal-Dual Algorithms Date: 10-17-07 14.1 Last Time We finished our discussion of randomized rounding and
More informationPosterior Regularization
Posterior Regularization 1 Introduction One of the key challenges in probabilistic structured learning, is the intractability of the posterior distribution, for fast inference. There are numerous methods
More informationLecture 3: Lower Bounds for Bandit Algorithms
CMSC 858G: Bandits, Experts and Games 09/19/16 Lecture 3: Lower Bounds for Bandit Algorithms Instructor: Alex Slivkins Scribed by: Soham De & Karthik A Sankararaman 1 Lower Bounds In this lecture (and
More informationMGMT 69000: Topics in High-dimensional Data Analysis Falll 2016
MGMT 69000: Topics in High-dimensional Data Analysis Falll 2016 Lecture 14: Information Theoretic Methods Lecturer: Jiaming Xu Scribe: Hilda Ibriga, Adarsh Barik, December 02, 2016 Outline f-divergence
More informationMath 140A - Fall Final Exam
Math 140A - Fall 2014 - Final Exam Problem 1. Let {a n } n 1 be an increasing sequence of real numbers. (i) If {a n } has a bounded subsequence, show that {a n } is itself bounded. (ii) If {a n } has a
More information21.2 Example 1 : Non-parametric regression in Mean Integrated Square Error Density Estimation (L 2 2 risk)
10-704: Information Processing and Learning Spring 2015 Lecture 21: Examples of Lower Bounds and Assouad s Method Lecturer: Akshay Krishnamurthy Scribes: Soumya Batra Note: LaTeX template courtesy of UC
More informationLagrangian Duality and Convex Optimization
Lagrangian Duality and Convex Optimization David Rosenberg New York University February 11, 2015 David Rosenberg (New York University) DS-GA 1003 February 11, 2015 1 / 24 Introduction Why Convex Optimization?
More informationLecture 6: September 19
36-755: Advanced Statistical Theory I Fall 2016 Lecture 6: September 19 Lecturer: Alessandro Rinaldo Scribe: YJ Choe Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These notes have
More information1 Review and Overview
DRAFT a final version will be posted shortly CS229T/STATS231: Statistical Learning Theory Lecturer: Tengyu Ma Lecture # 16 Scribe: Chris Cundy, Ananya Kumar November 14, 2018 1 Review and Overview Last
More informationIntroduction to Mathematical Programming IE406. Lecture 10. Dr. Ted Ralphs
Introduction to Mathematical Programming IE406 Lecture 10 Dr. Ted Ralphs IE406 Lecture 10 1 Reading for This Lecture Bertsimas 4.1-4.3 IE406 Lecture 10 2 Duality Theory: Motivation Consider the following
More information8. Geometric problems
8. Geometric problems Convex Optimization Boyd & Vandenberghe extremal volume ellipsoids centering classification placement and facility location 8 1 Minimum volume ellipsoid around a set Löwner-John ellipsoid
More information9. Geometric problems
9. Geometric problems EE/AA 578, Univ of Washington, Fall 2016 projection on a set extremal volume ellipsoids centering classification 9 1 Projection on convex set projection of point x on set C defined
More informationLECTURE 10 LECTURE OUTLINE
LECTURE 10 LECTURE OUTLINE Min Common/Max Crossing Th. III Nonlinear Farkas Lemma/Linear Constraints Linear Programming Duality Convex Programming Duality Optimality Conditions Reading: Sections 4.5, 5.1,5.2,
More informationh(x) lim H(x) = lim Since h is nondecreasing then h(x) 0 for all x, and if h is discontinuous at a point x then H(x) > 0. Denote
Real Variables, Fall 4 Problem set 4 Solution suggestions Exercise. Let f be of bounded variation on [a, b]. Show that for each c (a, b), lim x c f(x) and lim x c f(x) exist. Prove that a monotone function
More informationLecture 5 Channel Coding over Continuous Channels
Lecture 5 Channel Coding over Continuous Channels I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw November 14, 2014 1 / 34 I-Hsiang Wang NIT Lecture 5 From
More information8. Geometric problems
8. Geometric problems Convex Optimization Boyd & Vandenberghe extremal volume ellipsoids centering classification placement and facility location 8 Minimum volume ellipsoid around a set Löwner-John ellipsoid
More informationTheory of Probability Fall 2008
MIT OpenCourseWare http://ocw.mit.edu 8.75 Theory of Probability Fall 008 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms. Section Prekopa-Leindler inequality,
More informationLinear Programming Duality
Summer 2011 Optimization I Lecture 8 1 Duality recap Linear Programming Duality We motivated the dual of a linear program by thinking about the best possible lower bound on the optimal value we can achieve
More information1 Topology Definition of a topology Basis (Base) of a topology The subspace topology & the product topology on X Y 3
Index Page 1 Topology 2 1.1 Definition of a topology 2 1.2 Basis (Base) of a topology 2 1.3 The subspace topology & the product topology on X Y 3 1.4 Basic topology concepts: limit points, closed sets,
More informationCSCI5654 (Linear Programming, Fall 2013) Lecture-8. Lecture 8 Slide# 1
CSCI5654 (Linear Programming, Fall 2013) Lecture-8 Lecture 8 Slide# 1 Today s Lecture 1. Recap of dual variables and strong duality. 2. Complementary Slackness Theorem. 3. Interpretation of dual variables.
More informationInformation theoretic perspectives on learning algorithms
Information theoretic perspectives on learning algorithms Varun Jog University of Wisconsin - Madison Departments of ECE and Mathematics Shannon Channel Hangout! May 8, 2018 Jointly with Adrian Tovar-Lopez
More informationPolicy Gradient. U(θ) = E[ R(s t,a t );π θ ] = E[R(τ);π θ ] (1) 1 + e θ φ(s t) E[R(τ);π θ ] (3) = max. θ P(τ;θ)R(τ) (6) P(τ;θ) θ log P(τ;θ)R(τ) (9)
CS294-40 Learning for Robotics and Control Lecture 16-10/20/2008 Lecturer: Pieter Abbeel Policy Gradient Scribe: Jan Biermeyer 1 Recap Recall: H U() = E[ R(s t,a ;π ] = E[R();π ] (1) Here is a sample path
More informationDefinition 6.1. A metric space (X, d) is complete if every Cauchy sequence tends to a limit in X.
Chapter 6 Completeness Lecture 18 Recall from Definition 2.22 that a Cauchy sequence in (X, d) is a sequence whose terms get closer and closer together, without any limit being specified. In the Euclidean
More informationLecture 1: Entropy, convexity, and matrix scaling CSE 599S: Entropy optimality, Winter 2016 Instructor: James R. Lee Last updated: January 24, 2016
Lecture 1: Entropy, convexity, and matrix scaling CSE 599S: Entropy optimality, Winter 2016 Instructor: James R. Lee Last updated: January 24, 2016 1 Entropy Since this course is about entropy maximization,
More informationAN ELEMENTARY PROOF OF THE TRIANGLE INEQUALITY FOR THE WASSERSTEIN METRIC
PROCEEDINGS OF THE AMERICAN MATHEMATICAL SOCIETY Volume 136, Number 1, January 2008, Pages 333 339 S 0002-9939(07)09020- Article electronically published on September 27, 2007 AN ELEMENTARY PROOF OF THE
More information3. If a choice is broken down into two successive choices, the original H should be the weighted sum of the individual values of H.
Appendix A Information Theory A.1 Entropy Shannon (Shanon, 1948) developed the concept of entropy to measure the uncertainty of a discrete random variable. Suppose X is a discrete random variable that
More informationEC 521 MATHEMATICAL METHODS FOR ECONOMICS. Lecture 1: Preliminaries
EC 521 MATHEMATICAL METHODS FOR ECONOMICS Lecture 1: Preliminaries Murat YILMAZ Boğaziçi University In this lecture we provide some basic facts from both Linear Algebra and Real Analysis, which are going
More informationDuality of LPs and Applications
Lecture 6 Duality of LPs and Applications Last lecture we introduced duality of linear programs. We saw how to form duals, and proved both the weak and strong duality theorems. In this lecture we will
More informationTHE INVERSE FUNCTION THEOREM
THE INVERSE FUNCTION THEOREM W. PATRICK HOOPER The implicit function theorem is the following result: Theorem 1. Let f be a C 1 function from a neighborhood of a point a R n into R n. Suppose A = Df(a)
More informationLecture Notes on Metric Spaces
Lecture Notes on Metric Spaces Math 117: Summer 2007 John Douglas Moore Our goal of these notes is to explain a few facts regarding metric spaces not included in the first few chapters of the text [1],
More informationLecture 2: August 31
0-704: Information Processing and Learning Fall 206 Lecturer: Aarti Singh Lecture 2: August 3 Note: These notes are based on scribed notes from Spring5 offering of this course. LaTeX template courtesy
More informationCOMPSCI 650 Applied Information Theory Jan 21, Lecture 2
COMPSCI 650 Applied Information Theory Jan 21, 2016 Lecture 2 Instructor: Arya Mazumdar Scribe: Gayane Vardoyan, Jong-Chyi Su 1 Entropy Definition: Entropy is a measure of uncertainty of a random variable.
More informationLecture 6: Conic Optimization September 8
IE 598: Big Data Optimization Fall 2016 Lecture 6: Conic Optimization September 8 Lecturer: Niao He Scriber: Juan Xu Overview In this lecture, we finish up our previous discussion on optimality conditions
More informationd(x n, x) d(x n, x nk ) + d(x nk, x) where we chose any fixed k > N
Problem 1. Let f : A R R have the property that for every x A, there exists ɛ > 0 such that f(t) > ɛ if t (x ɛ, x + ɛ) A. If the set A is compact, prove there exists c > 0 such that f(x) > c for all x
More informationConstrained Optimization and Lagrangian Duality
CIS 520: Machine Learning Oct 02, 2017 Constrained Optimization and Lagrangian Duality Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may or may
More informationMidterm 1. Every element of the set of functions is continuous
Econ 200 Mathematics for Economists Midterm Question.- Consider the set of functions F C(0, ) dened by { } F = f C(0, ) f(x) = ax b, a A R and b B R That is, F is a subset of the set of continuous functions
More information6.1 Variational representation of f-divergences
ECE598: Information-theoretic methods in high-dimensional statistics Spring 2016 Lecture 6: Variational representation, HCR and CR lower bounds Lecturer: Yihong Wu Scribe: Georgios Rovatsos, Feb 11, 2016
More informationStability results for Logarithmic Sobolev inequality
Stability results for Logarithmic Sobolev inequality Daesung Kim (joint work with Emanuel Indrei) Department of Mathematics Purdue University September 20, 2017 Daesung Kim (Purdue) Stability for LSI Probability
More informationThe fundamental theorem of linear programming
The fundamental theorem of linear programming Michael Tehranchi June 8, 2017 This note supplements the lecture notes of Optimisation The statement of the fundamental theorem of linear programming and the
More informationConvex Optimization and Support Vector Machine
Convex Optimization and Support Vector Machine Problem 0. Consider a two-class classification problem. The training data is L n = {(x 1, t 1 ),..., (x n, t n )}, where each t i { 1, 1} and x i R p. We
More informationExample Problem. Linear Program (standard form) CSCI5654 (Linear Programming, Fall 2013) Lecture-7. Duality
CSCI5654 (Linear Programming, Fall 013) Lecture-7 Duality Lecture 7 Slide# 1 Lecture 7 Slide# Linear Program (standard form) Example Problem maximize c 1 x 1 + + c n x n s.t. a j1 x 1 + + a jn x n b j
More informationMeasure and Integration: Solutions of CW2
Measure and Integration: s of CW2 Fall 206 [G. Holzegel] December 9, 206 Problem of Sheet 5 a) Left (f n ) and (g n ) be sequences of integrable functions with f n (x) f (x) and g n (x) g (x) for almost
More informationMachine Learning. Support Vector Machines. Fabio Vandin November 20, 2017
Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training
More informationThe Moment Method; Convex Duality; and Large/Medium/Small Deviations
Stat 928: Statistical Learning Theory Lecture: 5 The Moment Method; Convex Duality; and Large/Medium/Small Deviations Instructor: Sham Kakade The Exponential Inequality and Convex Duality The exponential
More informationMath 127C, Spring 2006 Final Exam Solutions. x 2 ), g(y 1, y 2 ) = ( y 1 y 2, y1 2 + y2) 2. (g f) (0) = g (f(0))f (0).
Math 27C, Spring 26 Final Exam Solutions. Define f : R 2 R 2 and g : R 2 R 2 by f(x, x 2 (sin x 2 x, e x x 2, g(y, y 2 ( y y 2, y 2 + y2 2. Use the chain rule to compute the matrix of (g f (,. By the chain
More informationBayesian Machine Learning - Lecture 7
Bayesian Machine Learning - Lecture 7 Guido Sanguinetti Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh gsanguin@inf.ed.ac.uk March 4, 2015 Today s lecture 1
More informationGenerative Adversarial Networks
Generative Adversarial Networks Stefano Ermon, Aditya Grover Stanford University Lecture 10 Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 10 1 / 17 Selected GANs https://github.com/hindupuravinash/the-gan-zoo
More informationOptimal Transport in Risk Analysis
Optimal Transport in Risk Analysis Jose Blanchet (based on work with Y. Kang and K. Murthy) Stanford University (Management Science and Engineering), and Columbia University (Department of Statistics and
More informationOptimality, Duality, Complementarity for Constrained Optimization
Optimality, Duality, Complementarity for Constrained Optimization Stephen Wright University of Wisconsin-Madison May 2014 Wright (UW-Madison) Optimality, Duality, Complementarity May 2014 1 / 41 Linear
More informationa) Let x,y be Cauchy sequences in some metric space. Define d(x, y) = lim n d (x n, y n ). Show that this function is well-defined.
Problem 3) Remark: for this problem, if I write the notation lim x n, it should always be assumed that I mean lim n x n, and similarly if I write the notation lim x nk it should always be assumed that
More informationConsistency of the maximum likelihood estimator for general hidden Markov models
Consistency of the maximum likelihood estimator for general hidden Markov models Jimmy Olsson Centre for Mathematical Sciences Lund University Nordstat 2012 Umeå, Sweden Collaborators Hidden Markov models
More informationMath 328 Course Notes
Math 328 Course Notes Ian Robertson March 3, 2006 3 Properties of C[0, 1]: Sup-norm and Completeness In this chapter we are going to examine the vector space of all continuous functions defined on the
More informationSolving Dual Problems
Lecture 20 Solving Dual Problems We consider a constrained problem where, in addition to the constraint set X, there are also inequality and linear equality constraints. Specifically the minimization problem
More informationTechinical Proofs for Nonlinear Learning using Local Coordinate Coding
Techinical Proofs for Nonlinear Learning using Local Coordinate Coding 1 Notations and Main Results Denition 1.1 (Lipschitz Smoothness) A function f(x) on R d is (α, β, p)-lipschitz smooth with respect
More informationThe extreme points of symmetric norms on R^2
Graduate Theses and Dissertations Iowa State University Capstones, Theses and Dissertations 2008 The extreme points of symmetric norms on R^2 Anchalee Khemphet Iowa State University Follow this and additional
More informationBasic Properties of Metric and Normed Spaces
Basic Properties of Metric and Normed Spaces Computational and Metric Geometry Instructor: Yury Makarychev The second part of this course is about metric geometry. We will study metric spaces, low distortion
More informationLinear Programming: Chapter 5 Duality
Linear Programming: Chapter 5 Duality Robert J. Vanderbei September 30, 2010 Slides last edited on October 5, 2010 Operations Research and Financial Engineering Princeton University Princeton, NJ 08544
More informationVariational Inference (11/04/13)
STA561: Probabilistic machine learning Variational Inference (11/04/13) Lecturer: Barbara Engelhardt Scribes: Matt Dickenson, Alireza Samany, Tracy Schifeling 1 Introduction In this lecture we will further
More informationLecture 1: Introduction, Entropy and ML estimation
0-704: Information Processing and Learning Spring 202 Lecture : Introduction, Entropy and ML estimation Lecturer: Aarti Singh Scribes: Min Xu Disclaimer: These notes have not been subjected to the usual
More information16.1 Bounding Capacity with Covering Number
ECE598: Information-theoretic methods in high-dimensional statistics Spring 206 Lecture 6: Upper Bounds for Density Estimation Lecturer: Yihong Wu Scribe: Yang Zhang, Apr, 206 So far we have been mostly
More informationInformation geometry for bivariate distribution control
Information geometry for bivariate distribution control C.T.J.Dodson + Hong Wang Mathematics + Control Systems Centre, University of Manchester Institute of Science and Technology Optimal control of stochastic
More informationNote 3: LP Duality. If the primal problem (P) in the canonical form is min Z = n (1) then the dual problem (D) in the canonical form is max W = m (2)
Note 3: LP Duality If the primal problem (P) in the canonical form is min Z = n j=1 c j x j s.t. nj=1 a ij x j b i i = 1, 2,..., m (1) x j 0 j = 1, 2,..., n, then the dual problem (D) in the canonical
More informationConvex Optimization Boyd & Vandenberghe. 5. Duality
5. Duality Convex Optimization Boyd & Vandenberghe Lagrange dual problem weak and strong duality geometric interpretation optimality conditions perturbation and sensitivity analysis examples generalized
More informationConvex Optimization & Lagrange Duality
Convex Optimization & Lagrange Duality Chee Wei Tan CS 8292 : Advanced Topics in Convex Optimization and its Applications Fall 2010 Outline Convex optimization Optimality condition Lagrange duality KKT
More informationEcon Slides from Lecture 1
Econ 205 Sobel Econ 205 - Slides from Lecture 1 Joel Sobel August 23, 2010 Warning I can t start without assuming that something is common knowledge. You can find basic definitions of Sets and Set Operations
More informationA : k n. Usually k > n otherwise easily the minimum is zero. Analytical solution:
1-5: Least-squares I A : k n. Usually k > n otherwise easily the minimum is zero. Analytical solution: f (x) =(Ax b) T (Ax b) =x T A T Ax 2b T Ax + b T b f (x) = 2A T Ax 2A T b = 0 Chih-Jen Lin (National
More information1 Review and Overview
CS229T/STATS231: Statistical Learig Theory Lecturer: Tegyu Ma Lecture #12 Scribe: Garrett Thomas, Pega Liu October 31, 2018 1 Review a Overview Recall the GAN setup: we have iepeet samples x 1,..., x raw
More informationJune 21, Peking University. Dual Connections. Zhengchao Wan. Overview. Duality of connections. Divergence: general contrast functions
Dual Peking University June 21, 2016 Divergences: Riemannian connection Let M be a manifold on which there is given a Riemannian metric g =,. A connection satisfying Z X, Y = Z X, Y + X, Z Y (1) for all
More informationQuantitative Biology II Lecture 4: Variational Methods
10 th March 2015 Quantitative Biology II Lecture 4: Variational Methods Gurinder Singh Mickey Atwal Center for Quantitative Biology Cold Spring Harbor Laboratory Image credit: Mike West Summary Approximate
More informationEE364a Review Session 5
EE364a Review Session 5 EE364a Review announcements: homeworks 1 and 2 graded homework 4 solutions (check solution to additional problem 1) scpd phone-in office hours: tuesdays 6-7pm (650-723-1156) 1 Complementary
More informationCS-E4830 Kernel Methods in Machine Learning
CS-E4830 Kernel Methods in Machine Learning Lecture 3: Convex optimization and duality Juho Rousu 27. September, 2017 Juho Rousu 27. September, 2017 1 / 45 Convex optimization Convex optimisation This
More informationConvex Optimization and SVM
Convex Optimization and SVM Problem 0. Cf lecture notes pages 12 to 18. Problem 1. (i) A slab is an intersection of two half spaces, hence convex. (ii) A wedge is an intersection of two half spaces, hence
More informationDistance-Divergence Inequalities
Distance-Divergence Inequalities Katalin Marton Alfréd Rényi Institute of Mathematics of the Hungarian Academy of Sciences Motivation To find a simple proof of the Blowing-up Lemma, proved by Ahlswede,
More informationApplications of Linear Programming
Applications of Linear Programming lecturer: András London University of Szeged Institute of Informatics Department of Computational Optimization Lecture 9 Non-linear programming In case of LP, the goal
More informationconverges as well if x < 1. 1 x n x n 1 1 = 2 a nx n
Solve the following 6 problems. 1. Prove that if series n=1 a nx n converges for all x such that x < 1, then the series n=1 a n xn 1 x converges as well if x < 1. n For x < 1, x n 0 as n, so there exists
More informationLinear and Combinatorial Optimization
Linear and Combinatorial Optimization The dual of an LP-problem. Connections between primal and dual. Duality theorems and complementary slack. Philipp Birken (Ctr. for the Math. Sc.) Lecture 3: Duality
More informationLINEAR PROGRAMMING II
LINEAR PROGRAMMING II LP duality strong duality theorem bonus proof of LP duality applications Lecture slides by Kevin Wayne Last updated on 7/25/17 11:09 AM LINEAR PROGRAMMING II LP duality Strong duality
More informationExponentiated Gradient Descent
CSE599s, Spring 01, Online Learning Lecture 10-04/6/01 Lecturer: Ofer Dekel Exponentiated Gradient Descent Scribe: Albert Yu 1 Introduction In this lecture we review norms, dual norms, strong convexity,
More information