On A-distance and Relative A-distance

Size: px
Start display at page:

Download "On A-distance and Relative A-distance"

Transcription

1 1 ADAPTIVE COMMUNICATIONS AND SIGNAL PROCESSING LABORATORY CORNELL UNIVERSITY, ITHACA, NY On A-distance and Relative A-distance Ting He and Lang Tong Technical Report No. ACSP-TR August 004

2 I. INTRODUCTION We give a method to measure the distance between two probability distributions, and based on the distance measure, we bound the probability that the distance between the empirical distribution and the actual distribution exceeds certain level. The direct implication of our result is that for large sample size, one can replace the actual probability with its corresponding empirical probability with arbitrarily small error. The proof of the theorems is based on the Vapnik- Chervonenkis Theory [6] and Anthony & Shawe-Taylor s extension of the Vapnik-Chervonenkis Theory [4]. II. DISTANCE MEASURE A-distance Fix a measure space and let A be a collection of measurable sets. Let P 1 and P be probability distributions over this space. The A-distance between P 1 and P is defined as d A (P 1,P ) = sup P 1 (A) P (A) For finite sample sets S 1 and S, d A (S 1,S ) is defined similarly by replacing P i (A) with S i (A) = S i A / S i. The following notion of relative A-distance offers a way to take the relative magnitude of a change into account. Relative A-distance Let P 1,P be two probability distributions over the same measure space, let A denote a family of measurable subsets of that space, and A a set in A. The relative A-distance between P 1 and P is defined as φ A (P 1,P ) = sup P 1 (A) P (A) P 1 (A)+P (A) For empirical distances, simply replace P i (A) with the empirical measure S i (A) = S i A / S i. It is easy to see that A-distance is a metric. For the proof of relative A-distance as a metric, see [5]. III. VC BOUNDS The following theorems are derived from [6] and [4] to guarantee the rate that the empirical distance converges to the underlying distance for both distance notions.

3 3 For A-distance, we have the following theorems: Theorem 3.1 (Vapnik-Chervonenkis Inequality): Let P be a probability distribution over domain X and S be a collection of n i.i.d samples drawn from P. Then for a family of subsets of X A and a constant ǫ (0, 1), P n {sup S(A) P(A) > ǫ} 4Π A (n) 1 e nǫ /8 Using Theorem 3.1, it is easy to derive the following corollary. Corollary 3.: Let P 1, P be any probability distributions over some domain X and let A be a family of subsets of X and ǫ (0, 1). If S 1, S are i.i.d n samples drawn by P 1, P respectively, then, P n [ A A P 1 (A) P (A) S 1 (A) S (A) ǫ] < 8Π A (n)e nǫ /3 Where P n in the above inequality is the probability over the pairs of samples (S 1,S ) induced by the sample generating distributions (P 1,P ). Proof: Simple algebra yields the result. Pr{ A A, P 1 (A) P (A) S 1 (A) S (A) ǫ} Pr{sup P 1 (A) P (A) S 1 (A) + S (A) ǫ} (1) Pr{sup P 1 (A) S 1 (A) + P (A) S (A) ǫ} () Pr{{sup P 1 (A) S 1 (A) ǫ } {sup P (A) S (A) ǫ }} (3) Pr{sup P 1 (A) S 1 (A) ǫ } + Pr{sup P (A) S (A) ǫ } (4) 8Π A (n)e nǫ /3 (5) where the last inequality comes from Theorem Π A(n) is the shatter coefficient [3]. If A has a finite VC-dimension d, then by Sauer s Lemma, Π A(n) < (n + 1) d for all n.

4 4 We thus have ways to bound the probability that empirical A-distance deviates from true A-distance from both sides. The theoretical guarantee can be improved by considering relative A-distance. We can get results similar to Theorem 3.1 and Corollary 3. for the metric φ A (P 1,P ). We start with the following result of Anthony and Shawe-Taylor [4]. Lemma 3.3: Let A be a family of subsets of the domain X. P is any probability distribution over X. If S 1 and S are two collections of n samples each, drawn i.i.d. from P, then P n (φ A (S 1,S ) > ǫ) Π A (n)e nǫ /4 (where P n is the probability that P induces over the choice of samples.) In [4], Anthony and Shawe-Taylor proved that Pr{sup S 1 (A) S (A) S 1 (A)+S (A) > ǫ} Π A (n)e nǫ /4 By symmetry of S 1 (A) and S (A), the result in Lemma 3.3 holds. Theorem 3.4: Let A be a family of subsets of the domain X, P be any probability distribution over X, and S be a set of n samples, each drawn i.i.d. from P. Then P n (φ A (S,P) > ǫ) 8Π A (n)e nǫ /4 (Where P n is the n th power of P - the probability that P induces over the choice of samples). The proof of this theorem is similar to the proof in [4]. Proof: Define Q = S Xn : A A s.t. P(A) S(A) P(A)+S(A) R = SS X n : A A s.t. S(A) S (A) where S, S are two sets of n samples, drawn i.i.d. from P. S(A)+S (A) > ǫ > ǫ Then we claim that Pr(Q) 4 Pr(R) for n > 4. This is true because of the following. ǫ P(C)+S(C) Suppose S Q, so there is C A s.t. P(C) S(C) > ǫ. Hence S(C) < P(C) + ǫ 4 ǫ ǫ 16 + P(C). Noting that S(C) 0, some simple calculation shows that P(C) > ǫ.

5 5 If we draw another set of n samples S, each drawn i.i.d. from P, and define F = S (C) S(C) S (C)+S(C) we have F > ǫ if S (C) > P(C). This is because the function f(x,y) = x y is monotone (x+y)/ increasing w.r.t. x and monotone decreasing w.r.t. y on x, y (0, 1)(taking derivative easily verifies it). So inf F is achieved when S(C) = P(C) + ǫ ǫ ǫ + P(C) and 4 16 S (C) = P(C). Plugging in yields the value ǫ, and the strict inequality follows from the strict inequalities about S(C) and S (C). The random variable ns (C) has binomial distribution B(n,P(C)). For n > 4 ǫ > P(C), ns (C) > np(c) with probability 1/4( [4]). Therefore for n > 4 ǫ, we have Pr(Q) 4 Pr(R). In [4], it is proved that Thus P n sup Pr(R) Π A (n)e nǫ /4. P(A) S(A) P(A)+S(A) Note that this inequality is trivially satisfied if n 4 ǫ. By symmetry we have > ǫ 4Π A(n)e nǫ /4 P n (φ A (S,P) > ǫ) 8Π A (n)e nǫ /4. Similar to Corollary 3., we have the following corollary of Theorem 3.4 which bounds the probability that the empirical relative A-distance deviates from the true relative A-distance. Corollary 3.5: Let P 1, P be any probability distributions over some domain X and let A be a family of subsets of X and ǫ (0, 1). If S 1, S are two collections of n samples each, drawn i.i.d. from P 1, P respectively, then P n [ φ A (P 1,P ) φ A (S 1,S ) > ǫ] 16Π A (n)e nǫ /16 Where P n in the above inequality is the probability over the pairs of samples (S 1,S ) induced by the sample generating distribution (P 1,P ).

6 6 Proof: Because φ A (, ) is a metric on [0, 1]( [5]), we have φ A (P 1,P ) φ A (P 1,S 1 ) + φ A (S 1,S ) + φ A (S,P ) and Therefore, φ A (P 1,P ) φ A (S 1,S ) φ A (P 1,S 1 ) φ A (S,P ) Pr{ φ A (P 1,P ) φ A (S 1,S ) > ǫ} Pr{φ A (P 1,S 1 ) + φ A (P,S ) > ǫ} (6) Pr{φ A (P 1,S 1 ) > ǫ } + Pr{φ A(P,S ) > ǫ } (7) 16Π A (n)e nǫ /16 (8) where the last inequality comes from Theorem 3.4. REFERENCES [1] B. Brodsky and B. Darkovsky, Non-Parametric Methods in Change-Point Problems, Kluwer Academic, The Netherlands, [] J. Shao, Mathematical Statistics, Springer, [3] L. Gyorfi, Principles of Nonparametric Learning, Springer Wien New York, 00. [4] M. Anthony and J. Shawe-Taylor, A result of Vapnik with applications, in Discrete and Applied Mathematics, vol. 47(), pp , [5] S. Ben-David, J. Gehrke and D. Kifer, Detecting Change in Data Streams, in Proc. 004 VLDB Conference, (Toronto, Canada), 004. [6] V.N. Vapnik and A. Ya. Chervonenkis On the uniform convergence of relative frequency of events to their probabilities in Theory of Probability and its Applications, Vol. 16, pp 64-80, 1971.

7 Influence Functions

7 Influence Functions 7 Influence Functions The influence function is used to approximate the standard error of a plug-in estimator. The formal definition is as follows. 7.1 Definition. The Gâteaux derivative of T at F in the

More information

A Result of Vapnik with Applications

A Result of Vapnik with Applications A Result of Vapnik with Applications Martin Anthony Department of Statistical and Mathematical Sciences London School of Economics Houghton Street London WC2A 2AE, U.K. John Shawe-Taylor Department of

More information

The sample complexity of agnostic learning with deterministic labels

The sample complexity of agnostic learning with deterministic labels The sample complexity of agnostic learning with deterministic labels Shai Ben-David Cheriton School of Computer Science University of Waterloo Waterloo, ON, N2L 3G CANADA shai@uwaterloo.ca Ruth Urner College

More information

The Vapnik-Chervonenkis Dimension

The Vapnik-Chervonenkis Dimension The Vapnik-Chervonenkis Dimension Prof. Dan A. Simovici UMB 1 / 91 Outline 1 Growth Functions 2 Basic Definitions for Vapnik-Chervonenkis Dimension 3 The Sauer-Shelah Theorem 4 The Link between VCD and

More information

THE VAPNIK- CHERVONENKIS DIMENSION and LEARNABILITY

THE VAPNIK- CHERVONENKIS DIMENSION and LEARNABILITY THE VAPNIK- CHERVONENKIS DIMENSION and LEARNABILITY Dan A. Simovici UMB, Doctoral Summer School Iasi, Romania What is Machine Learning? The Vapnik-Chervonenkis Dimension Probabilistic Learning Potential

More information

On Learnability, Complexity and Stability

On Learnability, Complexity and Stability On Learnability, Complexity and Stability Silvia Villa, Lorenzo Rosasco and Tomaso Poggio 1 Introduction A key question in statistical learning is which hypotheses (function) spaces are learnable. Roughly

More information

Learning Theory. Piyush Rai. CS5350/6350: Machine Learning. September 27, (CS5350/6350) Learning Theory September 27, / 14

Learning Theory. Piyush Rai. CS5350/6350: Machine Learning. September 27, (CS5350/6350) Learning Theory September 27, / 14 Learning Theory Piyush Rai CS5350/6350: Machine Learning September 27, 2011 (CS5350/6350) Learning Theory September 27, 2011 1 / 14 Why Learning Theory? We want to have theoretical guarantees about our

More information

MACHINE LEARNING - CS671 - Part 2a The Vapnik-Chervonenkis Dimension

MACHINE LEARNING - CS671 - Part 2a The Vapnik-Chervonenkis Dimension MACHINE LEARNING - CS671 - Part 2a The Vapnik-Chervonenkis Dimension Prof. Dan A. Simovici UMB Prof. Dan A. Simovici (UMB) MACHINE LEARNING - CS671 - Part 2a The Vapnik-Chervonenkis Dimension 1 / 30 The

More information

Sample width for multi-category classifiers

Sample width for multi-category classifiers R u t c o r Research R e p o r t Sample width for multi-category classifiers Martin Anthony a Joel Ratsaby b RRR 29-2012, November 2012 RUTCOR Rutgers Center for Operations Research Rutgers University

More information

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity;

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity; CSCI699: Topics in Learning and Game Theory Lecture 2 Lecturer: Ilias Diakonikolas Scribes: Li Han Today we will cover the following 2 topics: 1. Learning infinite hypothesis class via VC-dimension and

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Vapnik Chervonenkis Theory Barnabás Póczos Empirical Risk and True Risk 2 Empirical Risk Shorthand: True risk of f (deterministic): Bayes risk: Let us use the empirical

More information

An Introduction to Statistical Theory of Learning. Nakul Verma Janelia, HHMI

An Introduction to Statistical Theory of Learning. Nakul Verma Janelia, HHMI An Introduction to Statistical Theory of Learning Nakul Verma Janelia, HHMI Towards formalizing learning What does it mean to learn a concept? Gain knowledge or experience of the concept. The basic process

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Computational and Statistical Learning Theory Problem set 1 Due: Monday, October 10th Please send your solutions to learning-submissions@ttic.edu Notation: Input space: X Label space: Y = {±1} Sample:

More information

The information-theoretic value of unlabeled data in semi-supervised learning

The information-theoretic value of unlabeled data in semi-supervised learning The information-theoretic value of unlabeled data in semi-supervised learning Alexander Golovnev Dávid Pál Balázs Szörényi January 5, 09 Abstract We quantify the separation between the numbers of labeled

More information

COMS 4771 Introduction to Machine Learning. Nakul Verma

COMS 4771 Introduction to Machine Learning. Nakul Verma COMS 4771 Introduction to Machine Learning Nakul Verma Announcements HW2 due now! Project proposal due on tomorrow Midterm next lecture! HW3 posted Last time Linear Regression Parametric vs Nonparametric

More information

Generalization and Overfitting

Generalization and Overfitting Generalization and Overfitting Model Selection Maria-Florina (Nina) Balcan February 24th, 2016 PAC/SLT models for Supervised Learning Data Source Distribution D on X Learning Algorithm Expert / Oracle

More information

Computational Learning Theory

Computational Learning Theory CS 446 Machine Learning Fall 2016 OCT 11, 2016 Computational Learning Theory Professor: Dan Roth Scribe: Ben Zhou, C. Cervantes 1 PAC Learning We want to develop a theory to relate the probability of successful

More information

VC Dimension and Sauer s Lemma

VC Dimension and Sauer s Lemma CMSC 35900 (Spring 2008) Learning Theory Lecture: VC Diension and Sauer s Lea Instructors: Sha Kakade and Abuj Tewari Radeacher Averages and Growth Function Theore Let F be a class of ±-valued functions

More information

ROBUST DETECTION OF STEPPING-STONE ATTACKS

ROBUST DETECTION OF STEPPING-STONE ATTACKS ROBUST DETECTION OF STEPPING-STONE ATTACKS Ting He and Lang Tong School of Electrical and Computer Engineering Cornell University Ithaca, NY 14853, USA Email:{th255,lt35}@cornell.edu Abstract The detection

More information

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015 Machine Learning 10-701, Fall 2015 VC Dimension and Model Complexity Eric Xing Lecture 16, November 3, 2015 Reading: Chap. 7 T.M book, and outline material Eric Xing @ CMU, 2006-2015 1 Last time: PAC and

More information

4 Expectation & the Lebesgue Theorems

4 Expectation & the Lebesgue Theorems STA 205: Probability & Measure Theory Robert L. Wolpert 4 Expectation & the Lebesgue Theorems Let X and {X n : n N} be random variables on a probability space (Ω,F,P). If X n (ω) X(ω) for each ω Ω, does

More information

Generalization, Overfitting, and Model Selection

Generalization, Overfitting, and Model Selection Generalization, Overfitting, and Model Selection Sample Complexity Results for Supervised Classification Maria-Florina (Nina) Balcan 10/03/2016 Two Core Aspects of Machine Learning Algorithm Design. How

More information

Multiclass Learnability and the ERM principle

Multiclass Learnability and the ERM principle Multiclass Learnability and the ERM principle Amit Daniely Sivan Sabato Shai Ben-David Shai Shalev-Shwartz November 5, 04 arxiv:308.893v [cs.lg] 4 Nov 04 Abstract We study the sample complexity of multiclass

More information

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Machine Learning. Lecture 9: Learning Theory. Feng Li. Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell

More information

Domain Adaptation Can Quantity Compensate for Quality?

Domain Adaptation Can Quantity Compensate for Quality? Domain Adaptation Can Quantity Compensate for Quality? hai Ben-David David R. Cheriton chool of Computer cience University of Waterloo Waterloo, ON N2L 3G1 CANADA shai@cs.uwaterloo.ca hai halev-hwartz

More information

Rademacher Complexity Bounds for Non-I.I.D. Processes

Rademacher Complexity Bounds for Non-I.I.D. Processes Rademacher Complexity Bounds for Non-I.I.D. Processes Mehryar Mohri Courant Institute of Mathematical ciences and Google Research 5 Mercer treet New York, NY 00 mohri@cims.nyu.edu Afshin Rostamizadeh Department

More information

VC-DENSITY FOR TREES

VC-DENSITY FOR TREES VC-DENSITY FOR TREES ANTON BOBKOV Abstract. We show that for the theory of infinite trees we have vc(n) = n for all n. VC density was introduced in [1] by Aschenbrenner, Dolich, Haskell, MacPherson, and

More information

A Necessary Condition for Learning from Positive Examples

A Necessary Condition for Learning from Positive Examples Machine Learning, 5, 101-113 (1990) 1990 Kluwer Academic Publishers. Manufactured in The Netherlands. A Necessary Condition for Learning from Positive Examples HAIM SHVAYTSER* (HAIM%SARNOFF@PRINCETON.EDU)

More information

Multiclass Learnability and the ERM Principle

Multiclass Learnability and the ERM Principle Journal of Machine Learning Research 6 205 2377-2404 Submitted 8/3; Revised /5; Published 2/5 Multiclass Learnability and the ERM Principle Amit Daniely amitdaniely@mailhujiacil Dept of Mathematics, The

More information

Computational Learning Theory

Computational Learning Theory Computational Learning Theory Pardis Noorzad Department of Computer Engineering and IT Amirkabir University of Technology Ordibehesht 1390 Introduction For the analysis of data structures and algorithms

More information

Strictly positive definite functions on a real inner product space

Strictly positive definite functions on a real inner product space Advances in Computational Mathematics 20: 263 271, 2004. 2004 Kluwer Academic Publishers. Printed in the Netherlands. Strictly positive definite functions on a real inner product space Allan Pinkus Department

More information

Chebyshev Type Inequalities for Sugeno Integrals with Respect to Intuitionistic Fuzzy Measures

Chebyshev Type Inequalities for Sugeno Integrals with Respect to Intuitionistic Fuzzy Measures BULGARIAN ACADEMY OF SCIENCES CYBERNETICS AND INFORMATION TECHNOLOGIES Volume 9, No 2 Sofia 2009 Chebyshev Type Inequalities for Sugeno Integrals with Respect to Intuitionistic Fuzzy Measures Adrian I.

More information

Quadratic Upper Bound for Recursive Teaching Dimension of Finite VC Classes

Quadratic Upper Bound for Recursive Teaching Dimension of Finite VC Classes Proceedings of Machine Learning Research vol 65:1 10, 2017 Quadratic Upper Bound for Recursive Teaching Dimension of Finite VC Classes Lunjia Hu Ruihan Wu Tianhong Li Institute for Interdisciplinary Information

More information

LECTURE 15: COMPLETENESS AND CONVEXITY

LECTURE 15: COMPLETENESS AND CONVEXITY LECTURE 15: COMPLETENESS AND CONVEXITY 1. The Hopf-Rinow Theorem Recall that a Riemannian manifold (M, g) is called geodesically complete if the maximal defining interval of any geodesic is R. On the other

More information

Lecture 2: Uniform Entropy

Lecture 2: Uniform Entropy STAT 583: Advanced Theory of Statistical Inference Spring 218 Lecture 2: Uniform Entropy Lecturer: Fang Han April 16 Disclaimer: These notes have not been subjected to the usual scrutiny reserved for formal

More information

Bennett-type Generalization Bounds: Large-deviation Case and Faster Rate of Convergence

Bennett-type Generalization Bounds: Large-deviation Case and Faster Rate of Convergence Bennett-type Generalization Bounds: Large-deviation Case and Faster Rate of Convergence Chao Zhang The Biodesign Institute Arizona State University Tempe, AZ 8587, USA Abstract In this paper, we present

More information

Case study: stochastic simulation via Rademacher bootstrap

Case study: stochastic simulation via Rademacher bootstrap Case study: stochastic simulation via Rademacher bootstrap Maxim Raginsky December 4, 2013 In this lecture, we will look at an application of statistical learning theory to the problem of efficient stochastic

More information

Families that remain k-sperner even after omitting an element of their ground set

Families that remain k-sperner even after omitting an element of their ground set Families that remain k-sperner even after omitting an element of their ground set Balázs Patkós Submitted: Jul 12, 2012; Accepted: Feb 4, 2013; Published: Feb 12, 2013 Mathematics Subject Classifications:

More information

Models of Language Acquisition: Part II

Models of Language Acquisition: Part II Models of Language Acquisition: Part II Matilde Marcolli CS101: Mathematical and Computational Linguistics Winter 2015 Probably Approximately Correct Model of Language Learning General setting of Statistical

More information

Learnability and models of decision making under uncertainty

Learnability and models of decision making under uncertainty and models of decision making under uncertainty Pathikrit Basu Federico Echenique Caltech Virginia Tech DT Workshop April 6, 2018 Pathikrit To think is to forget a difference, to generalize, to abstract.

More information

On the Sample Complexity of Noise-Tolerant Learning

On the Sample Complexity of Noise-Tolerant Learning On the Sample Complexity of Noise-Tolerant Learning Javed A. Aslam Department of Computer Science Dartmouth College Hanover, NH 03755 Scott E. Decatur Laboratory for Computer Science Massachusetts Institute

More information

,... We would like to compare this with the sequence y n = 1 n

,... We would like to compare this with the sequence y n = 1 n Example 2.0 Let (x n ) n= be the sequence given by x n = 2, i.e. n 2, 4, 8, 6,.... We would like to compare this with the sequence = n (which we know converges to zero). We claim that 2 n n, n N. Proof.

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Machine Learning: Jordan Boyd-Graber University of Maryland RADEMACHER COMPLEXITY Slides adapted from Rob Schapire Machine Learning: Jordan Boyd-Graber UMD Introduction

More information

Rademacher Averages and Phase Transitions in Glivenko Cantelli Classes

Rademacher Averages and Phase Transitions in Glivenko Cantelli Classes IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 48, NO. 1, JANUARY 2002 251 Rademacher Averages Phase Transitions in Glivenko Cantelli Classes Shahar Mendelson Abstract We introduce a new parameter which

More information

1 Stochastic Dynamic Programming

1 Stochastic Dynamic Programming 1 Stochastic Dynamic Programming Formally, a stochastic dynamic program has the same components as a deterministic one; the only modification is to the state transition equation. When events in the future

More information

CIS 800/002 The Algorithmic Foundations of Data Privacy September 29, Lecture 6. The Net Mechanism: A Partial Converse

CIS 800/002 The Algorithmic Foundations of Data Privacy September 29, Lecture 6. The Net Mechanism: A Partial Converse CIS 800/002 The Algorithmic Foundations of Data Privacy September 29, 20 Lecturer: Aaron Roth Lecture 6 Scribe: Aaron Roth Finishing up from last time. Last time we showed: The Net Mechanism: A Partial

More information

Continuity. Chapter 4

Continuity. Chapter 4 Chapter 4 Continuity Throughout this chapter D is a nonempty subset of the real numbers. We recall the definition of a function. Definition 4.1. A function from D into R, denoted f : D R, is a subset of

More information

Empirical Risk Minimization

Empirical Risk Minimization Empirical Risk Minimization Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Introduction PAC learning ERM in practice 2 General setting Data X the input space and Y the output space

More information

Effective Dimension and Generalization of Kernel Learning

Effective Dimension and Generalization of Kernel Learning Effective Dimension and Generalization of Kernel Learning Tong Zhang IBM T.J. Watson Research Center Yorktown Heights, Y 10598 tzhang@watson.ibm.com Abstract We investigate the generalization performance

More information

E0 370 Statistical Learning Theory Lecture 5 (Aug 25, 2011)

E0 370 Statistical Learning Theory Lecture 5 (Aug 25, 2011) E0 370 Statistical Learning Theory Lecture 5 Aug 5, 0 Covering Nubers, Pseudo-Diension, and Fat-Shattering Diension Lecturer: Shivani Agarwal Scribe: Shivani Agarwal Introduction So far we have seen how

More information

Using the Mean Absolute Percentage Error for Regression Models

Using the Mean Absolute Percentage Error for Regression Models Using the Mean Absolute Percentage Error for Regression Models Arnaud De Myttenaere, Boris Golden, Bénédicte Le Grand, Fabrice Rossi To cite this version: Arnaud De Myttenaere, Boris Golden, Bénédicte

More information

COMP9444: Neural Networks. Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization

COMP9444: Neural Networks. Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization : Neural Networks Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization 11s2 VC-dimension and PAC-learning 1 How good a classifier does a learner produce? Training error is the precentage

More information

EXPONENTIAL INEQUALITIES IN NONPARAMETRIC ESTIMATION

EXPONENTIAL INEQUALITIES IN NONPARAMETRIC ESTIMATION EXPONENTIAL INEQUALITIES IN NONPARAMETRIC ESTIMATION Luc Devroye Division of Statistics University of California at Davis Davis, CA 95616 ABSTRACT We derive exponential inequalities for the oscillation

More information

WEAK CONVERGENCE OF RESOLVENTS OF MAXIMAL MONOTONE OPERATORS AND MOSCO CONVERGENCE

WEAK CONVERGENCE OF RESOLVENTS OF MAXIMAL MONOTONE OPERATORS AND MOSCO CONVERGENCE Fixed Point Theory, Volume 6, No. 1, 2005, 59-69 http://www.math.ubbcluj.ro/ nodeacj/sfptcj.htm WEAK CONVERGENCE OF RESOLVENTS OF MAXIMAL MONOTONE OPERATORS AND MOSCO CONVERGENCE YASUNORI KIMURA Department

More information

Understanding Generalization Error: Bounds and Decompositions

Understanding Generalization Error: Bounds and Decompositions CIS 520: Machine Learning Spring 2018: Lecture 11 Understanding Generalization Error: Bounds and Decompositions Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the

More information

arxiv: v1 [cs.lg] 18 Feb 2017

arxiv: v1 [cs.lg] 18 Feb 2017 Quadratic Upper Bound for Recursive Teaching Dimension of Finite VC Classes Lunjia Hu, Ruihan Wu, Tianhong Li and Liwei Wang arxiv:1702.05677v1 [cs.lg] 18 Feb 2017 February 21, 2017 Abstract In this work

More information

Lecture 8 Inequality Testing and Moment Inequality Models

Lecture 8 Inequality Testing and Moment Inequality Models Lecture 8 Inequality Testing and Moment Inequality Models Inequality Testing In the previous lecture, we discussed how to test the nonlinear hypothesis H 0 : h(θ 0 ) 0 when the sample information comes

More information

Computational Learning Theory for Artificial Neural Networks

Computational Learning Theory for Artificial Neural Networks Computational Learning Theory for Artificial Neural Networks Martin Anthony and Norman Biggs Department of Statistical and Mathematical Sciences, London School of Economics and Political Science, Houghton

More information

We are now going to go back to the concept of sequences, and look at some properties of sequences in R

We are now going to go back to the concept of sequences, and look at some properties of sequences in R 4 Lecture 4 4. Real Sequences We are now going to go back to the concept of sequences, and look at some properties of sequences in R Definition 3 A real sequence is increasing if + for all, and strictly

More information

Learning with Imperfect Data

Learning with Imperfect Data Mehryar Mohri Courant Institute and Google mohri@cims.nyu.edu Joint work with: Yishay Mansour (Tel-Aviv & Google) and Afshin Rostamizadeh (Courant Institute). Standard Learning Assumptions IID assumption.

More information

Learning symmetric non-monotone submodular functions

Learning symmetric non-monotone submodular functions Learning symmetric non-monotone submodular functions Maria-Florina Balcan Georgia Institute of Technology ninamf@cc.gatech.edu Nicholas J. A. Harvey University of British Columbia nickhar@cs.ubc.ca Satoru

More information

Chapter 5. Weak convergence

Chapter 5. Weak convergence Chapter 5 Weak convergence We will see later that if the X i are i.i.d. with mean zero and variance one, then S n / p n converges in the sense P(S n / p n 2 [a, b])! P(Z 2 [a, b]), where Z is a standard

More information

Continuity. Chapter 4

Continuity. Chapter 4 Chapter 4 Continuity Throughout this chapter D is a nonempty subset of the real numbers. We recall the definition of a function. Definition 4.1. A function from D into R, denoted f : D R, is a subset of

More information

VC-dimension of a context-dependent perceptron

VC-dimension of a context-dependent perceptron 1 VC-dimension of a context-dependent perceptron Piotr Ciskowski Institute of Engineering Cybernetics, Wroc law University of Technology, Wybrzeże Wyspiańskiego 27, 50 370 Wroc law, Poland cis@vectra.ita.pwr.wroc.pl

More information

Yale University Department of Computer Science. The VC Dimension of k-fold Union

Yale University Department of Computer Science. The VC Dimension of k-fold Union Yale University Department of Computer Science The VC Dimension of k-fold Union David Eisenstat Dana Angluin YALEU/DCS/TR-1360 June 2006, revised October 2006 The VC Dimension of k-fold Union David Eisenstat

More information

12. Structural Risk Minimization. ECE 830 & CS 761, Spring 2016

12. Structural Risk Minimization. ECE 830 & CS 761, Spring 2016 12. Structural Risk Minimization ECE 830 & CS 761, Spring 2016 1 / 23 General setup for statistical learning theory We observe training examples {x i, y i } n i=1 x i = features X y i = labels / responses

More information

Lecture 3. Econ August 12

Lecture 3. Econ August 12 Lecture 3 Econ 2001 2015 August 12 Lecture 3 Outline 1 Metric and Metric Spaces 2 Norm and Normed Spaces 3 Sequences and Subsequences 4 Convergence 5 Monotone and Bounded Sequences Announcements: - Friday

More information

Peter Hoff Minimax estimation October 31, Motivation and definition. 2 Least favorable prior 3. 3 Least favorable prior sequence 11

Peter Hoff Minimax estimation October 31, Motivation and definition. 2 Least favorable prior 3. 3 Least favorable prior sequence 11 Contents 1 Motivation and definition 1 2 Least favorable prior 3 3 Least favorable prior sequence 11 4 Nonparametric problems 15 5 Minimax and admissibility 18 6 Superefficiency and sparsity 19 Most of

More information

Learning Kernels -Tutorial Part III: Theoretical Guarantees.

Learning Kernels -Tutorial Part III: Theoretical Guarantees. Learning Kernels -Tutorial Part III: Theoretical Guarantees. Corinna Cortes Google Research corinna@google.com Mehryar Mohri Courant Institute & Google Research mohri@cims.nyu.edu Afshin Rostami UC Berkeley

More information

Localization vs. Identification of Semi-Algebraic Sets

Localization vs. Identification of Semi-Algebraic Sets Machine Learning 32, 207 224 (1998) c 1998 Kluwer Academic Publishers. Manufactured in The Netherlands. Localization vs. Identification of Semi-Algebraic Sets SHAI BEN-DAVID MICHAEL LINDENBAUM Department

More information

Generalization Error Bounds for Collaborative Prediction with Low-Rank Matrices

Generalization Error Bounds for Collaborative Prediction with Low-Rank Matrices Generalization Error Bounds for Collaborative Prediction with Low-Rank Matrices Nathan Srebro Department of Computer Science University of Toronto Toronto, ON, Canada nati@cs.toronto.edu Noga Alon School

More information

Vapnik-Chervonenkis Dimension of Neural Nets

Vapnik-Chervonenkis Dimension of Neural Nets P. L. Bartlett and W. Maass: Vapnik-Chervonenkis Dimension of Neural Nets 1 Vapnik-Chervonenkis Dimension of Neural Nets Peter L. Bartlett BIOwulf Technologies and University of California at Berkeley

More information

On the Approximability of Partial VC Dimension

On the Approximability of Partial VC Dimension On the Approximability of Partial VC Dimension Cristina Bazgan 1 Florent Foucaud 2 Florian Sikora 1 1 LAMSADE, Université Paris Dauphine, CNRS France 2 LIMOS, Université Blaise Pascal, Clermont-Ferrand

More information

Material covered: Class numbers of quadratic fields, Valuations, Completions of fields.

Material covered: Class numbers of quadratic fields, Valuations, Completions of fields. ALGEBRAIC NUMBER THEORY LECTURE 6 NOTES Material covered: Class numbers of quadratic fields, Valuations, Completions of fields. 1. Ideal class groups of quadratic fields These are the ideal class groups

More information

Chapter 8. General Countably Additive Set Functions. 8.1 Hahn Decomposition Theorem

Chapter 8. General Countably Additive Set Functions. 8.1 Hahn Decomposition Theorem Chapter 8 General Countably dditive Set Functions In Theorem 5.2.2 the reader saw that if f : X R is integrable on the measure space (X,, µ) then we can define a countably additive set function ν on by

More information

Sample Complexity of Learning Independent of Set Theory

Sample Complexity of Learning Independent of Set Theory Sample Complexity of Learning Independent of Set Theory Shai Ben-David University of Waterloo, Canada Based on joint work with Pavel Hrubes, Shay Moran, Amir Shpilka and Amir Yehudayoff Simons workshop,

More information

On the VC-Dimension of the Choquet Integral

On the VC-Dimension of the Choquet Integral On the VC-Dimension of the Choquet Integral Eyke Hüllermeier and Ali Fallah Tehrani Department of Mathematics and Computer Science University of Marburg, Germany {eyke,fallah}@mathematik.uni-marburg.de

More information

Statistical and Computational Learning Theory

Statistical and Computational Learning Theory Statistical and Computational Learning Theory Fundamental Question: Predict Error Rates Given: Find: The space H of hypotheses The number and distribution of the training examples S The complexity of the

More information

Towards Lifelong Machine Learning Multi-Task and Lifelong Learning with Unlabeled Tasks Christoph Lampert

Towards Lifelong Machine Learning Multi-Task and Lifelong Learning with Unlabeled Tasks Christoph Lampert Towards Lifelong Machine Learning Multi-Task and Lifelong Learning with Unlabeled Tasks Christoph Lampert HSE Computer Science Colloquium September 6, 2016 IST Austria (Institute of Science and Technology

More information

Statistical Learning Learning From Examples

Statistical Learning Learning From Examples Statistical Learning Learning From Examples We want to estimate the working temperature range of an iphone. We could study the physics and chemistry that affect the performance of the phone too hard We

More information

Tame definable topological dynamics

Tame definable topological dynamics Tame definable topological dynamics Artem Chernikov (Paris 7) Géométrie et Théorie des Modèles, 4 Oct 2013, ENS, Paris Joint work with Pierre Simon, continues previous work with Anand Pillay and Pierre

More information

k-fold unions of low-dimensional concept classes

k-fold unions of low-dimensional concept classes k-fold unions of low-dimensional concept classes David Eisenstat September 2009 Abstract We show that 2 is the minimum VC dimension of a concept class whose k-fold union has VC dimension Ω(k log k). Keywords:

More information

Estimating the sample complexity of a multi-class. discriminant model

Estimating the sample complexity of a multi-class. discriminant model Estimating the sample complexity of a multi-class discriminant model Yann Guermeur LIP, UMR NRS 0, Universit Paris, place Jussieu, 55 Paris cedex 05 Yann.Guermeur@lip.fr Andr Elissee and H l ne Paugam-Moisy

More information

Lecture 2. We now introduce some fundamental tools in martingale theory, which are useful in controlling the fluctuation of martingales.

Lecture 2. We now introduce some fundamental tools in martingale theory, which are useful in controlling the fluctuation of martingales. Lecture 2 1 Martingales We now introduce some fundamental tools in martingale theory, which are useful in controlling the fluctuation of martingales. 1.1 Doob s inequality We have the following maximal

More information

Sections of Convex Bodies via the Combinatorial Dimension

Sections of Convex Bodies via the Combinatorial Dimension Sections of Convex Bodies via the Combinatorial Dimension (Rough notes - no proofs) These notes are centered at one abstract result in combinatorial geometry, which gives a coordinate approach to several

More information

CS 6375: Machine Learning Computational Learning Theory

CS 6375: Machine Learning Computational Learning Theory CS 6375: Machine Learning Computational Learning Theory Vibhav Gogate The University of Texas at Dallas Many slides borrowed from Ray Mooney 1 Learning Theory Theoretical characterizations of Difficulty

More information

Lecture 14: Binary Classification: Disagreement-based Methods

Lecture 14: Binary Classification: Disagreement-based Methods CSE599i: Online and Adaptive Machine Learning Winter 2018 Lecture 14: Binary Classification: Disagreement-based Methods Lecturer: Kevin Jamieson Scribes: N. Cano, O. Rafieian, E. Barzegary, G. Erion, S.

More information

Consistency of Nearest Neighbor Methods

Consistency of Nearest Neighbor Methods E0 370 Statistical Learning Theory Lecture 16 Oct 25, 2011 Consistency of Nearest Neighbor Methods Lecturer: Shivani Agarwal Scribe: Arun Rajkumar 1 Introduction In this lecture we return to the study

More information

means is a subset of. So we say A B for sets A and B if x A we have x B holds. BY CONTRAST, a S means that a is a member of S.

means is a subset of. So we say A B for sets A and B if x A we have x B holds. BY CONTRAST, a S means that a is a member of S. 1 Notation For those unfamiliar, we have := means equal by definition, N := {0, 1,... } or {1, 2,... } depending on context. (i.e. N is the set or collection of counting numbers.) In addition, means for

More information

Extending the Monoidal T-norm Based Logic with an Independent Involutive Negation

Extending the Monoidal T-norm Based Logic with an Independent Involutive Negation Extending the Monoidal T-norm Based Logic with an Independent Involutive Negation Tommaso Flaminio Dipartimento di Matematica Università di Siena Pian dei Mantellini 44 53100 Siena (Italy) flaminio@unisi.it

More information

Generalization bounds

Generalization bounds Advanced Course in Machine Learning pring 200 Generalization bounds Handouts are jointly prepared by hie Mannor and hai halev-hwartz he problem of characterizing learnability is the most basic question

More information

Geometry and topology of continuous best and near best approximations

Geometry and topology of continuous best and near best approximations Journal of Approximation Theory 105: 252 262, Geometry and topology of continuous best and near best approximations Paul C. Kainen Dept. of Mathematics Georgetown University Washington, D.C. 20057 Věra

More information

Causal Inference Basics

Causal Inference Basics Causal Inference Basics Sam Lendle October 09, 2013 Observed data, question, counterfactuals Observed data: n i.i.d copies of baseline covariates W, treatment A {0, 1}, and outcome Y. O i = (W i, A i,

More information

106 CHAPTER 3. TOPOLOGY OF THE REAL LINE. 2. The set of limit points of a set S is denoted L (S)

106 CHAPTER 3. TOPOLOGY OF THE REAL LINE. 2. The set of limit points of a set S is denoted L (S) 106 CHAPTER 3. TOPOLOGY OF THE REAL LINE 3.3 Limit Points 3.3.1 Main Definitions Intuitively speaking, a limit point of a set S in a space X is a point of X which can be approximated by points of S other

More information

Upper and Lower Bounds

Upper and Lower Bounds James K. Peterson Department of Biological Sciences and Department of Mathematical Sciences Clemson University August 30, 2017 Outline 1 2 s 3 Basic Results 4 Homework Let S be a set of real numbers. We

More information

Lecture 6: September 19

Lecture 6: September 19 36-755: Advanced Statistical Theory I Fall 2016 Lecture 6: September 19 Lecturer: Alessandro Rinaldo Scribe: YJ Choe Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These notes have

More information

The p-adic Numbers. Akhil Mathew

The p-adic Numbers. Akhil Mathew The p-adic Numbers Akhil Mathew ABSTRACT These are notes for the presentation I am giving today, which itself is intended to conclude the independent study on algebraic number theory I took with Professor

More information

Activized Learning with Uniform Classification Noise

Activized Learning with Uniform Classification Noise Activized Learning with Uniform Classification Noise Liu Yang Machine Learning Department, Carnegie Mellon University Steve Hanneke LIUY@CS.CMU.EDU STEVE.HANNEKE@GMAIL.COM Abstract We prove that for any

More information

Uniform laws of large numbers 2

Uniform laws of large numbers 2 C H A P T E R 4 Uniform laws of large numbers The focus of this chapter is a class of results known as uniform laws of large numbers. 3 As suggested by their name, these results represent a strengthening

More information