Learning Theory. Sridhar Mahadevan. University of Massachusetts. p. 1/38

Size: px
Start display at page:

Download "Learning Theory. Sridhar Mahadevan. University of Massachusetts. p. 1/38"

Transcription

1 Learning Theory Sridhar Mahadevan University of Massachusetts p. 1/38

2 Topics Probability theory meet machine learning Concentration inequalities: Chebyshev, Chernoff, Hoeffding, and Markov. Bounds on generalization error Structural risk minimization Growth functions and VC dimension The VC theorem: the most important theorem in machine learning p. 2/38

3 Generalization The essence of learning is generalization This requires making inferences about a population from a sample Example: In the class poll, 8 out of 21 students said they could make the class on Wednesday. What inferences can we make for the entire class? p. 3/38

4 Polling In the 2012 Presidential election, the New York Times 538 political analysis column predicted President Obama would be reelected with 98% probability and win 303 electoral college votes President Obama was reelected with 330 electoral college votes. On November 3rd, 1948, most news organizations incorrectly predicted that President Truman would be defeated by challenger Dewey. p. 4/38

5 Incorrect Polling p. 5/38

6 The Law of Averages If you toss a coin a large number of times, which of the following statements is true? If you get a lot of heads, then tails should start coming up. The number of heads should gradually get closer to the number of tails. The chance error the difference between the number of heads and half the number of tosses increases. The difference between the percentage of heads and 50% decreases. p. 6/38

7 Borell Cantelli Lemma The most famous theorem in probability theory! Theorem: If {A n,n 1} is a sequence of events such that P(A n ) < i=1 then it follows that P(A n,i.o) = 0 p. 7/38

8 Law of Large Numbers Let X n be a sequence of random variables, and S n = n i=1 X n. S n E(S n ) n 0 Bernoulli (1713): Let X n be a Bernoulli distributed random variable. For ǫ > 0 lim P( S n n n p ǫ) = 0 p. 8/38

9 Chebyshev Inequality Let X be a random variable and g(x) be any non-negative function. Then, for any r > 0: P(g(X) r) E(g(X) r p. 9/38

10 Chebyshev Inequality Proof: Eg(X) = r x:g(x) r g(x)f X (x)dx x:g(x) r = rp(g(x) r) g(x)f X (x)dx f X (x)dx p. 10/38

11 Example Let X be a random variable with mean E(X) = µ and variance Var(X) = σ 2. P( (X µ)2 σ 2 t 2 ) 1 t 2 P( X µ tσ) 1 t 2 P( X µ < 2σ) 3 4 p. 11/38

12 Markov Inequality Let X be a nonnegative random variable. Then for any value r > 0, it follows that P(X r) E(X) r Proof: Straightforward consequence of Chebyshev s inequality. p. 12/38

13 Chernoff Bounds In a seminar paper in 1952, Chernoff introduced a powerful approach to proving concentration inequalities Chernoff bounds have been used widely in computer science (e.g., probabilistic algorithms, large graphs) The most important example of a Chernoff bound is the Hoeffding inequality p. 13/38

14 Chernoff Bounds Let X i,1 i n be a set of mutually independent random variables such that P(X i = +1) = P(X i = 1) = 1 2 Let S n = i X i. Then, it follows that for any a > 0: P(S n > a) < e a2 2n p. 14/38

15 Chernoff s Method First, note a trivial consequence of the Markov inequality P(X > αe(x)) < 1 α Exponentiate what is to be proved and use Markov s inequality: P(S n > a) = P(e λs n > e λa ) < E(eλS n ) e λa for some arbitrary λ > 0 (to be selected later to tighten the bound) p. 15/38

16 Chernoff s Method Note that E(e λx i ) = eλ +e λ 2 = cosh(x) Hence, it follows that E(e λs n ) = E(e i λx i ) = i E(e λx i ) = cosh n (x) e nλ2 2 because cosh(λ) < e λ Finally, we get P(S n > a) < enλ 2 = e nλ e λa 2 2 λa Crucial last step: optimize λ to tighten the bound. p. 16/38

17 Chernoff s Method Recap: here is what we have so far: P(S n > a) < enλ e = enλ2 λa λa Select λ > 0 to make bound as tight as possible. Find the gradient λ (nλ2 2 λa) = nλ a which gives λ = a n. Putting this value of λ above, we get our final bound: P(S n > a) < e a2 2n p. 17/38

18 Hoeffding Inequality Let X i,i = 1,...,n be i.i.d samples where a X i b. Let S n = i X i. Then: P(S n ES n ǫ) e 2ǫ 2 i (a i b i )2 P(S n ES n ǫ) e 2ǫ 2 i (a i b i )2 p. 18/38

19 Proof Sketch As before, we use Chernoff bounding method and Markov s inequality: P(S n ES n ǫ) = P(e λ(s n ES n ) e λǫ ) i E(eλ(X i EX i ) ) e λǫ We also use the following inequality: E(e λx ) e λ2 (b a) 2 8. This follows from the convexity of the exponential function We also tune λ above to tighten the bound as much as possible. p. 19/38

20 Union Bound Given a set of (not necessarily independent) events A 1,...,A k P(A 1...A k ) P(A 1 )+...+P(A k ) This is very useful in converting one-sided bounds into two-sided bounds. P( S n ES n ǫ) 2e 2ǫ 2 i (a i b i )2 p. 20/38

21 Hoeffding Inequality Let X i,i = 1,...,n be i.i.d samples from a Bernoulli distribution with probability of success p. Let ˆp = 1 N i X i. Theorem: In this special case, the Hoeffding inequality states that for any ǫ > 0 P( p ˆp > ǫ) 2e 2ǫ2 n p. 21/38

22 Class Poll Revisited 1.4 Class Poll Hoeffding Bound Probability times Error p. 22/38

23 Classification Error Given a hypothesis h H approximating some true target function f, we can define its training error ǫ(h) ˆ on a dataset D as ˆǫ(h) = (x,y) D 1(h(x) y) The true error of h given any (unknown) distribution P on the entire (unseen) data space X is given as ǫ(h) = P (x,y) P (h(x) y) How does training error relate to true error? p. 23/38

24 Generalization Error From the Hoeffding inequality, it follows that P( ǫ(h) ˆǫ(h) > ǫ) 2e 2ǫ2 n How do we generalize this over all possible hypotheses in our hypothesis space H? For finite hypothesis spaces, we can use the union bound: P(A 1...A k ) P(A 1 )+...+P(A k ) if A 1,...,A k are k different events (may not be independent) p. 24/38

25 Generalization Error Let us define the event A i as the event that P( ǫ(h i ) ˆǫ(h i ) > ǫ) So, the probability that some hypothesis in our space of hypotheses has unacceptable error is bounded by P( h H ǫ(h) ˆǫ(h) > ǫ) k i=1 P(A i ) = 2k e 2ǫ2 n Here, H = k (our space of hypotheses is finite). p. 25/38

26 Number of Examples Needed If we set the reliability at 1 δ, for some δ (0,1), then equating δ = 2k e 2ǫ2 n we can compute the number of examples needed for reliable learning as m 1 2 H ln 2ǫ2 δ p. 26/38

27 Generalization Error bound For fixed reliability 1 δ, and hypothesis space H, we can compute the generalization error as ǫ(h) ˆǫ(h) 1 2n ln 2 H δ Note that this only holds for finite hypothesis spaces where H < For the infinite case, which is more common in machine learning, we need to introduce the concept of growth functions p. 27/38

28 Finding Consistent Hypotheses Class Example Complexity Conjunctions A 1 A 2 Polynomial k-dnf k-cnf (A 1 A 3 ) (A 4 A 2 )... Polynomial (A 1 A 3 ) (A 4 A 2 )... Polynomial 3-layer NN Feedforward net NP-hard k-decision List Simple decision tree Polynomial LTU Perceptron Polynomial k-term DNF Term 1 Term 2 NP-hard For polynomial time learning, we also need to take into account the complexity of finding a low-error (or zero training error) hypothesis p. 28/38

29 PAC Results for Concept Classes Class Sample complexity PAC-time learnable? Pure conj. Polynomial Polynomial k-cnf Polynomial Polynomial k-term DNF Polynomial No, unless NP RP Interesting fact: k-term DNF k-cnf. So, learning a larger set can be easier than learning a smaller set! p. 29/38

30 Structural Risk Minimization With probability 1 δ, the true error of a learned classifier ĥ can be bounded as ) (min ǫ(ĥ) ǫ(h) h H n ln 2 H δ This follows from the Hoeffding bound, because ǫ(ĥ) ˆǫ(ĥ)+ǫ ˆǫ(h )+ǫ ǫ(h )+2ǫ p. 30/38

31 Structural Risk Minimization With probability 1 δ, the true error of a learned classifier ĥ can be bounded as ) (min ǫ(ĥ) ǫ(h) h H n ln 2 H δ If we switch from a less expressive H to a more expressive H, then the bias error due to the first term may reduce. However, the variance error due to the second term increases! p. 31/38

32 Dichotomies How do we extend the previous analysis to infinite spaces (e.g., hyperplanes, polynomial functions etc.)? Let us assume a hypothesis h H maps each example x 1,...,x n D to the set { 1,+1}. The dichotomies defined by H on D are H(x 1,...,x n ) = {(h(x 1 ),...,h(x n )) h H} The growth function m H (n) is defined as the maximum number of dichotomies generated by H on any dataset X of size n. p. 32/38

33 Examples of Growth Functions Consider a dataset X of n points on a plane R 2 and the hypothesis space H to be all lines on R 2. What is m H (3)? What is the growth function m H (4) for the space of hypotheses defined by perceptrons on the Euclidean space R 2? Let H be defined as h : R { 1,+1} in the form h(x) = sgn(x a) (where sgn(x) = 1 if x < 0 and +1 otherwise). What is m H (n)? p. 33/38

34 Shattering and Breakpoints A set X is said to be shattered by H if m H (n) = 2 n where X = n That is, the hypothesis space H produces all possible 2 n dichotomies on the dataset X of n examples. If no dataset X of size n can be shattered, n is called a breakpoint for H. Example: What is the breakpoint of the hypothesis space of perceptrons on the plane? p. 34/38

35 Tree-Structured Attributes any_shape convex non_convex regular_polygon ellipse crescent channel triangle hexagon square circle proper_ellipse What is the breakpoint of the above hypothesis space H? p. 35/38

36 VC Dimension The VC dimension (for Vapnik-Chervonenkis) V C(H) of a hypothesis space H is defined as the smallest n for which the growth function m H (n) = 2 n. That is, the hypothesis space H produces all possible 2 n dichotomies on the dataset X of n examples. If no dataset X of size n can be shattered, n is called a breakpoint for H. Example: What is the breakpoint of the hypothesis space of perceptrons on the plane? p. 36/38

37 VC Bounds Finite case: With probability 1 δ, the true error of a learned classifier ĥ can be bounded as ) (min ǫ(ĥ) ǫ(h) h H n ln 2 H δ Infinite case: ) (min ǫ(ĥ) ǫ(h) h H 8 +2 n ln 4m H δ p. 37/38

38 Vapnik-Chervonenkis (VC) Theorem We can now state the celebrated VC theorem, the most important theoretical result in machine learning! For any hypothesis space H, with VC dimension VC(H), given a classifier ĥ that is found by training on a finite dataset X, its generalization error can be bounded with probability 1 δ by ǫ(ĥ) ǫ(h )+O ( 1 n ln 1 δ + VC(H) n log ) m VC(H) p. 38/38

Computational Learning Theory

Computational Learning Theory CS 446 Machine Learning Fall 2016 OCT 11, 2016 Computational Learning Theory Professor: Dan Roth Scribe: Ben Zhou, C. Cervantes 1 PAC Learning We want to develop a theory to relate the probability of successful

More information

Computational Learning Theory. Definitions

Computational Learning Theory. Definitions Computational Learning Theory Computational learning theory is interested in theoretical analyses of the following issues. What is needed to learn effectively? Sample complexity. How many examples? Computational

More information

COMP9444: Neural Networks. Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization

COMP9444: Neural Networks. Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization : Neural Networks Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization 11s2 VC-dimension and PAC-learning 1 How good a classifier does a learner produce? Training error is the precentage

More information

COMS 4771 Introduction to Machine Learning. Nakul Verma

COMS 4771 Introduction to Machine Learning. Nakul Verma COMS 4771 Introduction to Machine Learning Nakul Verma Announcements HW2 due now! Project proposal due on tomorrow Midterm next lecture! HW3 posted Last time Linear Regression Parametric vs Nonparametric

More information

Computational Learning Theory

Computational Learning Theory 0. Computational Learning Theory Based on Machine Learning, T. Mitchell, McGRAW Hill, 1997, ch. 7 Acknowledgement: The present slides are an adaptation of slides drawn by T. Mitchell 1. Main Questions

More information

PAC-learning, VC Dimension and Margin-based Bounds

PAC-learning, VC Dimension and Margin-based Bounds More details: General: http://www.learning-with-kernels.org/ Example of more complex bounds: http://www.research.ibm.com/people/t/tzhang/papers/jmlr02_cover.ps.gz PAC-learning, VC Dimension and Margin-based

More information

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin Learning Theory Machine Learning CSE546 Carlos Guestrin University of Washington November 25, 2013 Carlos Guestrin 2005-2013 1 What now n We have explored many ways of learning from data n But How good

More information

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18 CSE 417T: Introduction to Machine Learning Lecture 11: Review Henry Chai 10/02/18 Unknown Target Function!: # % Training data Formal Setup & = ( ), + ),, ( -, + - Learning Algorithm 2 Hypothesis Set H

More information

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Machine Learning. Lecture 9: Learning Theory. Feng Li. Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning PAC Learning and VC Dimension Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE

More information

ORIE 4741: Learning with Big Messy Data. Generalization

ORIE 4741: Learning with Big Messy Data. Generalization ORIE 4741: Learning with Big Messy Data Generalization Professor Udell Operations Research and Information Engineering Cornell September 23, 2017 1 / 21 Announcements midterm 10/5 makeup exam 10/2, by

More information

Computational Learning Theory (VC Dimension)

Computational Learning Theory (VC Dimension) Computational Learning Theory (VC Dimension) 1 Difficulty of machine learning problems 2 Capabilities of machine learning algorithms 1 Version Space with associated errors error is the true error, r is

More information

CS340 Machine learning Lecture 4 Learning theory. Some slides are borrowed from Sebastian Thrun and Stuart Russell

CS340 Machine learning Lecture 4 Learning theory. Some slides are borrowed from Sebastian Thrun and Stuart Russell CS340 Machine learning Lecture 4 Learning theory Some slides are borrowed from Sebastian Thrun and Stuart Russell Announcement What: Workshop on applying for NSERC scholarships and for entry to graduate

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Vapnik Chervonenkis Theory Barnabás Póczos Empirical Risk and True Risk 2 Empirical Risk Shorthand: True risk of f (deterministic): Bayes risk: Let us use the empirical

More information

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity;

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity; CSCI699: Topics in Learning and Game Theory Lecture 2 Lecturer: Ilias Diakonikolas Scribes: Li Han Today we will cover the following 2 topics: 1. Learning infinite hypothesis class via VC-dimension and

More information

Computational Learning Theory

Computational Learning Theory Computational Learning Theory Sinh Hoa Nguyen, Hung Son Nguyen Polish-Japanese Institute of Information Technology Institute of Mathematics, Warsaw University February 14, 2006 inh Hoa Nguyen, Hung Son

More information

Machine Learning

Machine Learning Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University October 11, 2012 Today: Computational Learning Theory Probably Approximately Coorrect (PAC) learning theorem

More information

VC Dimension Review. The purpose of this document is to review VC dimension and PAC learning for infinite hypothesis spaces.

VC Dimension Review. The purpose of this document is to review VC dimension and PAC learning for infinite hypothesis spaces. VC Dimension Review The purpose of this document is to review VC dimension and PAC learning for infinite hypothesis spaces. Previously, in discussing PAC learning, we were trying to answer questions about

More information

Computational Learning Theory. CS534 - Machine Learning

Computational Learning Theory. CS534 - Machine Learning Computational Learning Theory CS534 Machine Learning Introduction Computational learning theory Provides a theoretical analysis of learning Shows when a learning algorithm can be expected to succeed Shows

More information

The Vapnik-Chervonenkis Dimension

The Vapnik-Chervonenkis Dimension The Vapnik-Chervonenkis Dimension Prof. Dan A. Simovici UMB 1 / 91 Outline 1 Growth Functions 2 Basic Definitions for Vapnik-Chervonenkis Dimension 3 The Sauer-Shelah Theorem 4 The Link between VCD and

More information

Active Learning and Optimized Information Gathering

Active Learning and Optimized Information Gathering Active Learning and Optimized Information Gathering Lecture 7 Learning Theory CS 101.2 Andreas Krause Announcements Project proposal: Due tomorrow 1/27 Homework 1: Due Thursday 1/29 Any time is ok. Office

More information

Machine Learning

Machine Learning Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University October 11, 2012 Today: Computational Learning Theory Probably Approximately Coorrect (PAC) learning theorem

More information

1 Review of The Learning Setting

1 Review of The Learning Setting COS 5: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #8 Scribe: Changyan Wang February 28, 208 Review of The Learning Setting Last class, we moved beyond the PAC model: in the PAC model we

More information

Probability Review. Yutian Li. January 18, Stanford University. Yutian Li (Stanford University) Probability Review January 18, / 27

Probability Review. Yutian Li. January 18, Stanford University. Yutian Li (Stanford University) Probability Review January 18, / 27 Probability Review Yutian Li Stanford University January 18, 2018 Yutian Li (Stanford University) Probability Review January 18, 2018 1 / 27 Outline 1 Elements of probability 2 Random variables 3 Multiple

More information

Statistical Learning Learning From Examples

Statistical Learning Learning From Examples Statistical Learning Learning From Examples We want to estimate the working temperature range of an iphone. We could study the physics and chemistry that affect the performance of the phone too hard We

More information

Computational learning theory. PAC learning. VC dimension.

Computational learning theory. PAC learning. VC dimension. Computational learning theory. PAC learning. VC dimension. Petr Pošík Czech Technical University in Prague Faculty of Electrical Engineering Dept. of Cybernetics COLT 2 Concept...........................................................................................................

More information

Statistical and Computational Learning Theory

Statistical and Computational Learning Theory Statistical and Computational Learning Theory Fundamental Question: Predict Error Rates Given: Find: The space H of hypotheses The number and distribution of the training examples S The complexity of the

More information

ECS171: Machine Learning

ECS171: Machine Learning ECS171: Machine Learning Lecture 6: Training versus Testing (LFD 2.1) Cho-Jui Hsieh UC Davis Jan 29, 2018 Preamble to the theory Training versus testing Out-of-sample error (generalization error): What

More information

Learning Theory. Piyush Rai. CS5350/6350: Machine Learning. September 27, (CS5350/6350) Learning Theory September 27, / 14

Learning Theory. Piyush Rai. CS5350/6350: Machine Learning. September 27, (CS5350/6350) Learning Theory September 27, / 14 Learning Theory Piyush Rai CS5350/6350: Machine Learning September 27, 2011 (CS5350/6350) Learning Theory September 27, 2011 1 / 14 Why Learning Theory? We want to have theoretical guarantees about our

More information

CSE 312 Final Review: Section AA

CSE 312 Final Review: Section AA CSE 312 TAs December 8, 2011 General Information General Information Comprehensive Midterm General Information Comprehensive Midterm Heavily weighted toward material after the midterm Pre-Midterm Material

More information

STA 711: Probability & Measure Theory Robert L. Wolpert

STA 711: Probability & Measure Theory Robert L. Wolpert STA 711: Probability & Measure Theory Robert L. Wolpert 6 Independence 6.1 Independent Events A collection of events {A i } F in a probability space (Ω,F,P) is called independent if P[ i I A i ] = P[A

More information

PAC Learning. prof. dr Arno Siebes. Algorithmic Data Analysis Group Department of Information and Computing Sciences Universiteit Utrecht

PAC Learning. prof. dr Arno Siebes. Algorithmic Data Analysis Group Department of Information and Computing Sciences Universiteit Utrecht PAC Learning prof. dr Arno Siebes Algorithmic Data Analysis Group Department of Information and Computing Sciences Universiteit Utrecht Recall: PAC Learning (Version 1) A hypothesis class H is PAC learnable

More information

The sample complexity of agnostic learning with deterministic labels

The sample complexity of agnostic learning with deterministic labels The sample complexity of agnostic learning with deterministic labels Shai Ben-David Cheriton School of Computer Science University of Waterloo Waterloo, ON, N2L 3G CANADA shai@uwaterloo.ca Ruth Urner College

More information

An Introduction to Statistical Theory of Learning. Nakul Verma Janelia, HHMI

An Introduction to Statistical Theory of Learning. Nakul Verma Janelia, HHMI An Introduction to Statistical Theory of Learning Nakul Verma Janelia, HHMI Towards formalizing learning What does it mean to learn a concept? Gain knowledge or experience of the concept. The basic process

More information

Generalization, Overfitting, and Model Selection

Generalization, Overfitting, and Model Selection Generalization, Overfitting, and Model Selection Sample Complexity Results for Supervised Classification Maria-Florina (Nina) Balcan 10/03/2016 Two Core Aspects of Machine Learning Algorithm Design. How

More information

Computational Learning Theory

Computational Learning Theory 1 Computational Learning Theory 2 Computational learning theory Introduction Is it possible to identify classes of learning problems that are inherently easy or difficult? Can we characterize the number

More information

Learning theory Lecture 4

Learning theory Lecture 4 Learning theory Lecture 4 David Sontag New York University Slides adapted from Carlos Guestrin & Luke Zettlemoyer What s next We gave several machine learning algorithms: Perceptron Linear support vector

More information

Computational Learning Theory

Computational Learning Theory Computational Learning Theory Pardis Noorzad Department of Computer Engineering and IT Amirkabir University of Technology Ordibehesht 1390 Introduction For the analysis of data structures and algorithms

More information

Empirical Risk Minimization

Empirical Risk Minimization Empirical Risk Minimization Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Introduction PAC learning ERM in practice 2 General setting Data X the input space and Y the output space

More information

CS 6375: Machine Learning Computational Learning Theory

CS 6375: Machine Learning Computational Learning Theory CS 6375: Machine Learning Computational Learning Theory Vibhav Gogate The University of Texas at Dallas Many slides borrowed from Ray Mooney 1 Learning Theory Theoretical characterizations of Difficulty

More information

Lecture 25 of 42. PAC Learning, VC Dimension, and Mistake Bounds

Lecture 25 of 42. PAC Learning, VC Dimension, and Mistake Bounds Lecture 25 of 42 PAC Learning, VC Dimension, and Mistake Bounds Thursday, 15 March 2007 William H. Hsu, KSU http://www.kddresearch.org/courses/spring2007/cis732 Readings: Sections 7.4.17.4.3, 7.5.17.5.3,

More information

On the Sample Complexity of Noise-Tolerant Learning

On the Sample Complexity of Noise-Tolerant Learning On the Sample Complexity of Noise-Tolerant Learning Javed A. Aslam Department of Computer Science Dartmouth College Hanover, NH 03755 Scott E. Decatur Laboratory for Computer Science Massachusetts Institute

More information

PAC-learning, VC Dimension and Margin-based Bounds

PAC-learning, VC Dimension and Margin-based Bounds More details: General: http://www.learning-with-kernels.org/ Example of more complex bounds: http://www.research.ibm.com/people/t/tzhang/papers/jmlr02_cover.ps.gz PAC-learning, VC Dimension and Margin-based

More information

THE VAPNIK- CHERVONENKIS DIMENSION and LEARNABILITY

THE VAPNIK- CHERVONENKIS DIMENSION and LEARNABILITY THE VAPNIK- CHERVONENKIS DIMENSION and LEARNABILITY Dan A. Simovici UMB, Doctoral Summer School Iasi, Romania What is Machine Learning? The Vapnik-Chervonenkis Dimension Probabilistic Learning Potential

More information

Lecture 2 Sep 5, 2017

Lecture 2 Sep 5, 2017 CS 388R: Randomized Algorithms Fall 2017 Lecture 2 Sep 5, 2017 Prof. Eric Price Scribe: V. Orestis Papadigenopoulos and Patrick Rall NOTE: THESE NOTES HAVE NOT BEEN EDITED OR CHECKED FOR CORRECTNESS 1

More information

The PAC Learning Framework -II

The PAC Learning Framework -II The PAC Learning Framework -II Prof. Dan A. Simovici UMB 1 / 1 Outline 1 Finite Hypothesis Space - The Inconsistent Case 2 Deterministic versus stochastic scenario 3 Bayes Error and Noise 2 / 1 Outline

More information

Introduction to Support Vector Machines

Introduction to Support Vector Machines Introduction to Support Vector Machines Shivani Agarwal Support Vector Machines (SVMs) Algorithm for learning linear classifiers Motivated by idea of maximizing margin Efficient extension to non-linear

More information

Understanding Generalization Error: Bounds and Decompositions

Understanding Generalization Error: Bounds and Decompositions CIS 520: Machine Learning Spring 2018: Lecture 11 Understanding Generalization Error: Bounds and Decompositions Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the

More information

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015 Machine Learning 10-701, Fall 2015 VC Dimension and Model Complexity Eric Xing Lecture 16, November 3, 2015 Reading: Chap. 7 T.M book, and outline material Eric Xing @ CMU, 2006-2015 1 Last time: PAC and

More information

1 A Lower Bound on Sample Complexity

1 A Lower Bound on Sample Complexity COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #7 Scribe: Chee Wei Tan February 25, 2008 1 A Lower Bound on Sample Complexity In the last lecture, we stopped at the lower bound on

More information

10.1 The Formal Model

10.1 The Formal Model 67577 Intro. to Machine Learning Fall semester, 2008/9 Lecture 10: The Formal (PAC) Learning Model Lecturer: Amnon Shashua Scribe: Amnon Shashua 1 We have see so far algorithms that explicitly estimate

More information

Learning Theory Continued

Learning Theory Continued Learning Theory Continued Machine Learning CSE446 Carlos Guestrin University of Washington May 13, 2013 1 A simple setting n Classification N data points Finite number of possible hypothesis (e.g., dec.

More information

Computational Learning Theory

Computational Learning Theory 09s1: COMP9417 Machine Learning and Data Mining Computational Learning Theory May 20, 2009 Acknowledgement: Material derived from slides for the book Machine Learning, Tom M. Mitchell, McGraw-Hill, 1997

More information

PAC Learning Introduction to Machine Learning. Matt Gormley Lecture 14 March 5, 2018

PAC Learning Introduction to Machine Learning. Matt Gormley Lecture 14 March 5, 2018 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University PAC Learning Matt Gormley Lecture 14 March 5, 2018 1 ML Big Picture Learning Paradigms:

More information

Computational Learning Theory. CS 486/686: Introduction to Artificial Intelligence Fall 2013

Computational Learning Theory. CS 486/686: Introduction to Artificial Intelligence Fall 2013 Computational Learning Theory CS 486/686: Introduction to Artificial Intelligence Fall 2013 1 Overview Introduction to Computational Learning Theory PAC Learning Theory Thanks to T Mitchell 2 Introduction

More information

STAT 200C: High-dimensional Statistics

STAT 200C: High-dimensional Statistics STAT 200C: High-dimensional Statistics Arash A. Amini May 30, 2018 1 / 59 Classical case: n d. Asymptotic assumption: d is fixed and n. Basic tools: LLN and CLT. High-dimensional setting: n d, e.g. n/d

More information

Foundations of Machine Learning

Foundations of Machine Learning Introduction to ML Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu page 1 Logistics Prerequisites: basics in linear algebra, probability, and analysis of algorithms. Workload: about

More information

4 Expectation & the Lebesgue Theorems

4 Expectation & the Lebesgue Theorems STA 205: Probability & Measure Theory Robert L. Wolpert 4 Expectation & the Lebesgue Theorems Let X and {X n : n N} be random variables on a probability space (Ω,F,P). If X n (ω) X(ω) for each ω Ω, does

More information

Computational Learning Theory: Shattering and VC Dimensions. Machine Learning. Spring The slides are mainly from Vivek Srikumar

Computational Learning Theory: Shattering and VC Dimensions. Machine Learning. Spring The slides are mainly from Vivek Srikumar Computational Learning Theory: Shattering and VC Dimensions Machine Learning Spring 2018 The slides are mainly from Vivek Srikumar 1 This lecture: Computational Learning Theory The Theory of Generalization

More information

IFT Lecture 7 Elements of statistical learning theory

IFT Lecture 7 Elements of statistical learning theory IFT 6085 - Lecture 7 Elements of statistical learning theory This version of the notes has not yet been thoroughly checked. Please report any bugs to the scribes or instructor. Scribe(s): Brady Neal and

More information

Lecture 29: Computational Learning Theory

Lecture 29: Computational Learning Theory CS 710: Complexity Theory 5/4/2010 Lecture 29: Computational Learning Theory Instructor: Dieter van Melkebeek Scribe: Dmitri Svetlov and Jake Rosin Today we will provide a brief introduction to computational

More information

1 Sequences of events and their limits

1 Sequences of events and their limits O.H. Probability II (MATH 2647 M15 1 Sequences of events and their limits 1.1 Monotone sequences of events Sequences of events arise naturally when a probabilistic experiment is repeated many times. For

More information

Computational Learning Theory

Computational Learning Theory Computational Learning Theory Slides by and Nathalie Japkowicz (Reading: R&N AIMA 3 rd ed., Chapter 18.5) Computational Learning Theory Inductive learning: given the training set, a learning algorithm

More information

A Tutorial on Computational Learning Theory Presented at Genetic Programming 1997 Stanford University, July 1997

A Tutorial on Computational Learning Theory Presented at Genetic Programming 1997 Stanford University, July 1997 A Tutorial on Computational Learning Theory Presented at Genetic Programming 1997 Stanford University, July 1997 Vasant Honavar Artificial Intelligence Research Laboratory Department of Computer Science

More information

CS340 Machine learning Lecture 5 Learning theory cont'd. Some slides are borrowed from Stuart Russell and Thorsten Joachims

CS340 Machine learning Lecture 5 Learning theory cont'd. Some slides are borrowed from Stuart Russell and Thorsten Joachims CS340 Machine learning Lecture 5 Learning theory cont'd Some slides are borrowed from Stuart Russell and Thorsten Joachims Inductive learning Simplest form: learn a function from examples f is the target

More information

Statistical Machine Learning

Statistical Machine Learning Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 2: Introduction to statistical learning theory. 1 / 22 Goals of statistical learning theory SLT aims at studying the performance of

More information

Web-Mining Agents Computational Learning Theory

Web-Mining Agents Computational Learning Theory Web-Mining Agents Computational Learning Theory Prof. Dr. Ralf Möller Dr. Özgür Özcep Universität zu Lübeck Institut für Informationssysteme Tanya Braun (Exercise Lab) Computational Learning Theory (Adapted)

More information

Models of Language Acquisition: Part II

Models of Language Acquisition: Part II Models of Language Acquisition: Part II Matilde Marcolli CS101: Mathematical and Computational Linguistics Winter 2015 Probably Approximately Correct Model of Language Learning General setting of Statistical

More information

Introduction to Machine Learning CMU-10701

Introduction to Machine Learning CMU-10701 Introduction to Machine Learning CMU10701 11. Learning Theory Barnabás Póczos Learning Theory We have explored many ways of learning from data But How good is our classifier, really? How much data do we

More information

Learning From Data Lecture 3 Is Learning Feasible?

Learning From Data Lecture 3 Is Learning Feasible? Learning From Data Lecture 3 Is Learning Feasible? Outside the Data Probability to the Rescue Learning vs. Verification Selection Bias - A Cartoon M. Magdon-Ismail CSCI 4100/6100 recap: The Perceptron

More information

Essentials on the Analysis of Randomized Algorithms

Essentials on the Analysis of Randomized Algorithms Essentials on the Analysis of Randomized Algorithms Dimitris Diochnos Feb 0, 2009 Abstract These notes were written with Monte Carlo algorithms primarily in mind. Topics covered are basic (discrete) random

More information

Naïve Bayes classification

Naïve Bayes classification Naïve Bayes classification 1 Probability theory Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. Examples: A person s height, the outcome of a coin toss

More information

Supervised Machine Learning (Spring 2014) Homework 2, sample solutions

Supervised Machine Learning (Spring 2014) Homework 2, sample solutions 58669 Supervised Machine Learning (Spring 014) Homework, sample solutions Credit for the solutions goes to mainly to Panu Luosto and Joonas Paalasmaa, with some additional contributions by Jyrki Kivinen

More information

12.1 A Polynomial Bound on the Sample Size m for PAC Learning

12.1 A Polynomial Bound on the Sample Size m for PAC Learning 67577 Intro. to Machine Learning Fall semester, 2008/9 Lecture 12: PAC III Lecturer: Amnon Shashua Scribe: Amnon Shashua 1 In this lecture will use the measure of VC dimension, which is a combinatorial

More information

Machine Learning. Computational Learning Theory. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Machine Learning. Computational Learning Theory. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012 Machine Learning CSE6740/CS7641/ISYE6740, Fall 2012 Computational Learning Theory Le Song Lecture 11, September 20, 2012 Based on Slides from Eric Xing, CMU Reading: Chap. 7 T.M book 1 Complexity of Learning

More information

Announcements. Proposals graded

Announcements. Proposals graded Announcements Proposals graded Kevin Jamieson 2018 1 Bayesian Methods Machine Learning CSE546 Kevin Jamieson University of Washington November 1, 2018 2018 Kevin Jamieson 2 MLE Recap - coin flips Data:

More information

Estimating the accuracy of a hypothesis Setting. Assume a binary classification setting

Estimating the accuracy of a hypothesis Setting. Assume a binary classification setting Estimating the accuracy of a hypothesis Setting Assume a binary classification setting Assume input/output pairs (x, y) are sampled from an unknown probability distribution D = p(x, y) Train a binary classifier

More information

Lecture Slides for INTRODUCTION TO. Machine Learning. By: Postedited by: R.

Lecture Slides for INTRODUCTION TO. Machine Learning. By:  Postedited by: R. Lecture Slides for INTRODUCTION TO Machine Learning By: alpaydin@boun.edu.tr http://www.cmpe.boun.edu.tr/~ethem/i2ml Postedited by: R. Basili Learning a Class from Examples Class C of a family car Prediction:

More information

Machine Learning 4771

Machine Learning 4771 Machine Learning 477 Instructor: Tony Jebara Topic 5 Generalization Guarantees VC-Dimension Nearest Neighbor Classification (infinite VC dimension) Structural Risk Minimization Support Vector Machines

More information

COMPSCI 240: Reasoning Under Uncertainty

COMPSCI 240: Reasoning Under Uncertainty COMPSCI 240: Reasoning Under Uncertainty Andrew Lan and Nic Herndon University of Massachusetts at Amherst Spring 2019 Lecture 20: Central limit theorem & The strong law of large numbers Markov and Chebyshev

More information

Stochastic Models of Manufacturing Systems

Stochastic Models of Manufacturing Systems Stochastic Models of Manufacturing Systems Ivo Adan Organization 2/47 7 lectures (lecture of May 12 is canceled) Studyguide available (with notes, slides, assignments, references), see http://www.win.tue.nl/

More information

Fundamental Tools - Probability Theory II

Fundamental Tools - Probability Theory II Fundamental Tools - Probability Theory II MSc Financial Mathematics The University of Warwick September 29, 2015 MSc Financial Mathematics Fundamental Tools - Probability Theory II 1 / 22 Measurable random

More information

Cognitive Cyber-Physical System

Cognitive Cyber-Physical System Cognitive Cyber-Physical System Physical to Cyber-Physical The emergence of non-trivial embedded sensor units, networked embedded systems and sensor/actuator networks has made possible the design and implementation

More information

Does Unlabeled Data Help?

Does Unlabeled Data Help? Does Unlabeled Data Help? Worst-case Analysis of the Sample Complexity of Semi-supervised Learning. Ben-David, Lu and Pal; COLT, 2008. Presentation by Ashish Rastogi Courant Machine Learning Seminar. Outline

More information

An Introduction to No Free Lunch Theorems

An Introduction to No Free Lunch Theorems February 2, 2012 Table of Contents Induction Learning without direct observation. Generalising from data. Modelling physical phenomena. The Problem of Induction David Hume (1748) How do we know an induced

More information

Part of the slides are adapted from Ziko Kolter

Part of the slides are adapted from Ziko Kolter Part of the slides are adapted from Ziko Kolter OUTLINE 1 Supervised learning: classification........................................................ 2 2 Non-linear regression/classification, overfitting,

More information

Evaluating Classifiers. Lecture 2 Instructor: Max Welling

Evaluating Classifiers. Lecture 2 Instructor: Max Welling Evaluating Classifiers Lecture 2 Instructor: Max Welling Evaluation of Results How do you report classification error? How certain are you about the error you claim? How do you compare two algorithms?

More information

Quick Tour of Basic Probability Theory and Linear Algebra

Quick Tour of Basic Probability Theory and Linear Algebra Quick Tour of and Linear Algebra Quick Tour of and Linear Algebra CS224w: Social and Information Network Analysis Fall 2011 Quick Tour of and Linear Algebra Quick Tour of and Linear Algebra Outline Definitions

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information

Advanced Introduction to Machine Learning CMU-10715

Advanced Introduction to Machine Learning CMU-10715 Advanced Introduction to Machine Learning CMU-10715 Risk Minimization Barnabás Póczos What have we seen so far? Several classification & regression algorithms seem to work fine on training datasets: Linear

More information

1 The Probably Approximately Correct (PAC) Model

1 The Probably Approximately Correct (PAC) Model COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #3 Scribe: Yuhui Luo February 11, 2008 1 The Probably Approximately Correct (PAC) Model A target concept class C is PAC-learnable by

More information

1 Learning Linear Separators

1 Learning Linear Separators 10-601 Machine Learning Maria-Florina Balcan Spring 2015 Plan: Perceptron algorithm for learning linear separators. 1 Learning Linear Separators Here we can think of examples as being from {0, 1} n or

More information

6.041/6.431 Fall 2010 Quiz 2 Solutions

6.041/6.431 Fall 2010 Quiz 2 Solutions 6.04/6.43: Probabilistic Systems Analysis (Fall 200) 6.04/6.43 Fall 200 Quiz 2 Solutions Problem. (80 points) In this problem: (i) X is a (continuous) uniform random variable on [0, 4]. (ii) Y is an exponential

More information

Machine Learning. Computational Learning Theory. Eric Xing , Fall Lecture 9, October 5, 2016

Machine Learning. Computational Learning Theory. Eric Xing , Fall Lecture 9, October 5, 2016 Machine Learning 10-701, Fall 2016 Computational Learning Theory Eric Xing Lecture 9, October 5, 2016 Reading: Chap. 7 T.M book Eric Xing @ CMU, 2006-2016 1 Generalizability of Learning In machine learning

More information

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability Probability theory Naïve Bayes classification Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. s: A person s height, the outcome of a coin toss Distinguish

More information

Learning Theory. Aar$ Singh and Barnabas Poczos. Machine Learning / Apr 17, Slides courtesy: Carlos Guestrin

Learning Theory. Aar$ Singh and Barnabas Poczos. Machine Learning / Apr 17, Slides courtesy: Carlos Guestrin Learning Theory Aar$ Singh and Barnabas Poczos Machine Learning 10-701/15-781 Apr 17, 2014 Slides courtesy: Carlos Guestrin Learning Theory We have explored many ways of learning from data But How good

More information

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #5 Scribe: Allen(Zhelun) Wu February 19, ). Then: Pr[err D (h A ) > ɛ] δ

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #5 Scribe: Allen(Zhelun) Wu February 19, ). Then: Pr[err D (h A ) > ɛ] δ COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #5 Scribe: Allen(Zhelun) Wu February 19, 018 Review Theorem (Occam s Razor). Say algorithm A finds a hypothesis h A H consistent with

More information

Statistical learning theory, Support vector machines, and Bioinformatics

Statistical learning theory, Support vector machines, and Bioinformatics 1 Statistical learning theory, Support vector machines, and Bioinformatics Jean-Philippe.Vert@mines.org Ecole des Mines de Paris Computational Biology group ENS Paris, november 25, 2003. 2 Overview 1.

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Computational and Statistical Learning Theory Problem set 1 Due: Monday, October 10th Please send your solutions to learning-submissions@ttic.edu Notation: Input space: X Label space: Y = {±1} Sample:

More information

Introduction to Statistical Data Analysis Lecture 4: Sampling

Introduction to Statistical Data Analysis Lecture 4: Sampling Introduction to Statistical Data Analysis Lecture 4: Sampling James V. Lambers Department of Mathematics The University of Southern Mississippi James V. Lambers Statistical Data Analysis 1 / 30 Introduction

More information