Learning Theory Continued

Similar documents
Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin

PAC-learning, VC Dimension and Margin-based Bounds

PAC-learning, VC Dimension and Margin-based Bounds

Learning Theory. Aar$ Singh and Barnabas Poczos. Machine Learning / Apr 17, Slides courtesy: Carlos Guestrin

Stochastic Gradient Descent

ECE 5424: Introduction to Machine Learning

ECE 5984: Introduction to Machine Learning

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

Learning Theory. Piyush Rai. CS5350/6350: Machine Learning. September 27, (CS5350/6350) Learning Theory September 27, / 14

Machine Learning

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables?

Machine Learning

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Bias-Variance Tradeoff

CS 6375: Machine Learning Computational Learning Theory

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

Computational Learning Theory. CS534 - Machine Learning

Computational Learning Theory

Machine Learning CSE546 Sham Kakade University of Washington. Oct 4, What about continuous variables?

FINAL EXAM: FALL 2013 CS 6375 INSTRUCTOR: VIBHAV GOGATE

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Kernels. Machine Learning CSE446 Carlos Guestrin University of Washington. October 28, Carlos Guestrin

Computational Learning Theory

Logistic Regression Logistic

Linear and Logistic Regression. Dr. Xiaowei Huang

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015

TDT4173 Machine Learning

VC Dimension Review. The purpose of this document is to review VC dimension and PAC learning for infinite hypothesis spaces.

CS340 Machine learning Lecture 4 Learning theory. Some slides are borrowed from Sebastian Thrun and Stuart Russell

Data Mining und Maschinelles Lernen

Computational Learning Theory (VC Dimension)

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

Computational Learning Theory: Shattering and VC Dimensions. Machine Learning. Spring The slides are mainly from Vivek Srikumar

Classification: The rest of the story

Learning theory Lecture 4

Introduction to Machine Learning CMU-10701

Dimensionality reduction

Introduction to Machine Learning

VBM683 Machine Learning

Variance Reduction and Ensemble Methods

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

Learning with multiple models. Boosting.

Learning Theory. Machine Learning B Seyoung Kim. Many of these slides are derived from Tom Mitchell, Ziv- Bar Joseph. Thanks!

Decision Trees. Machine Learning CSEP546 Carlos Guestrin University of Washington. February 3, 2014

Computational Learning Theory

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2

Voting (Ensemble Methods)

Midterm, Fall 2003

Day 3: Classification, logistic regression

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17

Machine Learning. Nonparametric Methods. Space of ML Problems. Todo. Histograms. Instance-Based Learning (aka non-parametric methods)

Introduction to Machine Learning CMU-10701

TDT4173 Machine Learning

K-means. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. November 19 th, Carlos Guestrin 1

CSE446: Linear Regression Regulariza5on Bias / Variance Tradeoff Winter 2015

Active Learning and Optimized Information Gathering

Linear classifiers Lecture 3

Computational Learning Theory. CS 486/686: Introduction to Artificial Intelligence Fall 2013

FINAL: CS 6375 (Machine Learning) Fall 2014

Lecture 25 of 42. PAC Learning, VC Dimension, and Mistake Bounds

Support Vector Machines

COMS 4771 Introduction to Machine Learning. Nakul Verma

Web-Mining Agents Computational Learning Theory

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Generalization, Overfitting, and Model Selection

Name (NetID): (1 Point)

Hypothesis Testing and Computational Learning Theory. EECS 349 Machine Learning With slides from Bryan Pardo, Tom Mitchell

Machine Learning. Lecture 9: Learning Theory. Feng Li.

ECE 5984: Introduction to Machine Learning

Understanding Generalization Error: Bounds and Decompositions

Computational Learning Theory

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Expectation Maximization

Machine Learning. Computational Learning Theory. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Decision Trees. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. February 5 th, Carlos Guestrin 1

Generative v. Discriminative classifiers Intuition

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Linear Regression. Machine Learning CSE546 Kevin Jamieson University of Washington. Oct 5, Kevin Jamieson 1

Computational Learning Theory

Kernelized Perceptron Support Vector Machines

CSE446: non-parametric methods Spring 2017

Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression

Support vector machines Lecture 4

COMP9444: Neural Networks. Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization

CSE 546 Final Exam, Autumn 2013

Computational Learning Theory

Announcements. Proposals graded

Learning Theory, Overfi1ng, Bias Variance Decomposi9on

The Naïve Bayes Classifier. Machine Learning Fall 2017

Linear Classifiers: Expressiveness

Introduction to Machine Learning

Support Vector Machines. Machine Learning Fall 2017

EM (cont.) November 26 th, Carlos Guestrin 1

Least Mean Squares Regression

Statistical and Computational Learning Theory

Discriminative Models

Computational Learning Theory

Machine Learning Basics

Computational Learning Theory. Definitions

Lecture Slides for INTRODUCTION TO. Machine Learning. By: Postedited by: R.

Transcription:

Learning Theory Continued Machine Learning CSE446 Carlos Guestrin University of Washington May 13, 2013 1 A simple setting n Classification N data points Finite number of possible hypothesis (e.g., dec. trees of depth d) n A learner finds a hypothesis h that is consistent with training data Gets zero error in training error train (h) = 0 n What is the probability that h has more than ε true error? error true (h) ε 2 1

Generalization error in finite hypothesis spaces [Haussler 88] n Theorem: Hypothesis space H finite, dataset D with N i.i.d. samples, 0 < ε < 1 : for any learned hypothesis h that is consistent on the training data: P (error true (h) > ) apple H e N 3 Limitations of Haussler 88 bound n Consistent classifier P (error true (h) > ) apple H e N n Size of hypothesis space 4 2

What if our classifier does not have zero error on the training data? n A learner with zero training errors may make mistakes in test set n What about a learner with error train (h) in training set? 5 Generalization bound for H hypothesis n Theorem: Hypothesis space H finite, dataset D with N i.i.d. samples, 0 < ε < 1 : for any learned hypothesis h: P (error true (h i ) error train (h i ) > ) apple e 2N 2 6 3

PAC bound and Bias-Variance tradeoff P (error true (h) error train (h) > ) apple e 2N 2 or, after moving some terms around, with probability at least 1-δ: error true (h) apple error train (h)+ s ln H +ln 1 2N n Important: PAC bound holds for all h, but doesn t guarantee that algorithm finds best h!!! 7 What about the size of the hypothesis space? N ln H +ln1 2 2 n How large is the hypothesis space? 8 4

Boolean formulas with m binary features N ln H +ln1 2 2 9 Number of decision trees of depth k Recursive solution Given m attributes H k = Number of decision trees of depth k H 0 =2 H k+1 = (#choices of root attribute) * (# possible left subtrees) * (# possible right subtrees) = m * H k * H k N ln H +ln1 2 2 Write L k = log 2 H k L 0 = 1 L k+1 = log 2 m + 2L k So L k = (2 k -1)(1+log 2 m) +1 10 5

PAC bound for decision trees of depth k n Bad!!! N 2k log m +ln 1 2 Number of points is exponential in depth! n But, for N data points, decision tree can t get too big Number of leaves never more than number data points 11 Number of Decision Trees with k Leaves n Number of decision trees of depth k is really really big: ln H is about 2 k log m n Decision trees with up to k leaves: H is about m k k 2k n A very loose bound 12 6

PAC bound for decision trees with k leaves Bias-Variance revisited ln H DTs k leaves apple 2k(ln m +lnk) error true (h) apple error train (h)+ s ln H +ln 1 2N error true (h) apple error train (h)+ s 2k(ln m +lnk)+ln 1 2N 13 What did we learn from decision trees? n n Bias-Variance tradeoff formalized error true (h) apple error train (h)+ s 2k(ln m +lnk)+ln 1 2N Moral of the story: Complexity of learning not measured in terms of size hypothesis space, but in maximum number of points that allows consistent classification Complexity N no bias, lots of variance Lower than N some bias, less variance 14 7

What about continuous hypothesis spaces? error true (h) apple error train (h)+ n Continuous hypothesis space: H = Infinite variance??? s ln H +ln 1 2N n As with decision trees, only care about the maximum number of points that can be classified exactly! Called VC dimension see readings for details 15 What you need to know n Finite hypothesis space Derive results Counting number of hypothesis Mistakes on Training data n Complexity of the classifier depends on number of points that can be classified exactly Finite case decision trees Infinite case VC dimension n Bias-Variance tradeoff in learning theory n Remember: will your algorithm find best classifier? 16 8

Clustering K-means Machine Learning CSE446 Carlos Guestrin University of Washington May 13, 2013 17 Clustering images Set of Images [Goldberger et al.] 18 9

Clustering web search results 19 Some Data 20 10

K-means 1. Ask user how many clusters they d like. (e.g. k=5) 21 K-means 1. Ask user how many clusters they d like. (e.g. k=5) 2. Randomly guess k cluster Center locations 22 11

K-means 1. Ask user how many clusters they d like. (e.g. k=5) 2. Randomly guess k cluster Center locations 3. Each datapoint finds out which Center it s closest to. (Thus each Center owns a set of datapoints) 23 K-means 1. Ask user how many clusters they d like. (e.g. k=5) 2. Randomly guess k cluster Center locations 3. Each datapoint finds out which Center it s closest to. 4. Each Center finds the centroid of the points it owns 24 12

K-means 1. Ask user how many clusters they d like. (e.g. k=5) 2. Randomly guess k cluster Center locations 3. Each datapoint finds out which Center it s closest to. 4. Each Center finds the centroid of the points it owns 5. and jumps there 6. Repeat until terminated! 25 K-means n Randomly initialize k centers µ (0) = µ 1 (0),, µ k (0) n Classify: Assign each point j {1, m} to nearest center: n Recenter: µ i becomes centroid of its point: Equivalent to µ i average of its points! 26 13

What is K-means optimizing? n Potential function F(µ,C) of centers µ and point allocations C: N n Optimal K-means: min µ min C F(µ,C) 27 Does K-means converge??? Part 1 n Optimize potential function: n Fix µ, optimize C 28 14

Does K-means converge??? Part 2 n Optimize potential function: n Fix C, optimize µ 29 Coordinate descent algorithms n n n Want: min a min b F(a,b) Coordinate descent: fix a, minimize b fix b, minimize a repeat Converges!!! if F is bounded to a (often good) local optimum n as we saw in applet (play with it!) (For LASSO it converged to the optimum) n K-means is a coordinate descent algorithm! 30 15

31 32 16

33 How many points can a linear boundary classify exactly? (1-D) 34 17

How many points can a linear boundary classify exactly? (2-D) 35 How many points can a linear boundary classify exactly? (d-d) 36 18

PAC bound using VC dimension n Number of training points that can be classified exactly is VC dimension!!! Measures relevant size of hypothesis space, as with decision trees with k leaves 37 Shattering a set of points 38 19

VC dimension 39 PAC bound using VC dimension n Number of training points that can be classified exactly is VC dimension!!! Measures relevant size of hypothesis space, as with decision trees with k leaves Bound for infinite dimension hypothesis spaces: 40 20

Examples of VC dimension n Linear classifiers: VC(H) = d+1, for d features plus constant term b n Neural networks VC(H) = #parameters Local minima means NNs will probably not find best parameters n 1-Nearest neighbor? 41 Another VC dim. example - What can we shatter? n What s the VC dim. of decision stumps in 2d? 42 21

Another VC dim. example - What can t we shatter? n What s the VC dim. of decision stumps in 2d? 43 What you need to know n Finite hypothesis space Derive results Counting number of hypothesis Mistakes on Training data n Complexity of the classifier depends on number of points that can be classified exactly Finite case decision trees Infinite case VC dimension n Bias-Variance tradeoff in learning theory n Remember: will your algorithm find best classifier? 44 22