CS 2750 Machine Learning. Lecture 23. Concept learning. CS 2750 Machine Learning. Concept Learning

Similar documents
CS 2750 Machine Learning. Lecture 22. Concept learning. CS 2750 Machine Learning. Concept Learning

10-701/ Machine Learning Mid-term Exam Solution

Intro to Learning Theory

Lecture 9: Boosting. Akshay Krishnamurthy October 3, 2017

Empirical Process Theory and Oracle Inequalities

Frequentist Inference

Sequences and Series of Functions

Statistical Inference (Chapter 10) Statistical inference = learn about a population based on the information provided by a sample.

Hypothesis Testing. Evaluation of Performance of Learned h. Issues. Trade-off Between Bias and Variance

Roberto s Notes on Series Chapter 2: Convergence tests Section 7. Alternating series

Properties and Tests of Zeros of Polynomial Functions

Uncertainty. Variables. assigns to each sentence numerical degree of belief between 0 and 1. uncertainty

Definitions and Theorems. where x are the decision variables. c, b, and a are constant coefficients.

Optimally Sparse SVMs

Fall 2013 MTH431/531 Real analysis Section Notes

Stat 421-SP2012 Interval Estimation Section

Part I: Covers Sequence through Series Comparison Tests

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 12

Statistical Machine Learning II Spring 2017, Learning Theory, Lecture 7

sin(n) + 2 cos(2n) n 3/2 3 sin(n) 2cos(2n) n 3/2 a n =

Lecture 15: Learning Theory: Concentration Inequalities

6.3 Testing Series With Positive Terms

Lecture Notes for CS 313H, Fall 2011

Lecture 3: August 31

Statistical Pattern Recognition

Lecture 8: Convergence of transformations and law of large numbers

Topics Machine learning: lecture 2. Review: the learning problem. Hypotheses and estimation. Estimation criterion cont d. Estimation criterion

Section 11.8: Power Series

62. Power series Definition 16. (Power series) Given a sequence {c n }, the series. c n x n = c 0 + c 1 x + c 2 x 2 + c 3 x 3 +

Essential Question How can you recognize an arithmetic sequence from its graph?

CHAPTER 10 INFINITE SEQUENCES AND SERIES

1 Review of Probability & Statistics

ARIMA Models. Dan Saunders. y t = φy t 1 + ɛ t

Output Analysis and Run-Length Control

(A sequence also can be thought of as the list of function values attained for a function f :ℵ X, where f (n) = x n for n 1.) x 1 x N +k x N +4 x 3

Solution of Final Exam : / Machine Learning

Efficient GMM LECTURE 12 GMM II

MA131 - Analysis 1. Workbook 3 Sequences II

Machine Learning Brett Bernstein

Carleton College, Winter 2017 Math 121, Practice Final Prof. Jones. Note: the exam will have a section of true-false questions, like the one below.

Math 475, Problem Set #12: Answers

MATH 324 Summer 2006 Elementary Number Theory Solutions to Assignment 2 Due: Thursday July 27, 2006

Binary classification, Part 1

An alternating series is a series where the signs alternate. Generally (but not always) there is a factor of the form ( 1) n + 1

Math F215: Induction April 7, 2013

3.2 Properties of Division 3.3 Zeros of Polynomials 3.4 Complex and Rational Zeros of Polynomials

1 Generating functions for balls in boxes

Induction: Solutions

6.867 Machine learning

Machine Learning. Ilya Narsky, Caltech

SECTION 1.5 : SUMMATION NOTATION + WORK WITH SEQUENCES

Problem Set 4 Due Oct, 12

Chapter 6: Numerical Series

NUMERICAL METHODS FOR SOLVING EQUATIONS

b i u x i U a i j u x i u x j

Lecture 10 October Minimaxity and least favorable prior sequences

Lecture 3. Properties of Summary Statistics: Sampling Distribution

Support vector machine revisited

18.440, March 9, Stirling s formula

1 Review and Overview

Recurrence Relations

The Quark Puzzle A 3D printable model and/or paper printable puzzle that allows students to learn the laws of colour charge through inquiry.

Lecture 19: Convergence

MAT1026 Calculus II Basic Convergence Tests for Series

MATH/STAT 352: Lecture 15

Zeros of Polynomials

Big Picture. 5. Data, Estimates, and Models: quantifying the accuracy of estimates.

Math 61CM - Solutions to homework 3

Grouping 2: Spectral and Agglomerative Clustering. CS 510 Lecture #16 April 2 nd, 2014

A quick activity - Central Limit Theorem and Proportions. Lecture 21: Testing Proportions. Results from the GSS. Statistics and the General Population

Bertrand s Postulate

Infinite Sequences and Series

Machine Learning for Data Science (CS 4786)

and each factor on the right is clearly greater than 1. which is a contradiction, so n must be prime.

An Introduction to Randomized Algorithms

Series: Infinite Sums

A Question. Output Analysis. Example. What Are We Doing Wrong? Result from throwing a die. Let X be the random variable

Math 25 Solutions to practice problems

Practice Problems: Taylor and Maclaurin Series

Resolution Proofs of Generalized Pigeonhole Principles

HOMEWORK 2 SOLUTIONS

CSE 1400 Applied Discrete Mathematics Number Theory and Proofs

Fourier Series and the Wave Equation

Advanced Stochastic Processes.

Sigma notation. 2.1 Introduction

1 Hash tables. 1.1 Implementation

Ma 530 Introduction to Power Series

w (1) ˆx w (1) x (1) /ρ and w (2) ˆx w (2) x (2) /ρ.

Lesson 10: Limits and Continuity

Math 113 Exam 3 Practice

Resampling Methods. X (1/2), i.e., Pr (X i m) = 1/2. We order the data: X (1) X (2) X (n). Define the sample median: ( n.

Complex Analysis Spring 2001 Homework I Solution

ECE 8527: Introduction to Machine Learning and Pattern Recognition Midterm # 1. Vaishali Amin Fall, 2015

ENGI 4421 Confidence Intervals (Two Samples) Page 12-01

10.6 ALTERNATING SERIES

PAPER : IIT-JAM 2010

6.867 Machine learning, lecture 7 (Jaakkola) 1

CHAPTER 1 SEQUENCES AND INFINITE SERIES

Lecture 4. We also define the set of possible values for the random walk as the set of all x R d such that P(S n = x) > 0 for some n.

Axis Aligned Ellipsoid

Transcription:

Lecture 3 Cocept learig Milos Hauskrecht milos@cs.pitt.edu Cocept Learig Outlie: Learig boolea fuctios Most geeral ad most specific cosistet hypothesis. Mitchell s versio space algorithm Probably approximately correct (PAC) learig. Sample complexity for PAC. VapikChervoekis (VC) dimesio. Improved sample complexity bouds. 1

Learig cocepts Assume objects (examples) described i terms of attributes: Sky AirTemp Humidity Wid Water Forecast EjoySport Suy Warm Normal Strog Warm Same yes Raiy Cold Normal Strog Warm Chage o Cocept = a set of objects Cocept learig: Give a sample of labeled objects we wat to lear a boolea mappig from objects to T/F idetifyig a uderlyig cocept E.g. EjoySport cocept Cocept (hypothesis) space H Restrictio o the boolea descriptio of cocepts Learig cocepts Object (istace) space X Cocept (hypothesis) spaces H H X!!!! Assume biary attributes (e.g. true/false, warm/cold) Istace space X: differet objects Cocept space H: possible cocepts = all possible subsets of objects

Learig cocepts Problem: Cocept space too large Solutio: restricted hypothesis space H Example: cojuctive cocepts ( Sky Suy ) (Weather Cold ) 3 possible cocepts Why? Other restricted spaces: 3CNF (or kcnf) 3DNF (or kdnf) ( 7 a1 a3 a ) (...) ( 9 a1 a5 a ) (...) Learig cocepts After seeig k examples the hypothesis space (eve if restricted) ca have may cosistet cocept hypotheses Cosistet hypothesis: a cocept c that evaluates to T o all positive examples ad to F o all egatives. What to lear? Geeral to specific learig. Start from all true ad refie with the maximal (cosistet) geeralizatio. Specific to geeral learig. Start from all false ad refie with the most restrictive specializatio. Versio space learig. Keep all cosistet hypothesis aroud the combiatio of the above two cases. 3

Specific to geeral learig (for cojuctive cocepts) Assume two hypotheses: h1 ( Suy,?,? Strog h ( Suy,?,?,?,?,?) The we say that:,?,?) h is more geeral tha h1, h1 is a special case (specializatio of) h arbitrary Specific to geeral learig: start from the allfalse hypothesis h0 (,,,,, ) by scaig samples, gradually refie the hypothesis (make it more geeral) wheever it does ot satisfy the ew sample see (keep the most restrictive specializatio of positives) Specific to geeral learig. Example Cojuctive cocepts, target is a cojuctive cocept h (,,,,, ) All false (Suy,Warm, Normal, Strog, Warm, Same) T h ( Suy, Warm, Normal, Strog, Warm, Same ) (Raiy, Cold, Normal, Strog, Warm, Chage) F h ( Suy, Warm, Normal, Strog, Warm, Same ) (Suy,Warm, High, Strog, Warm, Same) T h ( Suy, Warm,?, Strog, Warm, Same ) (Suy,Warm, High, Strog, Cool, Same) T h ( Suy, Warm,?, Strog,?, Same ) 4

Geeral to specific learig Dual problem to the specific to geeral learig Start from the all true hypothesis h0 (?,?,?,?,?,?) Refie the cocept descriptio such that all samples are cosistet (keep maximal possible geeralizatio) h (?,?,?,?,?,?) (Suy,Warm, Normal, Strog, Warm, Same) T h (?,?,?,?,?,?) (Suy,Warm, High, Strog, Warm, Same) T h (?,?,?,?,?,?) (Raiy, Cold, Normal, Strog, Warm, Chage) F h ( Suy,?,?,?,?,?), (?,?,?,?,?, (?, Warm Same )?,?,?,?), Mitchell s versio space algorithm Keeps the space of cosistet hypotheses Most geeral rule Upper boud (frige) Pushed dow by examples Versio space Lower boud (frige) Pushed up by examples Most specific rule 5

Mitchell s versio space algorithm Keeps ad refies the friges of the versio space Coverges to the target cocept wheever the target is a member of the hypotheses space H Assumptio: No oise i the data samples (the same example has always the same label) The hope is that the frige is always small Is this correct? Expoetial frige set example Cojuctive cocepts, upper frige (geeral to specific) Samples: ( true, true, true, true,..., true ) T 1 ( false, false, true, true,..., true ) ( true, true, false, false,..., true )... ( true, true, true,..., false, false ) Maximal geeralizatios differet hypotheses we eed to remember ( true,?, true,?,..., true,?) (?, true, true,?,..., true,?) ( true,?,?, true,..., true,?)... (?, true,?, true,...,?, true ) F F F 6

Learig cocepts Versio space algorithm may require large umber of samples to coverge to the target cocept I the worst case we must see all cocepts before covergig to it. The samples may come from differet distributios it may take a very log time to see all examples The frige ca go expoetial i the umber of attributes Alterative solutio: Select a hypothesis that is cosistet after some umber of (, ) samples is see by our algorithm Ca we tell how far are we from the solutio? Yes!!! PAC framework develops the criteria for measurig the accuracy of our choice i probabilistic terms Valiat s framework Probability distributio from which samples are draw There is a error permitted i assigig the labels to examples The cocept leared does ot have to be perfect but it should ot be very far from the target cocept ct target cocept c leared cocept x ext sample from the distributio Error ( ct, c) P( x c x ct ) P( x c x ct ) accuracy parameter We would like to have cocept such that Error ( c, c) T 7

PAC learig To get the error to be smaller tha the accuracy parameter i all cases may be hard: Some examples may be very rare ad to see them may require large umber of samples Istead we choose: where P( Error ( c, c) ) 1 T is a cofidece factor Probably approximately correct (PAC) learig With probability 1 a cocept with a error ot more tha is foud Sample complexity of PAC learig How may samples we eed to see to satisfy PAC criterio? Assume: we saw m idepedet samples draw from the distributio, ad h is a hypothesis that is cosistet with all m examples ad its error is larger tha epsilo Error ( c, h) P(a sample is cosistet with a give h) (1 ) P ( m samples are cosistet with a give h) (1 ) There are at most H hypotheses i the space P ( ay bad hypothesis survives m samples) H (1 ) T m m 8

Sample complexity of PAC learig P ( ay bad hypothesis survives m samples) H (1 ) H e m I the PAC framework we wat to boud this probability with the cofidece factor H Expressig for m e m (l( 1 / ) l H ) m After m samples satisfyig the above iequality ay cosistet hypothesis satisfies the PAC criterio m Efficiet PAC learability The cocept is efficietly PAC learable if the time it takes to output the cocept is polyomial i, 1 /, 1 / Two aspects: Sample complexity a umber of examples eeded to lear the cocept satisfyig PAC criterio A prerequisite to efficiet PAC learability Time complexity the time it takes to fid the cocept Eve if the sample complexity is OK, the learig procedure may ot be efficiet (e.g. expoetial frige) 9

Efficiet PAC learability Sample complexities depeds o the hypothesis space we use Cojuctive cocepts 3 possible cocepts m (l( 1 / ) l 3 ) (l(1 / ) l 3) efficiet All possible cocepts (ubiased hypothesis space) m (l(1 / ) l ) (l(1 / ) l ) iefficiet Efficiet PAC learability Polyomial sample complexity is ecessary but ot sufficiet Algorithm should work i polyomial time Some types of cocept (hypothesis) ca be leared efficietly. Example: cojuctive cocepts Specific to geeral learig. Keeps oe hypothesis aroud. The most specific descriptio of all positive examples. Ca be doe i poly time. Geeral to specific learig. We eed to keep the complete upper frige which ca be expoetial. Caot be doe i poly time. Other cocept (hypothesis) spaces with poly sample complexity: kdnf caot be PAC leared i poly time. kcnf polyomial time solutio 10

Learig cojuctive cocepts Learig cojuctive cocepts specific to geeral learig It is sufficiet to keep oe hypothesis aroud which is the most specific descriptio of all positive examples. Ca be doe i poly time. How? Iitial hypothesis: all false a1 a1 a a... a k a k Whe positive imstace is see we remove icosistet terms from the cojuctio: Positive istace: a, 1 a,... a k Hypothesis: a1 a1 a a... a k a k We keep doig this for m steps Learig 3CNF Sample complexity for the kcnf ad kdnf kdnf caot be leared efficietly kcnf ca be leared efficietly. How? Assume 3CNF ( a1 a3 a7 ) ( a a 4 a5 )... Oly a polyomial umber of clauses with at most 3 variables!! 3 ( 1) ( 1)( ) O ( ) Algorithm (specific to geeral learig): Start with the cojuctio of all possible clauses (always false) O positive example ay clause that is ot true is deleted O egative examples do othig Iterestig Ay kdnf ca be coverted ito kcnf 11

Quatifyig iductive bias Durig learig oly small fractio of samples see We eed to geeralize to usee examples Choice of the hypotheses space restrict our learig optios biases our learig Other biases: preferece towards simpler hypothesis, smaller degrees of freedom Questios: How to measure the bias? To what extet our biases affect our learig capabilities? Ca we lear eve if the hypotheses space is ifiite? (l( 1 / ) l H ) m VapikChervoekis dimesio Measures the biases of the cocept space Allows us to: Obtai better sample complexity boud Ca be exteded to attributes with ifiite value spaces. VC idea: do ot measure the size of the space, but the umber of distict istaces that ca be completely discrimiated usig H Example: H is a set of space of rectagles Discrimiatio of labeligs of 3 poits with rectagles 1

Shatterig of a set of istaces A set of istaces S X H shatters S if for every dichotomy (combiatio of labels) there is a hypothesis h cosistet with the dichotomy Example: H is a set of space of rectagles A set of 3 istaces (most flexible choice) Dichotomy 1 Dichotomy Dichotomy k 3 differet dichotomies, hypothesis for each of them VapikChervoekis dimesio VC dimesio of a hypothesis space H is the size of the largest subset of istaces that is shattered by H. Example: rectagles (VC at least 3) Try 4: Ca be shattered (for the most flexible 4), VC dimesio at least 4 Try 5: No set of 5 poits that ca be shattered, thus VC dimesio is 4 13

VC dimesio ad sample complexity Oe ca derive the sample complexity boud for PAC learig usig VC dimesio istead of hypothesis space size (we wo t do it here) m ( 4 l( / ) 8VC dim( H ) l(13 / )) Addig oise We have a target cocept but there is a chace of mislabelig the examples see Ca we PAClear also i this case? Blumer (1986). If h is a hypothesis that agrees with at least 1 m l( ) samples draw from the distributio the P( error ( h, ct ) ) Mitchell gives the sample complexity boud for the choice of the hypothesis with the best traiig error 14

Summary Learig boolea fuctios Most geeral ad most specific cosistet hypothesis. Mitchell s versio space algorithm Probably approximately correct (PAC) learig. Sample complexity for PAC. VapikChervoekis (VC) dimesio. Improved sample complexity bouds. Addig oise. 15